Window, window on the wall …
  
November 20th, 2012 by depesz | Tags: aggregate, array_agg, postgresql, primer, string_agg, window, window functions | 9 comments »

And maybe not on the wall, but instead in your SQLz, eating your data.

But a bit more seriously. Ever since PostgreSQL 8.4 we have window functions, but still I see people which do not know it or are wary to use it.

That’s why I decided to write a piece on window functions. How they work and what they can be used for.

Before we start even thinking about window functions, we need some test data. So I made a simple table:

create table people (
    id          serial  primary  key,
    username    text,
    department  text,
    salary      int4
);

and filled it with 250 rows based on Polish, random, names, department names from very cool movie, and some random() salaries. Looks like this:

select * from people limit 10;
 id |    username     |     department     | salary
----+-----------------+--------------------+--------
  1 | Blanka Pawlak   | Psychology         |  25200
  2 | Fabian Krawczyk | Chem               |  47900
  3 | Tomasz Górski   | Zoology            |  39600
  4 | Oliwier Woźniak | Data Archives      |  93500
  5 | Rafał Olszewski | Maintennance       |  77500
  6 | Maciej Mazur    | Distribution       |  57700
  7 | Justyna Wróbel  | Internal Logistics |  82800
  8 | Nikodem Wójcik  | Wranglers          |  90500
  9 | Lena Szewczyk   | Electrical         |  66400
 10 | Filip Kamiński  | Engineering        |  42300
(10 rows)

( full table dump, for your own testing purposes, can be downloaded. )

To write more I have to assume some things:

  • You know what grouping is
  • You know how grouping works

Window functions, are functions which work on windows. That’s a surprise. And what is window? Well, for starters we can assume that windows are more or less the same as “GROUP BY” groups, but defined in such a way that each row can have it’s own set of groups.

Let’s see simplest possible example, and let’s work from there:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
SELECT
    department,
    username,
    salary,
    count(*) over ()
FROM
    people
order by department, username
limit 20;
   department   |        username        | salary | count
----------------+------------------------+--------+-------
 Accounting     | Alan Jaworski          | 102900 |   250
 Accounting     | Aleksander Nowicki     |  78900 |   250
 Accounting     | Anna Nowakowska        |  77800 |   250
 Accounting     | Hubert Zieliński       |  27600 |   250
 Accounting     | Konrad Nowak           |  85400 |   250
 Accounting     | Mateusz Zając          |  87600 |   250
 Accounting     | Michał Olszewski       |  86100 |   250
 Accounting     | Paweł Olszewski        |  46700 |   250
 Accounting     | Rafał Dąbrowski        |  46600 |   250
 Accounting     | Rafał Kowalczyk        |  48600 |   250
 Accounting     | Zofia Szymańska        |  95200 |   250
 Administration | Antonina Zalewska      |  72700 |   250
 Administration | Dominik Kowalczyk      |  96000 |   250
 Administration | Fabian Wieczorek       |  28500 |   250
 Administration | Jan Walczak            |  27400 |   250
 Administration | Maksymilian Piotrowski |  89600 |   250
 Administration | Nadia Nowak            | 109800 |   250
 Administration | Stanisław Mazur        |  74800 |   250
 Archives       | Adam Wojciechowski     |  51200 |   250
 Archives       | Aleksandra Zając       |  31200 |   250
(20 rows)

As you can see I got count of all people in separate column. Of course – this example is not very interesting, as I could have done:

select *, (select count(*) from people) from people;

and would get more or less the same thing. That’s true. Lots of stuff that I’ll show with window functions can be done also differently. The point of window functions is that they are easier to write, and they tend to avoid additional table scans.

Anyway. Let’s see what exactly was happening. The interesting thing is line 5:

count(*) over ()

This is basic window function syntax. There is function call (count(*) in this case), and then there is keyword “OVER” and so called “window definition” in parentheses. In this case window definition is empty, which basically means: treat whole resultset as your “virtual group”.

You might notice that I used count(*), which is commonly used aggregate. The great thing is that while there specific window functions, you can also use any aggregate functions (like: min, max, avg, count, sum, array_agg, string_agg and so on) as window functions!

So the expression “count(*) over ()” means – get count of all rows. Simple. But we can do better. Let’s say, we’d want to compare given person salary to average salary in her department. Query:

select
    department,
    username,
    salary,
    avg( salary ) over ( partition by department )
from
    people
order by department, username
limit 20;
   department   |        username        | salary |        avg
----------------+------------------------+--------+--------------------
 Accounting     | Alan Jaworski          | 102900 | 71218.181818181818
 Accounting     | Aleksander Nowicki     |  78900 | 71218.181818181818
 Accounting     | Anna Nowakowska        |  77800 | 71218.181818181818
 Accounting     | Hubert Zieliński       |  27600 | 71218.181818181818
 Accounting     | Konrad Nowak           |  85400 | 71218.181818181818
 Accounting     | Mateusz Zając          |  87600 | 71218.181818181818
 Accounting     | Michał Olszewski       |  86100 | 71218.181818181818
 Accounting     | Paweł Olszewski        |  46700 | 71218.181818181818
 Accounting     | Rafał Dąbrowski        |  46600 | 71218.181818181818
 Accounting     | Rafał Kowalczyk        |  48600 | 71218.181818181818
 Accounting     | Zofia Szymańska        |  95200 | 71218.181818181818
 Administration | Antonina Zalewska      |  72700 | 71257.142857142857
 Administration | Dominik Kowalczyk      |  96000 | 71257.142857142857
 Administration | Fabian Wieczorek       |  28500 | 71257.142857142857
 Administration | Jan Walczak            |  27400 | 71257.142857142857
 Administration | Maksymilian Piotrowski |  89600 | 71257.142857142857
 Administration | Nadia Nowak            | 109800 | 71257.142857142857
 Administration | Stanisław Mazur        |  74800 | 71257.142857142857
 Archives       | Adam Wojciechowski     |  51200 | 69228.571428571429
 Archives       | Aleksandra Zając       |  31200 | 69228.571428571429
(20 rows)

The window function call:

avg( salary ) over ( partition by department )

means that we want average salary for windows, where window, for each row is taken by “grouping” all rows with department being the same as department in row that is being processed.

So, to write it using group by, I could:

select
    p.department,
    p.username,
    p.salary,
    d.avg
from
    people p
    join (
        select department, avg(salary)
        from people
        group by department
    ) as d using (department)
order by p.department, p.username
limit 20;

The thing is that PostgreSQL does the “join” automatically based on values of columns in row that uses window function (not sure if that’s clear).

Within window definition, you can use multiple columns (or expressions) for “PARTITION BY”. For example, we could do something like this:

select
    department,
    username,
    salary,
    count(*) over (partition by department, salary < 50000 )
from
    people
order by department, username
limit 20;
   department   |        username        | salary | count
----------------+------------------------+--------+-------
 Accounting     | Alan Jaworski          | 102900 |     7
 Accounting     | Aleksander Nowicki     |  78900 |     7
 Accounting     | Anna Nowakowska        |  77800 |     7
 Accounting     | Hubert Zieliński       |  27600 |     4
 Accounting     | Konrad Nowak           |  85400 |     7
 Accounting     | Mateusz Zając          |  87600 |     7
 Accounting     | Michał Olszewski       |  86100 |     7
 Accounting     | Paweł Olszewski        |  46700 |     4
 Accounting     | Rafał Dąbrowski        |  46600 |     4
 Accounting     | Rafał Kowalczyk        |  48600 |     4
 Accounting     | Zofia Szymańska        |  95200 |     7
 Administration | Antonina Zalewska      |  72700 |     5
 Administration | Dominik Kowalczyk      |  96000 |     5
 Administration | Fabian Wieczorek       |  28500 |     2
 Administration | Jan Walczak            |  27400 |     2
 Administration | Maksymilian Piotrowski |  89600 |     5
 Administration | Nadia Nowak            | 109800 |     5
 Administration | Stanisław Mazur        |  74800 |     5
 Archives       | Adam Wojciechowski     |  51200 |    11
 Archives       | Aleksandra Zając       |  31200 |     3
(20 rows)

In here, the count is number of people, including current person, that are in the same salary group, where group is defined as < 50000, or >= 50000, but all within the same department.

The other, very important, and commonly used, part of window definition, is sorting. This is something absolutely great.

When dealing with groups, values within the group where basically tossed together in a bag, without any order.

But within windows – rows can be sorted. And the sorting can be used for interesting things:

select
    department,
    username,
    count(*) over (partition by department),
    count(*) over (partition by department order by username)
from people
order by department, username limit 20;
   department   |        username        | count | count
----------------+------------------------+-------+-------
 Accounting     | Alan Jaworski          |    11 |     1
 Accounting     | Aleksander Nowicki     |    11 |     2
 Accounting     | Anna Nowakowska        |    11 |     3
 Accounting     | Hubert Zieliński       |    11 |     4
 Accounting     | Konrad Nowak           |    11 |     5
 Accounting     | Mateusz Zając          |    11 |     6
 Accounting     | Michał Olszewski       |    11 |     7
 Accounting     | Paweł Olszewski        |    11 |     8
 Accounting     | Rafał Dąbrowski        |    11 |     9
 Accounting     | Rafał Kowalczyk        |    11 |    10
 Accounting     | Zofia Szymańska        |    11 |    11
 Administration | Antonina Zalewska      |     7 |     1
 Administration | Dominik Kowalczyk      |     7 |     2
 Administration | Fabian Wieczorek       |     7 |     3
 Administration | Jan Walczak            |     7 |     4
 Administration | Maksymilian Piotrowski |     7 |     5
 Administration | Nadia Nowak            |     7 |     6
 Administration | Stanisław Mazur        |     7 |     7
 Archives       | Adam Wojciechowski     |    14 |     1
 Archives       | Aleksandra Zając       |    14 |     2
(20 rows)

Please note that adding “order by” to window definition changes scope of window function – it doesn’t work on all rows in given partition – it works just on rows from start to given row, based on ordering.

This means that doing cumulative sum gets trivial:

select
    department,
    username,
    salary,
    sum(salary) over (partition by department order by username)
from people
order by department, username
limit 20;
   department   |        username        | salary |  sum
----------------+------------------------+--------+--------
 Accounting     | Alan Jaworski          | 102900 | 102900
 Accounting     | Aleksander Nowicki     |  78900 | 181800
 Accounting     | Anna Nowakowska        |  77800 | 259600
 Accounting     | Hubert Zieliński       |  27600 | 287200
 Accounting     | Konrad Nowak           |  85400 | 372600
 Accounting     | Mateusz Zając          |  87600 | 460200
 Accounting     | Michał Olszewski       |  86100 | 546300
 Accounting     | Paweł Olszewski        |  46700 | 593000
 Accounting     | Rafał Dąbrowski        |  46600 | 639600
 Accounting     | Rafał Kowalczyk        |  48600 | 688200
 Accounting     | Zofia Szymańska        |  95200 | 783400
 Administration | Antonina Zalewska      |  72700 |  72700
 Administration | Dominik Kowalczyk      |  96000 | 168700
 Administration | Fabian Wieczorek       |  28500 | 197200
 Administration | Jan Walczak            |  27400 | 224600
 Administration | Maksymilian Piotrowski |  89600 | 314200
 Administration | Nadia Nowak            | 109800 | 424000
 Administration | Stanisław Mazur        |  74800 | 498800
 Archives       | Adam Wojciechowski     |  51200 |  51200
 Archives       | Aleksandra Zając       |  31200 |  8240


gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.