And maybe not on the wall, but instead in your SQLz, eating your data.
But a bit more seriously. Ever since PostgreSQL 8.4 we have window functions, but still I see people which do not know it or are wary to use it.
That’s why I decided to write a piece on window functions. How they work and what they can be used for.
Before we start even thinking about window functions, we need some test data. So I made a simple table:
create table people ( id serial primary key, username text, department text, salary int4 ); |
and filled it with 250 rows based on Polish, random, names, department names from very cool movie, and some random() salaries. Looks like this:
select * from people limit 10; id | username | department | salary ----+-----------------+--------------------+-------- 1 | Blanka Pawlak | Psychology | 25200 2 | Fabian Krawczyk | Chem | 47900 3 | Tomasz Górski | Zoology | 39600 4 | Oliwier Woźniak | Data Archives | 93500 5 | Rafał Olszewski | Maintennance | 77500 6 | Maciej Mazur | Distribution | 57700 7 | Justyna Wróbel | Internal Logistics | 82800 8 | Nikodem Wójcik | Wranglers | 90500 9 | Lena Szewczyk | Electrical | 66400 10 | Filip Kamiński | Engineering | 42300 (10 rows) |
( full table dump, for your own testing purposes, can be downloaded. )
To write more I have to assume some things:
- You know what grouping is
- You know how grouping works
Window functions, are functions which work on windows. That’s a surprise. And what is window? Well, for starters we can assume that windows are more or less the same as “GROUP BY” groups, but defined in such a way that each row can have it’s own set of groups.
Let’s see simplest possible example, and let’s work from there:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | SELECT department, username, salary, count(*) over () FROM people order by department, username limit 20; department | username | salary | count ----------------+------------------------+--------+------- Accounting | Alan Jaworski | 102900 | 250 Accounting | Aleksander Nowicki | 78900 | 250 Accounting | Anna Nowakowska | 77800 | 250 Accounting | Hubert Zieliński | 27600 | 250 Accounting | Konrad Nowak | 85400 | 250 Accounting | Mateusz Zając | 87600 | 250 Accounting | Michał Olszewski | 86100 | 250 Accounting | Paweł Olszewski | 46700 | 250 Accounting | Rafał Dąbrowski | 46600 | 250 Accounting | Rafał Kowalczyk | 48600 | 250 Accounting | Zofia Szymańska | 95200 | 250 Administration | Antonina Zalewska | 72700 | 250 Administration | Dominik Kowalczyk | 96000 | 250 Administration | Fabian Wieczorek | 28500 | 250 Administration | Jan Walczak | 27400 | 250 Administration | Maksymilian Piotrowski | 89600 | 250 Administration | Nadia Nowak | 109800 | 250 Administration | Stanisław Mazur | 74800 | 250 Archives | Adam Wojciechowski | 51200 | 250 Archives | Aleksandra Zając | 31200 | 250 (20 rows) |
As you can see I got count of all people in separate column. Of course – this example is not very interesting, as I could have done:
select *, (select count(*) from people) from people; |
and would get more or less the same thing. That’s true. Lots of stuff that I’ll show with window functions can be done also differently. The point of window functions is that they are easier to write, and they tend to avoid additional table scans.
Anyway. Let’s see what exactly was happening. The interesting thing is line 5:
count(*) over () |
This is basic window function syntax. There is function call (count(*) in this case), and then there is keyword “OVER” and so called “window definition” in parentheses. In this case window definition is empty, which basically means: treat whole resultset as your “virtual group”.
You might notice that I used count(*), which is commonly used aggregate. The great thing is that while there specific window functions, you can also use any aggregate functions (like: min, max, avg, count, sum, array_agg, string_agg and so on) as window functions!
So the expression “count(*) over ()” means – get count of all rows. Simple. But we can do better. Let’s say, we’d want to compare given person salary to average salary in her department. Query:
select department, username, salary, avg( salary ) over ( partition by department ) from people order by department, username limit 20; department | username | salary | avg ----------------+------------------------+--------+-------------------- Accounting | Alan Jaworski | 102900 | 71218.181818181818 Accounting | Aleksander Nowicki | 78900 | 71218.181818181818 Accounting | Anna Nowakowska | 77800 | 71218.181818181818 Accounting | Hubert Zieliński | 27600 | 71218.181818181818 Accounting | Konrad Nowak | 85400 | 71218.181818181818 Accounting | Mateusz Zając | 87600 | 71218.181818181818 Accounting | Michał Olszewski | 86100 | 71218.181818181818 Accounting | Paweł Olszewski | 46700 | 71218.181818181818 Accounting | Rafał Dąbrowski | 46600 | 71218.181818181818 Accounting | Rafał Kowalczyk | 48600 | 71218.181818181818 Accounting | Zofia Szymańska | 95200 | 71218.181818181818 Administration | Antonina Zalewska | 72700 | 71257.142857142857 Administration | Dominik Kowalczyk | 96000 | 71257.142857142857 Administration | Fabian Wieczorek | 28500 | 71257.142857142857 Administration | Jan Walczak | 27400 | 71257.142857142857 Administration | Maksymilian Piotrowski | 89600 | 71257.142857142857 Administration | Nadia Nowak | 109800 | 71257.142857142857 Administration | Stanisław Mazur | 74800 | 71257.142857142857 Archives | Adam Wojciechowski | 51200 | 69228.571428571429 Archives | Aleksandra Zając | 31200 | 69228.571428571429 (20 rows) |
The window function call:
avg( salary ) over ( partition by department ) |
means that we want average salary for windows, where window, for each row is taken by “grouping” all rows with department being the same as department in row that is being processed.
So, to write it using group by, I could:
select p.department, p.username, p.salary, d.avg from people p join ( select department, avg(salary) from people group by department ) as d using (department) order by p.department, p.username limit 20; |
The thing is that PostgreSQL does the “join” automatically based on values of columns in row that uses window function (not sure if that’s clear).
Within window definition, you can use multiple columns (or expressions) for “PARTITION BY”. For example, we could do something like this:
select department, username, salary, count(*) over (partition by department, salary < 50000 ) from people order by department, username limit 20; department | username | salary | count ----------------+------------------------+--------+------- Accounting | Alan Jaworski | 102900 | 7 Accounting | Aleksander Nowicki | 78900 | 7 Accounting | Anna Nowakowska | 77800 | 7 Accounting | Hubert Zieliński | 27600 | 4 Accounting | Konrad Nowak | 85400 | 7 Accounting | Mateusz Zając | 87600 | 7 Accounting | Michał Olszewski | 86100 | 7 Accounting | Paweł Olszewski | 46700 | 4 Accounting | Rafał Dąbrowski | 46600 | 4 Accounting | Rafał Kowalczyk | 48600 | 4 Accounting | Zofia Szymańska | 95200 | 7 Administration | Antonina Zalewska | 72700 | 5 Administration | Dominik Kowalczyk | 96000 | 5 Administration | Fabian Wieczorek | 28500 | 2 Administration | Jan Walczak | 27400 | 2 Administration | Maksymilian Piotrowski | 89600 | 5 Administration | Nadia Nowak | 109800 | 5 Administration | Stanisław Mazur | 74800 | 5 Archives | Adam Wojciechowski | 51200 | 11 Archives | Aleksandra Zając | 31200 | 3 (20 rows) |
In here, the count is number of people, including current person, that are in the same salary group, where group is defined as < 50000, or >= 50000, but all within the same department.
The other, very important, and commonly used, part of window definition, is sorting. This is something absolutely great.
When dealing with groups, values within the group where basically tossed together in a bag, without any order.
But within windows – rows can be sorted. And the sorting can be used for interesting things:
select department, username, count(*) over (partition by department), count(*) over (partition by department order by username) from people order by department, username limit 20; department | username | count | count ----------------+------------------------+-------+------- Accounting | Alan Jaworski | 11 | 1 Accounting | Aleksander Nowicki | 11 | 2 Accounting | Anna Nowakowska | 11 | 3 Accounting | Hubert Zieliński | 11 | 4 Accounting | Konrad Nowak | 11 | 5 Accounting | Mateusz Zając | 11 | 6 Accounting | Michał Olszewski | 11 | 7 Accounting | Paweł Olszewski | 11 | 8 Accounting | Rafał Dąbrowski | 11 | 9 Accounting | Rafał Kowalczyk | 11 | 10 Accounting | Zofia Szymańska | 11 | 11 Administration | Antonina Zalewska | 7 | 1 Administration | Dominik Kowalczyk | 7 | 2 Administration | Fabian Wieczorek | 7 | 3 Administration | Jan Walczak | 7 | 4 Administration | Maksymilian Piotrowski | 7 | 5 Administration | Nadia Nowak | 7 | 6 Administration | Stanisław Mazur | 7 | 7 Archives | Adam Wojciechowski | 14 | 1 Archives | Aleksandra Zając | 14 | 2 (20 rows) |
Please note that adding “order by” to window definition changes scope of window function – it doesn’t work on all rows in given partition – it works just on rows from start to given row, based on ordering.
This means that doing cumulative sum gets trivial:
select department, username, salary, sum(salary) over (partition by department order by username) from people order by department, username limit 20; department | username | salary | sum ----------------+------------------------+--------+-------- Accounting | Alan Jaworski | 102900 | 102900 Accounting | Aleksander Nowicki | 78900 | 181800 Accounting | Anna Nowakowska | 77800 | 259600 Accounting | Hubert Zieliński | 27600 | 287200 Accounting | Konrad Nowak | 85400 | 372600 Accounting | Mateusz Zając | 87600 | 460200 Accounting | Michał Olszewski | 86100 | 546300 Accounting | Paweł Olszewski | 46700 | 593000 Accounting | Rafał Dąbrowski | 46600 | 639600 Accounting | Rafał Kowalczyk | 48600 | 688200 Accounting | Zofia Szymańska | 95200 | 783400 Administration | Antonina Zalewska | 72700 | 72700 Administration | Dominik Kowalczyk | 96000 | 168700 Administration | Fabian Wieczorek | 28500 | 197200 Administration | Jan Walczak | 27400 | 224600 Administration | Maksymilian Piotrowski | 89600 | 314200 Administration | Nadia Nowak | 109800 | 424000 Administration | Stanisław Mazur | 74800 | 498800 Archives | Adam Wojciechowski | 51200 | 51200 Archives | Aleksandra Zając | 31200 | 8240 |