Introduction to Window Functions in SQL
It is interesting that many people working with data have no clue about window functions in SQL. During a long period of time instead of using window functions, I preferred coding in Python and pandas. Today I would like to introduce you to the concept of "window" and how it is related to data extraction from a SQL database.
Window functions are applied to a subset/window of rows related to one another. In comparison to
GROUP BY operation, window functions do not decrease the number of rows. Aggregate functions like
COUNT could be used as window functions as well. Usually, window functions are used to do analytical tasks. The following examples of queries will be performed on the PostgreSQL database.
<function>(<expression>) OVER ( <window> <sorting> <window range> -- optional )
Looks creepy 😲 We need more practice. Let's assume that we have a salary table:
One day your boss approaches you, he wants to know who is the highest-paid employee by the department. We can use the
MAX function to select the highest salary for each department.
SELECT department, MAX(gross_salary) as max_salary FROM Salary GROUP BY 1;
The result is:
In order to select the person with the highest salary by department, we can use a subquery and
SELECT id, first_name, department, t.gross_salary FROM Salary JOIN ( SELECT department, MAX(gross_salary) as gross_salary FROM Salary GROUP BY 1 ) t USING(gross_salary, department);
The result is:
But the query looks verbose and dirty, maybe we can do the same with window functions? Sure. All you need to do is to set up a window for
MAX aggregate function:
SELECT id, first_name, department, gross_salary, MAX(gross_salary) OVER (PARTITION BY department) as max_gross_salary FROM Salary;
department is a window. It means that employees from the same department is a subset or window of related rows. The result of the query:
As you can see we have the new column called
max_gross_salary which shows the highest salary in each department. In order to retrieve the list of most paid employees we do the following:
SELECT * FROM ( SELECT id, first_name, department, gross_salary, MAX(gross_salary) OVER (PARTITION BY department) as max_gross_salary FROM Salary ) t WHERE max_gross_salary = gross_salary ORDER BY id;
Now your boss asks you to prepare the report which shows the ratio of salaries by department and by the total wage fund because some departments have relatively low-income employees in comparison with other departments. You can solve it using subqueries:
WITH gross_by_departments AS ( SELECT department, SUM(gross_salary) as dep_gross_salary FROM Salary GROUP BY 1 ) SELECT id, first_name, department, gross_salary, ROUND(CAST(gross_salary AS numeric(9, 2)) / dep_gross_salary * 100, 2) as dep_ratio, ROUND(CAST(gross_salary AS numeric(9, 2)) / (SELECT SUM(gross_salary) FROM Salary) * 100, 2) as total_ratio FROM Salary JOIN gross_by_departments USING(department) ORDER BY department, dep_ratio DESC
The result is:
As you can see Нина earns 71.4% out of HR department, but it is only 10.7% out of the total wage fund. Аркадий on the other hand has 21.4% out of total wage fund and 41% out of IT department. Can we refactor the query to make it smaller and more readable? Yes! Window functions!
SELECT id, first_name, department, gross_salary, ROUND(CAST(gross_salary AS numeric(9,2)) / SUM(gross_salary) OVER (PARTITION BY department) * 100, 2) as dep_ratio, ROUND(CAST(gross_salary AS numeric(9,2)) / SUM(gross_salary) OVER () * 100, 2) as total_ratio FROM Salary ORDER BY department, dep_ratio DESC;
OVER() means that the "window" is all rows.
Window functions only
We used the aggregate functions
SUM as window functions above. But SQL standard consists of functions that cannot be used as aggregates hence no way to apply them during grouping. Here is the list of them:
A comprehensive list of all window functions supported in PostgreSQL you can find here.
Using window functions
Let's use the
first_value function in order to solve the very first problem where we were asked to get the name of the highest-paid employee by each department. The function returns the very first values according to the provided window.
SELECT id, first_name, department, gross_salary, first_value(first_name) OVER (PARTITION BY department ORDER BY gross_salary DESC ) as highest_paid_employee FROM Salary
We are sorting the rows in our window by salary in descending order, which means that the very first row will be a person with the highest salary.
Now let's use
last_value. It does the opposite of what
SELECT id, first_name, department, gross_salary, last_value(first_name) OVER (PARTITION BY department ORDER BY gross_salary DESC) AS lowest_paid_employee FROM Salary
The result is:
If you look at the table you will see that the result is wrong. Why? Because of window ranges. By default, if you provide the
ORDER BY clause the window range is all preceding rows and current row, in SQL terms it is
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. It means that the
last_value for a particular column in a particular row is a column in the same row. In order to fix this problem, we need to change the range to
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING:
SELECT id, first_name, department, gross_salary, last_value(first_name) OVER ( PARTITION BY department ORDER BY gross_salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) as lowest_paid_employee FROM Salary
This is how it looks visually:
The possible values:
- N PRECEDING, N number of rows until current row
- CURRENT ROW
- UNBOUNDED PRECEDING
- UNBOUNDED FOLLOWING
- N FOLLOWING**, N number of rows following the current row