SQL Window functions
What are Window Functions?
Window functions are functions that allow you to perform calculations over a subset of data within a larger table. They are different from regular aggregate functions like SUM and AVG, which calculate a single value for an entire column. Window functions, on the other hand, calculate a value for each row based on a specified window or subset of rows.
This may sound simple, yet confusing. Let’s break it down into smaller pieces and gain a better understanding.
In this post, we will use the Sakila sample database provided by MySQL
Suppose you want to check the average rental amount for all rentals.
Average rental amount:
|
|
Output:
avg_rental_amount |
---|
4.20 |
This is an aggregate value for all rows, and the result set includes only one value.
Average rental amount per each film category(GROUP BY
):
To aggregate the average value for each film category, use the GROUP BY
function. The resulting table should have one row per film category.
|
|
Output:
name | avg_rental_amount |
---|---|
Comedy | 4.66 |
New | 4.63 |
Sports | 4.51 |
Games | 4.42 |
Horror | 4.40 |
What if we want to display each rental detail alongside the category average? Here’s where the WINDOW
function comes in.
Common Terms
- Partitioning: Partitioning involves dividing a table into smaller groups based on one or more columns. Window functions can then be applied to each partition separately, enabling you to perform calculations on a subset of rows within each partition.
- Order by: The
ORDER BY
clause is used to sort the rows within each partition in a specific order. - Frame clauses: Frame clauses specify the range of rows to include in the window function calculation. There are two types of frame clauses: rows and range.
Average rental amount per each film category(OVER
& PARTITION BY
):
|
|
Output:
rental_id | inventory_id | film_category | avg_rental_amount |
---|---|---|---|
1 | 367 | Family | 3.86 |
2 | 1525 | Music | 4.12 |
3 | 1711 | Children | 3.87 |
4 | 2452 | Horror | 4.40 |
5 | 2079 | Children | 3.87 |
6 | 2792 | Comedy | 4.66 |
7 | 3995 | Horror | 4.40 |
8 | 2346 | Animation | 3.99 |
9 | 2580 | Foreign | 4.13 |
10 | 1824 | Drama | 4.33 |
The result includes each rental, as well as the average rental amount for each category.
Average rental amount per each film category(OVER
& PARTITION BY
& ORDER BY
) :
|
|
rental_id | inventory_id | film_category | avg_rental_amount |
---|---|---|---|
1 | 367 | Family | 4.54 |
361 | 6 | Documentary | 1.75 |
424 | 2815 | Drama | 3.84 |
38 | 2540 | Family | 3.65 |
485 | 2392 | Comedy | 4.63 |
15 | 3049 | Documentary | 3.83 |
565 | 1494 | Action | 3.97 |
700 | 500 | Action | 3.36 |
3434 | 2295 | Family | 3.67 |
**Output: (**Please note only a subset of the output is shown below)
Here, we ordered the rows within each film category according to their inventory ID. This resulted in a change in the average rental amount for each inventory.
ORDER BY
clause in a window function while calculating aggregates.Running total and Running average(also referred as moving average)(OVER
& PARTITION BY
& ORDER BY
& FRAME
):
|
|
A running total is the continuous sum of a series of values over time. For example, if you have a list of numbers (1, 2, 3, 4), the running total would be (1, 3, 6, 10). Each number in the running total is the sum of all the previous numbers in the series.
Running average is the calculation of the average of a specified period in a time series. In our example, we use a time period of 3 days to calculate the average rental amount, taking into account the previous day, current day, and next day.
To calculate the Running Average of May 26, 2005, for the Action category, we will take the average of the rental amounts for May 25th, 26th, and 27th. For the first row, there won’t be any preceding row, and for the last row, there won’t be any following row. In these cases, we will consider a 2-day average. Since our data starts on May 25th, 2005, and there is no data for May 24th, we will calculate the running average for May 25th and 26th.
Output:
To better understand the running average values, all of the output data has been embedded in this document. Instead of providing just a couple of sample rows for explanation, you can validate or manually calculate the running average values yourself.
In summary, window functions allow you to perform calculations over a subset of data within a larger table, instead of calculating a single value for an entire column like regular aggregate functions. Window functions enable you to perform calculations on a subset of rows within each partition of a table based on a specified window or subset of rows. By using the OVER
, PARTITION BY
, ORDER BY
, and frame clauses, you can create more complex queries that include calculated values for each row alongside the subset’s aggregate value. Overall, window functions are a powerful tool for data analysis and can significantly simplify complex queries.
My next post will cover non-aggregate window functions such as RANK
, DENSE_RANK
, and NTILE
etc
References: MySQL window functions doc, MySQL Frame clause doc, Running Average