Contents

SQL Window functions

What are Window Functions?

Window functions are functions that allow you to perform calculations over a subset of data within a larger table. They are different from regular aggregate functions like SUM and AVG, which calculate a single value for an entire column. Window functions, on the other hand, calculate a value for each row based on a specified window or subset of rows.

This may sound simple, yet confusing. Let’s break it down into smaller pieces and gain a better understanding.

In this post, we will use the Sakila sample database provided by MySQL

Suppose you want to check the average rental amount for all rentals.

Average rental amount:

1
2
SELECT ROUND(AVG(p.amount), 2) as avg_rental_amount
FROM payment p;

Output:

avg_rental_amount
4.20

This is an aggregate value for all rows, and the result set includes only one value.

Average rental amount per each film category(GROUP BY):

To aggregate the average value for each film category, use the GROUP BY function. The resulting table should have one row per film category.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SELECT c.name,
    ROUND(AVG(p.amount), 2) as avg_rental_amount
FROM rental r
    JOIN payment p on r.rental_id = p.rental_id
    JOIN inventory i on r.inventory_id = i.inventory_id
    JOIN film f on i.film_id = f.film_id
    JOIN film_category fc on f.film_id = fc.film_id
    JOIN category c on fc.category_id = c.category_id
GROUP BY c.name
ORDER BY avg_rental_amount DESC
LIMIT 5;

Output:

name avg_rental_amount
Comedy 4.66
New 4.63
Sports 4.51
Games 4.42
Horror 4.40

What if we want to display each rental detail alongside the category average? Here’s where the WINDOW function comes in.

Common Terms

  • Partitioning: Partitioning involves dividing a table into smaller groups based on one or more columns. Window functions can then be applied to each partition separately, enabling you to perform calculations on a subset of rows within each partition.
  • Order by: The ORDER BY clause is used to sort the rows within each partition in a specific order.
  • Frame clauses: Frame clauses specify the range of rows to include in the window function calculation. There are two types of frame clauses: rows and range.

Average rental amount per each film category(OVER & PARTITION BY):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
SELECT r.rental_id,
    r.inventory_id,
    c.name as film_category,
    ROUND(AVG(p.amount) OVER(PARTITION BY c.name), 2) as avg_rental_amount
FROM rental r
    JOIN payment p on r.rental_id = p.rental_id
    JOIN inventory i on r.inventory_id = i.inventory_id
    JOIN film f on i.film_id = f.film_id
    JOIN film_category fc on f.film_id = fc.film_id
    JOIN category c on fc.category_id = c.category_id
ORDER BY r.rental_id
LIMIT 10;

Output:

rental_id inventory_id film_category avg_rental_amount
1 367 Family 3.86
2 1525 Music 4.12
3 1711 Children 3.87
4 2452 Horror 4.40
5 2079 Children 3.87
6 2792 Comedy 4.66
7 3995 Horror 4.40
8 2346 Animation 3.99
9 2580 Foreign 4.13
10 1824 Drama 4.33

The result includes each rental, as well as the average rental amount for each category.

Average rental amount per each film category(OVER & PARTITION BY & ORDER BY) :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
SELECT r.rental_id,
    r.inventory_id,
    c.name as film_category,
    ROUND(
        AVG(p.amount) OVER(
            PARTITION BY c.name
            ORDER BY i.inventory_id
        ),
        2
    ) as avg_rental_amount
FROM rental r
    JOIN payment p on r.rental_id = p.rental_id
    JOIN inventory i on r.inventory_id = i.inventory_id
    JOIN film f on i.film_id = f.film_id
    JOIN film_category fc on f.film_id = fc.film_id
    JOIN category c on fc.category_id = c.category_id
ORDER BY r.rental_id;
rental_id inventory_id film_category avg_rental_amount
1 367 Family 4.54
361 6 Documentary 1.75
424 2815 Drama 3.84
38 2540 Family 3.65
485 2392 Comedy 4.63
15 3049 Documentary 3.83
565 1494 Action 3.97
700 500 Action 3.36
3434 2295 Family 3.67

**Output: (**Please note only a subset of the output is shown below)

Here, we ordered the rows within each film category according to their inventory ID. This resulted in a change in the average rental amount for each inventory.

Warning
Hence, we should be mindful when adding an ORDER BY clause in a window function while calculating aggregates.

Running total and Running average(also referred as moving average)(OVER & PARTITION BY & ORDER BY & FRAME):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
SELECT CAST(DATE_FORMAT(r.rental_date, '%Y-%m-%d') AS date) as rental_date,
    c.name as film_category,
    p.amount,
    SUM(p.amount) OVER(
        PARTITION BY c.name
        ORDER BY DATE_FORMAT(r.rental_date, '%Y-%m-%d') ROWS UNBOUNDED PRECEDING
    ) as running_total,
    AVG(p.amount) OVER(
        PARTITION BY c.name
        ORDER BY CAST(DATE_FORMAT(r.rental_date, '%Y-%m-%d') AS date)
    ) as avg_per_cat_per_day,
    AVG(p.amount) OVER(
        PARTITION BY c.name
        ORDER BY CAST(DATE_FORMAT(r.rental_date, '%Y-%m-%d') AS date) RANGE BETWEEN INTERVAL 1 DAY PRECEDING
            AND INTERVAL 1 DAY FOLLOWING
    ) as rolling_avg
FROM rental r
    JOIN payment p on r.rental_id = p.rental_id
    JOIN inventory i on r.inventory_id = i.inventory_id
    JOIN film f on i.film_id = f.film_id
    JOIN film_category fc on f.film_id = fc.film_id
    JOIN category c on fc.category_id = c.category_id;

A running total is the continuous sum of a series of values over time. For example, if you have a list of numbers (1, 2, 3, 4), the running total would be (1, 3, 6, 10). Each number in the running total is the sum of all the previous numbers in the series.

Running average is the calculation of the average of a specified period in a time series. In our example, we use a time period of 3 days to calculate the average rental amount, taking into account the previous day, current day, and next day.

To calculate the Running Average of May 26, 2005, for the Action category, we will take the average of the rental amounts for May 25th, 26th, and 27th. For the first row, there won’t be any preceding row, and for the last row, there won’t be any following row. In these cases, we will consider a 2-day average. Since our data starts on May 25th, 2005, and there is no data for May 24th, we will calculate the running average for May 25th and 26th.

Output:

To better understand the running average values, all of the output data has been embedded in this document. Instead of providing just a couple of sample rows for explanation, you can validate or manually calculate the running average values yourself.

In summary, window functions allow you to perform calculations over a subset of data within a larger table, instead of calculating a single value for an entire column like regular aggregate functions. Window functions enable you to perform calculations on a subset of rows within each partition of a table based on a specified window or subset of rows. By using the OVER, PARTITION BY, ORDER BY, and frame clauses, you can create more complex queries that include calculated values for each row alongside the subset’s aggregate value. Overall, window functions are a powerful tool for data analysis and can significantly simplify complex queries.

My next post will cover non-aggregate window functions such as RANK, DENSE_RANK, and NTILE etc

References: MySQL window functions doc, MySQL Frame clause doc, Running Average