Any person that has worked with data analytics has had a bad day when they sighed over a problem that was intuitively simple but practically hard to crack using pure SQL.
What is the revenue growth month over month and running total revenue? Can we trust the metric, or does the data have some accidental duplicates that affect it? What are the top N orders for every month? What is the repeat purchase behavior? All these questions have to be translated from business language to programming language.
An intuitive solution in a countless number of cases like these is, “If only I could just loop over the results of my query I would be able to get the answer right away.” This opens a door to a world of workarounds: writing complex joins, data ending up in another spreadsheet, using procedural extensions of SQL, or even moving data processing outside the database. Not all alternatives are viable, others are just ugly.
At the same time, there’s a pure SQL implementation called “window functions” that is very readable, performant, and easy to debug.
In this SQL window functions tutorial, we will describe how these functions work in general, what is behind their syntax, and show how to answer these questions with pure SQL.
Learn by doing
For people unfamiliar with SQL window functions, the official database documentation can be hard to decipher, so we’ll go with real examples, starting from very basic queries and increasing the degree of complexity.
Questions that were listed in the introduction above are real business questions that practically any company faces, and window functions are a huge help with answering them. Imagine we have the following table of orders:
This dataset is very simple and small, however it is sufficient to illustrate the power of SQL window functions. Let’s start with the revenue growth.
In business terms, revenue growth in month M1 is calculated as:100*(m1-m0)/m0
where m1 is the revenue in the given month and m0 is the revenue in the previous month. So, technically we would need to find what the revenue is for each month and then somehow relate every month to the previous one to be able to do the calculation above. A very easy way to do this would be to calculate the monthly revenue in SQL:
then copy the output into the spreadsheet and use a formula to produce a growth metric:
When you copy the formula from cell C3 to cell C4 and so on, references are automatically shifted down, that’s what spreadsheets are good at. But what if you’d like to have this metric as a part of a nice dashboard that is fed by data coming directly from the database? Such tools as Statsbot can help you with that:
In this case, a spreadsheet is definitely not what you want to end up with. Let’s try to calculate this in SQL. Without window functions, a query that gets you the final result would look like this:
You’d have to calculate monthly revenue, then get the result of the previous month using self-join and use it in the final formula. The logic is broken down into 3 steps for clarity. You can’t break it down further, but even so the second step can be quite confusing.
What you have to keep in mind about datediff is that that the minuend (i.e. what you subtract from) is the third parameter of the function and the subtrahend (i.e. what you subtract) is the second parameter. I personally think that’s a bit counterintuitive for subtraction, and the self join concept itself is not basic. There’s actually a much better way to express the same logic:
Let’s break down the lag(… line of code:
- lag is a window function that gets you the previous row
- revenue is the expression of what exactly you would like to get from that row
- over (order by month) is your window specification that tells how exactly you would like to sort the rows to identify which row is the previous one (there’s plenty of options). In our case, we told the database to get the previous month’s row for every given month.
This is a generic structure of all SQL window functions: function itself, expression and other parameters, and window specification:
function (expression, [parameters]) OVER (window specification)
It is very clean and powerful and has countless opportunities. For example, you can add partition by to your window specification to look at different groups of rows individually:
Calculating revenue by state in the first subquery and updating the window specification to take the new grouping into account, you can look at each state individually and see what the revenue growth by state is.
Partitions are extremely useful when you need to calculate the same metric over different segments.
Another common request is to calculate running total over some period of time. This is the sum of the current element and all previous elements, for example, this is how the running total monthly revenue would look in our dataset:
And the query to get this is below:
The new thing here is the rows between unbounded preceding and current row part that is called “frame clause.” It’s a way to tell you which subset of other rows of your result set you’re interested in, relative to the current row. The general definition of the frame clause is:
rows between frame_start and frame_end
where frame_start can be one of the following:
- unbounded preceding which is “starting from the first row or ending on the last row of the window”
- N preceding is N rows you’re interested in
- current row
and frame_end can be:
- unbounded following which is “starting from the first row or ending on the last row of the window”
- N following is N rows you’re interested in
- current row
It’s very flexible, except that you have to make sure the first part of between is higher than the second part, i.e. between 7 preceding and current row is totally fine, between 7 preceding and 3 preceding is fine too, but between 3 preceding and 7 preceding would throw an error. You can sum, count, and average values within the selected window. You can see a few examples in the query below:
Every combination makes sense for particular use cases:
- amount_total is specified without a window and returns the entire total of $3400. Use it when you want to get the total in the same query as individual rows without grouping and joining back the grouped result.
- running_sum is the running total. Use it when you want to see how some variable such as revenue or website visits is accumulated over a period of time.
- running_sum_by_customer is the same as above but broken down by segment (you can see how revenue from each individual customer grows, and on bigger data it can be cities or states).
- trailing_avg shows the average amount of the last 5 orders. Use trailing average when you want to learn the trend and disguise volatility.
The ordering is critical here since the database needs to know how to sum the values. The result can be completely different when different ordering is applied. The picture below shows the result of the query:
The arrow next to running_sum tells us how the total amount is accumulated over time. The colored arrows next to running_sum_by_customer interconnect orders done by the same customer and the values show the total order amount of the given customer at the point of every order. Finally, the grey brackets next to trailing_avg reference the moving window of the last 5 rows.
Dealing with duplicate data
If you paid attention to the dataset you would probably notice that both orders for customer D on 02–05 have the same order_id=5 which doesn’t look right. How did this happen? It turns out that the original order was $250, then the customer decided to spend $50 more. The new record was inserted but the old record was not deleted. Such things happen in one or another table.
A long term solution in this case is to rectify the existing data flows and increase data awareness among developers, and what you can do right away is roll your sleeves up and clean the mess on your side:
That would get you all order records except one that you’d like to filter out. Let’s review the function that allows us to do so: row_number() returns the incremental counter of the row within the given window: 1,2,3, etc. It doesn’t take any parameters, that’s why it ends with empty brackets. The given window here is a set of rows that share a common order_id (partition by … is what separates these sets from each other), sorted by datetime descending, so the intermediary result of the subquery in the middle looks like this:
Every partition here is represented by a single row, except order_id=5. Inside the partition, you sorted the rows by datetime descending, so the latest row gets 1 and the earlier row gets 2. Then you filter only rows that have 1 to get rid of duplicates.
This is very useful for all sorts of duplicate problems. You might see that duplicates inflate the revenue number, so to calculate correct metrics we have to clean them out by combining the duplicate filtering query with the revenue growth query like this:
Now the data is clean and the revenue and growth metrics are correct. To avoid using orders_cleaned step in every query, you can create a view and use it as a table reference in other queries:
Besides filtering duplicates, window partitions are very useful when you need to identify top N rows in every group.
Top N rows in every group
Finding top rows in every group is a typical task for a data analyst. Finding out who your best customers are and reaching out to them is a good way to know what people especially like about your company and make it a standard. It is equally useful for employees, as leaderboards can be very good motivation for any team in the company. Let’s see how it works. Considering our dataset is tiny, let’s get the top 2 orders for every month:
The intermediary result of the orders_ranked statement would look like this (rows that appear in the final result are highlighted):
You can use any expression to separate partitions in window specification, not only column name (here we separated them by month, every partition is highlighted by its own color).
There are orders with the same month and amount, like order_id=1 and order_id=3, so we decided to resolve this by picking up the earliest order, adding datetime column to the sorting. If you’re interested in pulling both rows in case of conflict you can use rank function instead of row_number.
Repeat purchase behavior
Repeat purchase behavior is the key for a successful business, and investors totally love companies that can retain customers and make them spend more and more. We can translate this to a more precise data question: what is the repeat purchase rate and the typical difference between the first order and the second order amount? That would be expressed as:
The result of the intermediary customer_orders statement would look like this:
Every customer’s partition is highlighted by its own color. The brackets next to customer_id outline our partitions, the arrows next to datetime show sorting direction, the arrows that point to amount values show where prev_order_amount values come from, and customer_order_n values for repeat purchases are underlined. Since customers D, E, and F have only one order they are out of the analysis.
The final result looks like this:
This time we partitioned by customer_id to isolate sets of order rows that belong to the same customer, and identified the row order and the previous row’s amount value. In the final query, we have used conditional aggregates to calculate desired metrics which tell us that half of customers buy again and they spend almost twice as much on the second order. Imagine it’s not a dummy dataset, what a great business would that be!
As you have seen in our tutorial, SQL window functions are a powerful concept that allows advanced calculations. A typical window function consists of expression and window specification with optional partitioning and frame clause, and the opportunities to slice, smooth, interconnect, and deduplicate your data are truly boundless. There are plenty of interesting use cases where these functions give a great value. We’ll continue talking about them in the next few tutorials. Stay tuned!