Implementing Data Snapshots
For a dataset that contains a sequence of changes to a property over time, we
want to be able to get the most recent state of said property at any given date.
In this recipe, we'll learn how to calculate snapshots of statuses at any given
date for a cube with Product Id
, Status
, and Changed At
dimensions.
We can consider the status property to be a slowly changing dimension (SCD) of type 2. Modeling data schemas with slowly changing dimensions is an essential part of the data engineering skillset.
Let's explore the Statuses
cube that contains data like this:
[
{
"Statuses.orderId": 1,
"Statuses.status": "shipped",
"Statuses.changedAt": "2019-01-19T00:00:00.000",
},
{
"Statuses.orderId": 1,
"Statuses.status": "processing",
"Statuses.changedAt": "2019-03-14T00:00:00.000",
},
{
"Statuses.orderId": 1,
"Statuses.status": "completed",
"Statuses.changedAt": "2019-01-25T00:00:00.000",
},
{
"Statuses.orderId": 2,
"Statuses.status": "processing",
"Statuses.changedAt": "2019-08-21T00:00:00.000",
},
{
"Statuses.orderId": 2,
"Statuses.status": "completed",
"Statuses.changedAt": "2019-04-13T00:00:00.000",
},
{
"Statuses.orderId": 2,
"Statuses.status": "shipped",
"Statuses.changedAt": "2019-03-18T00:00:00.000",
},
];
We can see that statuses change occasionally. How do we count orders that
remained in the shipped
status at a particular date?
First, we need to generate a range with all dates of interest, from the earliest to the latest. Second, we need to join the dates with the statuses and leave only the most recent statuses to date.
cube(`StatusSnapshots`, {
extends: Statuses,
sql: `
-- Create a range from the earlist date to the latest date
WITH range AS (
SELECT date
FROM GENERATE_SERIES(
(SELECT MIN(changed_at) FROM ${Statuses.sql()} AS statuses),
(SELECT MAX(changed_at) FROM ${Statuses.sql()} AS statuses),
INTERVAL '1 DAY'
) AS date
)
-- Calculate snapshots for every date in the range
SELECT range.date, statuses.*
FROM range
LEFT JOIN ${Statuses.sql()} AS statuses
ON range.date >= statuses.changed_at
AND statuses.changed_at = (
SELECT MAX(changed_at)
FROM ${Statuses.sql()} AS sub_statuses
WHERE sub_statuses.order_id = statuses.order_id
)
`,
dimensions: {
date: {
sql: `date`,
type: `time`,
},
},
});
To generate a range of dates, here we use the
GENERATE_SERIES
function
which is Postgres-specific. Other databases have similar functions, e.g.,
GENERATE_DATE_ARRAY
in BigQuery.
Please note that it makes sense to make the StatusSnapshots
cube
extend the
original Statuses
cube in order to reuse the dimension definitions. We only
need to add a new dimension that indicates the date
of a snapshot. We're also
referencing the definition of the Statuses
cube with the
sql()
property.
To count orders that remained in the shipped
status at a particular date, we
will send a query that selects a snapshot by this date and also filters by the
status:
{
"measures": ["StatusSnapshots.count"],
"filters": [
{
"member": "StatusSnapshots.date",
"operator": "equals",
"values": ["2019-04-01"]
},
{
"member": "StatusSnapshots.status",
"operator": "equals",
"values": ["shipped"]
}
]
}
If we execute a couple of such queries for distinct dates, we'll spot the change:
// Shipped as of April 1, 2019:
[
{
"StatusSnapshots.count": 16,
},
];
// Shipped as of May 1, 2019:
[
{
"StatusSnapshots.count": 25,
},
];
Please feel free to check out the
full source code
or run it with the docker-compose up
command. You'll see the result, including
queried data, in the console.
Did you find this page useful?