Refreshing Select Partitions

We have a dataset with orders and we want to aggregate data while having decent performance. Orders have a creation time, so we can use partitioning by time to optimize pre-aggregations build and refresh time. The problem is that the order's status can change after a long period. In this case, we want to rebuild only partitions associated with this order.

In the recipe below, we'll learn how to use the refreshKey together with the FITER_PARAMS for partition separately.

Let's explore the Orders cube data that contains various information about orders, including number and status:

idnumberstatuscreated_atupdated_at
11processing2021-08-10 14:26:402021-08-10 14:26:40
22completed2021-08-20 13:21:382021-08-22 13:10:38
33shipped2021-09-01 10:27:382021-09-02 01:12:38
44completed2021-09-20 10:27:382021-09-20 10:27:38

In our case, each order has created_at and updated_at properties. The updated_at property is the last order update timestamp. To create a pre-aggregation with partitions, we need to specify the partitionGranularity property. Partitions will be split monthly by the created_at dimension.

preAggregations: {
    orders: {
      type: `rollup`,
      external: true,
      dimensions: [CUBE.number, CUBE.status, CUBE.createdAt, CUBE.updatedAt],
      timeDimension: CUBE.createdAt,
      granularity: `day`,
      partitionGranularity: `month`, // this is where we specify the partition
      refreshKey: {
        sql: `SELECT max(updated_at) FROM public.orders` // check for updates of the updated_at property
      },
    },
  },

As you can see, we defined custom a refreshKey that will check for new values of the updated_at property. The refresh key is evaluated for each partition separately. For example, if we update orders from august and update their updated_at property, the current refresh key will update for all partitions. There is how it looks in the Cube logs:

Executing SQL: 5b4c517f-b496-4c69-9503-f8cd2b4c73b6
--
  SELECT max(updated_at) FROM public.orders
--
Performing query completed: 5b4c517f-b496-4c69-9503-f8cd2b4c73b6 (15ms)
Performing query: 5b4c517f-b496-4c69-9503-f8cd2b4c73b6
Performing query: 5b4c517f-b496-4c69-9503-f8cd2b4c73b6
Executing SQL: 5b4c517f-b496-4c69-9503-f8cd2b4c73b6
--
  select min(("orders".created_at::timestamptz AT TIME ZONE 'UTC')) from public.orders AS "orders"
--
Executing SQL: 5b4c517f-b496-4c69-9503-f8cd2b4c73b6
--
  select max(("orders".created_at::timestamptz AT TIME ZONE 'UTC')) from public.orders AS "orders"
--

Note that the query for two partitions is the same. It's the reason why all partitions will be updated.

How do we fix this and update only the partition for august? We can use the FITER_PARAMS for that!

Let's update our pre-aggregation definition:

preAggregations: {
    orders: {
      type: `rollup`,
      external: true,
      dimensions: [CUBE.number, CUBE.status, CUBE.createdAt, CUBE.updatedAt],
      timeDimension: CUBE.createdAt,
      granularity: `day`,
      partitionGranularity: `month`,
      refreshKey: {
        sql: `SELECT max(updated_at) FROM public.orders WHERE ${FILTER_PARAMS.Orders.createdAt.filter('created_at')}`
      },
    },
  },

Cube will filter data by the created_at property and then apply the refresh key for the updated_at property. Here's how it looks in the Cube logs:

Executing SQL: e1155b2f-859b-4e61-a760-17af891f5f0b
--
  select min(("updated_orders".created_at::timestamptz AT TIME ZONE 'UTC')) from public.orders AS "updated_orders"
--
Executing SQL: e1155b2f-859b-4e61-a760-17af891f5f0b
--
  select max(("updated_orders".created_at::timestamptz AT TIME ZONE 'UTC')) from public.orders AS "updated_orders"
--
Performing query completed: e1155b2f-859b-4e61-a760-17af891f5f0b (10ms)
Performing query completed: e1155b2f-859b-4e61-a760-17af891f5f0b (13ms)
Performing query: e1155b2f-859b-4e61-a760-17af891f5f0b
Performing query: e1155b2f-859b-4e61-a760-17af891f5f0b
Executing SQL: e1155b2f-859b-4e61-a760-17af891f5f0b
--
  SELECT max(updated_at) FROM public.orders WHERE created_at >= '2021-08-01T00:00:00.000Z'::timestamptz AND created_at <= '2021-08-31T23:59:59.999Z'::timestamptz
--
Executing SQL: e1155b2f-859b-4e61-a760-17af891f5f0b
--
  SELECT max(updated_at) FROM public.orders WHERE created_at >= '2021-09-01T00:00:00.000Z'::timestamptz AND created_at <= '2021-09-30T23:59:59.999Z'::timestamptz

Note that Cube checks the refresh key value using a date range over the created_at property. With this refresh key, only one partition will be updated.

We have received orders from two partitions of a pre-aggregation and only one of them has been updated when an order changed its status:

// Orders before update:
[
  {
    "Orders.number": "1",
    "Orders.status": "processing",
    "Orders.createdAt": "2021-08-10T14:26:40.000",
    "Orders.updatedAt": "2021-08-10T14:26:40.000"
  },
  {
    "Orders.number": "2",
    "Orders.status": "completed",
    "Orders.createdAt": "2021-08-20T13:21:38.000",
    "Orders.updatedAt": "2021-08-20T13:21:38.000"
  },
  {
    "Orders.number": "3",
    "Orders.status": "shipped",
    "Orders.createdAt": "2021-09-01T10:27:38.000",
    "Orders.updatedAt": "2021-09-01T10:27:38.000"
  },
  {
    "Orders.number": "4",
    "Orders.status": "completed",
    "Orders.createdAt": "2021-09-20T10:27:38.000",
    "Orders.updatedAt": "2021-09-20T10:27:38.000"
  }
]
// Pre-aggregations for orders before update:
{
  "dev_pre_aggregations.orders__orders": {
    "targetTableName": "(
      SELECT * FROM dev_pre_aggregations.orders__orders20210801_qgajzwit_mdtjpixm_1glan84 UNION ALL 
      SELECT * FROM dev_pre_aggregations.orders__orders20210901_bvzl43q1_py2oudte_1glan84)",
    "refreshKeyValues": [
      {},
      {}
    ]
  }
}
// Orders after update:
[
  {
    "Orders.number": "1",
    "Orders.status": "shipped",
    "Orders.createdAt": "2021-08-10T14:26:40.000",
    "Orders.updatedAt": "2021-09-30T06:45:28.000"
  },
  {
    "Orders.number": "2",
    "Orders.status": "completed",
    "Orders.createdAt": "2021-08-20T13:21:38.000",
    "Orders.updatedAt": "2021-08-20T13:21:38.000"
  },
  {
    "Orders.number": "3",
    "Orders.status": "shipped",
    "Orders.createdAt": "2021-09-01T10:27:38.000",
    "Orders.updatedAt": "2021-09-01T10:27:38.000"
  },
  {
    "Orders.number": "4",
    "Orders.status": "completed",
    "Orders.createdAt": "2021-09-20T10:27:38.000",
    "Orders.updatedAt": "2021-09-20T10:27:38.000"
  }
]
// Pre-aggregations for orders after update:
{
  "dev_pre_aggregations.orders__orders": {
    "targetTableName": "(
      SELECT * FROM dev_pre_aggregations.orders__orders20210801_lx4b2bkg_mdtjpixm_1glana3 UNION ALL 
      SELECT * FROM dev_pre_aggregations.orders__orders20210901_bvzl43q1_py2oudte_1glan84)",
    "refreshKeyValues": [
      {},
      {}
    ]
  }
}

Please feel free to check out the full source code or run it with the docker-compose up command. You'll see the result, including queried data, in the console.

Did you find this page useful?