Using pre-aggregations
Pre-aggregations is a powerful way to speed up your Cube queries. There are many configuration options to consider. Please make sure to also check the Pre-Aggregations reference in the data modeling section.
Refresh Strategy
Refresh strategy can be customized by setting the
refresh_key
property for the
pre-aggregation.
The default value of refresh_key
is
every: 1 hour
, if neither of the cubes overrides it's refreshKey
parameter.
It can be redefined either by overriding the default value of
the every
property:
cubes:
- name: orders
# ...
pre_aggregations:
- name: amount_by_created
type: rollup
measures:
- amount
time_dimension: created_at
granularity: month
refresh_key:
every: 12 hour
Or by providing a sql
property
instead, and leaving every
unchanged from its default value:
cubes:
- name: orders
# ...
pre_aggregations:
- name: amount_by_created
measures:
- amount
time_dimension: created_at
granularity: month
refresh_key:
# every will default to `10 seconds` here
sql: SELECT MAX(created_at) FROM orders
Or both every
and sql
can be defined together:
cubes:
- name: orders
# ...
pre_aggregations:
- name: amount_by_created
measures:
- amount
time_dimension: created_at
granularity: month
refresh_key:
every: 12 hour
sql: SELECT MAX(created_at) FROM orders
When every
and sql
are used together, Cube will run the query from the sql
property on an interval defined by the every
property. If the query returns
new results, then the pre-aggregation will be refreshed.
Rollup Only Mode
To make Cube only serve requests from pre-aggregations, the
CUBEJS_ROLLUP_ONLY
environment variable can be
set to true
on an API instance. This will prevent serving data on API requests
from the source database.
When using this configuration in a single node deployment (where the API instance and Refresh Worker are configured on the same host), requests made to the API that cannot be satisfied by a rollup throw an error. Scheduled refreshes will continue to work in the background.
Partitioning
Partitioning (opens in a new tab) is an extremely effective optimization for
accelerating pre-aggregations build and refresh time. It effectively "shards"
the data between multiple tables, splitting them by a defined attribute. Cube
can be configured to incrementally refresh only the last set of partitions
through the updateWindow
property. This leads to faster refresh times due to
unnecessary data not being reloaded, and even reduced cost for some databases
like BigQuery or
AWS Athena.
Any rollup
pre-aggregation can be partitioned by time using the
partition_granularity
property in a pre-aggregation
definition. In the example below, the
partition_granularity
is set to month
, which means Cube will generate
separate tables for each month's worth of data. Once built, it will continue to
refresh on a daily basis the last 3 months of data.
cubes:
- name: orders
# ...
pre_aggregations:
- name: category_and_date
measures:
- count
- revenue
dimensions:
- category
time_dimension: created_at
granularity: day
partition_granularity: month
refresh_key:
every: 1 day
incremental: true
update_window: 3 months
Using Indexes
When to use indexes?
Indexes are great when you filter large amounts of data across one or several dimension columns. You can read more about them here.
Best Practices
To maximize performance, you can introduce an index per type of query so the set of dimensions used in the query overlap as much as possible with the ones defined in the index. Measures are traditionally only used in indexes if you plan to filter a measured value and the cardinality of the possible values of the measure is low.
The order in which columns are specified in the index is very important; suboptimal ordering can lead to diminished performance. To improve the performance of an index the main thing to consider is the order of the columns defined in it.
The rule of thumb for index column order is:
- Single value filters come first
GROUP BY
columns come second- Everything else used in the query comes afterward
Example:
Suppose you have a pre-aggregation that has millions of rows with the following structure:
timestamp | product_name | product_category | zip_code | order_total |
---|---|---|---|---|
2023-01-01 10:00:00 | Plastic Chair | Furniture | 88523 | 2000 |
2023-01-01 10:00:00 | Keyboard | Electronics | 88523 | 1000 |
2023-01-01 10:00:00 | Mouse | Electronics | 88523 | 800 |
2023-01-01 11:00:00 | Plastic Chair | Furniture | 88524 | 3000 |
2023-01-01 11:00:00 | Keyboard | Electronics | 88524 | 2000 |
The pre-aggregation code would look as follows:
cubes:
- name: orders
# ...
pre_aggregations:
- name: main
measures:
- order_total
dimensions:
- product_name
- product_category
- zip_code
time_dimension: timestamp
granularity: hour
partition_granularity: day
allow_non_strict_date_range_match: true
refresh_key:
every: 1 hour
incremental: true
update_window: 1 day
build_range_start:
sql: SELECT DATE_SUB(NOW(), 365)
build_range_end:
sql: SELECT NOW()
You run the following query on a regular basis, with the only difference between queries being the filter values:
{
"measures": [
"orders.order_total"
],
"timeDimensions": [
{
"dimension": "orders.timestamp",
"granularity": "hour",
"dateRange": [
"2022-12-14T06:00:00.000",
"2023-01-13T06:00:00.000"
]
}
],
"order": {
"orders.timestamp": "asc"
},
"filters": [
{
"member": "orders.product_category",
"operator": "equals",
"values": [
"Electronics"
]
},
{
"member": "orders.product_name",
"operator": "equals",
"values": [
"Keyboard",
"Mouse"
]
}
],
"dimensions": [
"orders.zip_code"
],
"limit": 10000
}
After running this on a dataset with millions of records you find that it's taking a long time to run, so you decide to add an index to target this specific query. Taking into account the best practices mentioned previously you should define an index as follows:
cubes:
- name: orders
# ...
pre_aggregations:
- name: main
# ...
indexes:
- name: category_productname_zipcode_index
columns:
- product_category
- zip_code
- product_name
Then the data within category_productname_zipcode_index
would look like:
product_category | product_name | zip_code | timestamp | order_total |
---|---|---|---|---|
Furniture | Plastic Chair | 88523 | 2023-01-01 10:00:00 | 2000 |
Electronics | Keyboard | 88523 | 2023-01-01 10:00:00 | 1000 |
Electronics | Mouse | 88523 | 2023-01-01 10:00:00 | 800 |
Furniture | Plastic Chair | 88524 | 2023-01-01 11:00:00 | 3000 |
Electronics | Keyboard | 88524 | 2023-01-01 11:00:00 | 2000 |
product_category
column comes first as it's a single value filter. Then
zip_code
as it's GROUP BY
column. product_name
comes last as it's a
multiple value filter.
It might sound counter-intuitive to have GROUP BY
columns before filter ones,
however Cube Store always performs scans on sorted data, and if GROUP BY
matches index ordering, merge sort-based algorithms are used for querying, which
are usually much faster than hash-based group by in case of index ordering
doesn't match the query. If in doubt, always use EXPLAIN
and EXPLAIN ANALYZE
in Cube Store to figure out the final query plan.
Aggregated indexes
Aggregated indexes can be defined as well. You can read more about them here.
Example:
cubes:
- name: orders
# ...
pre_aggregations:
- name: main
# ...
indexes:
# ...
- name: zip_code_index
columns:
- zip_code
type: aggregate
And the data for zip_code_index
would look like the following:
zip_code | order_total |
---|---|
88523 | 3800 |
88524 | 5000 |
Inspecting Pre-Aggregations
Cube Store partially supports the MySQL protocol. This allows you to execute simple queries using a familiar SQL syntax. You can connect using the MySQL CLI client, for example:
mysql -h <CUBESTORE_IP> --user=cubestore -pcubestore
Only Linux and Mac OS versions of MySQL client are supported as of right now.
You can install one on ubuntu using apt-get install default-mysql-client
command or brew install mysql-client
on Mac OS. Windows versions of the MySQL
client aren't supported.
To check which pre-aggregations are managed by Cube Store, you could run the following query:
SELECT * FROM information_schema.tables;
+----------------------+-----------------------------------------------+
| table_schema | table_name |
+----------------------+-----------------------------------------------+
| dev_pre_aggregations | orders_main20190101_23jnqarg_uiyfxd0f_1gifflf |
| dev_pre_aggregations | orders_main20190301_24ph0a1c_utzntnv_1gifflf |
| dev_pre_aggregations | orders_main20190201_zhrh5kj1_rkmsrffi_1gifflf |
| dev_pre_aggregations | orders_main20191001_mdw2hxku_waxajvwc_1gifflf |
| dev_pre_aggregations | orders_main20190701_izc2tl0h_bxsf1zlb_1gifflf |
+----------------------+-----------------------------------------------+
5 rows in set (0.01 sec)
These pre-aggregations are stored as Parquet files under the .cubestore/
folder in the project root during development.
EXPLAIN queries
Cube Store's MySQL protocol also supports EXPLAIN
and EXPLAIN ANALYZE
queries both of which are useful for determining how much processing a query
will require.
EXPLAIN
queries show the logical plan for a query:
EXPLAIN SELECT orders__platform, orders__gender, sum(orders__count) FROM dev_pre_aggregations.orders_general_o32v4dvq_vbyemtl2_1h5hs8r
GROUP BY orders__gender, orders__platform;
+-------------------------------------------------------------------------------------------------------------------------------------+
| logical plan |
+--------------------------------------------------------------------------------------------------------------------------------------+
| Projection, [dev_pre_aggregations.orders_general_o32v4dvq_vbyemtl2_1h5hs8r.orders__platform, dev_pre_aggregations.orders_general_o32v4dvq_vbyemtl2_1h5hs8r.orders__gender, SUM(dev_pre_aggregations.orders_general_o32v4dvq_vbyemtl2_1h5hs8r.orders__count)]
Aggregate
ClusterSend, indices: [[96]]
Scan dev_pre_aggregations.orders_general_o32v4dvq_vbyemtl2_1h5hs8r, source: CubeTable(index: orders_general_plat_gender_o32v4dvq_vbyemtl2_1h5hs8r:96:[123, 126]), fields: [orders__gender, orders__platform, orders__count] |
+-------------------------------------------------------------------------------------------------------------------------------------+
EXPLAIN ANALYZE
queries show the physical plan for the router and all workers
used for query processing:
EXPLAIN ANALYZE SELECT orders__platform, orders__gender, sum(orders__count) FROM dev_pre_aggregations.orders_general_o32v4dvq_vbyemtl2_1h5hs8r
GROUP BY orders__gender, orders__platform
+-----------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| node type | node name | physical plan |
+-----------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| router | | Projection, [orders__platform, orders__gender, SUM(dev_pre_aggregations.orders_general_o32v4dvq_vbyemtl2_1h5hs8r.orders__count)@2:SUM(orders__count)]
FinalInplaceAggregate
ClusterSend, partitions: [[123, 126]] |
| worker | 127.0.0.1:10001 | PartialInplaceAggregate
Merge
Scan, index: orders_general_plat_gender_o32v4dvq_vbyemtl2_1h5hs8r:96:[123, 126], fields: [orders__gender, orders__platform, orders__count]
Projection, [orders__gender, orders__platform, orders__count]
ParquetScan, files: /.cubestore/data/126-0qtyakym.parquet |
+-----------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
Pre-Aggregations Storage
The default pre-aggregations storage in Cube is its own purpose-built storage layer: Cube Store.
Alternatively, you can store pre-aggregations internally in the source
database. To store a pre-aggregation internally, set external: false
in
pre-aggregation definition.
Please note, that original_sql pre-aggregations are stored
internally by default. It is not recommended to store original_sql
pre-aggregations in Cube Store.
Joins between pre-aggregations
When making a query that joins data from two different cubes, Cube can use pre-aggregations instead of running the base SQL queries. To get started, first ensure both cubes have valid pre-aggregations:
cubes:
- name: orders
# ...
pre_aggregations:
- name: orders_rollup
measures:
- CUBE.count
dimensions:
- CUBE.user_id
- CUBE.status
time_dimension: CUBE.created_at
granularity: day
joins:
- name: users
sql: "{CUBE.user_id} = ${users.id}"
relationship: many_to_one
- name: users
# ...
pre_aggregations:
- name: users_rollup
dimensions:
- CUBE.id
- CUBE.name
Before we continue, let's add an index to the orders_rollup
pre-aggregation so
that the rollup_join
pre-aggregation can work correctly:
cubes:
- name: orders
# ...
pre_aggregations:
- name: orders_rollup
# ...
indexes:
- name: user_index
columns:
- CUBE.user_id
Now we can add a new pre-aggregation of type rollup_join
to the orders
cube:
cubes:
- name: orders
# ...
pre_aggregations:
# ...
- name: orders_with_users_rollup
type: rollup_join
measures:
- CUBE.count
dimensions:
- users.name
time_dimension: CUBE.created_at
granularity: day
rollups:
- users.users_rollup
- CUBE.orders_rollup
With all of the above set up, making a query such as the following will now use
orders.orders_rollup
and users.users_rollup
, avoiding a database request:
{
"dimensions": ["users.name"],
"timeDimensions": [
{
"dimension": "orders.created_at",
"dateRange": "This month"
}
],
"order": {
"orders.count": "desc"
},
"measures": ["orders.count"]
}
Pre-Aggregation Build Strategies
For ideal performance, pre-aggregations should be built using a dedicated Refresh Worker. See here for more details.
Cube supports three different strategies for building pre-aggregations. To see which strategies your database supports, please refer to its individual page from Connecting to the Database.
Simple
When using the simple strategy, Cube will use the source database as a temporary staging area for writing pre-aggregations to determine column types. The data is loaded back into memory before writing them to Cube Store (or an external database).
For larger datasets, we strongly recommend using the Batching or Export Bucket strategies instead.
Batching
Batching is a more performant strategy where Cube sends compressed CSVs for Cube Store to ingest.
The performance scales to the amount of memory available on the Cube instance. Batching is automatically enabled for any databases that can support it.
Export bucket
The export bucket strategy requires permission to execute CREATE TABLE
statements in the data source as part of the pre-aggregation build process.
Do not confuse the export bucket with the Cube Store storage bucket. Those are two separate storages and should never be mixed.
When dealing with larger pre-aggregations (more than 100k rows), performance can be significantly improved by using an export bucket. This allows the source database to temporarily materialize the data locally, which is then loaded into Cube Store in parallel:
Enabling the export bucket functionality requires extra configuration; please refer to the database-specific documentation for more details:
When using cloud storage, it is important to correctly configure any data retention policies to clean up the data in the export bucket as Cube does not currently manage this. For most use-cases, 1 day is sufficient.
Streaming pre-aggregations
Streaming pre-aggregations are different from traditional pre-aggregations in the way they are being updated. Traditional pre-aggregations follow the “pull” model — Cube pulls updates from the data source based on some cadence and/or condition. Streaming pre-aggregations follow the “push” model — Cube subscribes to the updates from the data source and always keeps pre-aggregation up to date.
You don’t need to define refresh_key
for streaming pre-aggregations. Whether
pre-aggregation is streaming or not is defined by the data source.
Currently, Cube supports only one streaming data source - ksqlDB. All pre-aggregations where data source is ksqlDB are streaming.
We are working on supporting more data sources for streaming pre-aggregations, please let us know (opens in a new tab) if you are interested in early access to any of these drivers or would like Cube to support any other SQL streaming engine.