Pre-aggregations
Pre-aggregations are materialized query results persisted as tables. Cube.js has an ability to analyze queries against a defined set of pre-aggregation rules in order to choose the optimal one that will be used to create pre-aggregation table.
If Cube.js finds a suitable pre-aggregation rule, database querying becomes a multi-stage process:
Cube.js checks if an up-to-date copy of the pre-aggregation exists.
Cube.js will execute a query against the pre-aggregated tables instead of the raw data.
Pre-aggregations can be defined in the preAggregations
available on each cube.
A more in-depth diagram detailing how pre-aggregations are selected can be
found here.
Pre-aggregations must have, at minimum, a name and a type. This name, along with the name of the cube will be used as a prefix for pre-aggregation tables created in the database.
Pre-aggregation names should:
- Be unique within a cube
- Start with a lowercase letter
- Consist of numbers, letters and
_
cube(`Orders`, {
sql: `SELECT * FROM orders`,
...,
preAggregations: {
ordersByStatus: {
dimensions: [CUBE.status],
measures: [CUBE.count],
},
},
});
Pre-aggregations must include all dimensions, measures, and filters you will query with.
Cube supports three types of pre-aggregations:
The default type is rollup
.
rollup
Rollup pre-aggregations are the most effective way to boost performance of any analytical application. The blazing fast performance of tools like Google Analytics or Mixpanel are backed by a similar concept. The theory behind it lies in multi-dimensional analysis, and a rollup pre-aggregation is the result of a roll-up operation on an OLAP cube. A rollup pre-aggregation is essentially the summarized data of the original cube grouped by any selected dimensions of interest.
The most performant kind of rollup pre-aggregation is an additive rollup:
all measures of which are based on decomposable aggregate
functions. Additive measure types are: count
, sum
,
min
, max
or countDistinctApprox
. The performance boost in this case is
based on two main properties of additive rollup pre-aggregations:
A rollup pre-aggregation table usually contains many fewer rows than its' corresponding original fact table. The fewer dimensions that are selected for roll-up means fewer rows in the materialized result. A smaller number of rows therefore means less time to query rollup pre-aggregation tables.
If your query is a subset of dimensions and measures of an additive rollup, then it can be used to calculate a query without accessing the raw data. The more dimensions and measures are selected for roll-up, the more queries can use this particular rollup.
Rollup definitions can contain members from a single cube as well as from multiple cubes. In case of multiple cubes being involved, the join query will be built according to the standard rules of cubes joining.
Rollups are selected for querying based on properties found in queries made to the Cube.js REST API. A thorough explanation can be found under Getting Started with Pre-Aggregations.
cube(`Orders`, {
sql: `SELECT * FROM orders`,
...,
preAggregations: {
ordersByCompany: {
measures: [CUBE.count],
dimensions: [CUBE.status],
},
},
});
originalSql
As the name suggests, it persists the results of the sql
property of the cube.
Pre-aggregations of type originalSql
should only be used when the cube's
sql
is a complex query (i.e. nested, window functions and/or multiple joins).
We strongly recommend only persisting results of originalSql
back to the
source database. They often do not provide much in the way of performance
directly, but there are two specific applications:
They can be used in tandem with the
useOriginalSqlPreAggregations
option in other rollup pre-aggregations.Situations where it is not possible to use a
rollup
pre-aggregations, such as funnels.
For example, to pre-aggregate all completed orders, you could do the following:
cube(`CompletedOrders`, {
sql: `SELECT * FROM orders WHERE completed = true`,
...,
preAggregations: {
main: {
type: `originalSql`
},
},
});
rollupJoin
rollupJoin
is currently in Preview, and the API may change in a future
version.
Cube is capable of performing joins between pre-aggregations from different data sources to avoid making excessive queries to them.
Currently pre-aggregations of type rollupJoin
do not support background
refreshing. Setting scheduledRefresh
to
true
will result in an error.
In the following example, we have a Users
cube with a usersRollup
pre-aggregation, and an Orders
cube with an ordersRollup
pre-aggregation,
and an ordersWithUsersRollup
pre-aggregation. Note the following:
- Both cubes have different values for
dataSource
. - The type of
ordersWithUsersRollup
isrollupJoin
. - This pre-aggregation has a special property
rollups
which is an array containing references to both "source" rollups.
cube(`Users`, {
dataSource: 'postgres',
sql: `SELECT * FROM public.users`,
preAggregations: {
usersRollup: {
dimensions: [CUBE.id, CUBE.name],
},
},
measures: {
count: {
type: `count`,
},
},
dimensions: {
id: {
sql: `id`,
type: `number`,
// We need to set this field as the primary key for joins to work
primaryKey: true,
},
name: {
sql: `first_name || last_name`,
type: `string`,
},
},
});
cube('Orders', {
dataSource: 'mssql',
sql: `SELECT * FROM orders`,
preAggregations: {
ordersRollup: {
measures: [CUBE.count],
dimensions: [CUBE.userId, CUBE.status],
timeDimension: CUBE.createdAt,
granularity: `day`,
},
// Here we add a new pre-aggregation of type `rollupJoin`
ordersWithUsersRollup: {
type: `rollupJoin`,
measures: [CUBE.count],
dimensions: [Users.name],
rollups: [Users.usersRollup, CUBE.ordersRollup],
},
},
joins: {
Users: {
relationship: `belongsTo`,
// Make sure the join uses dimensions on the cube, rather than
// the column names from the underlying SQL
sql: `${CUBE.userId} = ${Users.id}`,
},
},
measures: {
count: {
type: `count`,
},
},
dimensions: {
id: {
sql: `id`,
type: `number`,
primaryKey: true,
},
userId: {
sql: `user_id`,
type: `number`,
},
status: {
sql: `status`,
type: `string`,
},
createdAt: {
sql: `created_at`,
type: `time`,
},
},
});
rollupJoin
is not required to join cubes from the same data source; instead
include the foreign cube's dimensions/measures in the rollup definition
directly:
cube('Orders', {
sql: `SELECT * FROM orders`,
preAggregations: {
ordersRollup: {
measures: [CUBE.count],
dimensions: [Users.name, CUBE.status],
timeDimension: CUBE.createdAt,
granularity: `day`,
},
},
});
The measures
property is an array of measures from the
cube that should be included in the pre-aggregation:
cube('Orders', {
sql: `SELECT * FROM orders`,
measures: {
count: {
type: `count`,
},
},
preAggregations: {
usersRollup: {
measures: [CUBE.count],
},
},
});
The dimensions
property is an array of dimensions from the
cube that should be included in the pre-aggregation:
cube('Orders', {
sql: `SELECT * FROM orders`,
dimensions: {
status: {
type: `string`,
sql: `status`,
},
},
preAggregations: {
usersRollup: {
dimensions: [CUBE.status],
},
},
});
The timeDimension
property can be any dimension
of
type time
. All other measures and dimensions in
the schema are aggregated This property is an extremely useful tool for
improving performance with massive datasets.
cube('Orders', {
sql: `SELECT * FROM orders`,
measures: {
count: {
type: `count`,
},
},
dimensions: {
status: {
type: `string`,
sql: `status`,
},
createdAt: {
type: `time`,
sql: `created_at`,
},
},
preAggregations: {
ordersByStatus: {
measures: [CUBE.count],
dimensions: [CUBE.status],
timeDimension: CUBE.createdAt,
granularity: `day`,
},
},
});
A granularity
must also be included in the
pre-aggregation definition.
The granularity
property defines the granularity of data within the
pre-aggregation. If set to week
, for example, then Cube.js will pre-aggregate
the data by week and persist it to Cube Store.
cube('Orders', {
sql: `SELECT * FROM orders`,
...,
preAggregations: {
usersRollupByWeek: {
measures: [CUBE.count],
dimensions: [CUBE.status],
timeDimension: CUBE.createdAt,
granularity: `week`,
},
},
});
The value can be one of second
, minute
, hour
, day
, week
, month
,
quarter
or year
. This property is required when using
timeDimension
.
The segments
property is an array of segments from the
cube that can target the pre-aggregation:
cube(`Orders`, {
sql: `SELECT * FROM orders`,
measures: {
count: {
type: `count`,
},
},
segments: {
onlyComplete: {
sql: `${CUBE}.status = 'completed'`,
},
},
preAggregations: {
main: {
measures: [CUBE.count],
segments: [CUBE.onlyComplete],
},
},
});
The partitionGranularity
defines the granularity for each
partition of the pre-aggregation:
cube('Orders', {
sql: `SELECT * FROM orders`,
...,
preAggregations: {
usersRollup: {
measures: [CUBE.count],
dimensions: [CUBE.status],
timeDimension: CUBE.createdAt,
granularity: `day`,
partitionGranularity: `month`,
},
},
});
The value can be one of hour
, day
, week
, month
, quarter
, year
. A
timeDimension
and granularity
must also be included in the pre-aggregation definition. This property is
required when using partitioned pre-aggregations.
Cube.js can also take care of keeping pre-aggregations up to date with the
refreshKey
property. By default, it is set to every: '1 hour'
.
When using partitioned pre-aggregations, the refresh key is evaluated for each partition separately.
sql
You can set up a custom refresh check strategy by using the sql
property:
cube(`Orders`, {
sql: `SELECT * FROM orders`,
preAggregations: {
main: {
measures: [CUBE.count],
refreshKey: {
sql: `SELECT MAX(created_at) FROM orders`,
},
},
},
});
In the above example, the refresh key SQL will be executed every 10 seconds, as
every
is not defined. If the results of the SQL
refresh key differ from the last execution, then the pre-aggregation will be
refreshed.
every
The refreshKey
can define an every
property which can be used to refresh
pre-aggregations based on a time interval. By default, it is set to 1 hour
unless the sql
property is also defined, in which case
it is set to 10 seconds
. For example:
cube(`Orders`, {
sql: `SELECT * FROM orders`,
preAggregations: {
main: {
measures: [CUBE.count],
refreshKey: {
every: `1 day`,
},
},
},
});
For possible every
parameter values please refer to
refreshKey
documentation.
You can also use every
with sql
:
cube(`Orders`, {
sql: `SELECT * FROM orders`,
preAggregations: {
main: {
measures: [CUBE.count],
refreshKey: {
every: `1 hour`,
sql: `SELECT MAX(created_at) FROM orders`,
},
},
},
});
In the above example, the refresh key SQL will be executed every hour. If the results of the SQL refresh key differ from the last execution, then the pre-aggregation will be refreshed.
incremental
You can incrementally refresh partitioned rollups by setting
incremental: true
. This option defaults to false
.
Partition tables are refreshed as a whole. When a new partition table is available, it replaces the old one. Old partition tables are collected by a garbage collection mechanism. Append is never used to add new rows to the existing tables.
cube(`Orders`, {
sql: `SELECT * FROM orders`,
...,
preAggregations: {
main: {
measure: [CUBE.count],
timeDimension: CUBE.createdAt,
granularity: `day`,
partitionGranularity: `day`,
refreshKey: {
every: `1 day`,
incremental: true,
},
},
},
});
updateWindow
Incremental refreshes without a defined updateWindow
will only update the last
partition as determined by the pre-aggregation's partitionGranularity
.
The incremental: true
flag generates a special refreshKey
SQL query which
triggers a refresh for partitions where the end date lies within the
updateWindow
from the current time.
Because incremental refreshes generate their own SQL, you must not use the
sql
property here.
In the example below, it will refresh today's and the last 7 days of partitions
once a day. Partitions before the 7 day
interval will not be refreshed
once they are built unless the rollup SQL is changed.
cube(`Orders`, {
sql: `SELECT * FROM orders`,
...,
preAggregations: {
main: {
measure: [CUBE.count],
timeDimension: CUBE.createdAt,
granularity: `day`,
partitionGranularity: `day`,
refreshKey: {
every: `1 day`,
incremental: true,
updateWindow: `7 day`,
},
},
},
});
This property is required when using incremental
refreshes.
allowNonStrictDateRangeMatch
The allowNonStrictDateRangeMatch
property can be used to allow queries to
match this pre-aggregation when using non-strict date ranges. This is useful
when using a visualization tools such as Apache
Superset that use loose date ranges by default.
cube(`Orders`, {
sql: `SELECT * FROM orders`,
...,
preAggregations: {
main: {
measure: [CUBE.count],
timeDimension: CUBE.createdAt,
granularity: `day`,
partitionGranularity: `day`,
allowNonStrictDateRangeMatch: true,
},
},
});
Cube.js supports multi-stage pre-aggregations by reusing original SQL
pre-aggregations in rollups through the useOriginalSqlPreAggregations
property. It is helpful in cases where you want to re-use a heavy SQL query
calculation in multiple rollup
pre-aggregations. Without
useOriginalSqlPreAggregations
enabled, Cube.js will always re-execute all
underlying SQL calculations every time it builds new rollup tables.
originalSql
pre-aggregations must only be used when storing
pre-aggregations on the source database.
This also means that originalSql
pre-aggregations require readOnly: false
to
be set on their respective database driver.
cube(`Orders`, {
sql: `
SELECT * FROM orders1
UNION ALL
SELECT * FROM orders2
UNION ALL
SELECT * FROM orders3
`,
...,
preAggregations: {
main: {
type: `originalSql`,
},
statuses: {
measures: [CUBE.count],
dimensions: [CUBE.status],
useOriginalSqlPreAggregations: true,
},
completedOrders: {
measures: [CUBE.count],
timeDimension: CUBE.completedAt,
granularity: `day`,
useOriginalSqlPreAggregations: true,
},
},
});
To always keep pre-aggregations up-to-date, you can set
scheduledRefresh: true
. This option defaults to true
. If set to false
,
pre-aggregations will always be built on-demand. The
refreshKey
is used to determine if there's a need to update
specific pre-aggregations on each scheduled refresh run. For partitioned
pre-aggregations, min
and max
dates for
timeDimension
are checked to determine range for the
refresh.
Each time a scheduled refresh is run, it takes every pre-aggregation partition
starting with most recent ones in time and checks if the
refreshKey
has changed. If a change was detected, then that
partition will be refreshed.
In development mode, Cube.js runs the background refresh by default and will
refresh all pre-aggregations which have scheduledRefresh: true
.
Please consult Production Checklist for best practices on running background refresh in production environments.
cube(`Orders`, {
sql: `SELECT * FROM orders`,
// ...
preAggregations: {
ordersByStatus: {
measures: [CUBE.count],
dimensions: [CUBE.status],
timeDimension: CUBE.createdAt,
granularity: `day`,
partitionGranularity: `month`,
},
},
});
The build range defines what partitions should be built by a scheduled refresh.
Data will never be queried beyond this range. By default, the build range is
defined as the minimum and maximum values possible for the
timeDimension
used in the rollup.
The SQL queries for the build range (as defined by the sql
property) are
executed based on the refreshKey
settings of the
pre-aggregation.
cube(`Orders`, {
sql: `SELECT * FROM orders`,
...,
preAggregations: {
main: {
measures: [CUBE.count],
timeDimension: CUBE.createdAt,
granularity: `day`,
buildRangeStart: {
sql: `SELECT NOW() - interval '300 day'`,
},
buildRangeEnd: {
sql: `SELECT NOW()`,
},
},
},
});
It can be used together with the pre-aggregation's refreshKey
to define
granular update settings. Set refreshKey.updateWindow
to the interval in which
your data can change and buildRangeStart
to the earliest point of time when
history should be available.
In the following example, refreshKey.updateWindow
is 1 week
and
buildRangeStart
is SELECT NOW() - interval '365 day'
, so the scheduled
refresh will build historical partitions for 365 days in the past and will only
refresh last week's data.
cube(`Orders`, {
sql: `SELECT * FROM orders`,
...,
preAggregations: {
main: {
measures: [CUBE.count],
timeDimension: CUBE.createdAt,
granularity: `day`,
partitionGranularity: `month`,
buildRangeStart: {
sql: `SELECT NOW() - interval '365 day'`,
},
buildRangeEnd: {
sql: `SELECT NOW()`,
},
refreshKey: {
updateWindow: `1 week`,
},
},
},
});
In case of pre-aggregation tables having significant cardinality, you might want to create indexes for them in databases which support it. This is can be done as follows:
cube(`Orders`, {
sql: `SELECT * FROM orders`,
...,
preAggregations: {
categoryAndDate: {
measures: [CUBE.count],
dimensions: [CUBE.category],
timeDimension: CUBE.createdAt,
granularity: `day`,
indexes: {
categoryIndex: {
columns: [CUBE.category],
},
},
},
},
});
For originalSql
pre-aggregations, the original column names as strings can be
used:
cube(`Orders`, {
sql: `SELECT * FROM orders`,
...,
preAggregations: {
main: {
type: `originalSql`,
indexes: {
timestampIndex: {
columns: ['timestamp'],
},
},
},
},
});
Did you find this page useful?