Pre-aggregations
Pre-aggregations can be defined using the pre_aggregations
parameter of a cube.
Pre-aggregations must have, at minimum, a name and a type. Pre-aggregations must include all dimensions and measures you will query with.
A pre-aggregation is typically placed in a cube that defines most of dimensions and measures that this pre-aggregation includes. If a pre-aggregation doesn't include any dimensions or measures from a cube it's defined in, this pre-aggregation is ignored.
Parameters
name
The name
parameter serves as the identifier of a pre-aggregation. It must be
unique among all pre-aggregations within a cube and follow the naming
conventions.
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: orders_by_status
dimensions:
- CUBE.status
measures:
- CUBE.count
# ...
This name
, along with the name of the cube, will be used as a prefix for
pre-aggregation tables created in the database.
type
Cube supports the following types of pre-aggregations:
The default type is rollup
.
rollup
Rollup pre-aggregations are the most effective way to boost performance of any analytical application. The blazing fast performance of tools like Google Analytics or Mixpanel are backed by a similar concept. The theory behind it lies in multi-dimensional analysis, and a rollup pre-aggregation is the result of a roll-up operation on an OLAP cube (opens in a new tab). A rollup pre-aggregation is essentially the summarized data of the original cube grouped by any selected dimensions of interest.
The most performant kind of rollup pre-aggregation is an additive rollup:
all measures of which are based on decomposable aggregate
functions (opens in a new tab). Additive measure types are: count
, sum
,
min
, max
or count_distinct_approx
. The performance boost in this case is
based on two main properties of additive rollup pre-aggregations:
-
A rollup pre-aggregation table usually contains many fewer rows than its' corresponding original fact table. The fewer dimensions that are selected for roll-up means fewer rows in the materialized result. A smaller number of rows therefore means less time to query rollup pre-aggregation tables.
-
If your query is a subset of dimensions and measures of an additive rollup, then it can be used to calculate a query without accessing the raw data. The more dimensions and measures are selected for roll-up, the more queries can use this particular rollup.
Rollup definitions can contain members from a single cube as well as from multiple cubes. In case of multiple cubes being involved, the join query will be built according to the standard rules of cubes joining.
Rollups are selected for querying based on properties found in queries made to the Cube REST API. A thorough explanation can be found under Getting Started with Pre-Aggregations.
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: orders_by_company
measures:
- count
dimensions:
- status
# ...
original_sql
As the name suggests, it persists the results of the sql
property of the cube
in the data source (rather than Cube Store). original_sql
pre-aggregations should
only be used when the cube's sql
is a complex query (i.e., nested subqueries,
window functions, and/or multiple joins).
originalSql
pre-aggregations must only be stored in the data source.
While you can set external: true
for original_sql
pre-aggregation, this is
not recommended or generally supported.
They often do not provide much in the way of performance directly, but there are two specific applications:
-
They can be used in tandem with the
use_original_sql_pre_aggregations
option in other rollup pre-aggregations. -
Situations where it is not possible to use a
rollup
pre-aggregations, such as funnels.
For example, to pre-aggregate all completed orders, you could do the following:
cubes:
- name: completed_orders
sql: >
SELECT *
FROM orders
WHERE completed = true
pre_aggregations:
- name: main
type: original_sql
# ...
rollup_join
rollup_join
is currently in Preview, and the API may change in a future
version.
Rollup join is designed to join data across different data sources. To join data within the same data source you can list members from different cubes within a regular rollup. Rollup join can be used only to join two tables, and one can't contain more than one partition so that the join can operate correctly. It means that at least one table should be less than 1M of rows. There's work in progress to remove those limitations.
Cube is capable of performing joins between pre-aggregations from different data sources to avoid making excessive queries to them.
rollup_join
is an ephemeral pre-aggregation that relies on referenced rollups
when queries are executed. Setting
scheduled_refresh
to true
is unnecessary
for rollup_join
and will result in an error. Appropriate freshness controls
should be set on referenced rollups instead.
In the following example, we have a users
cube with a users_rollup
pre-aggregation, and an orders
cube with an orders_rollup
pre-aggregation,
and an orders_with_users_rollup
pre-aggregation. Note the following:
- Both cubes have different values for
data_source
. - The type of
orders_with_users_rollup
isrollup_join
. - This pre-aggregation has a special property
rollups
which is an array containing references to both "source" rollups.
cubes:
- name: users
data_source: postgres
sql_table: users
pre_aggregations:
- name: users_rollup
dimensions:
- CUBE.id
- CUBE.name
measures:
- name: count
type: count
dimensions:
- name: id
sql: id
type: number
primary_key: true
- name: name
sql: "{CUBE}.first_name || {CUBE}.last_name"
type: string
- name: orders
data_source: mssql
sql_table: orders
pre_aggregations:
- name: orders_rollup
measures:
- CUBE.count
dimensions:
- CUBE.user_id
- CUBE.status
time_dimension: CUBE.created_at
granularity: day
- name: orders_with_users_rollup
type: rollup_join
measures:
- CUBE.count
dimensions:
- users.name
rollups:
- users.users_rollup
- CUBE.orders_rollup
joins:
- name: users
relationship: many_to_one
sql: "{CUBE.user_id} = {users.id}"
measures:
- name: count
type: count
dimensions:
- name: id
sql: id
type: number
primary_key: true
- name: user_id
sql: user_id
type: number
- name: status
sql: status
type: string
- name: created_at
sql: created_at
type: time
rollup_join
is not required to join cubes from the same data source; instead,
include the foreign cube's dimensions/measures in the rollup definition
directly:
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: orders_rollup
measures:
- CUBE.count
dimensions:
- users.name
- CUBE.status
time_dimension: CUBE.created_at
granularity: day
rollup_lambda
rollup_lambda
pre-aggregations must be defined before any other
pre-aggregations in a cube.
A rollup_lambda
pre-aggregation is a special type of pre-aggregation that can
combine data from data sources and other rollups. It is extremely useful in
scenarios where real-time data is required.
Lambda pre-aggregations can be used to combine data from a data source and a pre-aggregation, or even from multiple pre-aggregations across different cubes that share the same dimensions and measures.
measures
The measures
property is an array of measures from the
cube that should be included in the pre-aggregation:
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: users_rollup
measures:
- CUBE.count
measures:
- name: count
type: count
dimensions
The dimensions
property is an array of dimensions from the
cube that should be included in the pre-aggregation:
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: users_rollup
dimensions:
- CUBE.status
dimensions:
- name: status
type: string
sql: status
time_dimension
The time_dimension
property can be any dimension
of
type time
. All other measures and dimensions in
the data model are aggregated. This property is an extremely useful tool for
improving performance with massive datasets.
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: orders_by_status
measures:
- CUBE.count
dimensions:
- CUBE.status
time_dimension: CUBE.created_at
granularity: day
measures:
- name: count
type: count
dimensions:
- name: status
type: string
sql: status
- name: created_at
type: time
sql: created_at
A granularity
must also be included in the
pre-aggregation definition.
granularity
The granularity
property defines the time dimension granularity of data
within the pre-aggregation. If set to week
, for example, then Cube will
pre-aggregate the data by week and persist it to Cube Store.
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: users_rollup_by_week
measures:
- CUBE.count
dimensions:
- CUBE.status
time_dimension: CUBE.created_at
granularity: week
The value can be either a default granularity (i.e., second
, minute
, hour
,
day
, week
, month
, quarter
, or year
) or a custom
granularity. This property is required when using
time_dimension
.
segments
The segments
property is an array of segments from the
cube that can target the pre-aggregation:
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: main
measures:
- CUBE.count
segments:
- CUBE.only_complete
measures:
- name: count
type: count
segments:
- name: only_complete
sql: "{CUBE}.status = 'completed'"
partition_granularity
The partition_granularity
defines the granularity for each
partition of the pre-aggregation:
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: users_rollup
measures:
- CUBE.count
dimensions:
- CUBE.status
time_dimension: CUBE.created_at
granularity: day
partition_granularity: month
# ...
The value can be one of hour
, day
, week
, month
, quarter
, year
. A
time_dimension
and granularity
must also be included in the pre-aggregation definition. This property is
required when using partitioned pre-aggregations.
Number of partitions to be built per cube is calculated as
build_range divided by partition_granularity
. Number of
partitions to be built per cube is multiplied by the count of time zones and
tenants in case different tenants have different pre-aggregation SQL.
Choose the count of partitions wisely as those consume memory and CPU resources. As a rule of thumb, you do not want to go over 500-1,000 partitions per pre-aggregation in total to keep the partitioning overhead low. Too many partitions will most likely cause out of memory. In case of very long build ranges please consider use Lambda pre-aggregations to reduce partition count per pre-aggregation.
refresh_key
Cube can also take care of keeping pre-aggregations up to date with the
refresh_key
property. By default, it is set to every: '1 hour'
,
if neither of the cubes' pre-aggregation references don't override refresh_key
.
When using partitioned pre-aggregations, the refresh key is evaluated for each partition separately.
sql
You can set up a custom refresh check strategy by using the sql
property:
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: main
measures:
- CUBE.count
refresh_key:
sql: SELECT MAX(created_at) FROM orders
# ...
In the above example, the refresh key SQL will be executed every 10 seconds, as
every
is not defined. If the results of the SQL
refresh key differ from the last execution, then the pre-aggregation will be
refreshed.
every
The refresh_key
can define an every
property which can be used to refresh
pre-aggregations based on a time interval. By default, it is set to 1 hour
unless the sql
property is also defined in any of cubes pre-aggregation references, in which case
it is set to 10 seconds
. For example:
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: main
measures:
- CUBE.count
refresh_key:
every: 1 day
# ...
To have a pre-aggregation rebuild at a specific time of day, you can use a CRON string with some limitations. For more details about values that can be used with the every
parameter, please refer to the
refreshKey
documentation.
You can also use every
with sql
:
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: main
measures:
- CUBE.count
refresh_key:
every: 1 hour
sql: SELECT MAX(created_at) FROM orders
In the above example, the refresh key SQL will be executed every hour. If the results of the SQL refresh key differ from the last execution, then the pre-aggregation will be refreshed.
incremental
You can incrementally refresh partitioned rollups by setting
incremental: true
. This option defaults to false
.
Partition tables are refreshed as a whole. When a new partition table is available, it replaces the old one. Old partition tables are collected by a garbage collection mechanism. Append is never used to add new rows to the existing tables.
Because incremental refreshes generate their own SQL, you must not use the
sql
property here.
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: main
measures:
- CUBE.count
time_dimension: CUBE.created_at
granularity: day
partition_granularity: day
refresh_key:
every: 1 day
incremental: true
# ...
update_window
Incremental refreshes without a defined update_window
will only update the
last partition as determined by the pre-aggregation's partition_granularity
.
The incremental: true
flag generates a special refresh_key
SQL query which
triggers a refresh for partitions where the end date lies within the
update_window
from the current time.
Because incremental refreshes generate their own SQL, you must not use the
sql
property here.
In the example below, it will refresh today's and the last 7 days of partitions
once a day. Partitions before the 7 day
interval will not be refreshed
once they are built unless the rollup SQL is changed.
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: main
measures:
- CUBE.count
time_dimension: CUBE.created_at
granularity: day
partition_granularity: day
refresh_key:
every: 1 day
incremental: true
update_window: 7 day
# ...
This property is required when using incremental
refreshes.
allow_non_strict_date_range_match
The allow_non_strict_date_range_match
property can be used to allow queries to
match this pre-aggregation when using non-strict date ranges. This is useful
when using a visualization tools such as Apache
Superset that use loose date ranges by default.
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: main
measure:
- CUBE.count
time_dimension: CUBE.created_at
granularity: day
partition_granularity: day
allow_non_strict_date_range_match: true
use_original_sql_pre_aggregations
Cube supports multi-stage pre-aggregations by reusing original SQL
pre-aggregations in rollups through the use_original_sql_pre_aggregations
property. It is helpful in cases where you want to re-use a heavy SQL query
calculation in multiple rollup
pre-aggregations. Without
use_original_sql_pre_aggregations
enabled, Cube will always re-execute all
underlying SQL calculations every time it builds new rollup tables.
originalSql
pre-aggregations must only be stored in the data source.
While you can set external: true
for original_sql
pre-aggregation, this is
not recommended or generally supported.
cubes:
- name: orders
sql: >
SELECT * FROM orders1 UNION ALL SELECT * FROM orders2 UNION ALL SELECT *
FROM orders3
pre_aggregations:
- name: main
type: original_sql
- name: statuses
measures:
- CUBE.count
dimensions:
- CUBE.status
use_original_sql_pre_aggregations: true
- name: completed_orders
measures:
- CUBE.count
time_dimension: CUBE.completed_at
granularity: day
use_original_sql_pre_aggregations: true
# ...
scheduled_refresh
To always keep pre-aggregations up-to-date, you can set
scheduled_refresh: true
. This option defaults to true
. If set to false
,
pre-aggregations will always be built on-demand. The
refresh_key
is used to determine if there's a need to
update specific pre-aggregations on each scheduled refresh run. For partitioned
pre-aggregations, min
and max
dates for
time_dimension
are checked to determine range for the
refresh.
Each time a scheduled refresh is run, it takes every pre-aggregation partition
starting with most recent ones in time and checks if the
refresh_key
has changed. If a change was detected, then
that partition will be refreshed.
In development mode, Cube runs the background refresh by default and will
refresh all pre-aggregations which have scheduled_refresh: true
.
Please consult Production Checklist for best practices on running background refresh in production environments.
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: orders_by_status
measures:
- CUBE.count
dimensions:
- CUBE.status
time_dimension: CUBE.created_at
granularity: day
partition_granularity: month
# ...
build_range_start
and build_range_end
Currently build_range_start
and build_range_end
doesn't have any effect on
pre-aggregations without partition_granularity
. This behavior can be changed
in future versions.
Cube will not return results outside of the defined build range. Data will only be queried within this range, which can lead to an empty result set, depending on the query. Consider using Lambda pre-aggregations if you want to query data outside of the build range.
The build range defines what partitions should be built by a scheduled refresh.
By default, the build range is defined as the minimum and maximum values
possible for the time_dimension
used in the rollup.
The SQL queries for the build range (as defined by the sql
property) are
executed based on the refresh_key
settings of the
pre-aggregation.
In case of very small update_window
or FILTER_PARAMS
are used in the refresh_key
definition
and the current timestamp is used as build_range_end
, there's a possibility
to write refresh_key
which won't refresh due to cycle dependency on each other.
To address such cases, you can use relative date in the future for the build_range_end
.
For example, you can add one day to the current timestamp and use it as build_range_end
.
The actual value to be added depends on how frequently refresh_key
is updated.
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: main
measures:
- CUBE.count
time_dimension: CUBE.created_at
granularity: day
partition_granularity: month
build_range_start:
sql: SELECT NOW() - INTERVAL '300 day'
build_range_end:
sql: SELECT NOW()
# ...
It can be used together with the pre-aggregation's refresh_key
to define
granular update settings. Set refresh_key.update_window
to the interval in
which your data can change and build_range_start
to the earliest point of time
when history should be available.
In the following example, refresh_key.update_window
is 1 week
and
build_range_start
is SELECT NOW() - INTERVAL '365 day'
, so the scheduled
refresh will build historical partitions for 365 days in the past and will only
refresh last week's data.
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: main
measures:
- CUBE.count
time_dimension: CUBE.created_at
granularity: day
partition_granularity: month
build_range_start:
sql: SELECT NOW() - INTERVAL '365 day'
build_range_end:
sql: SELECT NOW()
refresh_key:
update_window: 1 week
union_with_source_data
This option allows combining a pre-aggregation with fresh data retrieved from the data source; this is extremely useful in scenarios where latency can be sacrificed for accuracy.
To configure a pre-aggregation to behave in this way, ensure the pre-aggregation
is of type rollup_lambda
, and then set union_with_source_data
to true
:
cubes:
- name: orders
# ...
pre_aggregations:
- name: lambda
type: rollup_lambda
union_with_source_data: true
rollups:
- orders.main
indexes
This option allows to define indexes. Indexes are used to
fine-tune pre-aggregation performance when pre-aggregations have significant
cardinality. (For aggregating indexes, see the
type
option below.)
Here's how you can define an index:
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: category_and_date
measures:
- count
dimensions:
- category
time_dimension: created_at
granularity: day
indexes:
- name: category_index
columns:
- category
# ...
In the case that you want to reference the time dimension in the index,
the column name as a string in the following format can be used:
<cube name>__<time dimension name>_<granularity>
:
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: category_and_date
measures:
- CUBE.count
dimensions:
- CUBE.category
time_dimension: CUBE.created_at
granularity: day
indexes:
- name: category_created_at_index
columns:
- CUBE.category
- orders__created_at_day
# ...
The same approach works for original_sql
pre-aggregations:
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: main
type: original_sql
indexes:
- name: timestamp_index
columns:
- timestamp
In some cases, indexes would not work with original_sql
pre-aggregations.
Please track this issue (opens in a new tab).
type
This option is used to define aggregating indexes that contain only dimensions and pre-aggregated measures from the pre-aggregation definition.
Here's how you can define an aggregating index:
cubes:
- name: orders
sql_table: orders
pre_aggregations:
- name: category_and_date
measures:
- CUBE.count
dimensions:
- CUBE.status
time_dimension: CUBE.created_at
granularity: day
indexes:
- name: category_index
columns:
- CUBE.status
- Products.name
- name: aggregating_index
columns:
- CUBE.status
type: aggregate
# ...