Pre-aggregations

You can use the pre_aggregations parameter within cubes to define pre-aggregations.

Pre-aggregations must have, at minimum, a name and a type. Pre-aggregations must include all dimensions and measures you will query with.

A pre-aggregation is typically placed in a cube that defines most of dimensions and measures that this pre-aggregation includes. If a pre-aggregation doesn't include any dimensions or measures from a cube it's defined in, this pre-aggregation is ignored.

Parameters

`name`

The name parameter serves as the identifier of a pre-aggregation. It must be unique among all pre-aggregations within a cube and follow the naming conventions.

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    orders_by_status: {
      dimensions: [
        status
      ],
      measures: [
        count
      ]
    }
  },
 
  // ...
})

This name, along with the name of the cube, will be used as a prefix for pre-aggregation tables created in the database.

`type`

Cube supports the following types of pre-aggregations:

The default type is rollup.

`rollup`

Rollup pre-aggregations are the most effective way to boost performance of any analytical application. The blazing fast performance of tools like Google Analytics or Mixpanel are backed by a similar concept. The theory behind it lies in multi-dimensional analysis, and a rollup pre-aggregation is the result of a roll-up operation on an OLAP cube (opens in a new tab). A rollup pre-aggregation is essentially the summarized data of the original cube grouped by any selected dimensions of interest.

The most performant kind of rollup pre-aggregation is an additive rollup: all measures of which are based on decomposable aggregate functions (opens in a new tab). Additive measure types are: count, sum, min, max or count_distinct_approx. The performance boost in this case is based on two main properties of additive rollup pre-aggregations:

A rollup pre-aggregation table usually contains many fewer rows than its' corresponding original fact table. The fewer dimensions that are selected for roll-up means fewer rows in the materialized result. A smaller number of rows therefore means less time to query rollup pre-aggregation tables.
If your query is a subset of dimensions and measures of an additive rollup, then it can be used to calculate a query without accessing the raw data. The more dimensions and measures are selected for roll-up, the more queries can use this particular rollup.

Rollup definitions can contain members from a single cube as well as from multiple cubes. In case of multiple cubes being involved, the join query will be built according to the standard rules of cubes joining.

Rollups are selected for querying based on properties found in queries made to the Cube REST API. A thorough explanation can be found under Getting Started with Pre-Aggregations.

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    orders_by_company: {
      measures: [
        count
      ],
      dimensions: [
        status
      ]
    }
  },
 
  // ...
})

`original_sql`

As the name suggests, it persists the results of the sql property of the cube in the data source (rather than Cube Store). original_sql pre-aggregations should only be used when the cube's sql is a complex query (i.e., nested subqueries, window functions, and/or multiple joins).

originalSql pre-aggregations must only be stored in the data source. While you can set external: true for original_sql pre-aggregation, this is not recommended or generally supported.

They often do not provide much in the way of performance directly, but there are two specific applications:

They can be used in tandem with the use_original_sql_pre_aggregations option in other rollup pre-aggregations.
Situations where it is not possible to use a rollup pre-aggregations, such as funnels.

For example, to pre-aggregate all completed orders, you could do the following:

YAML

JavaScript

cube(`completed_orders`, {
  sql: `
    SELECT *
    FROM orders
    WHERE completed = true
  `,
 
  pre_aggregations: {
    main: {
      type: `original_sql`
    }
  },
 
  // ...
})

`rollup_join`

⚠️🐣 Preview

rollup_join is currently in Preview, and the API may change in a future version.

Rollup join is designed to join data across different data sources. To join data within the same data source you can list members from different cubes within a regular rollup. Rollup join can be used only to join two tables, and one can't contain more than one partition so that the join can operate correctly. It means that at least one table should be less than 1M of rows. There's work in progress to remove those limitations.

Cube is capable of performing joins between pre-aggregations from different data sources to avoid making excessive queries to them.

rollup_join is an ephemeral pre-aggregation that relies on referenced rollups when queries are executed. Setting scheduled_refresh to true is unnecessary for rollup_join and will result in an error. Appropriate freshness controls should be set on referenced rollups instead.

In the following example, we have a users cube with a users_rollup pre-aggregation, and an orders cube with an orders_rollup pre-aggregation, and an orders_with_users_rollup pre-aggregation. Note the following:

Both cubes have different values for data_source.
The type of orders_with_users_rollup is rollup_join.
This pre-aggregation has a special property rollups which is an array containing references to both "source" rollups.

YAML

JavaScript

cube(`users`, {
  data_source: `postgres`,
  sql_table: `users`,
 
  pre_aggregations: {
    users_rollup: {
      dimensions: [CUBE.id, CUBE.name]
    }
  },
 
  measures: {
    count: {
      type: `count`
    }
  },
 
  dimensions: {
    id: {
      sql: `id`,
      type: `number`,
      // We need to set this field as the primary key for joins to work
      primary_key: true
    },
 
    name: {
      sql: `first_name || last_name`,
      type: `string`
    }
  }
})
 
cube(`orders`, {
  data_source: `mssql`,
  sql_table: `orders`,
 
  pre_aggregations: {
    orders_rollup: {
      measures: [CUBE.count],
      dimensions: [CUBE.user_id, CUBE.status],
      time_dimension: CUBE.created_at,
      granularity: `day`
    },
 
    // Here we add a new pre-aggregation of type `rollup_join`
    orders_with_users_rollup: {
      type: `rollup_join`,
      measures: [CUBE.count],
      dimensions: [users.name],
      rollups: [users.users_rollup, CUBE.orders_rollup]
    }
  },
 
  joins: {
    users: {
      relationship: `many_to_one`,
      // Make sure the join uses dimensions on the cube, rather than
      // the column names from the underlying SQL
      sql: `${CUBE.user_id} = ${users.id}`
    }
  },
 
  measures: {
    count: {
      type: `count`
    }
  },
 
  dimensions: {
    id: {
      sql: `id`,
      type: `number`,
      primary_key: true
    },
 
    user_id: {
      sql: `user_id`,
      type: `number`
    },
 
    status: {
      sql: `status`,
      type: `string`
    },
 
    created_at: {
      sql: `created_at`,
      type: `time`
    }
  }
})

rollup_join is not required to join cubes from the same data source; instead, include the foreign cube's dimensions/measures in the rollup definition directly:

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    orders_rollup: {
      measures: [CUBE.count],
      dimensions: [users.name, CUBE.status],
      time_dimension: CUBE.created_at,
      granularity: `day`
    }
  }
})

`rollup_lambda`

rollup_lambda pre-aggregations must be defined before any other pre-aggregations in a cube.

A rollup_lambda pre-aggregation is a special type of pre-aggregation that can combine data from data sources and other rollups. It is extremely useful in scenarios where real-time data is required.

Lambda pre-aggregations can be used to combine data from a data source and a pre-aggregation, or even from multiple pre-aggregations across different cubes that share the same dimensions and measures.

`measures`

The measures property is an array of measures from the cube that should be included in the pre-aggregation:

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    users_rollup: {
      measures: [CUBE.count]
    }
  },
 
  measures: {
    count: {
      type: `count`
    }
  }
})

`dimensions`

The dimensions property is an array of dimensions from the cube that should be included in the pre-aggregation:

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    users_rollup: {
      dimensions: [CUBE.status]
    }
  },
 
  dimensions: {
    status: {
      type: `string`,
      sql: `status`
    }
  }
})

`time_dimension`

The time_dimension property can be any dimension of type time. All other measures and dimensions in the data model are aggregated. This property is an extremely useful tool for improving performance with massive datasets.

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    orders_by_status: {
      measures: [CUBE.count],
      dimensions: [CUBE.status],
      time_dimension: CUBE.created_at,
      granularity: `day`
    }
  },
 
  measures: {
    count: {
      type: `count`
    }
  },
 
  dimensions: {
    status: {
      type: `string`,
      sql: `status`
    },
 
    created_at: {
      type: `time`,
      sql: `created_at`
    }
  }
})

A granularity must also be included in the pre-aggregation definition.

`granularity`

The granularity property defines the time dimension granularity of data within the pre-aggregation. If set to week, for example, then Cube will pre-aggregate the data by week and persist it to Cube Store.

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    users_rollup_by_week: {
      measures: [CUBE.count],
      dimensions: [CUBE.status],
      time_dimension: CUBE.created_at,
      granularity: `week`
    }
  },
 
  // ...
})

The value can be either a default granularity (i.e., second, minute, hour, day, week, month, quarter, or year) or a custom granularity. This property is required when using time_dimension.

`segments`

The segments property is an array of segments from the cube that can target the pre-aggregation:

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    main: {
      measures: [CUBE.count],
      segments: [CUBE.only_complete]
    }
  },
 
  measures: {
    count: {
      type: `count`
    }
  },
 
  segments: {
    only_complete: {
      sql: `${CUBE}.status = 'completed'`
    }
  }
})

`partition_granularity`

The partition_granularity defines the granularity for each partition of the pre-aggregation:

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    users_rollup: {
      measures: [CUBE.count],
      dimensions: [CUBE.status],
      time_dimension: CUBE.created_at,
      granularity: `day`,
      partition_granularity: `month`
    }
  },
 
  // ...
})

The value can be one of hour, day, week, month, quarter, year. A time_dimension and granularity must also be included in the pre-aggregation definition. This property is required when using partitioned pre-aggregations.

Number of partitions to be built per cube is calculated as build_range divided by partition_granularity. Number of partitions to be built per cube is multiplied by the count of time zones and tenants in case different tenants have different pre-aggregation SQL.

Choose the count of partitions wisely as those consume memory and CPU resources. As a rule of thumb, you do not want to go over 500-1,000 partitions per pre-aggregation in total to keep the partitioning overhead low. Too many partitions will most likely cause out of memory. In case of very long build ranges please consider use Lambda pre-aggregations to reduce partition count per pre-aggregation.

`refresh_key`

Cube can also take care of keeping pre-aggregations up to date with the refresh_key property. By default, it is set to every: '1 hour', if neither of the cubes' pre-aggregation references don't override refresh_key.

When using partitioned pre-aggregations, the refresh key is evaluated for each partition separately.

`sql`

You can set up a custom refresh check strategy by using the sql property:

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    main: {
      measures: [CUBE.count],
      refresh_key: {
        sql: `SELECT MAX(created_at) FROM orders`
      }
    }
  },
 
  // ...
})

In the above example, the refresh key SQL will be executed every 10 seconds, as every is not defined. If the results of the SQL refresh key differ from the last execution, then the pre-aggregation will be refreshed.

`every`

The refresh_key can define an every property which can be used to refresh pre-aggregations based on a time interval. By default, it is set to 1 hour unless the sql property is also defined in any of cubes pre-aggregation references, in which case it is set to 10 seconds. For example:

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    main: {
      measures: [CUBE.count],
      refresh_key: {
        every: `1 day`
      }
    }
  },
 
  // ...
})

To have a pre-aggregation rebuild at a specific time of day, you can use a CRON string with some limitations. For more details about values that can be used with the every parameter, please refer to the refreshKey documentation.

You can also use every with sql:

Using CRON strings and sql is not supported and will result in a compilation error.

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    main: {
      measures: [CUBE.count],
      refreshKey: {
        every: `1 hour`,
        sql: `SELECT MAX(created_at) FROM orders`
      }
    }
  }
})

In the above example, the refresh key SQL will be executed every hour. If the results of the SQL refresh key differ from the last execution, then the pre-aggregation will be refreshed.

`incremental`

You can incrementally refresh partitioned rollups by setting incremental: true. This option defaults to false.

Partition tables are refreshed as a whole. When a new partition table is available, it replaces the old one. Old partition tables are collected by a garbage collection mechanism. Append is never used to add new rows to the existing tables.

Because incremental refreshes generate their own SQL, you must not use the sql property here.

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    main: {
      measures: [CUBE.count],
      time_dimension: CUBE.created_at,
      granularity: `day`,
      partition_granularity: `day`,
      refresh_key: {
        every: `1 day`,
        incremental: true
      }
    }
  },
 
  // ...
})

`update_window`

Incremental refreshes without a defined update_window will only update the last partition as determined by the pre-aggregation's partition_granularity.

The incremental: true flag generates a special refresh_key SQL query which triggers a refresh for partitions where the end date lies within the update_window from the current time.

Because incremental refreshes generate their own SQL, you must not use the sql property here.

In the example below, it will refresh today's and the last 7 days of partitions once a day. Partitions before the 7 day interval will not be refreshed once they are built unless the rollup SQL is changed.

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    main: {
      measures: [CUBE.count],
      time_dimension: CUBE.created_at,
      granularity: `day`,
      partition_granularity: `day`,
      refresh_key: {
        every: `1 day`,
        incremental: true,
        update_window: `7 day`
      }
    }
  },
 
  // ...
})

This property is required when using incremental refreshes.

`allow_non_strict_date_range_match`

The allow_non_strict_date_range_match parameter is used to allow queries to match a pre-aggregation even if a query contains a non-strict date range. It is set to true by default via the CUBEJS_PRE_AGGREGATIONS_ALLOW_NON_STRICT_DATE_RANGE_MATCH environment variable.

If this flag is set to false, Cube would check if requested date range exactly matches pre-aggregation granularity. For example, if you're requesting half of a day or your date range filter is just one millisecond off for a pre-aggregation with the daily granularity, Cube would not use such a pre-aggregation.

With this flag set to true, that strict check is lifted. It allows queries from BI tools to still match pre-aggregations at the cost of a slight potential data discrepancy. This is convenient when using Cube with visualization tools such as Tableau or Apache Superset that use loose date ranges.

YAML

JavaScript

cube(`orders`, {
  sql: `orders`,
 
  pre_aggregations: {
    main: {
      measure: [CUBE.count],
      time_dimension: CUBE.created_at,
      granularity: `day`,
      partition_granularity: `day`,
      allow_non_strict_date_range_match: true
    }
  },
 
  // ...
})

`use_original_sql_pre_aggregations`

Cube supports multi-stage pre-aggregations by reusing original SQL pre-aggregations in rollups through the use_original_sql_pre_aggregations property. It is helpful in cases where you want to re-use a heavy SQL query calculation in multiple rollup pre-aggregations. Without use_original_sql_pre_aggregations enabled, Cube will always re-execute all underlying SQL calculations every time it builds new rollup tables.

originalSql pre-aggregations must only be stored in the data source. While you can set external: true for original_sql pre-aggregation, this is not recommended or generally supported.

YAML

JavaScript

cube(`orders`, {
  sql: `
    SELECT * FROM orders1
    UNION ALL
    SELECT * FROM orders2
    UNION ALL
    SELECT * FROM orders3
  `,
 
  pre_aggregations: {
    main: {
      type: `original_sql`
    },
 
    statuses: {
      measures: [CUBE.count],
      dimensions: [CUBE.status],
      use_original_sql_pre_aggregations: true
    },
 
    completed_orders: {
      measures: [CUBE.count],
      time_dimension: CUBE.completed_at,
      granularity: `day`,
      use_original_sql_pre_aggregations: true
    }
  },
 
  // ...
})

`scheduled_refresh`

To always keep pre-aggregations up-to-date, you can set scheduled_refresh: true. This option defaults to true. If set to false, pre-aggregations will always be built on-demand. The refresh_key is used to determine if there's a need to update specific pre-aggregations on each scheduled refresh run. For partitioned pre-aggregations, min and max dates for time_dimension are checked to determine range for the refresh.

Each time a scheduled refresh is run, it takes every pre-aggregation partition starting with most recent ones in time and checks if the refresh_key has changed. If a change was detected, then that partition will be refreshed.

In development mode, Cube runs the background refresh by default and will refresh all pre-aggregations which have scheduled_refresh: true.

Please consult Production Checklist for best practices on running background refresh in production environments.

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    orders_by_status: {
      measures: [CUBE.count],
      dimensions: [CUBE.status],
      time_dimension: CUBE.created_at,
      granularity: `day`,
      partition_granularity: `month`
    }
  },
 
  // ...
})

`build_range_start` and `build_range_end`

Currently build_range_start and build_range_end doesn't have any effect on pre-aggregations without partition_granularity. This behavior can be changed in future versions.

Cube will not return results outside of the defined build range. Data will only be queried within this range, which can lead to an empty result set, depending on the query. Consider using Lambda pre-aggregations if you want to query data outside of the build range.

The build range defines what partitions should be built by a scheduled refresh. By default, the build range is defined as the minimum and maximum values possible for the time_dimension used in the rollup.

The SQL queries for the build range (as defined by the sql property) are executed based on the refresh_key settings of the pre-aggregation.

In case of very small update_window or FILTER_PARAMS are used in the refresh_key definition and the current timestamp is used as build_range_end, there's a possibility to write refresh_key which won't refresh due to cycle dependency on each other. To address such cases, you can use relative date in the future for the build_range_end. For example, you can add one day to the current timestamp and use it as build_range_end. The actual value to be added depends on how frequently refresh_key is updated.

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    main: {
      measures: [CUBE.count],
      time_dimension: CUBE.created_at,
      granularity: `day`,
      partition_granularity: `month`,
      build_range_start: {
        sql: `SELECT NOW() - INTERVAL '300 day'`
      },
      build_range_end: {
        sql: `SELECT NOW()`
      }
    }
  },
 
  // ...
})

It can be used together with the pre-aggregation's refresh_key to define granular update settings. Set refresh_key.update_window to the interval in which your data can change and build_range_start to the earliest point of time when history should be available.

In the following example, refresh_key.update_window is 1 week and build_range_start is SELECT NOW() - INTERVAL '365 day', so the scheduled refresh will build historical partitions for 365 days in the past and will only refresh last week's data.

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    main: {
      measures: [CUBE.count],
      time_dimension: CUBE.created_at,
      granularity: `day`,
      partition_granularity: `month`,
      build_range_start: {
        sql: `SELECT NOW() - INTERVAL '365 day'`
      },
      build_range_end: {
        sql: `SELECT NOW()`
      },
      refresh_key: {
        update_window: `1 week`
      }
    }
  },
 
  // ...
})

`union_with_source_data`

This option allows combining a pre-aggregation with fresh data retrieved from the data source; this is extremely useful in scenarios where latency can be sacrificed for accuracy.

To configure a pre-aggregation to behave in this way, ensure the pre-aggregation is of type rollup_lambda, and then set union_with_source_data to true:

YAML

JavaScript

cube(`orders`, {
  // ...
 
  pre_aggregations: {
    lambda: {
      type: `rollup_lambda`,
      union_with_source_data: true,
      rollups: [orders.main]
    }
  }
})

`indexes`

This option allows to define indexes. Indexes are used to fine-tune pre-aggregation performance when pre-aggregations have significant cardinality. (For aggregating indexes, see the type option below.)

Here's how you can define an index:

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    category_and_date: {
      measures: [
        count
      ],
      dimensions: [
        category
      ],
      time_dimension: created_at,
      granularity: `day`,
      indexes: {
        category_index: {
          columns: [
            category
          ]
        }
      }
    }
  },
 
  // ...
})

In the case that you want to reference the time dimension in the index, the column name as a string in the following format can be used: <cube name>__<time dimension name>_<granularity>:

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    category_and_date: {
      measures: [
        CUBE.count
      ],
      dimensions: [
        CUBE.category
      ],
      time_dimension: CUBE.created_at,
      granularity: `day`,
      indexes: {
        category_created_at_index: {
          columns: [
            CUBE.category,
            `orders__created_at_day`
          ]
        }
      }
    }
  },
 
  // ...
})

The same approach works for original_sql pre-aggregations:

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    main: {
      type: `original_sql`,
      indexes: {
        timestamp_index: {
          columns: [
            `timestamp`
          ]
        }
      }
    }
  }
})

In some cases, indexes would not work with original_sql pre-aggregations. Please track this issue (opens in a new tab).

`type`

This option is used to define aggregating indexes that contain only dimensions and pre-aggregated measures from the pre-aggregation definition.

Here's how you can define an aggregating index:

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  pre_aggregations: {
    category_and_date: {
      measures: [
        CUBE.count
      ],
      dimensions: [
        CUBE.status
      ],
      time_dimension: CUBE.created_at,
      granularity: `day`,
      indexes: {
        category_index: {
          columns: [
            CUBE.status,
            products.name
          ]
        },
        aggregating_index: {
          columns: [
            CUBE.status
          ],
          type: `aggregate`
        }
      }
    }
  },
 
  // ...
})

Joins Data access policies

Pre-aggregations

Parameters

name

type

rollup

original_sql

rollup_join

rollup_lambda

measures

dimensions

time_dimension

granularity

segments

partition_granularity

refresh_key

sql

every

incremental

update_window

allow_non_strict_date_range_match

use_original_sql_pre_aggregations

scheduled_refresh

build_range_start and build_range_end

union_with_source_data

indexes

type

`name`

`type`

`rollup`

`original_sql`

`rollup_join`

`rollup_lambda`

`measures`

`dimensions`

`time_dimension`

`granularity`

`segments`

`partition_granularity`

`refresh_key`

`sql`

`every`

`incremental`

`update_window`

`allow_non_strict_date_range_match`

`use_original_sql_pre_aggregations`

`scheduled_refresh`

`build_range_start` and `build_range_end`

`union_with_source_data`

`indexes`

`type`