Pre-aggregations

Pre-aggregations are materialized query results persisted as tables. Cube has an ability to analyze queries against a defined set of pre-aggregation rules in order to choose the optimal one that will be used to create pre-aggregation table.

If Cube finds a suitable pre-aggregation rule, database querying becomes a multi-stage process:

  1. Cube checks if an up-to-date copy of the pre-aggregation exists.

  2. Cube will execute a query against the pre-aggregated tables instead of the raw data.

Pre-aggregations can be defined in the preAggregations available on each cube. A more in-depth diagram detailing how pre-aggregations are selected can be found here.

Pre-aggregations must have, at minimum, a name and a type. This name, along with the name of the cube will be used as a prefix for pre-aggregation tables created in the database.

Pre-aggregation names should:

  • Be unique within a cube
  • Start with a lowercase letter
  • Consist of numbers, letters and _
JS
YAML
cube(`Orders`, {
  sql: `SELECT * FROM orders`,

  preAggregations: {
    orders_by_status: {
      dimensions: [CUBE.status],
      measures: [CUBE.count],
    },
  },
});

Pre-aggregations must include all dimensions, measures, and filters you will query with.

Cube supports the following types of pre-aggregations:

The default type is rollup.

rollup

Rollup pre-aggregations are the most effective way to boost performance of any analytical application. The blazing fast performance of tools like Google Analytics or Mixpanel are backed by a similar concept. The theory behind it lies in multi-dimensional analysis, and a rollup pre-aggregation is the result of a roll-up operation on an OLAP cube. A rollup pre-aggregation is essentially the summarized data of the original cube grouped by any selected dimensions of interest.

The most performant kind of rollup pre-aggregation is an additive rollup: all measures of which are based on decomposable aggregate functions. Additive measure types are: count, sum, min, max or countDistinctApprox. The performance boost in this case is based on two main properties of additive rollup pre-aggregations:

  1. A rollup pre-aggregation table usually contains many fewer rows than its' corresponding original fact table. The fewer dimensions that are selected for roll-up means fewer rows in the materialized result. A smaller number of rows therefore means less time to query rollup pre-aggregation tables.

  2. If your query is a subset of dimensions and measures of an additive rollup, then it can be used to calculate a query without accessing the raw data. The more dimensions and measures are selected for roll-up, the more queries can use this particular rollup.

Rollup definitions can contain members from a single cube as well as from multiple cubes. In case of multiple cubes being involved, the join query will be built according to the standard rules of cubes joining.

Rollups are selected for querying based on properties found in queries made to the Cube REST API. A thorough explanation can be found under Getting Started with Pre-Aggregations.

JS
YAML
cube(`Orders`, {
  sql: `SELECT * FROM orders`,

  ...,

  preAggregations: {
    orders_by_company: {
      measures: [CUBE.count],
      dimensions: [CUBE.status],
    },
  },
});

originalSql

As the name suggests, it persists the results of the sql property of the cube. Pre-aggregations of type originalSql should only be used when the cube's sql is a complex query (i.e. nested, window functions and/or multiple joins). We strongly recommend persisting results of originalSql only back to the source database.

originalSql pre-aggregations must only be used when storing pre-aggregations in the source database. While you can set external: true for originalSql pre-aggregation, this is not recommended or generally supported.

They often do not provide much in the way of performance directly, but there are two specific applications:

  1. They can be used in tandem with the useOriginalSqlPreAggregations option in other rollup pre-aggregations.

  2. Situations where it is not possible to use a rollup pre-aggregations, such as funnels.

For example, to pre-aggregate all completed orders, you could do the following:

JS
YAML
cube(`CompletedOrders`, {
  sql: `SELECT * FROM orders WHERE completed = true`,

  preAggregations: {
    main: {
      type: `originalSql`,
    },
  },
});

rollupJoin

🐣 Preview

rollupJoin is currently in Preview, and the API may change in a future version.

Rollup join can be used only to join two tables, and one can't contain more than one partition so that the join can operate correctly. It means that at least one table should be less than 1M of rows. There's work in progress to remove those limitations.

Cube is capable of performing joins between pre-aggregations from different data sources to avoid making excessive queries to them.

rollupJoin is ephemeral pre-aggregation that relies on referenced rollups when queries are executed. Setting scheduledRefresh to true is unnecessary for rollupJoin and will result in an error. Appropriate freshness controls should be set on referenced rollups instead.

In the following example, we have a Users cube with a users_rollup pre-aggregation, and an Orders cube with an orders_rollup pre-aggregation, and an orders_with_users_rollup pre-aggregation. Note the following:

  • Both cubes have different values for dataSource.
  • The type of orders_with_users_rollup is rollupJoin.
  • This pre-aggregation has a special property rollups which is an array containing references to both "source" rollups.
JS
YAML
cube(`Users`, {
  dataSource: 'postgres',
  sql: `SELECT * FROM public.users`,

  preAggregations: {
    users_rollup: {
      dimensions: [CUBE.id, CUBE.name],
    },
  },

  measures: {
    count: {
      type: `count`,
    },
  },

  dimensions: {
    id: {
      sql: `id`,
      type: `number`,
      // We need to set this field as the primary key for joins to work
      primaryKey: true,
    },
    name: {
      sql: `first_name || last_name`,
      type: `string`,
    },
  },
});

cube('Orders', {
  dataSource: 'mssql',
  sql: `SELECT * FROM orders`,

  preAggregations: {
    orders_rollup: {
      measures: [CUBE.count],
      dimensions: [CUBE.user_id, CUBE.status],
      timeDimension: CUBE.created_at,
      granularity: `day`,
    },
    // Here we add a new pre-aggregation of type `rollupJoin`
    orders_with_users_rollup: {
      type: `rollupJoin`,
      measures: [CUBE.count],
      dimensions: [Users.name],
      rollups: [Users.users_rollup, CUBE.orders_rollup],
    },
  },

  joins: {
    Users: {
      relationship: `belongsTo`,
      // Make sure the join uses dimensions on the cube, rather than
      // the column names from the underlying SQL
      sql: `${CUBE.user_id} = ${Users.id}`,
    },
  },

  measures: {
    count: {
      type: `count`,
    },
  },

  dimensions: {
    id: {
      sql: `id`,
      type: `number`,
      primaryKey: true,
    },
    user_id: {
      sql: `user_id`,
      type: `number`,
    },
    status: {
      sql: `status`,
      type: `string`,
    },
    created_at: {
      sql: `created_at`,
      type: `time`,
    },
  },
});

rollupJoin is not required to join cubes from the same data source; instead include the foreign cube's dimensions/measures in the rollup definition directly:

JS
YAML
cube('Orders', {
  sql: `SELECT * FROM orders`,

  preAggregations: {
    orders_rollup: {
      measures: [CUBE.count],
      dimensions: [Users.name, CUBE.status],
      timeDimension: CUBE.created_at,
      granularity: `day`,
    },
  },
});

rollupLambda

rollupLambda pre-aggregations must be defined before any other pre-aggregations in a cube.

A rollupLambda pre-aggregation is a special type of pre-aggregation that can combine data from data sources and other rollups. It is extremely useful in scenarios where real-time data is required.

Lambda pre-aggregations can be used to combine data from a data source and a pre-aggregation, or even from multiple pre-aggregations across different schema-compatible cubes.

The measures property is an array of measures from the cube that should be included in the pre-aggregation:

JS
YAML
cube('Orders', {
  sql: `SELECT * FROM orders`,

  measures: {
    count: {
      type: `count`,
    },
  },

  preAggregations: {
    users_rollup: {
      measures: [CUBE.count],
    },
  },
});

The dimensions property is an array of dimensions from the cube that should be included in the pre-aggregation:

JS
YAML
cube('Orders', {
  sql: `SELECT * FROM orders`,

  dimensions: {
    status: {
      type: `string`,
      sql: `status`,
    },
  },

  preAggregations: {
    users_rollup: {
      dimensions: [CUBE.status],
    },
  },
});

The timeDimension property can be any dimension of type time. All other measures and dimensions in the schema are aggregated This property is an extremely useful tool for improving performance with massive datasets.

JS
YAML
cube('Orders', {
  sql: `SELECT * FROM orders`,

  measures: {
    count: {
      type: `count`,
    },
  },

  dimensions: {
    status: {
      type: `string`,
      sql: `status`,
    },
    created_at: {
      type: `time`,
      sql: `created_at`,
    },
  },

  preAggregations: {
    orders_by_status: {
      measures: [CUBE.count],
      dimensions: [CUBE.status],
      timeDimension: CUBE.created_at,
      granularity: `day`,
    },
  },
});

A granularity must also be included in the pre-aggregation definition.

The granularity property defines the granularity of data within the pre-aggregation. If set to week, for example, then Cube will pre-aggregate the data by week and persist it to Cube Store.

JS
YAML
cube('Orders', {
  sql: `SELECT * FROM orders`,

  preAggregations: {
    users_rollupByWeek: {
      measures: [CUBE.count],
      dimensions: [CUBE.status],
      timeDimension: CUBE.created_at,
      granularity: `week`,
    },
  },
});

The value can be one of second, minute, hour, day, week, month, quarter or year. This property is required when using timeDimension.

The segments property is an array of segments from the cube that can target the pre-aggregation:

JS
YAML
cube(`Orders`, {
  sql: `SELECT * FROM orders`,

  measures: {
    count: {
      type: `count`,
    },
  },

  segments: {
    only_complete: {
      sql: `${CUBE}.status = 'completed'`,
    },
  },

  preAggregations: {
    main: {
      measures: [CUBE.count],
      segments: [CUBE.only_complete],
    },
  },
});

The partitionGranularity defines the granularity for each partition of the pre-aggregation:

JS
YAML
cube('Orders', {
  sql: `SELECT * FROM orders`,

  preAggregations: {
    users_rollup: {
      measures: [CUBE.count],
      dimensions: [CUBE.status],
      timeDimension: CUBE.created_at,
      granularity: `day`,
      partitionGranularity: `month`,
    },
  },
});

The value can be one of hour, day, week, month, quarter, year. A timeDimension and granularity must also be included in the pre-aggregation definition. This property is required when using partitioned pre-aggregations.

Number of partitions to be built per cube is calculated as buildRange divided by partitionGranularity. Number of partitions to be built per cube is multiplied by the count of time zones and tenants in case different tenants have different pre-aggregation SQL.

Choose the count of partitions wisely as those consume memory and CPU resources. As a rule of thumb, you do not want to go over 5,000-10,000 partitions in total to keep the partitioning overhead low. Too many partitions will most likely cause out of memory.

Cube can also take care of keeping pre-aggregations up to date with the refreshKey property. By default, it is set to every: '1 hour'.

When using partitioned pre-aggregations, the refresh key is evaluated for each partition separately.

sql

You can set up a custom refresh check strategy by using the sql property:

JS
YAML
cube(`Orders`, {
  sql: `SELECT * FROM orders`,

  preAggregations: {
    main: {
      measures: [CUBE.count],
      refreshKey: {
        sql: `SELECT MAX(created_at) FROM orders`,
      },
    },
  },
});

In the above example, the refresh key SQL will be executed every 10 seconds, as every is not defined. If the results of the SQL refresh key differ from the last execution, then the pre-aggregation will be refreshed.

every

The refreshKey can define an every property which can be used to refresh pre-aggregations based on a time interval. By default, it is set to 1 hour unless the sql property is also defined, in which case it is set to 10 seconds. For example:

JS
YAML
cube(`Orders`, {
  sql: `SELECT * FROM orders`,

  preAggregations: {
    main: {
      measures: [CUBE.count],
      refreshKey: {
        every: `1 day`,
      },
    },
  },
});

For possible every parameter values please refer to refreshKey documentation.

You can also use every with sql:

JS
YAML
cube(`Orders`, {
  sql: `SELECT * FROM orders`,

  preAggregations: {
    main: {
      measures: [CUBE.count],
      refreshKey: {
        every: `1 hour`,
        sql: `SELECT MAX(created_at) FROM orders`,
      },
    },
  },
});

In the above example, the refresh key SQL will be executed every hour. If the results of the SQL refresh key differ from the last execution, then the pre-aggregation will be refreshed.

incremental

You can incrementally refresh partitioned rollups by setting incremental: true. This option defaults to false.

Partition tables are refreshed as a whole. When a new partition table is available, it replaces the old one. Old partition tables are collected by a garbage collection mechanism. Append is never used to add new rows to the existing tables.

JS
YAML
cube(`Orders`, {
  sql: `SELECT * FROM orders`,

  preAggregations: {
    main: {
      measures: [CUBE.count],
      timeDimension: CUBE.created_at,
      granularity: `day`,
      partitionGranularity: `day`,
      refreshKey: {
        every: `1 day`,
        incremental: true,
      },
    },
  },
});

updateWindow

Incremental refreshes without a defined updateWindow will only update the last partition as determined by the pre-aggregation's partitionGranularity.

The incremental: true flag generates a special refreshKey SQL query which triggers a refresh for partitions where the end date lies within the updateWindow from the current time.

Because incremental refreshes generate their own SQL, you must not use the sql property here.

In the example below, it will refresh today's and the last 7 days of partitions once a day. Partitions before the 7 day interval will not be refreshed once they are built unless the rollup SQL is changed.

JS
YAML
cube(`Orders`, {
  sql: `SELECT * FROM orders`,

  preAggregations: {
    main: {
      measures: [CUBE.count],
      timeDimension: CUBE.created_at,
      granularity: `day`,
      partitionGranularity: `day`,
      refreshKey: {
        every: `1 day`,
        incremental: true,
        updateWindow: `7 day`,
      },
    },
  },
});

This property is required when using incremental refreshes.

allowNonStrictDateRangeMatch

The allowNonStrictDateRangeMatch property can be used to allow queries to match this pre-aggregation when using non-strict date ranges. This is useful when using a visualization tools such as Apache Superset that use loose date ranges by default.

JS
YAML
cube(`Orders`, {
  sql: `SELECT * FROM orders`,

  preAggregations: {
    main: {
      measure: [CUBE.count],
      timeDimension: CUBE.created_at,
      granularity: `day`,
      partitionGranularity: `day`,
      allowNonStrictDateRangeMatch: true,
    },
  },
});

Cube supports multi-stage pre-aggregations by reusing original SQL pre-aggregations in rollups through the useOriginalSqlPreAggregations property. It is helpful in cases where you want to re-use a heavy SQL query calculation in multiple rollup pre-aggregations. Without useOriginalSqlPreAggregations enabled, Cube will always re-execute all underlying SQL calculations every time it builds new rollup tables.

originalSql pre-aggregations must only be used when storing pre-aggregations in the source database. This also means that originalSql pre-aggregations require readOnly: false to be set on their respective database driver.

JS
YAML
cube(`Orders`, {
  sql: `
    SELECT * FROM orders1
    UNION ALL
    SELECT * FROM orders2
    UNION ALL
    SELECT * FROM orders3
    `,

  preAggregations: {
    main: {
      type: `originalSql`,
    },
    statuses: {
      measures: [CUBE.count],
      dimensions: [CUBE.status],
      useOriginalSqlPreAggregations: true,
    },
    completed_orders: {
      measures: [CUBE.count],
      timeDimension: CUBE.completedAt,
      granularity: `day`,
      useOriginalSqlPreAggregations: true,
    },
  },
});

To always keep pre-aggregations up-to-date, you can set scheduledRefresh: true. This option defaults to true. If set to false, pre-aggregations will always be built on-demand. The refreshKey is used to determine if there's a need to update specific pre-aggregations on each scheduled refresh run. For partitioned pre-aggregations, min and max dates for timeDimension are checked to determine range for the refresh.

Each time a scheduled refresh is run, it takes every pre-aggregation partition starting with most recent ones in time and checks if the refreshKey has changed. If a change was detected, then that partition will be refreshed.

In development mode, Cube runs the background refresh by default and will refresh all pre-aggregations which have scheduledRefresh: true.

Please consult Production Checklist for best practices on running background refresh in production environments.

JS
YAML
cube(`Orders`, {
  sql: `SELECT * FROM orders`,

  preAggregations: {
    ordersByStatus: {
      measures: [CUBE.count],
      dimensions: [CUBE.status],
      timeDimension: CUBE.created_at,
      granularity: `day`,
      partitionGranularity: `month`,
    },
  },
});

Currently buildRangeStart and buildRangeEnd doesn't have any effect on pre-aggregations without partitionGranularity. This behavior can be changed in future versions.

The build range defines what partitions should be built by a scheduled refresh. Data will never be queried beyond this range. By default, the build range is defined as the minimum and maximum values possible for the timeDimension used in the rollup.

The SQL queries for the build range (as defined by the sql property) are executed based on the refreshKey settings of the pre-aggregation.

JS
YAML
cube(`Orders`, {
  sql: `SELECT * FROM orders`,

  preAggregations: {
    main: {
      measures: [CUBE.count],
      timeDimension: CUBE.created_at,
      granularity: `day`,
      partitionGranularity: `month`,
      buildRangeStart: {
        sql: `SELECT NOW() - interval '300 day'`,
      },
      buildRangeEnd: {
        sql: `SELECT NOW()`,
      },
    },
  },
});

It can be used together with the pre-aggregation's refreshKey to define granular update settings. Set refreshKey.updateWindow to the interval in which your data can change and buildRangeStart to the earliest point of time when history should be available.

In the following example, refreshKey.updateWindow is 1 week and buildRangeStart is SELECT NOW() - interval '365 day', so the scheduled refresh will build historical partitions for 365 days in the past and will only refresh last week's data.

JS
YAML
cube(`Orders`, {
  sql: `SELECT * FROM orders`,

  preAggregations: {
    main: {
      measures: [CUBE.count],
      timeDimension: CUBE.created_at,
      granularity: `day`,
      partitionGranularity: `month`,
      buildRangeStart: {
        sql: `SELECT NOW() - interval '365 day'`,
      },
      buildRangeEnd: {
        sql: `SELECT NOW()`,
      },
      refreshKey: {
        updateWindow: `1 week`,
      },
    },
  },
});

This option allows combining a pre-aggregation with fresh data retrieved from the data source; this is extremely useful in scenarios where latency can be sacrificed for accuracy.

To configure a pre-aggregation to behave in this way, ensure the pre-aggregation is of type rollupLambda, and then set unionWithSourceData to true:

JS
YAML
cube(`Orders`, {
  preAggregations: {
    lambda: {
      type: `rollupLambda`,
      unionWithSourceData: true,
      rollups: [Orders.main],
    },
  },
});

In case of pre-aggregation tables having significant cardinality, you might want to create indexes for them in databases which support it. This can be done as follows:

JS
YAML
cube(`Orders`, {
  sql: `SELECT * FROM orders`,

  preAggregations: {
    categoryAndDate: {
      measures: [CUBE.count],
      dimensions: [CUBE.category],
      timeDimension: CUBE.created_at,
      granularity: `day`,
      indexes: {
        categoryIndex: {
          columns: [CUBE.category],
        },
      },
    },
  },
});

To maximize performance, you can introduce an index per query, so the set of dimensions used in the query is the same as in the index. There's no need to define measure columns in the index. The index column order is the sort order of all data within the index. Under the hood, Cube Store will create table dataset copy and will sort it according to index specification. Cube Store works best when you define all your equals filter dimensions first and all dimensions from GROUP BY last. For range filter columns like contains, >, <, >=, <= with high selectivity, typically less than 5% of records keep those columns before GROUP BY dimensions. Otherwise, for low selectivity range filters, consider putting these columns after GROUP BY dimensions.

For originalSql pre-aggregations, the original column names as strings can be used:

JS
YAML
cube(`Orders`, {
  sql: `SELECT * FROM orders`,

  preAggregations: {
    main: {
      type: `originalSql`,
      indexes: {
        timestampIndex: {
          columns: ['timestamp'],
        },
      },
    },
  },
});

Cube Store also supports aggregating indexes that contain only specified columns and pre-aggregated measures in the pre-aggregation definition. Queries that can target aggregating indexes must have the following characteristics:

  1. They cannot make use of any filters other than for those dimensions that are included in that index
  2. All dimensions used in the query must be defined in the aggregating index

Queries that do not have the above can still make use of regular indexes so performance can still be optimized.

Basically, the aggregating index is a rollup of data in a rollup table itself. Data needs to be downloaded from the source DB as many times as many rollups you have. Given that having a single rollup and multiple aggregate indexes gives you pretty much the same performance from the Cube Store side but multiple times less cost from a data warehouse perspective.

Aggregating indexes can be defined by adding type: 'aggregate' to the index definition:

JS
YAML
cube(`Orders`, {
  preAggregations: {
    categoryAndDate: {
      measures: [CUBE.count],
      dimensions: [CUBE.status],
      timeDimension: CUBE.created_at,
      granularity: `day`,
      indexes: {
        categoryIndex: {
          columns: [CUBE.status, Products.name],
        },
        aggregatedIndex: {
          columns: [CUBE.status],
          type: `aggregate`,
        },
      },
    },
  },
});

Did you find this page useful?