Using Pre-Aggregations

Pre-aggregations is a powerful way to speed up your Cube.js queries. There are many configuration options to consider. Please make sure to also check the Pre-Aggregations reference in the data schema section.

Refresh strategy can be customized by setting the refreshKey property for the pre-aggregation.

The default value of refreshKey is every: '1 hour'. It can be redefined either by overriding the default value of the every property:

cube(`Orders`, {

  ...,

  preAggregations: {
    amountByCreated: {
      type: `rollup`,
      measures: [amount],
      timeDimension: createdAt,
      granularity: `month`,
      refreshKey: {
        every: `12 hour`,
      },
    },
  },
});

Or by providing a sql property instead, and leaving every unchanged from its' default value:

cube(`Orders`, {

  ...,

  preAggregations: {
    amountByCreated: {
      measures: [amount],
      timeDimension: createdAt,
      granularity: `month`,
      refreshKey: {
        // every will default to `10 seconds` here
        sql: `SELECT MAX(created_at) FROM orders`,
      },
    },
  },
});

Or both every and sql can be defined together:

cube(`Orders`, {

  ...,

  preAggregations: {
    amountByCreated: {
      measures: [amount],
      timeDimension: createdAt,
      granularity: `month`,
      refreshKey: {
        every: `12 hour`,
        sql: `SELECT MAX(created_at) FROM orders`,
      },
    },
  },
});

When every and sql are used together, Cube.js will run the query from the sql property on an interval defined by the every property. If the query returns new results, then the pre-aggregation will be refreshed.

To make Cube.js only serve requests from pre-aggregations, the CUBEJS_ROLLUP_ONLY environment variable can be set to true on an API instance. This will prevent serving data on API requests from the source database.

When using this configuration in a single node deployment (where the API instance and Refresh Worker are configured on the same host), requests made to the API that cannot be satisfied by a rollup throw an error. Scheduled refreshes will continue to work in the background.

Partitioning is an extremely effective optimization for accelerating pre-aggregations build and refresh time. It effectively "shards" the data between multiple tables, splitting them by a defined attribute. Cube can be configured to incrementally refresh only the last set of partitions through the updateWindow property. This leads to faster refresh times due to unnecessary data not being reloaded, and even reduced cost for some databases like BigQuery or AWS Athena.

Any rollup pre-aggregation can be partitioned by time using the partitionGranularity property in a pre-aggregation definition. In the example below, the partitionGranularity is set to month, which means Cube will generate separate tables for each month's worth of data. Once built, it will continue to refresh on a daily basis the last 3 months of data.

cube(`Orders`, {
  sql: `select * from orders`,

  ...,

  preAggregations: {
    categoryAndDate: {
      measures: [Orders.count, revenue],
      dimensions: [category],
      timeDimension: createdAt,
      granularity: `day`,
      partitionGranularity: `month`,
      refreshKey: {
        every: `1 day`,
        incremental: true,
        updateWindow: `3 months`
      }
    },
  },
});

Cube Store partially supports the MySQL protocol. This allows you to execute simple queries using a familiar SQL syntax. You can connect using the MySQL CLI client, for example:

mysql -h <CUBESTORE_IP> --user=cubestore -pcubestore

To check which pre-aggregations are managed by Cube Store, you could run the following query:

SELECT * FROM information_schema.tables;
+----------------------+-----------------------------------------------+
| table_schema         | table_name                                    |
+----------------------+-----------------------------------------------+
| dev_pre_aggregations | orders_main20190101_23jnqarg_uiyfxd0f_1gifflf |
| dev_pre_aggregations | orders_main20190301_24ph0a1c_utzntnv_1gifflf  |
| dev_pre_aggregations | orders_main20190201_zhrh5kj1_rkmsrffi_1gifflf |
| dev_pre_aggregations | orders_main20191001_mdw2hxku_waxajvwc_1gifflf |
| dev_pre_aggregations | orders_main20190701_izc2tl0h_bxsf1zlb_1gifflf |
+----------------------+-----------------------------------------------+
5 rows in set (0.01 sec)

These pre-aggregations are stored as Parquet files under the .cubestore/ folder in the project root during development.

The default pre-aggregations storage in Cube is its own purpose-built storage layer: Cube Store.

Alternatively, you can store pre-aggregations internally in the source database. To store a pre-aggregation internally, set external: false in pre-aggregation definition.

Please note, that originalSql pre-aggregations are stored internally by default. It is not recommended to store originalSql pre-aggregations in Cube Store.

This feature is in Preview and the API may change in a future release. Joining pre-aggregations only works with databases of the same type, support for joining pre-aggregations from different databases is coming soon.

When making a query that joins data from two different cubes, Cube can use pre-aggregations instead of running the base SQL queries. To get started, first ensure both cubes have valid pre-aggregations:

// Orders
cube(`Orders`, {
  sql: `SELECT * FROM public.orders`,

  ...,

  preAggregations: {
    ordersRollup: {
      measures: [CUBE.count],
      dimensions: [CUBE.userId, CUBE.status],
      timeDimension: CUBE.createdAt,
      granularity: `day`,
    },
  },

  joins: {
    Users: {
      sql: `${CUBE.userId} = ${Users.id}`,
      relationship: `belongsTo`
    },
  },
});

// Users
cube(`Users`, {
  sql: `SELECT * FROM public.users`,

  ...,

  preAggregations: {
    usersRollup: {
      dimensions: [CUBE.id, CUBE.name],
    },
  },
});

Before we continue, let's add an index to the ordersRollup pre-aggregation so that the rollupJoin pre-aggregation can work correctly:

cube(`Orders`, {
  ...,

  preAggregations: {
    ordersRollup: {
      ...,
      indexes: {
        userIndex: {
          columns: [CUBE.userId],
        },
      },
    },
  },
});

Now we can add a new pre-aggregation of type rollupJoin to the Orders cube:

cube(`Orders`, {
  ...,

  preAggregations: {
    ordersWithUsersRollup: {
      type: `rollupJoin`,
      measures: [CUBE.count],
      dimensions: [Users.name],
      timeDimension: CUBE.createdAt,
      granularity: `day`,
      rollups: [Users.usersRollup, CUBE.ordersRollup],
    },
  },
});

With all of the above set up, making a query such as the following will now use Orders.ordersRollup and Users.usersRollup, avoiding a database request:

{
  "dimensions": ["Users.name"],
  "timeDimensions": [
    {
      "dimension": "Orders.createdAt",
      "dateRange": "This month"
    }
  ],
  "order": {
    "Orders.count": "desc"
  },
  "measures": ["Orders.count"]
}

For ideal performance, pre-aggregations should be built using a dedicated Refresh Worker. See here for more details.

Cube supports three different strategies for building pre-aggregations. To see which strategies your database supports, please refer to its individual page from Connecting to the Database.

When using the simple strategy, Cube will use the source database as a temporary staging area for writing pre-aggregations to determine column types. The data is loaded back into memory before writing them to Cube Store (or an external database).

For larger datasets, we strongly recommend using the Batching or Export Bucket strategies instead.

Internal vs External vs External with Cube Store diagram

Batching is a more performant strategy where Cube.js sends compressed CSVs for Cube Store to ingest.

Internal vs External vs External with Cube Store diagram

The performance scales to the amount of memory available on the Cube.js instance. Batching is automatically enabled for any databases that can support it.

When dealing with larger pre-aggregations (more than 100k rows), performance can be significantly improved by using an export bucket. This allows the source database to temporarily materialize the data locally, which is then loaded into Cube Store in parallel:

Internal vs External vs External with Cube Store diagram

Enabling the export bucket functionality requires extra configuration; please refer to the database-specific documentation for more details:

When using cloud storage, it is important to correctly configure any data retention policies to clean up the data in the export bucket as Cube.js does not currently manage this. For most use-cases, 1 day is sufficient.

Did you find this page useful?