Documentation
Lambda pre-aggregations

Lambda pre-aggregations

Lambda pre-aggregations follow the Lambda architecture (opens in a new tab) design to union real-time and batch data. Cube acts as a serving layer and uses pre-aggregations as a batch layer and source data or other pre-aggregations, usually streaming, as a speed layer. Due to this design, lambda pre-aggregations only work with data that is newer than the existing batched pre-aggregations.

Lambda pre-aggregations only work with Cube Store.

Use cases

Below we are looking at the most common examples of using lambda pre-aggregations.

Batch and source data

Batch data is coming from pre-aggregation and real-time data is coming from the data source.

Lambda pre-aggregation batch and source diagram

First, you need to create pre-aggregations that will contain your batch data. In the following example, we call it batch. Please note, it must have a time_dimension and partition_granularity specified. Cube will use these properties to union batch data with freshly-retrieved source data.

You may also control the batch part of your data with the build_range_start and build_range_end properties of a pre-aggregation to determine a specific window for your batched data.

Next, you need to create a lambda pre-aggregation. To do that, create pre-aggregation with type rollup_lambda, specify rollups you would like to use with rollups property, and finally set union_with_source_data: true to use source data as a real-time layer.

Please make sure that the lambda pre-aggregation definition comes first when defining your pre-aggregations.

YAML
JavaScript
cubes:
  - name: users
    # ...
 
    pre_aggregations:
      - name: lambda
        type: rollup_lambda
        union_with_source_data: true
        rollups:
          - CUBE.batch
 
      - name: batch
        measures:
          - users.count
        dimensions:
          - users.name
        time_dimension: users.created_at
        granularity: day
        partition_granularity: day
        build_range_start:
          sql: SELECT '2020-01-01'
        build_range_end:
          sql: SELECT '2022-05-30'

Batch and streaming data

In this scenario, batch data is comes from one pre-aggregation and real-time data comes from a streaming pre-aggregation.

Lambda pre-aggregation batch and streaming diagram

You can use lambda pre-aggregations to combine data from multiple pre-aggregations, where one pre-aggregation can have batch data and another streaming. Please note that build ranges of all rollups referenced by lambda rollup should have enough intersection between each other that anticipates partition build times for those rollups. Cube will maximize the coverage of the requested date range by partitions from different rollups. The first rollup in a list of referenced rollups that has a fully built partition for a particular date range will be used to serve this date range. The last rollup in a list will be used to cover the remaining uncovered part of a date range. Partitions of the last rollup will be used even if not completely built.

YAML
JavaScript
cubes:
  - name: streaming_users
    # This cube uses a streaming SQL data source such as ksqlDB
    # ...
 
    pre_aggregations:
      - name: streaming
        type: rollup
        measures:
          - CUBE.count
        dimensions:
          - CUBE.name
        time_dimension: CUBE.created_at
      granularity: day,
      partition_granularity: day
 
  - name: users
    # This cube uses a data source such as ClickHouse or BigQuery
    # ...
 
    pre_aggregations:
      - name: batch_streaming_lambda
        type: rollup_lambda
        rollups:
          - users.batch
          - streaming_users.streaming
 
      - name: batch
        type: rollup
        measures:
          - users.count
        dimensions:
          - users.name
        time_dimension: users.created_at
        granularity: day
        partition_granularity: day
        build_range_start:
          sql: SELECT '2020-01-01'
        build_range_end:
          sql: SELECT '2022-05-30'