External Rollups: Using Cube Store as an Acceleration Layer for BigQuery

Cover of the 'External Rollups: Using Cube Store as an Acceleration Layer for BigQuery' blog post

BigQuery is one of the most well-known data warehouses and it's great at handling large datasets. However, due to its distributed architecture, it will never give you a sub-second response, on any dataset.

For application developers, building an app directly on top of BigQuery means having substantial wait time on dashboards and charts, especially if they're dynamic (where users can select different date ranges or change filters). Most certainly, we can tolerate that for internal apps, but not for the public-facing apps that we ship to customers.

Anyway, we still want to leverage BigQuery’s cheap data storage and the ability to process large datasets, while not giving up on the performance. BigQuery will act as a single source of truth and store the raw data, however, we'll need to add an acceleration layer on top of it. This layer will be based on Cube.js and we'll store only small, aggregated tables (so-called "external rollups" or "pre-aggregations") in Cube Store which will provide us with desired sub-second responses. We'll also build an example front-end app with a dashboard:

front-end app

You can check out the demo and the source code on GitHub. Make sure to play with date range and switchers: dynamic dashboards benefit the most from the external rollups (pre-aggregations).

Cube.js provides external rollups (pre-aggregations) as a part of its two-level caching system. The schema below shows how Cube.js and Cube Store interact with BigQuery: the raw data is pre-aggregated asynchronously and put in Cube Store; when a query comes to the API, it will be served from pre-aggregations, without requests to BigQuery.

schema

To use the acceleration layer, we need to configure Cube.js to connect to BigQuery and specify the pre-aggregations that we want to build. If you are new to Cube.js, I recommend checking this 101-style tutorial first and then coming back here. We are going to use the public Hacker News dataset in BigQuery for our app.

First, please make sure you have Docker installed on your machine. It's recommended to run Cube.js with Docker.

Second, let's create a new folder for your Cube.js app and navigate to it:

mkdir external-rollups
cd external-rollups

Then, create a new docker-compose.yml file with Cube.js configuration. We'll use environment variables from the .env file and instruct Cube.js to connect to BigQuery:

cat > docker-compose.yml << EOL
version: '2.2'
services:
cube:
image: cubejs/cube:latest
ports:
- 4000:4000
- 3000:3000
env_file: .env
volumes:
- .:/cube/conf
EOL

In the .env file we'll set credentials for BigQuery. You can learn more how to obtain them in the docs. Note that Cube.js should have a write access to a schema in BigQuery where pre-aggregations will be temporarily stored before being loaded into Cube Store.

cat > .env << EOL
CUBEJS_DB_BQ_CREDENTIALS=<JSON_KEY_FILE_IN_BASE64>
CUBEJS_DB_BQ_PROJECT_ID=<PROJECT_ID>
CUBEJS_DB_TYPE=bigquery
CUBEJS_PREAGGREGATIONS_SCHEMA=<WRITABLE_SCHEMA_NAME>
CUBEJS_EXTERNAL_DEFAULT=true
CUBEJS_API_SECRET=SECRET
CUBEJS_DEV_MODE=true
EOL

That is all we need to let Cube.js connect to BigQuery. Also note that Cube Store is readily available for your development purposes when the CUBEJS_DEV_MODE environment variable is set to true. Learn more about running Cube Store in production in the docs.

Also, we're ready to create our first Cube.js data schema file. Cube.js uses the data schema to generate SQL code which will be executed in the database. Let's create the schema/Stories.js file with the following contents:

cube(`Stories`, {
sql: `select * from \`fh-bigquery.hackernews.full_partitioned\` WHERE type = 'story'`,
measures: {
count: {
type: `count`,
}
},
dimensions: {
category: {
type: `string`,
case: {
when: [
{ sql: `STARTS_WITH(title, "Show HN")`, label: `Show HN` },
{ sql: `STARTS_WITH(title, "Ask HN")`, label: `Ask HN` }
],
else: { label: `Other` }
}
},
time: {
sql: `timestamp`,
type: `time`
}
}
});

Now we're ready to run Cube.js:

docker compose up

Then, let's navigate to the Developer Playground, a convenient web-based tool that helps explore the data, at localhost:4000. You can select the Stories Count measure and the Category dimension there, alongside a time dimension named Stories Time for All time by Month, to build a chart as shown below:

Developer Playground

If we inspect generated SQL by clicking on the SQL button, we’ll see the following:

SELECT
CASE
WHEN STARTS_WITH(title, "Show HN") THEN 'Show HN'
WHEN STARTS_WITH(title, "Ask HN") THEN 'Ask HN'
ELSE 'Other'
END `stories__category`,
DATETIME_TRUNC(DATETIME(`stories`.timestamp, 'UTC'), MONTH) `stories__time_month`,
count(*) `stories__count`
FROM
(
select
*
from
`fh-bigquery.hackernews.full_partitioned`
WHERE
type = 'story'
) AS `stories`
GROUP BY
1,
2
ORDER BY
2 ASC
LIMIT
10000

This SQL shows us that this query runs against the raw data inside BigQuery. Now, let’s make it run against the pre-aggregated table inside Cube Store. To do that, we are going to define a pre-aggregation. Usually, it is done inside the same cube, but for the sake of the tutorial, so we can compare the performance with and without pre-aggregation, let’s create a new cube. We can do it inside the same file.

Add the following code to the schema/Stories.js file to declare a pre-aggregation and specify which measures and dimensions to include into it.

cube(`StoriesPreAgg`, {
extends: Stories,
preAggregations: {
main: {
measures: [ count ],
dimensions: [ category ],
granularity: `month`,
timeDimension: time
}
}
});

Now, go to the Developer Playground, refresh the page, and select the same measures and dimensions as before: count, category, and time grouped by month, but from the Stories Pre Agg cube. When running this query the first time, Cube.js will generate a pre-aggregation and load it into Cube Store. All subsequent requests will go directly to Cube Store. You can inspect the generated SQL and it should look like the following:

SELECT
`stories_pre_agg__category` `stories_pre_agg__category`,
`stories_pre_agg__time_month` `stories_pre_agg__time_month`,
sum(`stories_pre_agg__count`) `stories_pre_agg__count`
FROM
stb_pre_aggregations.stories_pre_agg_main
GROUP BY
1,
2
ORDER BY
2 ASC
LIMIT
10000

As you can see, query goes to the stb_pre_aggregations.stories_pre_agg_main table inside Cube Store. You can play around with filters to see the performance boost of the accelerated query compared to the raw one.

You can also check this demo dashboard with multiple charts and compare its performance with and without pre-aggregations. The source code of the example dashboard is available on GitHub.

Happy hacking!

Share this article