ClickHouse scalability and power for building data-intensive applications

This article demonstrates how to build analytics dashboards showing summaries of data stored in ClickHouse and how Cube can provide a semantic layer with high-level metrics definitions over raw data as well as a caching layer shielding your ClickHouse instance from repetitive requests by end users.

What is ClickHouse?

ClickHouse is a column-oriented database optimized for online analytical data processing. Because of data compression, storing values in sorted files, and a query engine optimized for dealing with calculating aggregations, it can easily handle trillions of data points in a single table.

Like all column-oriented databases, ClickHouse works best with denormalized data when it doesn’t need to join multiple tables. Additionally, we can use data skipping indexes to speed up data retrieval even more.

However, ClickHouse isn’t only a database. It integrates with various data sources, such as S3, Kafka, and relational databases, so we can ingest data into ClickHouse without using additional tools.

What is Cube?

Cube is the headless BI platform for accessing, organizing, and delivering data. Cube connects to many data warehouses, databases, or query engines, including ClickHouse, and allows you to quickly build data applications or analyze your data in BI tools. It serves as the single source of truth for your business metrics.

In this article, we demonstrate the data modeling layer, the caching layer, and interoperability between Cube’s REST and SQL APIs.

Example: analytics dashboards for a game studio and a high score table for the players

Suppose we work at a mobile game development studio. We have released a new game recently, and we would like to check whether the users like it.

Metrics. Tracking the number of registered users would be easy but definitely not enough. What about people who installed the game, played for one hour, and never ran it again? We want to visualize the user churn to see how many people still play the game after a few weeks since registering for the first time. Of course, we’ll group the users into cohorts. After all, during the first week, dozens of gaming websites wrote about our game. Millions of people installed the app. Are they still playing?

In addition to churn visualization, we want charts of daily, weekly, and monthly active users (DAU, WAU, MAU). Speaking of active users. The marketing team wants to send t-shirts to the most active players every week. We need to figure out who should receive those t-shirts.

Visualization. We must display all of those charts in a dashboard. We don’t know which BI tool or data notebook will be used to build the dashboard. Our data analysts consider Tableau, Deepnote, and Jupyter. We must be flexible and don’t limit their ability to pick the tool.

One more thing. The product manager asked if we could implement a REST API endpoint to populate the high score table with data. The players get one point if they play the game on a given day or two points if they play for at least one hour. Our users can access the high score table to check their position. They do it all the time, and the current implementation keeps crashing. We need to build something better.

architecture diagram

Pipeline. Our game sends a ping to a tracking endpoint every time a user starts playing a level. The ping contains the user identifier, and on the server-side, we add the current time to the event. We have already built an extraction pipeline to load those pings with user data into an S3 bucket.

Right now, our users generate several billions of events every day. The number of players grows constantly, and we hope it will keep growing. Fortunately, ClickHouse can handle such a large amount of data without problems. After all, it was designed to power the second largest web analytics platform in the world. ClickHouse can handle several trillions of data points while generating on the fly reports from non-aggregated data.

By the way, we need all of those features by the end of the week. Don’t panic. We can build all of them with ClickHouse and Cube in no time.

Let us show you how to finish the task quickly and have enough free time to install the game on your phone and play it too 😀

Why does it make sense to use ClickHouse with Cube?

ClickHouse is a blazing fast OLAP database. It can process millions of data points in a few hundred milliseconds (as we will see soon in one of the examples). With ClickHouse, we can prepare an analytics database for quickly accessing the data required to display our dashboards.

However, we don't wanna anyone to access ClickHouse directly. We want to retain control over queries used to generate the dashboards. After all, we have all seen dashboards of user activities where every chart used a slightly different definition of “active user” to make the data look better. Such dashboards looked nice and reassuring, but such mixed definitions are useless when you want to run a successful business. Even if the data presentation mechanism changes, the charts should always display the same data.

Data modeling. Because of that, we will use Cube as an intermediate service between ClickHouse and the dashboards. Cube will be our data modeling layer where we define what the term “active user” means. We will use it also to specify the scoring rules for our high score dashboard.

APIs. We'll also use the rich set of APIs provided by Cube to deliver data to end-user apps: the SQL API to power the internal dashboard (in a BI tool or a data notebook) and the REST API to feed data to the high score table (in a front-end app).

How to setup ClickHouse on EC2

ClickHouse Cloud, built by the creators of ClickHouse, is coming soon. You can even sign up for the private preview waitlist by the link above.

In this tutorial, we will use a single EC2 instance to run the database. First, we have to start the EC2 instance. For that, we need to login to the AWS console and open the EC2 view.

On the EC2 page, we click the “Launch instances” button and configure the instance. We use a t2.large instance with 200 GB of storage space. Our security group allows access to port 8321 from any IP address.

(Later, if you want to limit the access, you can look at the list of IP addresses used by Cube Cloud that is available on the Settings page.)

After starting the instance, we follow the official ClickHouse tutorial to start the server. In the next step, we create two tables to store our data.

Creating the data model

Before we store any data in our ClickHouse database, we need to create the tables. However, this obvious first step differs slightly in the case of ClickHouse because our table definition must include a database engine we want to use. Optionally, we can specify the primary key, partitioning key, and ordering key. We won’t need those parameters in our table definitions.

If you need help deciding which engine you should use, please look at the ClickHouse documentation on table engines. In this tutorial, we will stick to the most universal one: MergeTree.

In the customers table, we want to store the player data. We will group the players by their age, so for our needs, it’s sufficient to store only the customer_id and their age.

create table customers (
  id integer,
  age integer
) ENGINE = MergeTree()

In the activity table, we store the customer_id and the timestamp of the event:

create table activity (
  id integer,
  timestamp DateTime('UTC')
) ENGINE = MergeTree()

Importing data from S3

Suppose we have the data stored in CSV files in an S3-based data lake. We want to copy the events and customer data into ClickHouse directly from the S3 buckets.

To import the data, we will retrieve the content of a file stored in S3 using the S3 integration feature. It allows us to query the S3 files as if they were a table. For example, we can run the following query:

SELECT *
FROM s3(
  'http://s3.eu-central-1.amazonaws.com/BUCKET_NAME/FILE_PATH.csv',
  'AWS_KEY',
  'AWS_SECRET',
  'CSV',
  'id String, age String’
)
LIMIT 5

The result set will be:

id	age
id	age
979863	26
780123	60
614848	45
775046	62

Note that our file contains the header, so we cannot convert the value to numbers while retrieving them. We will need an additional projection for data conversion:

SELECT toInt32OrNull(id) AS id, toInt32OrNull(age) AS age
FROM s3(
  'http://s3.eu-central-1.amazonaws.com/BUCKET_NAME/FILE_PATH.csv',
  'AWS_KEY',
  'AWS_SECRET',
  'CSV',
  'id String, age String’
)
LIMIT 5

id	age
979863	26
780123	60
614848	45
775046	62
991601	23

Finally, we can use the INSERT FROM SELECT feature to populate the ClickHouse tables with data retrieved from S3:

INSERT INTO customers
SELECT toInt32OrNull(id) AS id, toInt32OrNull(age) AS age
FROM s3(
  'http://s3.eu-central-1.amazonaws.com/BUCKET_NAME/FILE_PATH.csv',
  'AWS_KEY',
  'AWS_SECRET',
  'CSV',
  'id String, age String’
)
LIMIT 5

ClickHouse queries that we want to run with Cube

Let’s take a look at the queries we need to run in ClickHouse.

DAU, WAU, and MAU

To calculate the number of daily active users, we need to group the events by date and sum the number of customers who produced events within every time window. Note that we can refer to the alias, defined in the projection section of the SQL query, in the grouping and ordering clauses!

SELECT
  toDate(timestamp) AS date,
  uniq(id) AS DAU
FROM activity
GROUP BY date
ORDER BY date

Example result:

date	DAU
2017-05-16	174
2017-05-17	402
2017-05-18	564
2017-05-19	664
2017-05-20	851

Calculating the weekly active users requires an additional step of converting a given date to the Monday in the corresponding week:

SELECT
  toMonday(timestamp) AS date,
  uniq(id) AS WAU
FROM activity
GROUP BY date
ORDER BY date

date	WAU
2017-05-15	1382
2017-05-22	2478
2017-05-29	5480
2017-06-05	8468
2017-06-12	18630

Similarly, to calculate the monthly active users, we need to group the values by two columns – year and month:

SELECT
  toYear(timestamp) AS year,
  toMonth(timestamp) AS month,
  uniq(id) AS MAU
FROM activity
GROUP BY year, month
ORDER BY year, month

year	month	MAU
2017	5	9340
2017	6	38098
2017	7	56833
2017	8	53981
2017	9	61115

Most active users

Now, we can retrieve a list of the top most active users from our database. This time, we group by the customer identifier and count the number of events:

SELECT id FROM (
  SELECT id, count(*) AS number_of_events
  FROM activity
  GROUP BY id
  ORDER BY number_of_events DESC
  LIMIT 10
)

id
320227
2094597
627176
919150
232551

Churn

Finally, we can switch to more interesting queries. Let’s calculate the churn in cohorts – the number of users who registered in a given week but stopped using the application after X days.

We will start with calculating the number of days the person spent playing our game and the day when they started playing it for the first time:

SELECT
  id,
  min(timestamp) AS joined_timestamp,
  ceil((max(timestamp) - min(timestamp)) / 3600 / 24) AS days_of_activity
FROM activity
GROUP BY id

To get the churn in user cohorts, we will use the query defined above as a subquery in the SQL statement that groups people by their first day and the number of days they spent playing the game:

WITH days_of_activity AS (
  SELECT
    id,
    min(timestamp) AS joined_timestamp,
    ceil((max(timestamp) - min(timestamp)) / 3600 / 24) AS days_of_activity
  FROM activity
  GROUP BY id
)

SELECT
  toMonday(joined_timestamp) AS start,
  days_of_activity,
  count(*) AS number_of_users
FROM days_of_activity
GROUP BY start, days_of_activity
ORDER BY start, days_of_activity

start	days_of_activity	number_of_users
2017-05-15	1	1457
2017-05-15	2	289
2017-05-15	3	173
2017-05-15	4	97
2017-05-15	5	57

We can extend the definition of our user cohorts and group them by the day when they started playing the game and by their age. Let’s define a query that assigns the player to an age group:

SELECT
  id,
  concat(toString(floor(age/10)*10), '_', toString(ceil(age/10)*10)) AS age_group
FROM customers

Now, we can add the age group to our previous query and join the days of activity with the demographic data:

WITH age_groups AS (
  SELECT
    id,
    concat(toString(floor(age/10)*10), '_', toString(ceil(age/10)*10)) AS age_group
    FROM customers
),

days_of_activity AS (
  SELECT
    id,
    min(timestamp) AS joined_timestamp,
    ceil((max(timestamp) - min(timestamp)) / 3600 / 24) AS days_of_activity
  FROM activity
  GROUP BY id
),

joined AS (
  SELECT
    doa.id,
    joined_timestamp,
    days_of_activity,
    age_group
  FROM days_of_activity doa
  JOIN age_groups ag ON doa.id = ag.id
)

SELECT
  age_group,
  toMonday(joined_timestamp) AS start,
  days_of_activity,
  count() AS number_of_users
FROM joined
GROUP BY age_group, start, days_of_activity
ORDER BY age_group, start, days_of_activity

age_group	start	days_of_activity	number_of_users
18_25	2017-05-15	1	591
18_25	2017-05-15	2	113
18_25	2017-05-15	3	98
18_25	2017-05-15	4	56
18_25	2017-05-15	5	31

High scores

In the last query, we calculate the data required for our high score table. We must assign one point to a person who played the game on a given day and two points to a person who played for at least one hour.

Let’s begin with retrieving the identifiers of players who played the game on a given day:

SELECT
  DISTINCT toDate(timestamp) AS day,
  id
FROM activity

Now, we retrieve the players who played for at least one hour:

SELECT
  toDate(timestamp) AS day,
  id,
  min(timestamp) AS start,
  max(timestamp) AS end,
  (end - start) / 3600 AS hours
FROM activity
GROUP BY day, id HAVING hours > 1

In the final step, we merge both datasets and calculate the number of points:

WITH once AS (
  SELECT
    DISTINCT toDate(timestamp) AS day,
    id
  FROM activity
),

at_least_one_hour AS (
  SELECT
    toDate(timestamp) AS day,
    id,
    min(timestamp) AS start,
    max(timestamp) AS end,
    (end - start) / 3600 AS hours
  FROM activity
  GROUP BY day, id
  HAVING hours > 1
),

merged AS (
  SELECT day, id FROM at_least_one_hour
  UNION ALL
  SELECT * FROM once
)

SELECT id, count(*) AS points
FROM merged
GROUP BY id
ORDER BY points DESC

id	points
605793	105
903949	105
598094	104
719487	104
592054	104

Our last query looks quite complex, but it wasn’t a big problem for ClickHouse even when we ran it on underpowered hardware.

When we tested it while running the ClickHouse server on a single machine with two vCPUs and 8 GB of RAM, it needed approximately 5 seconds to process over 400 million rows, aggregate the values, calculate the number of points, and sort the results:

Elapsed: 4.825 sec, read 423.48 million rows

Using Cube to scale ClickHouse concurrent query performance

OLAP solutions are designed to process relatively small number of very complex analytical queries rather than many simple queries at once. ClickHouse can easily handle several hundred simultaneous queries on a single server, but it's not like in every use case you'd want to expose ClickHouse to end users so they can run hundreds or thousands of concurrent queries per second.

When accessing the high score table, millions of our players will send the exact same query. It doesn't make much sense to run the same query many times per second because the results of each run will be practically the same. How can we handle them?

Fortunately, we can use the Cube caching mechanism to scale up the throughput without increasing the costs. Instead of deploying several ClickHouse servers to handle the load, we will store the query results for a few minutes. Our users shouldn’t notice the difference.

Of course, we don’t want to cache the data for internal dashboards. It won’t be a problem because Cube lets us specify the caching configuration separately for every data model.

Connecting Cube to ClickHouse

Let’s create a new Cube instance in Cube Cloud. First, please sign up for a Cube Cloud account with a free tier.

On the next page, we choose the name of our new deployment, the cloud provider, and the region where we want to deploy a new Cube application.

Next, we click the “Create button”...

And select ClickHouse on the page of database connections:

Finally, we specify the connection parameters and click “Apply”:

In the Generate Data Schema view, we can select all existing tables. When we do it, Cube will generate the initial data schema. However, we will use it only as our starting point. Later, we will define a separate schema for every aggregation we want to support.

After generating the schema, we should see the message that Cube deploys a new instance connected to our ClickHouse database.

Implementing ClickHouse queries in Cube

After deploying a Cube application, we open the Schema view and start defining data models.

In the Activity model, we add measurements and dimensions that allow us to build the ClickHouse queries we prepared earlier. To create the daily active users query, we need a way to measure the number of unique users, and we must convert a given timestamp into a date.

Because of that, we add the count_unique property to measures:

    count_unique: {
      sql: `uniq(id)`,
      type: `number`
    }

And the date property to dimensions:

    date: {
      sql: `toDate(timestamp)`,
      type: `time`
    }

When we have both fields added to the schema, we can save the changes in the Schema window:

After saving the changes, we open the Playground window and test our query. In the Measures field, we add the Activity Count Unique variable. In the Dimensions, we put the Activity Date value. Optionally, we can also use the Order form to sort the values by date.

Now, we can add the remaining properties required to retrieve weekly and monthly active users. When we finish, our dimensions property should look like this:

  dimensions: {
    user_id: {
      sql: `id`,
      type: `number`
    },    
    timestamp: {
      sql: `timestamp`,
      type: `time`
    },
    date: {
      sql: `toDate(timestamp)`,
      type: `time`
    },
    start_of_week: {
      sql: `toMonday(timestamp)`,
      type: `time`
    },
    year: {
      sql: `toYear(timestamp)`,
      type: `time`
    },
    month: {
      sql: `toMonth(timestamp)`,
      type: `time`
    }
  }

The properties we have defined so far allow us to retrieve a list of the top active users. For this query, we need to put the Count property in measures and the User Id into dimensions. Additionally, we must specify the limit and sorting configuration:

The churn and high score queries will require more customization. Instead of adding their properties to an existing data model, we will create a separate schema for each of those queries.

Let’s start with the churn. In the Schema view, we must add a new file and put the entire ClickHouse query inside the sql parameter:

We need to add a dimension for every column used in the GROUP BY clause and a measure for each aggregation. In this case, our dimensions and measures look like this:

  measures: {
    number_of_users: {
      sql: `number_of_users`,
      type: `sum`
    }
  },

  dimensions: {
    start_day: {
      sql: `start`,
      type: `time`
    },
    days_of_activity: {
      sql: `days_of_activity`,
      type: `number`
    }
  }

Note that we defined the number_of_users measure as a sum of values. We do it because Cube automatically adds a grouping clause to the query. Because of that, we need an aggregate function in measure. We can safely use sum as such aggregation because it won’t affect the final value.

After saving the changes, we can test our solution in the Playground. Remember to use the sorting feature to make the results user-friendly.

Similarly, we can define a separate schema for the second churn calculator. This time, our query uses data from multiple tables, but we don’t need to define any joins if we put the entire SQL statement into the sql parameter:

cube('churn_by_age', {
  sql: `WITH age_groups AS (
SELECT id, concat(toString(floor(age/10)*10), '_', toString(ceil(age/10)*10)) as age_group FROM customers
), days_of_activity AS (
SELECT id, min(timestamp) as joined_timestamp, ceil((max(timestamp) - min(timestamp)) / 3600 / 24) as days_of_activity FROM activity GROUP BY id
), joined AS (
SELECT doa.id, joined_timestamp, days_of_activity, age_group FROM days_of_activity doa JOIN age_groups ag ON doa.id = ag.id
)
SELECT age_group, toMonday(joined_timestamp) as start, days_of_activity, count() as number_of_users
FROM joined
GROUP BY age_group, start, days_of_activity
ORDER BY age_group, start, days_of_activity`,

  preAggregations: {
  },

  joins: {
  },

  measures: {
    number_of_users: {
      sql: `number_of_users`,
      type: `sum`
    }
  },

  dimensions: {
    age_group: {
      sql: `age_group`,
      type: `string`
    },
    start_day: {
      sql: `start`,
      type: `time`
    },
    days_of_activity: {
      sql: `days_of_activity`,
      type: `number`
    }
  }
});

After saving the changes, we can test it in the Playground view:

Finally, we can prepare a Cube schema for the high score calculation. Again, we copy the entire ClickHouse SQL query into the sql parameter and add the required dimensions and measures:

cube('high_score', {
  sql: `WITH once AS (
SELECT DISTINCT toDate(timestamp) as day, id FROM activity
), at_least_one_hour AS (
SELECT toDate(timestamp) as day, id, min(timestamp) as start, max(timestamp) as end, (end-start) / 3600 as hours FROM activity GROUP BY day, id HAVING hours > 1
), merged AS (
SELECT day, id FROM at_least_one_hour UNION ALL SELECT * FROM once
)
SELECT id, count(*) as points FROM merged GROUP BY id ORDER BY points DESC`,

  preAggregations: {},

  joins: {},

  measures: {
    points: {
      sql: `points`,
      type: `sum`
    }
  },

  dimensions: {
    user_id: {
      sql: `id`,
      type: `number`
    }
  }
});

Now, we can retrieve the high score table in the Playground view:

Adding caching to the highscore table data

As we mentioned at the beginning, we expect the high score table to be requested quite often by many users simultaneously. Instead of sending all queries to the underlying ClickHouse database, we will cache the results for several minutes and return the cached value.

To cache results, we need to configure a pre-aggregation in the high_score model. (Note that in the case of ClickHouse pre-aggregations, we must also define an index.)

  preAggregations: {
    high_score: {
      measures: [ points ],
      dimensions: [ user_id ],
      refreshKey: {
        every: `15 minute`,
      },
      indexes: {
        userIndex: {
          columns: [ CUBE.user_id ],
        },
      },
    }
  },

After adding the pre-aggregation, we have to commit the changes and push them to the Cube repository:

After pushing the changes, we must wait a few minutes until the new Cube application gets deployed in production. When it happens, we can open the “Pre-Aggregations” view.

Let’s go back to the Playground window. We will rerun the high score table query. However, this time, we should see a message saying that Cube accelerated the query with a pre-aggregation:

Retrieving data via the REST API

Finally, we can start using our data.

This tutorial will show you how to access the Cube REST API via the requests library in Python. If you write in JavaScript, don’t worry. The query syntax is the same, no matter what language or library you use.

Let’s open the Overview tab and copy the REST endpoint URL.

To use the API, we will need to add the /load suffix to the URL. For example, if our URL looks like this: https://cubeisgreat.aws-eu-central-1.cubecloudapp.dev/cubejs-api/v1, we must use https://cubeisgreat.aws-eu-central-1.cubecloudapp.dev/cubejs-api/v1/load to connect to the REST API.

After copying the endpoint, we click the “How to connect your application” link and copy the authentication header from the code snippet displayed.

We want to display the high score table on a website. As we have seen before, to get the table, we need to retrieve the “high score points” measure calculated over the dimension “high score user id” and ordered by the number of points in descending order. Let’s define the query as a JSON object and send it to the API.

Our JSON query looks like this:

cube_query = {
  "limit": 50,
  "measures": [
    "high_score.points"
  ],
  "dimensions": [
    "high_score.user_id"
  ],
  "order": {
    "high_score.points": "desc"
  }
}

Here is the entire Python code required to retrieve the high score table:

import requests
import json

url = "https://YOUR_ENDPOINT_URL"
auth_header_value = "AUTH_HEADER"

cube_query = {
  "limit": 50,
  "measures": [
    "high_score.points"
  ],
  "dimensions": [
    "high_score.user_id"
  ],
  "order": {
    "high_score.points": "desc"
  }
}

response = requests.get(
    url,
    params={'query': json.dumps(cube_query)},
    headers={'Authorization': auth_header_value},
)

if response.status_code == 200:
    response_json = response.json()
    high_score_table = response_json['data']

After running the code, our high_score_table variable should contain the user ids and their number of points:

Retrieving data via the SQL API

In the beginning, we decided that our data analysts would create internal dashboards using Tableau, Deepnote, or Jupyter Notebooks. In our example, we will use a Jupyter Notebook. However, we want to be ready to switch to Tableau, so we will use the Cube SQL API instead of REST endpoints.

First, we must enable the SQL API feature in the Cube settings:

After flipping the switch, we must wait a few minutes until the SQL API service gets deployed. When the SQL API is ready to use, we can click the “How to connect your BI tool” link in the Cube Overview tab.

We will see the following window:

We see that the Cube SQL API is compatible with PostgreSQL, so we can access the data using the psycopg2 library in Python. In combination with Pandas read_sql function, it allows us to quickly get all the data we need in our Jupyter notebook.

We will begin by retrieving the high score table again. This time, we use the SQL API:

First, we have to connect to the Cube database:

import pandas as pd
import psycopg2 as pg

def connect_to_cube():
    engine = pg.connect("dbname='db' user='YOUR_USERNAME' host='YOUR_HOST_URL' port='5432' password='YOUR_PASSWORD'")
    return engine

After that, we can retrieve the high score table using the SQL API:

df = pd.read_sql('select * from public.high_score', con=connect_to_cube())

In our example, we want to create a dashboard with charts showing the daily, weekly, and monthly active users. We will use the data schema defined earlier to retrieve the relevant information from Cube. We already have the count_unique measure and date, start_of_week, month, and year dimensions. Let's use them to create charts.

Daily active users:

daily_active_users = pd.read_sql(
    'select date, sum(count_unique) as cnt from public.activity group by 1 order by date asc',
    con=connect_to_cube()
)

daily_active_users.plot(x='date', y='cnt')

Below, we demonstrate the SQL queries for the remaining active user charts and one of the churn charts.

Weekly active users:

pd.read_sql(
    'select start_of_week, sum(count_unique) as cnt from public.activity group by 1 order by start_of_week asc',
    con=connect_to_cube()
)

Monthly active users:

pd.read_sql(
    'select year, month, sum(count_unique) as cnt from public.activity group by 1, 2 order by 1, 2',
    con=connect_to_cube()
)

Finally, let’s retrieve the churn data grouped by the week when the user started playing the game and their age group:

pd.read_sql(
    """select age_group, start_day, days_of_activity, sum(number_of_users)
    from public.churn_by_age group by 1, 2, 3 order by 1, 2, 3""",
    con=connect_to_cube()
)

So, here's the number of users who stopped playing the game after a given number of days_of_activity grouped by the start day and an age group:

What to do next

If you would like to learn more about using the Cube SQL API, please look at our documentation, supported downstream tools, and give it a try by registering a new Cube account.

If you want to try ClickHouse without the burden of setting up EC2 instances, register to the private preview waitlist of ClickHouse Cloud!