This article demonstrates how to build analytics dashboards showing summaries of data stored in ClickHouse and how Cube can provide a semantic layer with high-level metrics definitions over raw data as well as a caching layer shielding your ClickHouse instance from repetitive requests by end users.
What is ClickHouse?
ClickHouse is a column-oriented database optimized for online analytical data processing. Because of data compression, storing values in sorted files, and a query engine optimized for dealing with calculating aggregations, it can easily handle trillions of data points in a single table.
Like all column-oriented databases, ClickHouse works best with denormalized data when it doesn’t need to join multiple tables. Additionally, we can use data skipping indexes to speed up data retrieval even more.
However, ClickHouse isn’t only a database. It integrates with various data sources, such as S3, Kafka, and relational databases, so we can ingest data into ClickHouse without using additional tools.
What is Cube?
Cube is the headless BI platform for accessing, organizing, and delivering data. Cube connects to many data warehouses, databases, or query engines, including ClickHouse, and allows you to quickly build data applications or analyze your data in BI tools. It serves as the single source of truth for your business metrics.
Example: analytics dashboards for a game studio and a high score table for the players
Suppose we work at a mobile game development studio. We have released a new game recently, and we would like to check whether the users like it.
Metrics. Tracking the number of registered users would be easy but definitely not enough. What about people who installed the game, played for one hour, and never ran it again? We want to visualize the user churn to see how many people still play the game after a few weeks since registering for the first time. Of course, we’ll group the users into cohorts. After all, during the first week, dozens of gaming websites wrote about our game. Millions of people installed the app. Are they still playing?
In addition to churn visualization, we want charts of daily, weekly, and monthly active users (DAU, WAU, MAU). Speaking of active users. The marketing team wants to send t-shirts to the most active players every week. We need to figure out who should receive those t-shirts.
Visualization. We must display all of those charts in a dashboard. We don’t know which BI tool or data notebook will be used to build the dashboard. Our data analysts consider Tableau, Deepnote, and Jupyter. We must be flexible and don’t limit their ability to pick the tool.
One more thing. The product manager asked if we could implement a REST API endpoint to populate the high score table with data. The players get one point if they play the game on a given day or two points if they play for at least one hour. Our users can access the high score table to check their position. They do it all the time, and the current implementation keeps crashing. We need to build something better.
Pipeline. Our game sends a ping to a tracking endpoint every time a user starts playing a level. The ping contains the user identifier, and on the server-side, we add the current time to the event. We have already built an extraction pipeline to load those pings with user data into an S3 bucket.
Right now, our users generate several billions of events every day. The number of players grows constantly, and we hope it will keep growing. Fortunately, ClickHouse can handle such a large amount of data without problems. After all, it was designed to power the second largest web analytics platform in the world. ClickHouse can handle several trillions of data points while generating on the fly reports from non-aggregated data.
By the way, we need all of those features by the end of the week. Don’t panic. We can build all of them with ClickHouse and Cube in no time.
Let us show you how to finish the task quickly and have enough free time to install the game on your phone and play it too 😀
Why does it make sense to use ClickHouse with Cube?
ClickHouse is a blazing fast OLAP database. It can process millions of data points in a few hundred milliseconds (as we will see soon in one of the examples). With ClickHouse, we can prepare an analytics database for quickly accessing the data required to display our dashboards.
However, we don't wanna anyone to access ClickHouse directly. We want to retain control over queries used to generate the dashboards. After all, we have all seen dashboards of user activities where every chart used a slightly different definition of “active user” to make the data look better. Such dashboards looked nice and reassuring, but such mixed definitions are useless when you want to run a successful business. Even if the data presentation mechanism changes, the charts should always display the same data.
Data modeling. Because of that, we will use Cube as an intermediate service between ClickHouse and the dashboards. Cube will be our data modeling layer where we define what the term “active user” means. We will use it also to specify the scoring rules for our high score dashboard.
APIs. We'll also use the rich set of APIs provided by Cube to deliver data to end-user apps: the SQL API to power the internal dashboard (in a BI tool or a data notebook) and the REST API to feed data to the high score table (in a front-end app).
How to setup ClickHouse on EC2
ClickHouse Cloud, built by the creators of ClickHouse, is coming soon. You can even sign up for the private preview waitlist by the link above.
In this tutorial, we will use a single EC2 instance to run the database. First, we have to start the EC2 instance. For that, we need to login to the AWS console and open the EC2 view.
On the EC2 page, we click the “Launch instances” button and configure the instance. We use a
t2.large instance with 200 GB of storage space. Our security group allows access to port 8321 from any IP address.
(Later, if you want to limit the access, you can look at the list of IP addresses used by Cube Cloud that is available on the Settings page.)
After starting the instance, we follow the official ClickHouse tutorial to start the server. In the next step, we create two tables to store our data.
Creating the data model
Before we store any data in our ClickHouse database, we need to create the tables. However, this obvious first step differs slightly in the case of ClickHouse because our table definition must include a database engine we want to use. Optionally, we can specify the primary key, partitioning key, and ordering key. We won’t need those parameters in our table definitions.
If you need help deciding which engine you should use, please look at the ClickHouse documentation on table engines. In this tutorial, we will stick to the most universal one:
customers table, we want to store the player data. We will group the players by their age, so for our needs, it’s sufficient to store only the
customer_id and their
activity table, we store the
customer_id and the
timestamp of the event:
Importing data from S3
Suppose we have the data stored in CSV files in an S3-based data lake. We want to copy the events and customer data into ClickHouse directly from the S3 buckets.
To import the data, we will retrieve the content of a file stored in S3 using the S3 integration feature. It allows us to query the S3 files as if they were a table. For example, we can run the following query:
The result set will be:
Note that our file contains the header, so we cannot convert the value to numbers while retrieving them. We will need an additional projection for data conversion:
Finally, we can use the
INSERT FROM SELECT feature to populate the ClickHouse tables with data retrieved from S3:
ClickHouse queries that we want to run with Cube
Let’s take a look at the queries we need to run in ClickHouse.
DAU, WAU, and MAU
To calculate the number of daily active users, we need to group the events by date and sum the number of customers who produced events within every time window. Note that we can refer to the alias, defined in the projection section of the SQL query, in the grouping and ordering clauses!
Calculating the weekly active users requires an additional step of converting a given date to the Monday in the corresponding week:
Similarly, to calculate the monthly active users, we need to group the values by two columns – year and month:
Most active users
Now, we can retrieve a list of the top most active users from our database. This time, we group by the customer identifier and count the number of events:
Finally, we can switch to more interesting queries. Let’s calculate the churn in cohorts – the number of users who registered in a given week but stopped using the application after X days.
We will start with calculating the number of days the person spent playing our game and the day when they started playing it for the first time:
To get the churn in user cohorts, we will use the query defined above as a subquery in the SQL statement that groups people by their first day and the number of days they spent playing the game:
We can extend the definition of our user cohorts and group them by the day when they started playing the game and by their age. Let’s define a query that assigns the player to an age group:
Now, we can add the age group to our previous query and join the days of activity with the demographic data:
In the last query, we calculate the data required for our high score table. We must assign one point to a person who played the game on a given day and two points to a person who played for at least one hour.
Let’s begin with retrieving the identifiers of players who played the game on a given day:
Now, we retrieve the players who played for at least one hour:
In the final step, we merge both datasets and calculate the number of points:
Our last query looks quite complex, but it wasn’t a big problem for ClickHouse even when we ran it on underpowered hardware.
When we tested it while running the ClickHouse server on a single machine with two vCPUs and 8 GB of RAM, it needed approximately 5 seconds to process over 400 million rows, aggregate the values, calculate the number of points, and sort the results:
Using Cube to scale ClickHouse concurrent query performance
OLAP solutions are designed to process relatively small number of very complex analytical queries rather than many simple queries at once. ClickHouse can easily handle several hundred simultaneous queries on a single server, but it's not like in every use case you'd want to expose ClickHouse to end users so they can run hundreds or thousands of concurrent queries per second.
When accessing the high score table, millions of our players will send the exact same query. It doesn't make much sense to run the same query many times per second because the results of each run will be practically the same. How can we handle them?
Fortunately, we can use the Cube caching mechanism to scale up the throughput without increasing the costs. Instead of deploying several ClickHouse servers to handle the load, we will store the query results for a few minutes. Our users shouldn’t notice the difference.
Of course, we don’t want to cache the data for internal dashboards. It won’t be a problem because Cube lets us specify the caching configuration separately for every data model.
Connecting Cube to ClickHouse
Let’s create a new Cube instance in Cube Cloud. First, please sign up for a Cube Cloud account with a free tier.
On the next page, we choose the name of our new deployment, the cloud provider, and the region where we want to deploy a new Cube application.
Next, we click the “Create button”...
And select ClickHouse on the page of database connections:
Finally, we specify the connection parameters and click “Apply”:
In the Generate Data Schema view, we can select all existing tables. When we do it, Cube will generate the initial data schema. However, we will use it only as our starting point. Later, we will define a separate schema for every aggregation we want to support.
After generating the schema, we should see the message that Cube deploys a new instance connected to our ClickHouse database.
Implementing ClickHouse queries in Cube
After deploying a Cube application, we open the Schema view and start defining data models.
Activity model, we add measurements and dimensions that allow us to build the ClickHouse queries we prepared earlier. To create the daily active users query, we need a way to measure the number of unique users, and we must convert a given timestamp into a date.
Because of that, we add the
count_unique property to measures:
date property to dimensions:
When we have both fields added to the schema, we can save the changes in the Schema window:
After saving the changes, we open the Playground window and test our query. In the Measures field, we add the Activity Count Unique variable. In the Dimensions, we put the Activity Date value. Optionally, we can also use the Order form to sort the values by date.
Now, we can add the remaining properties required to retrieve weekly and monthly active users. When we finish, our dimensions property should look like this:
The properties we have defined so far allow us to retrieve a list of the top active users. For this query, we need to put the
Count property in measures and the
User Id into dimensions. Additionally, we must specify the limit and sorting configuration:
The churn and high score queries will require more customization. Instead of adding their properties to an existing data model, we will create a separate schema for each of those queries.
Let’s start with the churn. In the Schema view, we must add a new file and put the entire ClickHouse query inside the sql parameter:
We need to add a dimension for every column used in the GROUP BY clause and a measure for each aggregation. In this case, our dimensions and measures look like this:
Note that we defined the
number_of_users measure as a sum of values. We do it because Cube automatically adds a grouping clause to the query. Because of that, we need an aggregate function in measure. We can safely use sum as such aggregation because it won’t affect the final value.
After saving the changes, we can test our solution in the Playground. Remember to use the sorting feature to make the results user-friendly.
Similarly, we can define a separate schema for the second churn calculator. This time, our query uses data from multiple tables, but we don’t need to define any joins if we put the entire SQL statement into the sql parameter:
After saving the changes, we can test it in the Playground view:
Finally, we can prepare a Cube schema for the high score calculation. Again, we copy the entire ClickHouse SQL query into the sql parameter and add the required dimensions and measures:
Now, we can retrieve the high score table in the Playground view:
Adding caching to the highscore table data
As we mentioned at the beginning, we expect the high score table to be requested quite often by many users simultaneously. Instead of sending all queries to the underlying ClickHouse database, we will cache the results for several minutes and return the cached value.
After adding the pre-aggregation, we have to commit the changes and push them to the Cube repository:
After pushing the changes, we must wait a few minutes until the new Cube application gets deployed in production. When it happens, we can open the “Pre-Aggregations” view.
Let’s go back to the Playground window. We will rerun the high score table query. However, this time, we should see a message saying that Cube accelerated the query with a pre-aggregation:
Retrieving data via the REST API
Finally, we can start using our data.
Let’s open the Overview tab and copy the REST endpoint URL.
To use the API, we will need to add the
/load suffix to the URL. For example, if our URL looks like this:
https://cubeisgreat.aws-eu-central-1.cubecloudapp.dev/cubejs-api/v1, we must use
https://cubeisgreat.aws-eu-central-1.cubecloudapp.dev/cubejs-api/v1/load to connect to the REST API.
After copying the endpoint, we click the “How to connect your application” link and copy the authentication header from the code snippet displayed.
We want to display the high score table on a website. As we have seen before, to get the table, we need to retrieve the “high score points” measure calculated over the dimension “high score user id” and ordered by the number of points in descending order. Let’s define the query as a JSON object and send it to the API.
Our JSON query looks like this:
Here is the entire Python code required to retrieve the high score table:
After running the code, our
high_score_table variable should contain the user ids and their number of points:
Retrieving data via the SQL API
In the beginning, we decided that our data analysts would create internal dashboards using Tableau, Deepnote, or Jupyter Notebooks. In our example, we will use a Jupyter Notebook. However, we want to be ready to switch to Tableau, so we will use the Cube SQL API instead of REST endpoints.
First, we must enable the SQL API feature in the Cube settings:
After flipping the switch, we must wait a few minutes until the SQL API service gets deployed. When the SQL API is ready to use, we can click the “How to connect your BI tool” link in the Cube Overview tab.
We will see the following window:
We see that the Cube SQL API is compatible with PostgreSQL, so we can access the data using the psycopg2 library in Python. In combination with Pandas read_sql function, it allows us to quickly get all the data we need in our Jupyter notebook.
We will begin by retrieving the high score table again. This time, we use the SQL API:
First, we have to connect to the Cube database:
After that, we can retrieve the high score table using the SQL API:
In our example, we want to create a dashboard with charts showing the daily, weekly, and monthly active users. We will use the data schema defined earlier to retrieve the relevant information from Cube. We already have the
count_unique measure and
year dimensions. Let's use them to create charts.
Daily active users:
Below, we demonstrate the SQL queries for the remaining active user charts and one of the churn charts.
Weekly active users:
Monthly active users:
Finally, let’s retrieve the churn data grouped by the week when the user started playing the game and their age group:
So, here's the number of users who stopped playing the game after a given number of
days_of_activity grouped by the start day and an age group:
What to do next
If you want to try ClickHouse without the burden of setting up EC2 instances, register to the private preview waitlist of ClickHouse Cloud!