Getting Started with Data Modeling
Please reach out to us if you’d like to sign up for early access.
A Cube Data Schema is used to model raw data into meaningful business definitions and pre-aggregate data for optimal results. The data schema is exposed through the querying API that allows end-users to query a wide variety of analytical queries without modifying the schema itself.
Let’s use a users table with the following columns as an example:
id | paying | city | company_name |
---|---|---|---|
1 | true | San Francisco | Pied Piper |
2 | true | Palo Alto | Raviga |
3 | true | Redwood | Aviato |
4 | false | Mountain View | Bream-Hall |
5 | false | Santa Cruz | Hooli |
We can start with a set of simple questions about users we want to answer:
- How many users do we have?
- How many paying users?
- What is the percentage of paying users out of the total?
- How many users, paying or not, are from different cities and companies?
We don’t need to write SQL queries for every question, since the data schema allows building well-organized and reusable SQL.
In Cube, cubes are used to organize entities and
connections between entities. Usually one cube is created for each table in the
database, such as users
, orders
, products
, etc. In the sql
parameter of
the cube we define a base table for this cube. In our case, the base table is
simply our users
table.
cube(`Users`, {
sql: `SELECT * FROM users`,
});
Once the base table is defined, the next step is to add measures and dimensions to the cube.
Measures are referred to as quantitative data, such as number of units sold, number of unique visits, profit, and so on.
Dimensions are referred to as categorical data, such as state, gender, product name, or units of time (e.g., day, week, month).
Let's go ahead and create our first measure and two dimensions:
cube(`Users`, {
sql: `SELECT * FROM users`,
measures: {
count: {
sql: `id`,
type: `count`,
},
},
dimensions: {
city: {
sql: `city`,
type: `string`,
},
company_name: {
sql: `company_name`,
type: `string`,
},
},
});
Let's break down the above code snippet piece-by-piece. After defining the base
table for the cube (with the sql
property), we create a count
measure in the
measures
block. The type count
and sql id
means that when this measure will be requested via an API, Cube will generate
and execute the following SQL:
SELECT count(id) from users;
When we apply a city dimension to the measure to see "Where are users based?",
Cube will generate SQL with a GROUP BY
clause:
SELECT city, count(id) from users GROUP BY 1;
You can add as many dimensions as you want to your query when you perform grouping.
Now let's answer the next question – "How many paying users do we have?". To accomplish this, we will introduce measure filters:
cube(`Users`, {
measures: {
count: {
sql: `id`,
type: `count`,
},
paying_count: {
sql: `id`,
type: `count`,
filters: [{ sql: `${CUBE}.paying = 'true'` }],
},
},
});
It is best practice to prefix references to table columns with the name of the
cube or with the CUBE
constant when referencing the current cube's column.
That's it! Now we have the paying_count
measure, which shows only our paying
users. When this measure is requested, Cube will generate the following SQL:
SELECT
count(
CASE WHEN (users.paying = 'true') THEN users.id END
) "users.paying_count"
FROM users
Since the filters
property is an array, you can apply as many filters as
required. paying_count
can be used with dimensions the same way as a simple
count
. We can group paying_count
by city
and companyName
simply by adding
these dimensions alongside measures in the requested query.
To answer "What is the percentage of paying users out of the total?", we need to
calculate the paying users ratio, which is basically paying_count / count
.
Cube makes it extremely easy to perform this kind of calculation. Let's add a
new measure to our cube called paying_percentage
:
cube(`Users`, {
measures: {
count: {
sql: `id`,
type: `count`,
},
paying_count: {
sql: `id`,
type: `count`,
filters: [{ sql: `${CUBE}.paying = 'true'` }],
},
paying_percentage: {
sql: `100.0 * ${paying_count} / ${count}`,
type: `number`,
format: `percent`,
},
},
});
Here we defined a calculated measure paying_percentage
, which divides
paying_count
by count
. This example shows how you can reference measures
inside other measure definitions. When you request the paying_percentage
measure via an API, the following SQL will be generated:
SELECT
100.0 * count(
CASE WHEN (users.paying = 'true') THEN users.id END
) / count(users.id) "users.paying_percentage"
FROM users
As with other measures, paying_percentage
can be used with dimensions.
Did you find this page useful?