Cube’s Open Source Data Modeling Language

An introduction to open source data modeling in Cube’s headless BI platform.

Author avatarArtyom KeydunovApril 21, 2022Deep Dives
Cube’s Open Source Data Modeling Language
Show Original

Previously, we’ve discussed the recent emergence of headless BI, and reviewed headless BI’s four essential components. Now, let’s take a more in-depth look at one of those layers: data modeling. Here is an introduction to open source data modeling in Cube’s headless BI platform.

How data is modeled is the first fundamental element of the business intelligence stack, because it organizes and streamlines querying a data source. Additionally, the data model defines the relationships between each data entity, which forms the basis of all analysis of that data.

Data modeling is a vital component of the BI stack and must be consistent and universally accessible. And so, its logic should be defined only once, and should be defined in code. Furthermore, this code should be governed by version control, to enable safe iteration as an organization grows.

A data modeling language for Headless BI

open source data modeling language for headless BI

Cube’s entire headless BI platform is open source—and this includes our open source data modeling language. This language is used to describe measures, dimensions, and relationships in a data source. Then, Cube servers use the model written in this language to generate SQL queries against data sources when responding to requests made to our REST, GraphQL, and SQL APIs.

Unlike other data modeling languages, Cube’s data modeling language is not tied to any specific interface, visualization, or user experience. This decoupling makes it possible to use the same model definitions across every application. Each data consumer can have a custom-tailored experience in the data application of their choice while still working with consistent and standardized definitions.

The design of Cube’s data modeling language

Cube’s data model is a relational graph of the entities that contain measures and dimensions.

In Cube, entities are called “cubes”—inspired by OLAP cubes. In essence, they are data tables with a semantic meta layer that describes both the cube’s measures and dimensions and its relationships to other cubes.

open source data modeling - Cube

Cube operates on the foundational primitives of dimensions and measures. Users can define metrics by combining these primitives.

Dimensions are properties of entity objects, e.g., a user’s country or a product’s category.

cube(`Products`, {
dimensions: {
category: {
sql: `category`,
type: `string`
}
}
});

Measures are aggregations, typically defined as SQL expressions or algebraic operations over other measures:

cube(`Orders`, {
measures: {
totalValue: {
sql: `value`,
type: `sum`
}
}
});

Finally, cubes are connected to each other and constitute a data graph.

cube(`Orders`, {
joins: {
Products: {
relationship: `belongsTo`
sql: `${Orders}.product_id = ${Products}.id`,
}
}
});

When users construct metrics, the dimensions and measures they query don’t have to belong to the same cube. They can be located in different places on the data graph. Cube will generate appropriate SQL, including taking care of chasm / fan-outs, to join the underlying tables to respond to the request.

open source data modeling - Cube

Cube’s approach is different from a query-oriented approach in which metrics are defined by a custom query and a bunch of attributes are assigned to that query. That approach essentially results in single cubes, each with a single measure and a subset of dimensions and filters. Because these cubes aren’t connected to one another, we can’t reuse dimensions, filters, joins, access controls, caching configuration, etc.

If we were to continue answering more questions and building more metrics, we would have to build more queries, with duplicated definitions and different join paths. At best, this would be repetitive; at worst, it would lead to inconsistent definitions and a combinatorial explosion of metrics and dimensions.

Instead, we’re focused on letting users define relationships and reuse definitions to build maintainable and scalable data models.

Cube’s data modeling language is based on JavaScript. This may come as a surprise, given that many data modeling frameworks and languages are based on YAML. However, as a markup language, YAML has no runtime and only limited capacity for code organization. This is a big limitation for complex and dynamic data models.

By running a data model in a JavaScript virtual machine, Cube can mutate the model at runtime both reactively and proactively, which enables dynamic data definitions, caching, and access control rules.

Cube’s data modeling language also manages caching. Users can define a set of measures and dimensions that are commonly used together, and materialize them as rollup tables. Since the caching rules are defined in code, it’s possible to manage them using best practices including version control, code reviews, and tests in isolated environments.

open source data modeling - Cube

In Cube, data processing and storage are written in Rust, based on the Datafusion query execution framework, which uses Apache Arrow as its in-memory format.

A data modeling example

Let’s take a look at how to use Cube’s data modeling language.

We’ll model data for an imaginary e-commerce website. Assume the data warehouse has tables like orders and users with information about transactions and who placed them.

Orders

idID of the order
user_idID of the user
valueValue of the order in USD
statusStatus of the order, either “processing” or “completed”

Users

idID of the user
countryCountry of the user

We can represent these two tables in Cube’s modeling language as cubes with this configuration:

cube(`Orders`, {
sql: `SELECT * FROM orders`,
measures: {
totalValue: {
sql: `value`,
type: `sum`,
description: 'Revenue from completed orders placed on our website.',
}
},
dimensions: {
id: {
sql: `id`,
type: `number`,
primaryKey: true
},
status: {
sql: `status`,
description: 'Status of the order, either processing or completed',
type: `string`
},
time: {
sql: `timestamp`,
type: `time`
}
}
});
cube(`Users`, {
sql: `select * from users`,
dimensions: {
id: {
sql: `id`,
type: `number`,
primaryKey: true
},
country: {
sql: `country`,
type: `string`,
description: `User's country of residence`
}
}
});

Let’s look at the major elements of the model.

The sql property describes the table of data that will be used in cube, which references the underlying table within the data warehouse. It also can support arbitrary SQL queries in complex cases.

The cubes include defined measures and dimensions:

  • Measures: quantitative data, such as the number of units sold, number of unique visits, revenue, and so forth
  • Dimensions: properties of data entities, such as state, gender, product name, or units of time

To build data relationships, we define the joins property on one of the cubes:

cube(`Orders`, {
joins: {
Users: {
relationship: `belongsTo`
sql: `${Orders}.user_id = ${Users}.id`,
}
}
});

Querying the data model

With the above data model, it’s now possible to query Cube’s APIs to learn total value grouped by country, or a specific company‘s total value from completed orders.

To find out the total value of orders by country, we’d use the following JSON query:

{
measures: [Orders.totalValue],
dimensions: [Users.country]
}

You also can explore the data model from within a BI tool:

querying a Cube data model using Superset

These examples are relatively simple, but a data model and queries can get very complex, with many measures, dimensions, and relationships.

By creating a data model foundation and exposing metrics via many APIs, Cube ensures that data is accessible and consistent for every downstream application.

share this article