Using Cube with dbt
dbt (opens in a new tab) is a popular data integration tool used by many Cube users to transform the data in the data warehouse before bringing it to the semantic layer.
Usually, dbt projects are focused (opens in a new tab) on creating
staging and intermediate models and finally using them to define normalized
data marts (opens in a new tab) for the semantic layer. Often,
these marts would be described in .yml
files as
models (opens in a new tab) with columns (opens in a new tab) and
other properties.
Cube projects usually define cubes on top of dbt models, materialized as tables or views in your data warehouse, bring columns as dimensions, enrich them with measures, joins, and pre-aggregations. Finally, the facade of the data model is created with views and exposed to downstream tools via a rich set of APIs.
cubes:
- name: orders
- sql_table: my_dbt_schema.my_dbt_orders_model
dimensions: []
measures: []
joins: []
pre_aggregations: []
Cube provides a convenient cube_dbt
package to simplify data model
integration.
Data transformation
Data types
Cube provides a single time
type to model time
dimensions. However, raw data can contain temporal columns in various
formats, e.g., as TIMESTAMP
, DATETIME
, DATE
, etc. It's a best practice
to convert all temporal columns in dbt models into the TIMESTAMP
format.
Refresh keys
It's convenient to have an updated_at
column in every dbt model so that
the refresh keys of cubes would be defined as follows:
SELECT MAX(updated_at)
FROM orders
When the source data doesn't have a column that can be used to track updates
reliably and dbt snapshots (opens in a new tab) are used,
dbt_valid_from
and dbt_valid_to
columns can be used to define refresh keys, e.g.:
SELECT CONCAT(COUNT(*), MAX(dbt_valid_from), MIN(dbt_valid_to))
FROM orders_snapshot
Pre-aggregations
When pre-aggregations are implemented, they would be refreshed using their own refresh keys.
If applicable, consider configuring a data orchestration tool like Airflow, Dagster, or Prefect so that it triggers pre-aggregation refresh when dbt models are updated via the Orchestration API.
Data model integration
cube_dbt
package simplifies defining the data model on
top of dbt models. It provides convenient tools for loading the metadata
of a dbt project, inspecting dbt models, and rendering them as cubes with
dimensions. It's designed to work with dynamic data models
built with YAML, Jinja, and Python.
Usually, the integration would include the following steps:
- Load the metadata of your dbt project.
- Select relevant dbt models.
- Render dbt models as cubes with dimensions.
- Enrich these cubes with measures, joins, pre-aggregations, etc.
Installation
Add the cube_dbt
package to the requirements.txt
file in the root
directory of your Cube project. See the cube_dbt
package
documentation for details.
Loading dbt projects
cube_dbt
package loads the metadata of dbt projects from the
manifest.json
file (opens in a new tab). Any dbt command that parses the dbt
project would produce this file, including dbt build
and dbt run
.
Loading metadata from dbt Cloud might be implemented in further releases of
the cube_dbt
package.
In the most basic scenario, you can commit the manifest.json
file to the
source code of your Cube project and load it as a local file. Here's an
example source code of the globals.py
file in your data model folder:
from cube_dbt import Dbt
dbt = Dbt.from_file('./manifest.json')
In a less basic scenario, you can put the manifest.json
file to a remote
storage and load it by its URL:
from cube_dbt import Dbt
manifest_url = 'https://cube-dbt-integration.s3.amazonaws.com/manifest.json'
dbt = Dbt.from_url(manifest_url)
Actually, you can obtain the contents of the manifest.json
file in any way
that works for you. parse the JSON into a dictionary, and load it directly.
This is particularly useful when loading manifest.json
from non-public
remote storage using packages like smart-open
(opens in a new tab) or
boto3
(opens in a new tab).
from cube_dbt import Dbt
from smart_open import open
import json
manifest_url = 'https://cube-dbt-integration.s3.amazonaws.com/manifest.json'
manifest = None
with open(manifest_url) as file:
manifest_json = file.read()
manifest = json.loads(manifest_json)
dbt = Dbt(manifest)
Selecting dbt models
By default, cube_dbt
would load all dbt models, excluding
ephemeral (opens in a new tab) ones. You can also cherry-pick dbt
models and load only some of them.
For example, you can load only models under a directory or directories.
This is particularly useful if you keep your data marts under
models/marts/
as advised by dbt (opens in a new tab).
from cube_dbt import Dbt
manifest_url = 'https://cube-dbt-integration.s3.amazonaws.com/manifest.json'
dbt = Dbt.from_url(manifest_url).filter(paths=['marts/'])
You can also load only models with certain tags. This is useful if you'd like to have more granular controls over dbt models exposed to the semantic layer:
from cube_dbt import Dbt
manifest_url = 'https://cube-dbt-integration.s3.amazonaws.com/manifest.json'
dbt = Dbt.from_url(manifest_url).filter(tags=['cube'])
Finally, you can load only select models by their names:
from cube_dbt import Dbt
manifest_url = 'https://cube-dbt-integration.s3.amazonaws.com/manifest.json'
dbt = Dbt.from_url(manifest_url).filter(names=['orders', 'customers'])
Rendering dbt models
cube_dbt
provides convenient tools to render dbt models in YAML files
with Jinja. For that, dbt models should be made
accessible from Jinja templates. Here, they are
exposed via dbt_models
and dbt_model
functions:
from cube import TemplateContext
from cube_dbt import Dbt
manifest_url = 'https://cube-dbt-integration.s3.amazonaws.com/manifest.json'
dbt = Dbt.from_url(manifest_url).filter(paths=['marts/'])
template = TemplateContext()
@template.function('dbt_models')
def dbt_models():
return dbt.models
@template.function('dbt_model')
def dbt_model(name):
return dbt.model(name)
You can use the dbt_models
function in a Jinja template to iterate over
all models and output them as YAML. Note the convenient
as_cube
and
as_dimensions
methods that would render
each model with correct name
and sql_table
and also render each column
as a dimension.
cubes:
{% for model in dbt_models() %}
- {{ model.as_cube() }}
dimensions:
{{ model.as_dimensions(skip=['very_private_column']) }}
measures:
- name: count
type: count
{% endfor %}
cube_dbt
does its best to translate data_type
properties of columns to correct dimension types.
Also, if a column is tagged as primary_key
in its properties, it would
automatically become a primary key dimension. You can
also prevent select columns from being rendered as dimensions with skip
.
You can even iterate over columns on your own and render them as you wish:
cubes:
{% for model in dbt_models() %}
- {{ model.as_cube() }}
dimensions:
{% for column in model.columns() %}
- name: "{{ column.name }}"
sql: "{{ column.sql }}"
type: "{{ column.type }}"
description: "{{ column.description }}"
meta:
source: dbt
{% endfor %}
measures:
- name: count
type: count
{% endfor %}
However, in a real-world scenario, iterating over all dbt models and rendering them in a single YAML file is inconvenient since it would be difficult to enrich each rendered cube with its own measures, joins, etc.
Enriching the data model
It's best practice to render each dbt model (or data mart) as cube in its own file. It simplifies Jinja templates and improves the maintainability of the data model. Consider the following cube:
{% set model = dbt_model('orders') %}
cubes:
- {{ model.as_cube() }}
dimensions:
{{ model.as_dimensions() }}
# Model-specific measures
measures:
- name: count
type: count
- name: statuses
sql: "STRING_AGG({status})"
type: string
# Model-specific joins
joins:
- name: users
sql: "{CUBE.user_id} = {users.id}"
relationship: many_to_one
# Model-specific pre-aggregations
pre_aggregations:
- name: count_by_users
measures:
- CUBE.count
dimensions:
- users.full_name
You can easily customize measures, joins, and pre-aggregation that apply
to this very cube, mapped to the orders
dbt model, and have it
side-by-side with other cubes, even of they don't come from dbt.
To inspect rendered cubes, you can use Data Graph,
Playground, or the meta
endpoint
of the REST API.