Syntax
Entities within the data model (e.g., cubes, views, etc.) should be placed under
the model
folder, follow naming
conventions, and be defined using a supported
syntax.
Folder structure
Data model files should be placed inside the model
folder. You can use the
schema_path
configuration option to override the
folder name or the repository_factory
configuration
option to dynamically define the folder name and
data model file contents.
It's recommended to place each cube or view in a separate file, in model/cubes
and model/views
folders, respectively. Example:
model
├── cubes
│ ├── orders.yml
│ ├── products.yml
│ └── users.yml
└── views
└── revenue.yml
Model syntax
Cube supports two ways to define data model files: with YAML (opens in a new tab) or
JavaScript syntax. YAML data model files should have the .yml
extension,
whereas JavaScript data model files should end with .js
. You can mix YAML and
JavaScript files within a single data model.
cubes:
- name: orders
sql: >
SELECT *
FROM orders, line_items
WHERE orders.id = line_items.order_id
You can define the data model statically or build dynamic data models programmatically. YAML data models use Jinja and Python whereas JavaScript data models use JavaScript.
It is recommended to default to YAML syntax because of its simplicity and readability. However, JavaScript might provide more flexibility for dynamic data modeling.
Naming
Common rules apply to names of entities within the data model. All names must:
- start with a letter
- consist of letters, numbers, and underscore (
_
) symbols only
It is also recommended that names use snake case (opens in a new tab).
Good examples of names:
orders
,stripe_invoices
, orbase_payments
(cubes)opportunities
,cloud_accounts
, orarr
(views)count
,avg_price
, ortotal_amount_shipped
(measures)name
,is_shipped
, orcreated_at
(dimensions)main
,orders_by_status
, orlambda_invoices
(pre-aggregations)
SQL expressions
When defining cubes, you would often provide SQL snippets in sql
and
sql_table
parameters.
Provided SQL expressions should match your database SQL dialect, e.g.,
to aggregate a list of strings, you would probably pick the LISTAGG
function (opens in a new tab) in Snowflake and the STRING_AGG
function (opens in a new tab) in BigQuery.
cubes:
- name: orders
sql_table: orders
measures:
- name: statuses
sql: "STRING_AGG(status)"
type: string
dimensions:
- name: status
sql: "UPPER(status)"
type: string
User-defined functions
If you have created a user-defined function (opens in a new tab) (UDF) in your data
source, you can use it in the sql
parameter as well.
Case sensitivity
If your database uses case-sensitive identifiers, make sure to properly quote table and column names. For example, here's how you can reference a Postgres table that contains uppercase letters in its name:
cubes:
- name: orders
sql_table: 'public."Orders"'
References
To write versatile data models, it is important to be able to reference members of cubes and views, such as measures or dimensions, as well as table columns. Cube supports the following syntax for references.
column
Most commonly, you would use bare column names in the sql
parameter of
measures or dimensions. In the following example, name
references
the respective column of the users
table.
cubes:
- name: users
sql_table: users
dimensions:
- name: name
sql: name
type: string
This syntax works great for simple use cases. However, if your cubes have joins and joined cubes have columns with the same name, the generated SQL query might become ambiguous. See below how to work around that.
{member}
When defining measures and dimensions, you can also reference other members
of the same cube by wrapping their names in curly braces. In the following
example, the full_name
dimension references name
and surname
dimensions
of the same cube.
cubes:
- name: users
sql_table: users
dimensions:
- name: name
sql: name
type: string
- name: surname
sql: "UPPER(surname)"
type: string
- name: full_name
sql: "CONCAT({name}, ' ', {surname})"
type: string
This syntax works great for simple use cases. However, there are cases (like subquery) when you'd like to reference members of other cubes. See below how to do that.
{time_dimension.granularity}
When referencing a time dimension, you can specificy a
granularity to refer to a time value with that specific granularity. It can be
one of the default granularities (e.g., year
or
week
) or a custom granularity:
cubes:
- name: users
sql_table: users
dimensions:
- name: created_at
sql: created_at
type: time
granularities:
- name: sunday_week
interval: 1 week
offset: -1 day
- name: created_at__year
sql: "{created_at.year}"
type: time
- name: created_at__sunday_week
sql: "{created_at.sunday_week}"
type: time
{cube}.column
, {cube.member}
You can qualify column and member names with the name of a cube to remove the ambiguity when cubes are joined and reference members of other cubes.
cubes:
- name: users
sql_table: users
joins:
- name: contacts
sql: "{users}.contact_id = {contacts.id}"
relationship: one_to_one
dimensions:
- name: id
sql: "{users}.id"
type: number
primary_key: true
- name: name
sql: "COALESCE({users.name}, {contacts.name})"
type: string
- name: contacts
sql_table: contacts
dimensions:
- name: id
sql: "{contacts}.id"
type: number
primary_key: true
- name: name
sql: "{contacts}.name"
type: string
In production, using fully-qualified names is generally encouraged since it removes the ambiguity and keeps data model code maintainable as it grows. However, always referring to the current cube by its name leads to code repetition and violates the DRY principle. See below how to solve that.
{CUBE}
variable
You can use a handy {CUBE}
context variable
(mind the uppercase) to reference the current cube so you don't have to
repeat the its name over and over. It works both for column and member
references.
cubes:
- name: users
sql_table: users
joins:
- name: contacts
sql: "{CUBE}.contact_id = {contacts.id}"
relationship: one_to_one
dimensions:
- name: id
sql: "{CUBE}.id"
type: number
primary_key: true
- name: name
sql: "COALESCE({CUBE.name}, {contacts.name})"
type: string
- name: contacts
sql_table: contacts
dimensions:
- name: id
sql: "{CUBE}.id"
type: number
primary_key: true
- name: name
sql: "{CUBE}.name"
type: string
Check the {users.name}
dimension. Referencing another cube in the
dimension definition instructs Cube to make an implicit join to that cube.
For example, using the data model above, we can make the following query:
{
"dimensions": ["users.name"]
}
The resulting generated SQL query would look like this:
SELECT COALESCE("users".name, "contacts".name) "users__name"
FROM users "users"
LEFT JOIN contacts "contacts"
ON "users".contact_id = "contacts".id
{cube.sql()}
function
When defining a cube, you can reference the sql
parameter of another cube,
effectively reusing the SQL query it's defined on. This is particularly useful
when defining polymorphic cubes or using data blending.
Consider the following data model:
cube(`organisms`, {
sql_table: `organisms`
})
cube(`animals`, {
sql: `
SELECT *
FROM ${organisms.sql()}
WHERE kingdom = 'animals'
`
})
cube(`dogs`, {
sql: `
SELECT *
FROM ${animals.sql()}
WHERE species = 'dogs'
`,
measures: {
count: {
type: `count`
}
}
})
If you query for dogs.count
, Cube will generate the following SQL:
SELECT count(*) "dogs__count"
FROM (
SELECT *
FROM (
SELECT *
FROM organisms
WHERE kingdom = 'animals'
)
WHERE species = 'dogs'
) AS "dogs"
Currently, {cube.sql()}
function is only supported in JavaScript data models.
It is not supported in YAML data models.
Please track this issue (opens in a new tab).
As a workaround, you can use JavaScript data models, put a SQL query in a Jinja variable, or load it from the template context.
Non-SQL references
Outside SQL expressions, column
is not recognized
as a column name; it is rather recognized as a member name. It means that,
outside sql
and sql_table
parameters, you can skip the curly braces and
reference members by their names directly: member
, cube_name.member
, or
CUBE.member
.
cubes:
- name: orders
sql_table: orders
dimensions:
- name: status
sql: status
type: string
measures:
- name: count
type: count
pre_aggregations:
- name: orders_by_status
dimensions:
- CUBE.status
measures:
- CUBE.count
Context variables
In addition to the CUBE
variable, you can also use a few more context
variables within your data model. They are
generally useful for two purposes: optimizing generated SQL queries and
defining dynamic data models.