Joins
The joins
parameter declares a block to define relationships between cubes. It
allows users to access and compare fields from two or more cubes at the same
time.
cubes:
- name: my_cube
# ...
joins:
- name: target_cube
relationship: one_to_one || one_to_many || many_to_one
sql: SQL ON clause
All joins are generated as LEFT JOIN
. The cube which defines the join serves
as a main table, and any cubes referenced inside the joins
property are used
in the LEFT JOIN
clause. Learn more about direction of joins
here.
The semantics of INNER JOIN
can be achieved with additional filtering. For
example, a simple check of whether the column value IS NOT NULL
by using set
filter satisfies this requirement.
There's also no way to define FULL OUTER JOIN
and RIGHT OUTER JOIN
for the
sake of join modeling simplicity. To get RIGHT OUTER JOIN
semantics just
define join from other side of relationship.
The FULL OUTER JOIN
can be built inside cube sql
parameter. Quite frequently, FULL OUTER JOIN
is used to solve Data
Blending or similar problems. In that case, it's best
practice to have a separate cube for such an operation.
Parameters
name
The name must match the name of the joined cube and, thus, follow the naming conventions.
For example, when the products
cube is joined on to the orders
cube, we
would define the join as follows:
cubes:
- name: orders
# ...
joins:
- name: products
relationship: many_to_one
sql: "{CUBE.id} = {products.order_id}"
relationship
The relationship
property is used to describe the type of the relationship
between joined cubes. It’s important to properly define the type of relationship
so Cube can accurately calculate measures.
The cube that declares the join is considered left in terms of the left join (opens in a new tab) semantics, and the joined cube is considered right. It means that all rows of the left cube are selected, while only those rows of the right cube that match the condition are selected as well. For more information and specific examples, please see join directions.
The join does not need to be defined on both cubes, but the definition can affect the join direction.
You can use the following types of relationships:
one_to_one
for one-to-one (opens in a new tab) relationshipsone_to_many
for one-to-many (opens in a new tab) relationshipsmany_to_one
for the opposite of one-to-many (opens in a new tab) relationships
The types of relationships listed above were introduced in v0.32.19 for clarity as they are commonly used in the data space. The following aliases were used before and are still valid, so there's no need to update existing data models:
one_to_one
was known ashas_one
orhasOne
one_to_many
was known ashas_many
orhasMany
many_to_one
was known asbelongs_to
orbelongsTo
One-to-one
The one_to_one
type indicates a one-to-one (opens in a new tab) relationship between
the declaring cube and the joined cube. It means that one row in the declaring
cube can match only one row in the joined cube.
For example, in a data model containing users
and their profiles
, the
users
cube would declare the following join:
cubes:
- name: users
# ...
joins:
- name: profiles
relationship: one_to_one
sql: "{users}.id = {profiles.user_id}"
One-to-many
The one_to_many
type indicates a one-to-many (opens in a new tab) relationship between
the declaring cube and the joined cube. It means that one row in the declaring
cube can match many rows in the joined cube.
For example, in a data model containing authors
and the books
they have
written, the authors
cube would declare the following join:
cubes:
- name: authors
# ...
joins:
- name: books
relationship: one_to_many
sql: "{authors}.id = {books.author_id}"
Many-to-one
The many_to_one
type indicates the many-to-one relationship between the
declaring cube and the joined cube. You’ll often find this type of relationship
on the opposite side of the one-to-many (opens in a new tab) relationship. It means that
one row in the declaring cube matches a single row in the joined cube, while a
row in the joined cube can match many rows in the declaring cube.
For example, in a data model containing orders
and customers
who made them,
the orders
cube would have the following join:
cubes:
- name: orders
# ...
joins:
- name: customers
relationship: many_to_one
sql: "{orders}.customer_id = {customers.id}"
sql
sql
is necessary to indicate a related column between cubes. It is important
to properly specify a matching column when creating joins. Take a look at the
example below:
cubes:
- name: orders
# ...
joins:
- name: customers
relationship: many_to_one
sql: "{orders}.customer_id = {customers.id}"
Setting a primary key
In order for a join to work, it is necessary to define a primary_key
as
specified below. It is a requirement when a join is defined so that Cube can
handle row multiplication issues such as chasm and fan traps.
Let's imagine you want to calculate Order Amount
by Order Item Product Name
.
In this case, Order
rows will be multiplied by the Order Item
join due to
the one_to_many
relationship. In order to produce correct results, Cube will
select distinct primary keys from Order
first and then will join these primary
keys with Order
to get the correct Order Amount
sum result. Please note that
primary_key
should be defined in the dimensions
section.
cubes:
- name: orders
# ...
dimensions:
- name: customer_id
sql: id
type: number
primary_key: true
Setting primary_key
to true
will change the default value of the public
parameter to false
. If you still want public
to be true
— set it manually.
cubes:
- name: orders
# ...
dimensions:
- name: customer_id
sql: id
type: number
primary_key: true
public: true
If you don't have a single column in a cube's table that can act as a primary key, you can create a composite primary key as shown below.
The example uses Postgres string concatenation; note that SQL may be different depending on your database.
cubes:
- name: users
# ...
dimensions:
- name: id
sql:
"{CUBE}.user_id || '-' || {CUBE}.signup_week || '-' ||
{CUBE}.activity_week"
type: string
primary_key: true
Chasm and fan traps
Cube automatically detects chasm and fan traps based on the many_to_one
and one_to_many
relationships defined in join.
When detected, Cube generates a deduplication query that evaluates all distinct primary keys within the multiplied measure's cube and then joins distinct primary keys to this cube on itself to calculate the aggregation result.
If there's more than one multiplied measure in a query, then such query is generated for every such multiplied measure, and results are joined.
Cube solves for chasm and fan traps during query time.
If there's pre-aggregregation that fits measure multiplication requirements it'd be leveraged to serve such a query.
Such pre-aggregations and queries are always considered non-additive for the purpose of pre-aggregation matching.
Let's consider an example data model:
cubes:
- name: orders
sql_table: orders
dimensions:
- name: id
sql: id
type: number
primary_key: true
- name: city
sql: city
type: string
joins:
- name: customers
relationship: many_to_one
sql: "{orders}.customer_id = {customers.id}"
- name: customers
sql_table: customers
dimensions:
- name: id
sql: id
type: number
primary_key: true
measures:
- name: average_age
sql: age
type: avg
If we try to query customers.average_age
by orders.city
, the Cube detects that the average_age
measure in the customers
cube would be multiplied by orders
to customers
and would generate SQL similar to:
SELECT
"keys"."orders__city",
avg("customers_key__customers".age) "customers__average_age"
FROM
(
SELECT
DISTINCT "customers_key__orders".city "orders__city",
"customers_key__customers".id "customers__id"
FROM
orders AS "customers_key__orders"
LEFT JOIN customers AS "customers_key__customers" ON "customers_key__orders".customer_id = "customers_key__customers".id
) AS "keys"
LEFT JOIN customers AS "customers_key__customers" ON "keys"."customers__id" = "customers_key__customers".id
GROUP BY
1
CUBE reference
When you have several joined cubes, you should accurately use columns’ names to
avoid any mistakes. One way to make no mistakes is to use the CUBE
reference.
It allows you to specify columns’ names in cubes without any ambiguity. During
the implementation of the query, this reference will be used as an alias for a
basic cube. Take a look at the following example:
cubes:
- name: users
# ...
dimensions:
- name: name
sql: "{CUBE}.name"
type: string
Transitive joins
Join graph is directed and a → b
join is different from b → a
. Learn more
about it here.
Cube automatically takes care of transitive joins. For example, consider the following data model:
cubes:
- name: a
# ...
joins:
- name: b
sql: "{a}.b_id = {b.id}"
relationship: many_to_one
measures:
- name: count
type: count
- name: b
# ...
joins:
- name: c
sql: "{b}.c_id = {c.id}"
relationship: many_to_one
- name: c
# ...
dimensions:
- name: category
sql: category
type: string
Assume that the following query is run:
{
"measures": ["a.count"],
"dimensions": ["c.category"]
}
Joins a → b
and b → c
will be resolved automatically. Cube uses the
Dijkstra algorithm (opens in a new tab) to find a join path between cubes given
requested members.
In case there are multiple join paths that can be used to join the same set of cubes, Cube will collect cube names from members in the following order:
- Measures
- Dimensions
- Segments
- Time dimensions
Cube makes join trees as predictable and stable as possible, but this isn't guaranteed in case multiple join paths exist. Please use views to address join predictability and stability.