Working with Joins
A join creates a relationship between two cubes in your Cube project. Cube supports three types of join relationships often found in SQL databases:
one_to_one
one_to_many
many_to_one
To use an example, let's use two cubes, customers
and orders
:
cubes:
- name: customers
# ...
dimensions:
- name: id
sql: id
type: number
primary_key: true
- name: company
sql: company
type: string
- name: Orders
# ...
dimensions:
- name: id
sql: id
type: number
primary_key: true
- name: customer_id
sql: customer_id
type: number
We could add a join to the customers
cube:
cubes:
- name: customers
# ...
joins:
- name: orders
relationship: one_to_many
sql: "{CUBE}.id = {orders.customer_id}"
The join above means a customer has many orders. Let's send the following JSON query:
{
"dimensions": ["orders.status", "customers.company"],
"measures": ["orders.count"],
"timeDimensions": [
{
"dimension": "orders.created_at"
}
],
"order": { "customers.company": "asc" }
}
The query above will generate the following SQL:
SELECT
"orders".status "orders__status",
"customers".company "customers__company",
count("orders".id) "orders__count"
FROM
public.customers AS "customers"
LEFT JOIN public.orders AS "orders"
ON "customers".id = "orders".customer_id
GROUP BY 1, 2
ORDER BY 2 ASC
LIMIT 10000
However, if we have guest checkouts, that would mean we would have orders with
no matching customer. Looking back at the one_to_many
relationship and its'
resulting SQL, any guest checkouts would be excluded from the results. To remedy
this, we'll remove the join from the customers
cube and instead define a join
with a many_to_one
relationship on the orders
cube:
cubes:
- name: orders
# ...
joins:
- name: customers
relationship: many_to_one
sql: "{CUBE}.customer_id = {customers.id}"
In the above data model, our orders
cube defines the relationship between
itself and the customer
cube. The same JSON query now results in the following
SQL query:
SELECT
"orders".status "orders__status",
"customers".company "customers__company",
count("orders".id) "orders__count"
FROM
public.orders AS "orders"
LEFT JOIN public.customers AS "customers"
ON "orders".customer_id = "customers".id
GROUP BY 1, 2
ORDER BY 2 ASC
LIMIT 10000
As we can see, the base table in the query is orders
, and customers
is in
the LEFT JOIN
clause; this means any orders without a customer will also be
retrieved.
In Cube, joins only need to be defined from one direction. In the above example,
we explicitly removed the one_to_many
relationship from the customer
cube;
not doing so would cause the query to fail as Cube would be unable to determine
a valid join path. Click here to learn more about how the
direction of joins affects query results.
Many-to-many joins
A many-to-many relationship occurs when multiple records in a cube are associated with multiple records in another cube.
For example, let's say we have two cubes, topics
and posts
, pointing to the
topics
and posts
tables in our database, respectively. A post
can have
more than one topic
, and a topic
may have more than one post
.
In a database, you would most likely have an associative table (also known as a
junction table or cross-reference table). In our example, this table name might
be post_topics
.
You can jump to this section if you don't have an associative table in your database.
The diagram below shows the tables posts
, topics
, post_topics
, and their
relationships.
In the same way the post_topics
table was specifically created to handle this
association in the database, we need to create an associative cube
post_topics
, and declare the relationships from it to topics
cube and from
posts
to post_topics
.
The following example uses the one_to_many
relationship on the post_topics
cube; this causes the direction of joins to be posts → post_topics → topics
.
Read more about direction of joins here.
cubes:
- name: posts
sql_table: posts
joins:
- name: post_topics
relationship: many_to_one
sql: "{CUBE}.id = {post_topics.post_id}"
- name: topics
sql_table: topics
dimensions:
- name: post_id
sql: id
type: string
primary_key: true
- name: post_topics
sql_table: post_topics
joins:
- name: topic
relationship: one_to_many
sql: "{CUBE}.topic_id = {topics.id}"
dimensions:
- name: post_id
sql: post_id
type: string
In scenarios where a table doesn't define a primary key, one can be generated using SQL:
cubes:
- name: post_topics
# ...
dimensions:
- name: id
sql: "CONCAT({CUBE}.post_id, {CUBE}.topic_id)"
type: number
primary_key: true
Using virtual associative cube
Sometimes there is no associative table in the database, when in reality, there is a many-to-many relationship. In this case, the solution is to extract some data from existing tables and create a virtual (not backed by a real table in the database) associative cube.
Let’s consider the following example. We have tables emails
and
transactions
. The goal is to calculate the amount of transactions per
campaign. Both emails
and transactions
have a campaign_id
column. We don’t
have a campaigns table, but data about campaigns is part of the emails
table.
Let’s take a look at the emails
cube first:
cubes:
- name: emails
sql_table: emails
measures:
- name: count
type: count
dimensions:
- name: id
sql: id
type: number
primary_key: true
- name: campaign_name
sql: campaign_name
type: string
- name: campaign_id
sql: campaign_id
type: number
We can extract campaigns data into a virtual campaigns
cube:
cubes:
- name: campaigns
sql: >
SELECT
campaign_id,
campaign_name,
customer_name,
MIN(created_at) AS started_at
FROM emails GROUP BY 1, 2, 3
measures:
- name: count
type: count
dimensions:
- name: id
sql: campaign_id
type: string
primary_key: true
- name: name
sql: campaign_name
type: string
The following diagram shows our data model with the Campaigns
cube:
The last piece is to finally declare a many-to-many relationship. This should be
done by declaring a one_to_many
relationship on the associative cube,
campaigns
in our case.
cubes:
- name: emails
sql_table: emails
joins:
- name: campaigns
relationship: many_to_one
sql: >
{CUBE}.campaign_id = {campaigns.id} AND {CUBE}.customer_name =
{campaigns.customer_name}
measures:
- name: count
type: count
dimensions:
- name: id
sql: id
type: number
primary_key: true
- name: campaign_name
sql: campaign_name
type: string
- name: campaign_id
sql: campaign_id
type: number
- name: campaigns
joins:
- name: transactions
relationship: one_to_many
sql: >
{CUBE}.customer_name = {transactions.customer_name} AND
{CUBE}.campaign_id = {transactions.campaign_id}
dimensions:
- name: id
sql: id
type: number
primary_key: true
- name: customer_name
sql: customer_name
type: string
Directions of joins
The direction of joins greatly influences the final result set. It can be explicitly controlled on a view level.
As an example, let's take two cubes, orders
and customers
:
cubes:
- name: orders
sql_table: orders
measures:
- name: count
sql: id
type: count
- name: total_revenue
sql: revenue
type: sum
dimensions:
- name: id
sql: id
type: number
primary_key: true
- name: customer_id
sql: customer_id
type: number
- name: customers
sql_table: customers
measures:
- name: count
sql: id
type: count
dimensions:
- name: id
sql: id
type: number
primary_key: true
With the given data model, we have two valid analytics use cases that require different join directions.
The first case is to calculate the total revenue per customer. To do this,
we'll use the total_revenue
measure that is defined on the orders cube. We
need to be aware that orders can be placed without customer registration
(anonymous customers/guest checkouts). Therefore, we should start the join from
the orders
cube onto the customers
cube to ensure that we do not lose data
from anonymous orders.
views:
- name: total_revenue_per_customer
description: Total revenue per customer
cubes:
- join_path: orders
includes:
- total_revenue
- created_at
- join_path: orders.customers
includes:
- company
We can query this view as follows:
{
"dimensions": ["total_revenue_per_customer.company"],
"measures": ["total_revenue_per_customer.total_revenue"],
"timeDimensions": [
{
"dimension": "total_revenue_per_customer.created_at"
}
]
}
The second case is to find customers who have not placed any orders. We will use the count measure on the customers cube for that. In this case, we should join the customers cube with the orders cube to find customers with zero orders placed. The reverse order of joins would result in a dataset without data for customers with no orders.
views:
- name: customers_without_orders
description: Customers without orders
cubes:
- join_path: customers
includes:
- company
- join_path: customers.orders
prefix: true
includes:
- created_at
- count
We can then query the cube as follows:
{
"dimensions": ["customers_without_orders.company"],
"timeDimensions": [
{
"dimension": "customers_without_orders.orders_created_at"
}
],
"filters": [
{
"member": "customers_without_orders.orders_count",
"operator": "equals",
"values": ["0"]
}
]
}
Transitive join pitfalls
Let's consider an example where we have a many-to-many relationship between
users
and companies
through the companies_to_users
cube:
cubes:
- name: users
sql: >
SELECT 1 AS id, 'Ali' AS name UNION ALL
SELECT 2 AS id, 'Bob' AS name UNION ALL
SELECT 3 AS id, 'Eve' AS name
measures:
- name: count
type: count
dimensions:
- name: id
sql: id
type: string
primary_key: true
- name: companies
sql: >
SELECT 11 AS id, 'Acme Corporation' AS name UNION ALL
SELECT 12 AS id, 'Stark Industries' AS name
dimensions:
- name: id
sql: id
type: string
primary_key: true
- name: name
sql: name
type: string
- name: companies_to_users
sql: >
SELECT 11 AS company_id, 1 AS user_id UNION ALL
SELECT 11 AS company_id, 2 AS user_id UNION ALL
SELECT 12 AS company_id, 3 AS user_id
joins:
- name: users
sql: "{CUBE}.user_id = {users.id}"
relationship: one_to_many
- name: companies
sql: "{CUBE}.company_id = {companies.id}"
relationship: one_to_many
dimensions:
- name: id
# Joins require a primary key, so we'll create one on-the-fly
sql: "CONCAT({CUBE}.user_id, ':', {CUBE}.company_id)"
type: string
primary_key: true
With this data model, querying users.count
as a measure and companies.name
as a dimension would yield the following error: Can't find join path to join 'users', 'companies'
.
The root cause is that joins are directed and there's no
join path that goes by join definitions in the data model from users
to
companies
or in the opposite direction.
In Data Graph, you can see that both users
and companies
are to the right of companies_to_users
, meaning that there's no way to go
from users
to companies
moving left to right or right to left:
One possible solution is to move one of two joins from the companies_to_users
cube to either users
or companies
cube. Please note that it would affect
the query semantics and thus the final result:
cubes:
- name: users
joins:
- name: companies_to_users
sql: "{CUBE}.id = {companies_to_users}.user_id"
relationship: one_to_many
# ...
- name: companies_to_users
joins:
# - name: users
# sql: "{CUBE}.user_id = {users.id}"
# relationship: one_to_many
- name: companies
sql: "{CUBE}.company_id = {companies.id}"
relationship: one_to_many
# ...
Now there's a join path that goes by join definitions in the data model from
users
to companies_to_users
to companies
. Data Graph
also shows that you can reach companies
from cubes
by going left to right.