Working with Joins

A join creates a relationship between two cubes in your Cube project. Cube supports three types of join relationships often found in SQL databases:

one_to_one
one_to_many
many_to_one

To use an example, let's use two cubes, customers and orders:

YAML

JavaScript

cube(`customers`, {
  // ...
 
  dimensions: {
    id: {
      sql: `id`,
      type: `number`,
      primary_key: true
    },
 
    company: {
      sql: `company`,
      type: `string`
    }
  }
})
 
cube(`orders`, {
  // ...
 
  dimensions: {
    id: {
      sql: `id`,
      type: `number`,
      primary_key: true
    },
 
    customer_id: {
      sql: `customer_id`,
      type: `number`
    }
  }
})

We could add a join to the customers cube:

YAML

JavaScript

cube(`customers`, {
  // ...
 
  joins: {
    orders: {
      relationship: `one_to_many`,
      sql: `${CUBE}.id = ${orders.customer_id}`
    }
  }
})

The join above means a customer has many orders. Let's send the following JSON query:

{
  "dimensions": ["orders.status", "customers.company"],
  "measures": ["orders.count"],
  "timeDimensions": [
    {
      "dimension": "orders.created_at"
    }
  ],
  "order": { "customers.company": "asc" }
}

The query above will generate the following SQL:

SELECT
  "orders".status "orders__status",
  "customers".company "customers__company",
  count("orders".id) "orders__count"
FROM
  public.customers AS "customers"
  LEFT JOIN public.orders AS "orders"
    ON "customers".id = "orders".customer_id
GROUP BY 1, 2
ORDER BY 2 ASC
LIMIT 10000

However, if we have guest checkouts, that would mean we would have orders with no matching customer. Looking back at the one_to_many relationship and its' resulting SQL, any guest checkouts would be excluded from the results. To remedy this, we'll remove the join from the customers cube and instead define a join with a many_to_one relationship on the orders cube:

YAML

JavaScript

cube(`orders`, {
  // ...
 
  joins: {
    customers: {
      relationship: `many_to_one`,
      sql: `${CUBE}.customer_id = ${customers.id}`
    }
  }
})

In the above data model, our orders cube defines the relationship between itself and the customer cube. The same JSON query now results in the following SQL query:

SELECT
  "orders".status "orders__status",
  "customers".company "customers__company",
  count("orders".id) "orders__count"
FROM
  public.orders AS "orders"
  LEFT JOIN public.customers AS "customers"
    ON "orders".customer_id = "customers".id
GROUP BY 1, 2
ORDER BY 2 ASC
LIMIT 10000

As we can see, the base table in the query is orders, and customers is in the LEFT JOIN clause; this means any orders without a customer will also be retrieved.

In Cube, joins only need to be defined from one direction. In the above example, we explicitly removed the one_to_many relationship from the customer cube; not doing so would cause the query to fail as Cube would be unable to determine a valid join path. Click here to learn more about how the direction of joins affects query results.

Many-to-many joins

A many-to-many relationship occurs when multiple records in a cube are associated with multiple records in another cube.

For example, let's say we have two cubes, topics and posts, pointing to the topics and posts tables in our database, respectively. A post can have more than one topic, and a topic may have more than one post.

In a database, you would most likely have an associative table (also known as a junction table or cross-reference table). In our example, this table name might be post_topics.

You can jump to this section if you don't have an associative table in your database.

The diagram below shows the tables posts, topics, post_topics, and their relationships.

Many-to-Many Entity Diagram for posts, topics and post_topics

In the same way the post_topics table was specifically created to handle this association in the database, we need to create an associative cube post_topics, and declare the relationships from it to topics cube and from posts to post_topics.

YAML

JavaScript

cube(`posts`, {
  sql_table: `posts`,
 
  joins: {
    post_topics: {
      relationship: `one_to_many`,
      sql: `${CUBE}.id = ${post_topics.post_id}`
    }
  }
})
 
cube(`topics`, {
  sql_table: `topics`,
 
  dimensions: {
    post_id: {
      sql: `id`,
      type: `string`,
      primary_key: true
    }
  }
})
 
cube(`post_topics`, {
  sql_table: `post_topics`,
 
  joins: {
    topic: {
      relationship: `many_to_one`,
      sql: `${CUBE}.topic_id = ${topics.id}`
    }
  },
 
  dimensions: {
    post_id: {
      sql: `post_id`,
      type: `string`
    }
  }
})

The following example uses the many_to_one relationship on the post_topics cube; this causes the direction of joins to be posts → post_topics → topics. Read more about direction of joins here.

In scenarios where a table doesn't define a primary key, one can be generated using SQL:

YAML

JavaScript

cube(`post_topics`, {
  // ...
 
  dimensions: {
    id: {
      sql: `CONCAT(${CUBE}.post_id, ${CUBE}.topic_id)`,
      type: `number`,
      primary_key: true
    }
  }
})

Using virtual associative cube

Sometimes there is no associative table in the database, when in reality, there is a many-to-many relationship. In this case, the solution is to extract some data from existing tables and create a virtual (not backed by a real table in the database) associative cube.

Let’s consider the following example. We have tables emails and transactions. The goal is to calculate the amount of transactions per campaign. Both emails and transactions have a campaign_id column. We don’t have a campaigns table, but data about campaigns is part of the emails table.

Let’s take a look at the emails cube first:

YAML

JavaScript

cube(`emails`, {
  sql_table: `emails`,
 
  measures: {
    count: {
      type: `count`
    }
  },
 
  dimensions: {
    id: {
      sql: `id`,
      type: `number`,
      primary_key: true
    },
 
    campaign_name: {
      sql: `campaign_name`,
      type: `string`
    },
 
    campaign_id: {
      sql: `campaign_id`,
      type: `number`
    }
  }
})

We can extract campaigns data into a virtual campaigns cube:

YAML

JavaScript

cube(`campaigns`, {
  sql: `
    SELECT
      campaign_id,
      campaign_name,
      customer_name,
      MIN(created_at) AS started_at
    FROM emails
    GROUP BY 1, 2, 3
  `,
 
  measures: {
    count: {
      type: `count`
    }
  },
 
  dimensions: {
    id: {
      sql: `campaign_id`,
      type: `string`,
      primary_key: true
    },
 
    name: {
      sql: `campaign_name`,
      type: `string`
    }
  }
})

The following diagram shows our data model with the Campaigns cube:

Many-to-Many Entity Diagram for emails, campaigns and transactions

The last piece is to finally declare a many-to-many relationship. This should be done by declaring a one_to_many relationship on the associative cube, campaigns in our case.

YAML

JavaScript

cube(`emails`, {
  sql_table: `emails`,
 
  joins: {
    campaigns: {
      relationship: `many_to_one`,
      sql: `${CUBE}.campaign_id = ${campaigns.id}
      AND ${CUBE}.customer_name = ${campaigns.customer_name}`
    }
  },
 
  measures: {
    count: {
      type: `count`
    }
  },
 
  dimensions: {
    id: {
      sql: `id`,
      type: `number`,
      primary_key: true
    },
 
    campaign_name: {
      sql: `campaign_name`,
      type: `string`
    },
 
    campaign_id: {
      sql: `campaign_id`,
      type: `number`
    }
  }
})
 
cube(`campaigns`, {
  joins: {
    transactions: {
      relationship: `one_to_many`,
      sql: `${CUBE}.customer_name = ${transactions.customer_name}
      AND ${CUBE}.campaign_id = ${transactions.campaign_id}`
    }
  },
 
  dimensions: {
    id: {
      sql: `id`,
      type: `number`,
      primary_key: true
    },
 
    customer_name: {
      sql: `customer_name`,
      type: `string`
    }
  }
})

Directions of joins

The direction of joins greatly influences the final result set. It can be explicitly controlled on a view level.

As an example, let's take two cubes, orders and customers:

YAML

JavaScript

cube(`orders`, {
  sql_table: `orders`,
 
  measures: {
    count: {
      sql: `id`,
      type: `count`
    }
  },
 
  dimensions: {
    id: {
      sql: `id`,
      type: `number`,
      primary_key: true
    },
 
    customer_id: {
      sql: `customer_id`,
      type: `number`
    }
  }
})
 
cube(`customers`, {
  sql_table: `customers`,
 
  measures: {
    count: {
      sql: `id`,
      type: `count`
    },
 
    total_revenue: {
      sql: `revenue`,
      type: `sum`
    }
  },
 
  dimensions: {
    id: {
      sql: `id`,
      type: `number`,
      primary_key: true
    }
  }
})

With the given data model, we have two valid analytics use cases that require different join directions.

The first case is to calculate the total revenue per customer. To do this, we'll use the total_revenue measure that is defined on the orders cube. We need to be aware that orders can be placed without customer registration (anonymous customers/guest checkouts). Therefore, we should start the join from the orders cube onto the customers cube to ensure that we do not lose data from anonymous orders.

YAML

JavaScript

views:
  - name: total_revenue_per_customer
    description: Total revenue per customer
 
    cubes:
      - join_path: orders
        includes:
          - total_revenue
          - created_at
 
      - join_path: orders.customers
        includes:
          - company

We can query this view as follows:

{
  "dimensions": ["total_revenue_per_customer.company"],
  "measures": ["total_revenue_per_customer.total_revenue"],
  "timeDimensions": [
    {
      "dimension": "total_revenue_per_customer.created_at"
    }
  ]
}

The second case is to find customers who have not placed any orders. We will use the count measure on the customers cube for that. In this case, we should join the customers cube with the orders cube to find customers with zero orders placed. The reverse order of joins would result in a dataset without data for customers with no orders.

YAML

JavaScript

views:
  - name: customers_without_orders
    description: Customers without orders
 
    cubes:
      - join_path: customers
        includes:
          - company
 
      - join_path: customers.orders
        prefix: true
        includes:
          - created_at
          - count

We can then query the cube as follows:

{
  "dimensions": ["customers_without_orders.company"],
  "timeDimensions": [
    {
      "dimension": "customers_without_orders.orders_created_at"
    }
  ],
  "filters": [
    {
      "member": "customers_without_orders.orders_count",
      "operator": "equals",
      "values": ["0"]
    }
  ]
}

Troubleshooting

`Can't find join path`

Sometimes, you might come across the following error message: Can't find join path to join 'cube_a', 'cube_b'.

It indicates that a query failed because it includes members from cubes that can't be joined in order to generate a valid query to the upstream data source. Please check that you've defined necessary joins and that they have correct directions.

Also, please consider using views since they incapsulate join paths and completely remove the possibility of the error in question. You might also consider setting the public parameter to false on your cubes to hide them from end users.

If you’re building a custom data application, you might use the meta endpoint of the REST API. It groups cubes into connectedComponents to help select those ones that can be joined together.

`Primary key is required when join is defined`

Sometimes, you might come across the following error message: cube_a cube: primary key for 'cube_a' is required when join is defined in order to make aggregates work properly.

It indicates that you have a cube with joins and pre-aggregations. However, that cube doesn't have a primary key.

When generating SQL queries, Cube uses primary keys to avoid fanouts. A fanout happens when two tables are joined and a single value gets duplicated in the end result, meaning that some values can be double counted.

Please define a primary key dimension in this cube to make joins and pre-aggregations work correctly.

If your data doesn't have a natural primary key, e.g., id, you can define a composite primary key by concatenating most or all of the columns in the table. Example:

cubes:
  - name: cube_a
    # ...
 
    dimensions:
      - name: composite_key
        sql: CONCAT(column_a, '-', column_b, '-', column_c)
        type: string
        primary_key: true

Transitive join pitfalls

Let's consider an example where we have a many-to-many relationship between users and companies through the companies_to_users cube:

YAML

JavaScript

cube(`users`, {
  sql: `
    SELECT 1 AS id, 'Ali' AS name UNION ALL
    SELECT 2 AS id, 'Bob' AS name UNION ALL
    SELECT 3 AS id, 'Eve' AS name
  `,
 
  measures: {
    count: {
      type: `count`
    }
  },
 
  dimensions: {
    id: {
      sql: `id`,
      type: `string`,
      primary_key: true
    }
  }
})
 
cube(`companies`, {
  sql: `
    SELECT 11 AS id, 'Acme Corporation' AS name UNION ALL
    SELECT 12 AS id, 'Stark Industries' AS name
  `,
 
  dimensions: {
    id: {
      sql: `id`,
      type: `string`,
      primary_key: true
    },
 
    name: {
      sql: `name`,
      type: `string`
    }
  }
})
 
cube(`companies_to_users`, {
  sql: `
    SELECT 11 AS company_id, 1 AS user_id UNION ALL
    SELECT 11 AS company_id, 2 AS user_id UNION ALL
    SELECT 12 AS company_id, 3 AS user_id
  `,
 
  joins: {
    users: {
      sql: `${CUBE}.user_id = ${users.id}`,
      relationship: `one_to_many`
    },
 
    companies: {
      sql: `${CUBE}.company_id = ${companies.id}`,
      relationship: `one_to_many`
    }
  },
 
  dimensions: {
    id: {
      // Joins require a primary key, so we'll create one on-the-fly
      sql: `CONCAT(${CUBE}.user_id, ':', ${CUBE}.company_id)`,
      type: `string`,
      primary_key: true
    }
  }
})

With this data model, querying users.count as a measure and companies.name as a dimension would yield the following error: Can't find join path to join 'users', 'companies'.

The root cause is that joins are directed and there's no join path that goes by join definitions in the data model from users to companies or in the opposite direction.

In Visual Model, you can see that both users and companies are to the right of companies_to_users, meaning that there's no way to go from users to companies moving left to right or right to left:

One possible solution is to move one of two joins from the companies_to_users cube to either users or companies cube. Please note that it would affect the query semantics and thus the final result:

YAML

JavaScript

cube(`users`, {
  joins: {
    companies_to_users: {
      sql: `${CUBE}.id = ${companies_to_users}.user_id`,
      relationship: `one_to_many`
    }
  }
 
  // ...
})
 
cube(`companies_to_users`, {
  joins: {
    // users: {
    //   sql: `${CUBE}.user_id = ${users.id}`,
    //   relationship: `one_to_many`
    // },
 
    companies: {
      sql: `${CUBE}.company_id = ${companies.id}`,
      relationship: `one_to_many`
    }
  }
 
  // ...
})

Now there's a join path that goes by join definitions in the data model from users to companies_to_users to companies. Visual Model also shows that you can reach companies from cubes by going left to right.

Data blending Syntax