Implementing Entity-Attribute-Value Model (EAV)
Use case
We want to create a cube for a dataset which uses the Entity-Attribute-Value (opens in a new tab) model (EAV). It stores entities in a table that can be joined to another table with numerous attribute-value pairs. Each entity is not guaranteed to have the same set of associated attributes, thus making the entity-attribute-value relation a sparse matrix. In the cube, we'd like every attribute to be modeled as a dimension.
Data modeling
Let's explore the users
cube that contains the entities:
cube(`users`, {
sql_table: `users`,
joins: {
orders: {
relationship: "one_to_many",
sql: `${CUBE}.id = ${orders.user_id}`,
},
},
dimensions: {
name: {
sql: `first_name || ' ' || last_name`,
type: `string`,
},
},
});
The users
cube is joined with the orders
cube to reflect that there might be
many orders associated with a single user. The orders remain in various
statuses, as reflected by the status
dimension, and their creation dates are
available via the created_at
dimension:
cube(`orders`, {
sql_table: `orders`,
dimensions: {
user_id: {
sql: `user_id`,
type: `string`,
},
status: {
sql: `status`,
type: `string`,
},
created_at: {
sql: `created_at`,
type: `time`,
},
},
});
Currently, the dataset contains orders in the following statuses:
[
{
"orders.status": "completed",
},
{
"orders.status": "processing",
},
{
"orders.status": "shipped",
},
];
Let's say that we'd like to know, for each user, the earliest creation date for their orders in any of these statuses. In terms of the EAV model:
- the users serve as entities and they should be modeled with a cube
- order statuses serve as attributes and they should be modeled as dimensions
- the earliest creation dates for each status serve as attribute values and they will be modeled as dimension values
Let's explore some possible ways to model that.
Static attributes
We already know that the following statuses are present in the dataset:
completed
, processing
, and shipped
. Let's assume this set of statuses is
not going to change often.
Then, modeling the cube is as simple as defining a few joins (one join per attribute):
cube(`users_statuses_joins`, {
sql: `
SELECT
users.first_name,
users.last_name,
MIN(cOrders.created_at) AS cCreatedAt,
MIN(pOrders.created_at) AS pCreatedAt,
MIN(sOrders.created_at) AS sCreatedAt
FROM public.users AS users
LEFT JOIN public.orders AS cOrders
ON users.id = cOrders.user_id AND cOrders.status = 'completed'
LEFT JOIN public.orders AS pOrders
ON users.id = pOrders.user_id AND pOrders.status = 'processing'
LEFT JOIN public.orders AS sOrders
ON users.id = sOrders.user_id AND sOrders.status = 'shipped'
GROUP BY 1, 2
`,
dimensions: {
name: {
sql: `first_name || ' ' || last_name`,
type: `string`,
},
completed_created_at: {
sql: `cCreatedAt`,
type: `time`,
},
processing_created_at: {
sql: `pCreatedAt`,
type: `time`,
},
shipped_created_at: {
sql: `sCreatedAt`,
type: `time`,
},
},
});
Querying the cube would yield data like this. As we can see, every user has attributes that show the earliest creation date for their orders in all three statuses. However, some attributes don't have values (meaning that a user doesn't have orders in this status).
[
{
"users_statuses_joins.name": "Ally Blanda",
"users_statuses_joins.completed_created_at": "2019-03-05T00:00:00.000",
"users_statuses_joins.processing_created_at": null,
"users_statuses_joins.shipped_created_at": "2019-04-06T00:00:00.000",
},
{
"users_statuses_joins.name": "Cayla Mayert",
"users_statuses_joins.completed_created_at": "2019-06-14T00:00:00.000",
"users_statuses_joins.processing_created_at": "2021-05-20T00:00:00.000",
"users_statuses_joins.shipped_created_at": null,
},
{
"users_statuses_joins.name": "Concepcion Maggio",
"users_statuses_joins.completed_created_at": null,
"users_statuses_joins.processing_created_at": "2020-07-14T00:00:00.000",
"users_statuses_joins.shipped_created_at": "2019-07-19T00:00:00.000",
},
];
The drawback is that when the set of statuses changes, we'll need to amend the cube definition in several places: update selected values and joins in SQL as well as update the dimensions. Let's see how to work around that.
Static attributes, DRY version
We can embrace the Don't Repeat Yourself (opens in a new tab) principle and eliminate the repetition by generating the cube definition dynamically based on the list of statuses. Let's create a new JavaScript model so we can move all repeated code patterns into handy functions and iterate over statuses in relevant parts of the cube's code.
const statuses = ["completed", "processing", "shipped"];
const createValue = (status, index) =>
`MIN(orders_${index}.created_at) AS created_at_${index}`;
const createJoin = (status, index) =>
`LEFT JOIN public.orders AS orders_${index}
ON users.id = orders_${index}.user_id
AND orders_${index}.status = '${status}'`;
const createDimension = (status, index) => ({
[`${status}_created_at`]: {
sql: (CUBE) => `created_at_${index}`,
type: `time`,
},
});
cube(`users_statuses_DRY`, {
sql: `
SELECT
users.first_name,
users.last_name,
${statuses.map(createValue).join(",")}
FROM public.users AS users
${statuses.map(createJoin).join("")}
GROUP BY 1, 2
`,
dimensions: Object.assign(
{
name: {
sql: `first_name || ' ' || last_name`,
type: `string`,
},
},
statuses.reduce(
(all, status, index) => ({
...all,
...createDimension(status, index),
}),
{}
)
),
});
The new users_statuses_DRY
cube is functionally identical to the
users_statuses_joins
cube above. Querying this new cube would yield the same
data. However, there's still a static list of statuses present in the cube's
source code. Let's work around that next.
Dynamic attributes
We can eliminate the list of statuses from the cube's code by loading this list
from an external source, e.g., the data source. Here's the code from the
fetch.js
file that defines the fetchStatuses
function that would load the
statuses from the database. Note that it uses the pg
package (Node.js client
for Postgres) and reuses the credentials from Cube.
const { Pool } = require("pg");
const pool = new Pool({
host: process.env.CUBEJS_DB_HOST,
port: process.env.CUBEJS_DB_PORT,
user: process.env.CUBEJS_DB_USER,
password: process.env.CUBEJS_DB_PASS,
database: process.env.CUBEJS_DB_NAME,
});
const statusesQuery = `
SELECT DISTINCT status
FROM public.orders
`;
exports.fetchStatuses = async () => {
const client = await pool.connect();
const result = await client.query(statusesQuery);
client.release();
return result.rows.map((row) => row.status);
};
In the cube file, we will use the fetchStatuses
function to load the list of
statuses. We will also wrap the cube definition with the asyncModule
built-in
function that allows the data model to be created
dynamically.
const fetchStatuses = require("../fetch").fetchStatuses;
asyncModule(async () => {
const statuses = await fetchStatuses();
const createValue = (status, index) =>
`MIN(orders_${index}.created_at) AS created_at_${index}`;
const createJoin = (status, index) =>
`LEFT JOIN public.orders AS orders_${index}
ON users.id = orders_${index}.user_id
AND orders_${index}.status = '${status}'`;
const createDimension = (status, index) => ({
[`${status}_created_at`]: {
sql: (CUBE) => `created_at_${index}`,
type: `time`,
},
});
cube(`users_statuses_dynamic`, {
sql: `
SELECT
users.first_name,
users.last_name,
${statuses.map(createValue).join(",")}
FROM public.users AS users
${statuses.map(createJoin).join("")}
GROUP BY 1, 2
`,
dimensions: Object.assign(
{
name: {
sql: `first_name || ' ' || last_name`,
type: `string`,
},
},
statuses.reduce(
(all, status, index) => ({
...all,
...createDimension(status, index),
}),
{}
)
),
});
});
Again, the new users_statuses_dynamic
cube is functionally identical to the
previously created cubes. So, querying this new cube would yield the same data
too.
Source code
Please feel free to check out the
full source code (opens in a new tab)
or run it with the docker-compose up
command. You'll see the result, including
queried data, in the console.