The ongoing story of the digital age is still about taking products apart. The wave of bundling, unbundling, and rebundling happening in the data space has subsided, but the point remains. Generations of tools in the modern data stack, whatever it is, succeed each other primarily for two reasons:

  • Emerging tools provide more flexibility and a lower entry barrier for practitioners.
  • Alternatively, teams escape into a safe harbor of a well-known integrated solution that promises big returns on even bigger investments. Indeed, has anyone ever got fired for buying Looker?

Today, we’re going to talk about Cube and Apache Superset—a bundle of tools representing a finer alternative to Looker in many use cases.

We've had a webinar on replacing Looker with Cube and Superset. Check the recording and slides.

Free as in “open source”

Cube and Superset belong to different categories in the data stack. The former is the headless business intelligence platform, and the latter serves as the data visualization and exploration platform. However, being open source and standing on the shoulders of giants are among many things that they have in common.

Both tools are licensed under Apache License 2.0 and actively developed on GitHub. Over the last 3 years, Cube’s repository has acquired almost 14,000 stars (an unmatched leader in the category). Written mostly in Rust, Cube’s data processing and storage are based on the Arrow DataFusion query execution framework which uses Apache Arrow as its in-memory format. Superset has a longer history (born at Airbnb in 2015 where it replaced Tableau, joined the Apache Incubator program in 2017), has a repository with 48,000 stars, and is powered by Apache ECharts, one of the most mature and popular charting libraries.

Cube and Superset

You’re definitely free to take and use Cube and Superset at your own discretion, including self-hosting them in your private or public cloud. You’re also free to explore and get started with Cube Cloud and Preset Cloud, fully managed platforms with freemium pricing models and plenty of quality of life features on top of open source products. Compare that to Looker’s “request a demo” paywall that would cost your team a few weeks of qualification calls on Google Meet before you start building.

Simple as in “separation of concerns”

Cube is intentionally headless and agnostic to the data presentation layer. However, it does a number of things and does them well:

  • connects to a plethora of data sources, from cloud data warehouses like BigQuery, Snowflake, and Redshift to streaming platforms like Kafka and Redpanda;
  • provides a data modeling layer as the centralized source of truth for metrics definitions;
  • secures data with its access control layer;
  • makes sure any query runs well under a second with its caching layer;
  • delivers data to any downstream tool, including BI tools like Superset, via its API layer that provides SQL API, REST API, and GraphQL API.

Superset works by connecting to a data source and querying the data where it lives. The ideal way to use Superset is with a dataset-centric approach to modeling and visualization.

A dataset-centric approach implies that a well-normalized dataset is used to “derive” richer datasets with extra semantics, making them valuable for analysis and visualization. By creating derived datasets that incorporate aggregate metric calculations earlier in the ETL/ELT process before surfacing for visualization, many of the issues with change management, maintenance, logic reuse within dashboards, etc. are significantly reduced for a better experience.

Cube complements Superset in this way, and this allows for Superset-generated SQL queries to work with no issues when connected to a Cube environment (which is treated as a PostgreSQL database connection).

Looker vs Cube + Superset

You’re free to configure and fine-tune each tool and each layer within every tool when building with Cube and Superset. Want to build and schedule a report? Check Superset dashboards. Need to expose new metrics in that report? Cube’s data modeling layer is at your service.

Compatible as in “no vendor lock-in”

Speaking of data modeling, Cube provides a data modeling language that is not tied to any specific interface, visualization, or user experience. Unlike proprietary LookML definitions that require Looker IDE for syntax highlighting and linting, Cube’s data model is written in JavaScript (and soon-to-be-written in YAML) and feels native in your editor of choice. On top of that, you are able to generate a data model dynamically or even fetch it from a remote endpoint.

Consider an excerpt of Looker view from the recent deep dive by Pedram:

view: workspace_activation {
sql_table_name: "METRICS"."WORKSPACE_ACTIVATION" ;;
dimension_group: date {
type: time
timeframes: [
raw,
time,
date,
week,
month,
quarter,
year
]
sql: ${TABLE}."REPORTING_DATE" ;;
}
dimension: is_active_workspace {
type: yesno
sql: ${TABLE}."IS_ACTIVE_WORKSPACE" ;;
}
dimension: workspace_id {
type: string
primary_key: yes
sql: ${TABLE}."WORKSPACE_ID" ;;
}
measure: count_workspaces {
type: count_distinct
description: "# of Workspaces"
sql: ${workspace_id} ;;
}
measure: count_active_workspaces {
type: count_distinct
description: "# of Unique Workspaces Active within 1 Day"
sql: ${workspace_id} ;;
filters: [is_active_workspace: "yes"]
}
measure: activation_rate {
type: number
sql: ${count_active_workspaces} / ${count_workspaces} ;;
value_format_name: percent_1
}
}

Compare with the equivalent Cube data model (way less of ;;, to say the least):

cube(`WorkspaceActivation`, {
sql: `SELECT * FROM public.active_workspace_details`,
measures: {
count: {
type: `count`
},
n_workspaces: {
sql: `workspace_id`,
type: `countDistinct`,
},
n_active_workspaces: {
sql: `workspace_id`,
type: `countDistinct`,
filters: [ {
sql: `${is_active_workspace} IS TRUE`
} ],
},
activation_rate: {
sql: `ROUND(100 * (1.0 * ${n_active_workspaces} / ${n_workspaces}), 2)`,
type: `number`
}
},
dimensions: {
workspace_id: {
sql: `workspace_id`,
type: `number`,
primaryKey: true
},
reporting_day: {
sql: `reporting_day`,
type: `time`
},
is_active_workspace: {
sql: `is_active`,
type: `boolean`
}
},
preAggregations: {
main: {
measures: [ activation_rate ],
timeDimension: reporting_day,
granularity: `week`,
refreshKey: {
every: `1 day`
}
}
}
});

In a similar fashion, Cube’s API layer is compatible with practically any data consumer. When the time comes, and you realize that, alongside business users using dashboards and reports in Superset, you’d like to have data analysts crunching numbers and seeking insights in data notebooks, Cube will be there to help liberate your data.

Data consumers

You can expect any SQL query generated by your data exploration tool like Superset to work with Cube’s SQL API. Needless to say, equivalent queries to Cube’s REST API and GraphQL API will yield the same data:

SELECT
DATE_TRUNC('WEEK', reporting_day) AS reporting_week,
activation_rate
FROM WorkspaceActivation
ORDER BY 1 DESC;

Let’s have a bird’s-eye view here.

The generation of BI tools like Looker encouraged end-users to invest heavily in building out tons of LookML models to populate the semantic layer. While this established Looker as a source of truth in an organization for business metrics, it also created an immense amount of lock-in in your BI tool. You couldn’t take your LookML models with you and use them with another tool if you decided to switch BI tools. Google’s acquisition of Looker accelerated organizations’ anxieties around this lock-in, to the point that Google was actually forced to integrate LookML with Tableau. This way, organizations could use LookML for transformation (Looker’s strength) and Tableau for visualization (Tableau’s strength).

The future however is looking to be very different from the past:

  • Open-source headless BI platforms like Cube provide unified semantic layers to sit between the database and BI later, eliminating the need, or providing alternate implementations, for expensive solutions like Looker that lock you in.
  • Open-source BI tools like Superset with an intentionally thin semantic layer complement Cube and enable last-mile data transformation for the explicit purpose of data visualization in your BI tool.

When the need for a complex yet flexible metrics/semantic layer arises, Cube and Superset is a winning open source-based combo!

Ready-to-go as in “today”

Now, you’re very much welcome to try Cube and Superset today and see them in action. Please join the Slack communities of Cube and Superset to share your feedback.

Please tune in to the webinar on October 4 with Shreesham Mukherjee, Developer Relations Engineer at Preset, and Igor Lukanin, Head of Developer Relations at Cube. We’ll discuss Cube and Superset as an open source alternative to Looker in popular use cases, demo both tools, and take your questions.

See you!