Getting Started with a Databricks Semantic Layer

Working with Databricks and Cube is a great way to ensure your data engineering team can build and maintain a unified semantic layer for your organization. At a higher level, it is a better means to make cloud data accessible and consistent to every data consumer in and outside of your company.

By combining Databricks’ unified analytics and data store with Cube’s data modeling, access control, caching, and APIs, it’s now possible to deliver consistent metrics securely to business intelligence tools, embedded analytics use cases, and AI agents.

Why a Semantic Layer?

A semantic layer functions as a middleware component between all of your data sources, including cloud-based data lakehouses, such as Databricks, and downstream applications. Its primary role is to abstract complicated physical data models and present them in a more user-friendly manner with improved field labeling, field descriptions, and dynamically calculated metrics. It serves as a contextual bridge between the data sources and the diverse analytical tools businesses employ for data manipulation, slicing, dicing, and analysis.

Why Cube and Databricks?

Unify your data stack. Databricks established itself as a high-performance Lakehouse for a modern data stack, unifying all data types. Combined with Cube, you will have consistently modeled metrics and performant queries for all your downstream data consumers.

Modern data stack follows several foundational principles and architecture designs:

Using a Lakehouse like Databricks as a central data platform
ELT vs. ETL
Applying software engineering best practices to data management

The Cube platform embodies software engineering best practices, empowering data engineers to handle data models through code. This capability unlocks the potential to harness version control systems for secure data model evolution management, fostering collaborative efforts and streamlined code reviews. Additionally, it facilitates the integration of a semantic layer into the CI/CD pipeline and the creation of distinct development, staging, and production environments.

Harness the capabilities and scalability offered by Databricks with Cube. Cube seamlessly integrates with Databricks, enabling direct queries without the need for data extraction. This ensures that your data remains within the Databricks environment. Cube optimizes performance by leveraging the computing power of your Databricks instance, fully maximizing the scalability of your cloud data lakehouse. Additionally, you have the option to create pre-aggregations within Cube, reducing the need for small, repetitive queries on your Databricks instance and allowing for background aggregate updates instead. Our aggregate-aware pre-aggregations are fundamentally different from common table or view materializing strategies often employed for performance reasons. With Cube pre-aggregations, users do not need to be aware of objects materialized for performance reasons; instead, they simply issue their queries, and Cube automatically chooses the optimal way to answer, from cache, from a pre-aggregate, or the source.

Use your Databricks Lakehouse data in any application. Cube supports your current and future use cases with first-class application programming interfaces (APIs). In addition to our SQL API, Cube also speaks REST and GraphQL, which are often preferred by application developers who are working with common front-end frameworks to develop embedded analytic applications. With Cube, you can take your investment in your Databricks ecosystem and build anything you can dream of.

A step-by-step guide to building semantic layer with Cube and Databricks

In the below section, we’ll cover a quickstart step-by-step tutorial on how to start with Cube and Databricks. Remember: Cube Cloud is free to start (there is a sign-up button, up and to the right^). No credit card or calls with salespeople are required! :)

Find a detailed getting started with Cube Cloud and Databricks guide in Cube documentation.

Create Databricks and Cube accounts

First, create Cube Cloud and Databricks accounts.

If you don’t have a dataset that you would like to use, you can follow this guide to create a test dataset in your Databricks instance.

Connect Databricks to Cube

Please follow these steps to create your first deployment in Cube Cloud and connect it to your Databricks instance.

After you sign in to your Cube Cloud account, click “Create Deployment”.
Give the deployment a name, select the cloud provider and region of your choice, and click “Next”.
Next, click Create to create a new project from scratch.
Select Databricks from the list of supported data sources and enter your Databricks credentials to connect to it. Get a free Databricks account
Once Databricks is connected, Cube can generate a basic data model from your data warehouse schema, which helps you get started with data modeling faster. We'll inspect these generated files in the next section and start changing them.

Create your data model

Cube follows a dataset-oriented data modeling approach, which is inspired by and expands upon dimensional modeling.

When building a data mode in Cube, you work with two dataset-centric objects: cubes and views. Cubes usually represent business entities such as customers, line items, and orders. In cubes, you define all the calculations within the measures and dimensions of these entities.

To begin building your data model, click on “Enter Development Mode” in Cube Cloud. This will take you to your personal developer space.

Once you are in developer mode, navigate to the “Data Model” and click on any file located in the left sidebar to open it.

You can follow this detailed guide on how to make changes to your data model and build new measures and dimensions.

When you are ready to test updates to your data model, you can navigate to Cube’s Playground. The Playground is a web-based tool that allows you to query your data without connecting any tools or writing any code.

Query your data

You can query Cube using a BI or visualization tool through the Cube SQL API. For a better experience, we recommend mapping the BI's data model to Cube's semantic layer. This can be done automatically with Semantic Layer Sync or manually.

Semantic Layer Sync Semantic Layer Sync will synchronize all public cubes and views with connected BI tools.

Create a new sync by navigating to the “Semantic Layer Sync” tab on the “BI Integrations” page and clicking “+ Create Sync”.
Follow the steps in the wizard to create a sync
Replace the fields for user, password, and URL with your BI tool credentials, then click on “Save All”. You can now go to the “BI Integrations” page and trigger the synchronization of your newly created semantic layer.
After running the sync, navigate to your BI tool instance. You should see the orders_view dataset that was created in the tool. Cube automatically maps all metrics and dimensions in the BI tool to measures and dimensions in the Cube data model.

Manual setup Alternatively, you can connect to Cube and create all the mappings manually.

You can find the credentials to connect to Cube on the “BI Integrations” page under the “SQL API Connection” tab.
After connecting, create a new dataset in the BI tool and select "orders_view" as a table.
Now you can map the BI tool metrics and columns to Cube's measures and dimensions.

That was a high level preview. You can create both you Databricks and Cube instances for free to try it out in your own environment.

Great work!