Core to our business at Simon Data is a data platform that enables our clients to access, operationalize, and centralize their marketing data. The ultimate goal of Simon Data is to give marketers an intuitive, non-technical platform in which to manage sophisticated data operations required to drive great customer marketing outcomes.
The warehouse is powered by a complex, multi-tenant environment comprising datasets of varying lineages, schemas, and business purposes. Some datasets are common across client accounts — such as data we produce or import through marketing channel integrations — while other datasets are client-specific. While this delivers significant leverage for our customers to orchestrate bespoke end-user impressions, it also introduces a brain-bending taxonomical challenge to the engineers looking to drive analytical insights across the entire system.
Since the product’s creation, the platform has seen several analytics products built into it. Early last year, we organized a focused team to consolidate them on a standard foundation. At that same time, we also wanted to rethink how we integrated analytics in our application to streamline the development and deployment processes.
We knew the best path forward was to build a framework that could be used both now and in the future for analytics product development on any part of our core platform. To pull this off, we had some goals for the system.
We wanted a seamless development experience, support for querying arbitrary data, fluency in managing various schemas, and an ability to rapidly prototype and develop the user experience. Equally important was the ability to transparently present the queries and transformations used to produce results to facilitate QA by multiple stakeholders.
What our users wanted
When we talked with users and business stakeholders, we discovered several pain points:
- Users complained about difficulty configuring custom reports and limited dashboarding capabilities.
- Developers spent too much time maintaining multiple different and complex implementations of analytics products.
- Salespeople wanted an easy-to-use and effective out-of-the-box reporting dashboard to present to prospective clients.
- Client Solutions Managers wanted more transparency in how the system calculated results.
- Product managers and engineers were concerned about the level of effort required to build new analytics products.
We started with a holistic review of the analytics products we already had to gather requirements for the new reporting suite and took a close look at how clients were (or were not) using them. Alongside that, we also did a systematic review of technologies we could use to build a new reporting platform.
We looked at many commercial products and open-source projects and narrowed the list down to a handful of contenders, including Apache Kylin and Druid from the open-source space and Looker from the enterprise. Cube.js quickly rose to the top of that list.
What we liked most about Cube.js (no other offering provided all of these):
- It was designed for precisely the problem we were trying to solve, which was embedding analytics into an existing application, and it fits nicely into our existing technology stack.
- First-class support for the tools we use: Snowflake on the backend and React/Recharts on the frontend.
- Pluggable authentication and schema configuration hooks we could use to integrate with the rest of our platform. Specifically, we use driverFactory to provide credentials for Snowflake and MySQL (for external pre-aggregations).
- Out-of-the-box support for deploying as a serverless application on AWS Lambda with API Gateway, ElastiCache, and Aurora/MySQL.
- A fantastic developer sandbox app, Developer Playground, which provides interactive query development, troubleshooting/debugging, the export of queries, and even ready-made Recharts components.
- Multiple layers of caching to provide the performance necessary for interactive data visualization in our application.
- An integrated framework for building frontend data components leveraging our massive data warehouse, one that was lightweight and enabled rapid iteration without a lot of boilerplate, nor the need to wire together lots of different libraries.
Where we had to invest in Cube.js
Marketers interact with Simon Data through our easy-to-use web application. We needed a way to query data from multiple independent data warehouses, each with its different schema, and present the results in a common interface. This was a challenge on two fronts: development and deployment.
First, we had to invest in integrating Cube.js with our infrastructure to provide credentials and schema configuration for each individual client's data warehouse. Cube.js provided hooks we could use, so this was reasonably straightforward.
However, we needed to give developers a way to run Cube.js to mirror the production environment while being able to rapidly iterate on schema, query, and frontend component development. This meant building a common application bootstrap and configuration for Cube.js that we could use to launch it in dev, staging, and production environments.
One neat trick we pulled off here was getting Cube.js to seamlessly serve both its playground application and API queries from our web app when running in development mode. This is important for us to be able to run everything (i.e. both Cube.js and Simon Data web app) on our laptops and have the web app be able to send queries to the Cube.js Playground running locally.
Another challenge we faced was safely and securely deploying configurations to production without impacting what was already running. We achieved this by making schemas immutable.
When we want to deploy a new schema or patch an existing one, we create a whole new schema or increment the current schema with the changes. Once we have the latest schema working in our development environment, we can test it out in the staging environment and deploy it to production. We can do all of this in the background, so we only switch the production web application over to the new schema once we have complete confidence in it (see Figures 2 & 3 below).
Our repeatable deploy process sets up the new schema in Snowflake and Cube.js simultaneously, which gives strong consistency between the Snowflake and Cube.js configurations.
We have defined schemas as a set of Snowflake views and Cube.js cubes that are developed and deployed together. We create a schema by writing view templates that map onto the data in Simon’s warehouse designed for the given customer to provide a standard interface to create the cubes on top of (see Figure 1 below). Our developers write these view and cube templates using Jinja from their IDEs and can rapidly iterate on them locally using our end-to-end development environment integration. Once the schema is complete and committed, our deployer tool dynamically renders the templates by filling in variables specific to each customer’s datasets. This pattern was the key to building repeatable schemas and deploying them consistently in our complex multi-tenant environment.
Finally, we created a comprehensive, automated end-to-end integration test suite (written in Python) that exercises each cube across all schemas that are live in Cube.js. This serves two important purposes:
- It validates our production environment.
- It keeps the Cube.js pre-aggregations database up to date ahead of user interaction.
The integration test runs periodically and also after the deployment of any new schemas. Once the test passes, we can safely switch the web application to the new schema with confidence.
How Cube.js is deployed at Simon Data
Figure 1: Schema is the unit of deployment
- Each schema gets deployed into both Cube.js and Snowflake at the same time.
- A schema is composed of one or more sources. Each source defines a particular data source.
- Sources are where developers work. During development, the Cube, View, and Table definitions are created as a single unit (a source) that powers one or more in-app features.
- In a source, each Cube (Cube.js config files) has a dependent View (with a SQL query, which the Cube is defined on top of), and each View has one or more dependent Tables (with a name and list of columns that are verified in validation).
- Tables represent the upstream data in the data warehouse that we want to bring into the web application with an in-app reporting feature.
Figure 2: Deployment process
- Create a new schema in the Cube.js Playground.
- Deploy the schema into Snowflake & Cube.js.
- Switch the Simon Web App to use the new schema.
Figure 3: Schema migration
We can develop a new schema and deploy it in the background, then switch the Simon web app over when ready.
How are we better of now?
We delivered three major wins to our clients and business stakeholders with this platform:
Speed. We can develop and ship analytics products without dealing with any net new infrastructure development going forward. Our efforts are focused on what matters: the data and the UI. Integrating the Cube.js Playground into our development stack enables us to rapidly build a schema and query in Cube.js and then export that for implementation in the front end, all of which a developer can quickly do on a laptop.
Flexibility. We have a framework that enables us to seamlessly query any data in Snowflake and present it in the web application. The real win here is that we won't have to choose between maintainability and time to market to get this flexibility -- we can have both.
Stability (Antifragility). When we're ready to ship a new product feature, we can deploy it with the confidence that we will not introduce regression on existing functionality because our schemas are immutable. We can also roll out new features to client accounts in a controlled manner, one by one or in batches, and even keep the changes hidden to perform production validation before they are released to end users.
Results and future plans
We’ve already built net-new analytics products on this platform that solve fundamental client pain points, proving that the platform works. We’re also starting to realize the velocity gains when iterating on these products to add functionality and refine the user experience.
Now we’re starting to look at consolidating some of the earlier analytics products onto the new framework. Finally, our business stakeholders now have transparency into the calculations that power analytics products in the application, giving them confidence in the data and helping them more easily answer client questions.