Ever since the wide availability of ChatGPT in early 2023, Large Language Models (LLMs) have become a subject of intense interest and investment in nearly every industry. As we all try to wrap our heads around exactly how the power of generative AI will transform the world, here at Cube it was obvious to us that semantic layers would have a part to play. This would take shape as the integration of the Cube semantic layer into applications that retrieve information from data stores, perform analysis and return requested results to a user who is interacting with them using natural language, interpreted by an LLM. The exact return method of these results could be multimodal - natural language generation, tabular result sets, or even chained into visualization libraries to return dashboards, but the big idea is “ask a question, get an answer”. We weren’t sure exactly what it would look like, but we knew it was coming.

To prepare for this tectonic shift, we focused on learning about how builders were thinking about this problem and we released features that we felt would empower them such as our semantic document loader for LangChain and our expanded meta endpoints to make it easier to provide the useful context within Cube to the LLM, improving consistency and reducing hallucinations. We also engaged with many builders focusing on next generation analytics experiences driven by natural language query interfaces enabled by LLMs such as our partners at Patterson Consulting and Delphi Labs who have been instrumental in delivering AI-enabled analytics use cases for our customers.

While we all felt that it made sense that an LLM based application would provide more accurate results when answering questions that required examining data within a database with the aid of well designed semantic layer, when compared with more simple zero-shot text-to-sql methods, we will be the first to tell you that we were long on customer anecdotes and light on empirical evidence.

Recently a group of researchers from data.world (Juan F. Sequeda, Dean Allemang and Bryon Jacob) have published a ground-breaking paper studying the accuracy of LLMs answering questions with a text-to-sql pattern and comparing those results with the accuracy of introducing a knowledge graph into the process. For those who may not be familiar, knowledge graphs are somewhat similar to semantic layers in that they are also an additional layer of business context that can inform usage of data.

In addition to the results, the paper also puts forward the detailed methodology for evaluating the accuracy of LLMs in answering business questions asked of a database. They provide a schema, sample data, natural language questions and expected answers in a format that is fairly easy to replicate their experiment. Our partners at Delphi, who build an AI-enabled analytics application on top of semantic layers, quickly realized that this methodology could be adapted to perform a benchmark of the accuracy of their own solution. They have recently published some interesting findings of their own, that they were able to achieve an outstanding 100% accuracy with Delphi connected to a subset of the data.world test modeled in Cube as the semantic layer. This is an incredible result and not only a validation of the ability of semantic layers to provide necessary context to improve LLM accuracy, but also speaks to their thoughtful prompt engineering and design choices.

Why are semantic layers the missing piece?

The two driving themes for why we need to use semantic layers with LLMs are context and constraint. We see more and more teams moving towards providing context through prompt engineering and doing things like providing the LLM with a knowledge graph to interface with. This is definitely a step in the right direction, but you also need to constrain their output.

Semantic layers provide both a knowledge graph and a constrained interface for an LLM. As a special type of knowledge graph, a semantic layer provides the LLM with a model of a world, composed of entities and their measures and dimensions. LLMs have been trained on language, and all language is composed of entities (nouns) and their dimensions and attributes (adjectives). Language also contains mathematical terms such as ‘total’, ‘running total’, ‘average’… as a result, LLMs are well-suited to knowledge graphs.

The reason LLMs are also quite good at SQL is because of how many documents and articles (think Stack Overflow) contain SQL. However, SQL is a minuscule fraction of what LLMs have been trained on. An interface closer to natural language is a better fit for LLMs, and a constrained one, to reduce chance of error and hallucination, is even better. Requests to Cube’s REST API are both close to natural language (a requested list of metrics, dimensions and filters with few, if any, additional terms) and highly constrained (there is only really one way to form a request), if the LLM generates the request incorrectly this will almost always result in an error rather than a hallucination.

However, this has been theory so far, with some anecdotal evidence from using Delphi. What we needed was a benchmark, and now that the data.world one has been created - we have one.

Testing Methodology

We undertook an ambitious benchmark test, following the footsteps of data.world and dbt Labs, who tested their semantic layer on a subset of 11 questions from the original benchmark. Our goal was to replicate dbt Labs’s methodology on our platform, utilizing Delphi in conjunction with the Cube semantic layer. The outcome was remarkable - while dbt Labs achieved an impressive 83% across 8 questions (excluding 3 with multi-hop joins), Delphi + Cube reached a perfect score of 100% on all questions.

The benchmark began by setting up a Snowflake data warehouse. We created the necessary tables from the data.world benchmark using the DDL available in our GitHub repository.

Once the database structure was in place, we proceeded to populate it with the data files sourced from the insurance dataset available on data.world’s GitHub, originally developed by Object Management Group. This model, which involves a 3NF structure with multi-hop joins, was crucial in replicating the complexity of real-world data, setting the stage for an authentic benchmarking experience.

Next, we started building out the Cube semantic layer. This involved signing up for Cube’s free tier and creating a new project using the credentials from our Snowflake account. Using Cube Cloud’s built-in IDE, we developed the Cube schema available in our GitHub repository, including all tables and dimensions from the benchmark (even “extra” ones our questions did not use) and some basic measures.

Finally, we created a new free account on Delphi, connected Cube via its REST API integration, and then were off to the races. Working off the benchmark results sheet created by dbt Labs, we ran each question in a separate Delphi conversation and tracked the results. We replicated this 5 separate times to ensure that our results were consistent and not impacted by the non-determinism that’s often possible with LLMs.

Our results are publicly available, including the Cube queries Delphi generated. Delphi was able to answer 100% of the questions correctly, even including some where it was able to calculate the needed measures on the fly.

In line with our commitment to transparency and community collaboration, we have made our Cube schema and benchmark instructions publicly available. These resources can be accessed on our GitHub repository at Delphi Semantic Layer LLM Benchmark, inviting others to replicate our benchmark and share their findings.

Conclusion

There is now empirical evidence that using a semantic layer with an LLM, for analytics, is superior to other methods. We have chained a number of proprietary methods in addition to this to deliver the best-in-class performance we see with Delphi. We invite everyone to take a look at the testing methodology above and take both Cube and Delphi for a spin. We’d love to show you how easy it is to start asking questions of your data and feel confident that the answers your team will receive are grounded in the reality of your data.