Why semantic layers make LLM analytics reliable: a paired benchmark across three frontier models
If you've ever had an AI analytics tool confidently return the wrong number — the SQL ran, the rows came back, nothing looked broken, but the answer was wrong — you've seen the problem we set out to measure.
This is the most common failure mode in LLM-powered analytics, and it's not a model-quality problem. Switching from one frontier model to another doesn't fix it. The model is being asked to do a job it can't reliably do: translate a fuzzy human question into correct SQL against a database whose business definitions aren't written down anywhere it can see.
We argue that supplying those definitions — what we call a semantic layer — is the structural fix. Today we're releasing the research, the benchmark, and the dataset to back that up.
What we tested
We ran a controlled, paired-comparison benchmark on 100 natural-language questions over the Cleaned Contoso Retail Dataset loaded into ClickHouse. Three frontier LLMs — Claude Opus 4.7, Claude Sonnet 4.6, and GPT-5.4 — were each evaluated under two conditions:
- Schema only. The model receives the warehouse DDL and the question.
- + Semantic layer. The model additionally receives a 4 KB hand-authored markdown document describing the dataset's measures, conventions, and disambiguation rules. No code, no runtime, no tool calls. Just the document, pasted into the prompt.
Same questions, same harness, same single-shot protocol, same reasoning effort, same judge across all six configurations. The only thing that changes between conditions is whether the markdown is present.
What we found
Adding the semantic layer improves accuracy by +17 to +23 percentage points across all three models.
| Model | Schema only | + Semantic layer | Δ |
|---|---|---|---|
| Claude Opus 4.7 | 50.5% [40.8, 60.1] | 67.7% [58.0, 76.1] | +17.2 pp |
| Claude Sonnet 4.6 | 46.5% [37.0, 56.2] | 68.7% [59.0, 77.0] | +22.2 pp |
| GPT-5.4 | 45.5% [36.0, 55.2] | 68.7% [59.0, 77.0] | +23.2 pp |
All three paired improvements are statistically significant under two-sided exact McNemar (p ≤ 0.0015, n = 99).
The more interesting result is what happens across models. With the semantic layer present, the three models are statistically indistinguishable from one another (pairwise p ≥ 0.79). Without it, they are also indistinguishable (pairwise p ≥ 0.42). Every cross-cluster comparison is significant at p < 0.01.
In plain language: the presence or absence of the semantic-layer document accounts for essentially all of the significant variance in the benchmark. Model choice within tier does not.
Why this matters
The data model is the upper bound on AI quality. The markdown we used was 4 KB — an afternoon of an analyst's time. But it has to capture what your metrics actually mean. If your team hasn't agreed on what revenue means, or CAC, or a qualified lead, no amount of model upgrading will produce correct answers.
Treat this as an architecture decision, not a model choice. With a semantic layer, you can pick whichever frontier model best fits your cost and latency budget without losing accuracy. Without one, switching to a stronger or more expensive model will not recover the gap. The gap is about the context, not the model.
This finding sits inside a consistent band of independent results, like BIRD's external-knowledge ablation (+20 pp), and the dbt Labs paired benchmark (+15 pp). Different domains, different teams; same direction.
What we're releasing
- The paper. Full methodology, statistical analysis, and discussion: https://arxiv.org/abs/2604.25149.
- The benchmark. 100 retail-analytics questions, reference SQL, and the semantic-layer markdown — at https://github.com/cubedevinc/semantic-layer-benchmark.
- The dataset. Public on Kaggle — credit to Bhanu Thakur for the cleaned re-export of Microsoft's ContosoRetailDW.