Introducing Cube Evals

Today we are launching Cube Evals.

Something changed this past year about what a data team is on the hook for. Questions that used to land in your queue—in Slack, in a ticket, as a "quick favor"—now go to an agent instead. People ask Cube's Analytics Chat, they ask through a copilot embedded in your product, they ask Claude or ChatGPT pointed at your Remote MCP server. The agent answers in seconds and the person moves on. That's the win. It's also the problem: the answer went out grounded in your data model, and you never saw it.

The only way to check whether those answers are right has been to read them—open the chat, eyeball the SQL, spot-check a number against something you trust. That catches almost nothing. You can't read every answer, you certainly can't read every answer a second time after you change a join or swap out a model, and "looked fine when I checked it last month" is not how you want to run the company's reporting.

An agent in production is a production system

If an AI agent is answering business questions in production, it is a production system. And production systems have tests—a suite you run on every change that tells you, objectively, whether the thing still works. Not vibes, not a spot-check, not a good feeling about the demo. Your agent has never had that.

That's what Cube Evals is. You write eval cases—a natural-language question paired with its known-correct answer—and run your agent against them on any branch. You get back an objective accuracy score, like "86% (43/50)", and a per-case breakdown of which questions failed and why. Run it on main to see where you stand. Change a model or an agent config on a branch, re-run, and compare: did accuracy go up, or did you just introduce a regression you'd otherwise have shipped?

How it works

An eval case lives in your data model repo as code. The easiest place to start is a single file, agents/eval_questions.yml: each case has a name, the question in plain English, and the ground truth—either inline SQL or a reference to a certified query you've already vetted. As the suite grows you can split it across agents/eval_questions/*.yml.

# agents/eval_questions.yml
eval_questions:
  - name: revenue_by_quarter
    question: What was revenue by quarter over the last two years?
    certifiedQuery: revenue_by_quarter

  - name: completed_revenue_q4
    question: What was total completed revenue in Q4 2025?
    sql: |
      SELECT SUM(amount) AS revenue
      FROM orders
      WHERE status = 'completed'
        AND created_at >= '2025-10-01'
        AND created_at < '2026-01-01'

When you run a suite, Cube invokes your own agent on each question—the same agent, the same LLM provider, the same config that powers normal chat. Nothing new leaves your environment that wasn't already leaving it every time someone uses the agent. The agent produces a query. Cube executes both that query and your ground-truth query against your warehouse, through the same Cube API every other query goes through, and compares the two result sets.

When a case fails, you see the agent's SQL next to the ground-truth SQL, tagged with what went wrong—a row-count mismatch, a missing column, a value that's off—plus a plain-English read on it. That points you straight at the model definition or agent instruction that needs fixing.

Grading is deterministic by default

No model grades the answer. Result sets are compared deterministically: sort-invariant, numeric-tolerant to four significant figures, and agnostic to column aliases. Cube checks the rows, not the prose.

We started with deterministic grading because I wanted a number you don't have to second-guess. The same eval run returns the same verdict every time, and when a case fails you can see exactly why: a row count that's off, a missing column, a value that doesn't match. So you can reproduce a result, show a failure to whoever owns the metric, and gate a merge on it in CI without anyone arguing about how the grader felt that day.

There's a real case for model-based grading too. It catches things a result comparison can't, like whether a written answer reads well or whether the agent picked a sensible chart, and we're looking at adding it as an option you can turn on. Deterministic grading stays the default, and you decide where a model check is worth the tradeoff.

It lives where you already work

Evals run where you already do this kind of work: in the IDE, next to the model, scoped to a branch. They're part of AI Studio, the place in Cube where you tune and ship the agent—observe what it's doing, evaluate it against ground truth, improve the model, ship. Testing belongs in that loop, on a branch, in front of a code review—not in a separate console you remember to open once a quarter.

Customers were already doing this by hand

We didn't guess at this one. Customers came to us already doing it the hard way—tracking eval questions and agent results by hand in spreadsheets, comparing runs manually after every change—and asked us, in so many words, for a Cube eval. When the people running an agent in production are building this for themselves out of a spreadsheet, the feature is already validated. We just made it native.

Get started

Cube Evals are available now. The docs walk through authoring eval cases and running a suite. If you're already on Cube, open the Evals tab in AI Studio and add your first eval case. If you're new to Cube, request a demo and we'll show you the whole loop—model, agent, and the tests that keep it honest.