After ChatGPT was released in late 2022, almost everyone in data and analytics rushed to leverage GenAI models for text-to-SQL solutions. After two years, it became clear that text-to-SQL alone was not simply enough.

Enterprises need intelligent AI teammates to amplify their data and analytics teams' productivity. This is part of a larger trend as enterprises adopt agents and digital teammates across different roles and functions in the organization. Future AI data engineers will work alongside human data teams to carry out tasks such as building data assets, investigating ongoing issues, and optimizing costs.

For an AI agent to be a helpful teammate, it is not enough to take a natural language query as input and output a SQL statement. AI agents for data and analytics need a comprehensive understanding of the existing underlying data assets, from the raw data through transformations and semantic modeling to reporting. More importantly, they need the ability to make changes to these assets as they reason and go through the chain-of-thought process.

For example, when the agent is tasked with adding a new report to the existing dashboard, it needs to evaluate the existing semantic model to see if the current model can support the requested report. If it lacks dimensions or measures for the required report, the agent needs to go into the transformation and raw data layers to see what is available and what needs to be built or modified. It would investigate, build, test, validate, and collect feedback from the human as it progresses through the task to finally deliver the requested report.

Use cases

AI data engineers can perform a variety of tasks, but the deliverables typically come down to either the creation of a new data asset or the modification of an existing one. The agent can modify extract-load pipelines, transformations, semantic models, and, ultimately, dashboards and reports.

The value added to the business is increased productivity. AI will level up every data professional: senior data engineers will be able to do more within the same timeframe as before, and junior data engineers will be able to tackle more challenging tasks. It will also enable less technical team members to contribute to areas they couldn't before, e.g., dashboard designers contributing to data transformation pipelines.

Agents will be able to handle most of the use cases data teams are working on today – creating or modifying reports and dashboards, optimizing queries from time and cost perspectives, debugging data issues, refactoring codebases, etc.

Architecture

We still need to build many basic blocks to enable this future. Some of them would require advancing the fundamental enabling technology - LLMs, while others would require infrastructure for AI agents to retrieve, understand, and modify data assets.

Advanced reasoning

In September last year, OpenAI introduced o1, a new series of models with a reasoning architecture that spends more time thinking about problems. This architecture and chain-of-thought approach are prerequisites for complicated AI agents that can complete sophisticated tasks, such as making changes to the organization's data assets. We are still in the very early stages of this next frontier in AI, but the development pace looks very promising.

AI data engineers will rely on reasoning and chain-of-thought processes as they receive input from humans, delve into semantic models, transformations, and the data itself to ultimately perform actions for the given task.

Data Assets as Code

The data and analytics field has experienced a massive trend in recent years of applying software engineering practices to data management. This has led to the rise of code-first workflows as the primary way to manage data assets. Code-first workflows provide many benefits for humans, such as collaboration, version control, and CI/CD, but they are absolutely necessary for AI agents.

AI agents consume code as input and produce it as output. This will become the major interface layer between the agents and data tools for managing transformations, semantic modeling, and BI-as-code data assets.

AI agents will work on multiple tasks simultaneously. They will create branches within the version control system, spin up containers, make changes to the codebase, run CI, and present results to humans. Humans will review code changes and provide feedback through the code review process. This way, code-based workflows being used by data teams today can seamlessly integrate new AI teammates.

Agents Workspace

As humans need laptops and access to data tools, AI agents would need containers to execute workflows and API access to data systems. AI data engineers will write code faster than humans do today. This would set a high bar for infrastructure.

On one end, it would need to scale up massively as agents can work on thousands of threads simultaneously. On the other end, it should scale down to zero once agents are done with their work. The speed of provisioning containers for AI agents would be crucial.

Interoperability of the existing data tools

Human data teams use many different data tools in today's workflow: extract-load, transformation, warehousing, semantic modeling, cataloging, observability, visualization, etc. Agents would need a way to programmatically manipulate data asset states within these systems. I believe code will become the universal way to manage the state of data assets across the board, and data tools will integrate with existing version control and CI/CD systems to enable development-deployment cycles for both humans and agents on data teams.

At Cube, we are excited about the future of AI agents for data teams and the opportunity to contribute to making it happen. We're investing in building the fundamental blocks to enable AI agents to manage data assets across the data stack. Please reach out if you would like to discuss the future of data and analytics!