From Tokens to Truth: LLMs Need a Semantic Layer

Enterprises trying to use large language models (LLMs) to access their corporate data are hitting a brick wall. It is easy for a vendor to demo a natural language interface that allows users to write prompts that access corporate data. In the demo, a user types in something like: “show me my most profitable customers” and the model quickly returns a list of customers after accessing a demo version of a corporate database. That looks great in the sales pitch, but the reality is very different.

First, consider how you would identify “my most profitable customers.” To answer this you need a list of all the customers and the revenues that have been booked against each. However, to determine which are actually profitable you need more information. How are costs allocated to customers? Some customers may be large and require expensive account management, and with their high sales volume they probably negotiate discounts. Such a customer may actually be less profitable than a mid-sized customer that pays the list price, doesn’t haggle and does not require extensive hand-holding from expensive account managers. All this information is hidden away in various corporate systems that include cost allocation rules, assuming that the data is properly accounted for at all. Even a fluent AI system will struggle to identify the correct data sources and business rules needed to answer this question.

Even if you pare things back to a much simpler case, then it can still be tricky. In corporate systems there is rarely a neatly labelled database called “customer” with all the information that you need to be able to answer questions about customers. Most enterprises have a multiplicity of systems holding customer data. You might have a customer relationship management system, but customer data will also be stored in an ERP system, various marketing systems, customer support systems, etc. A survey by Salesforce showed that the average enterprise uses over a thousand different business applications, and dozens of these will contain versions of customer data. Even identifying the right database columns is not trivial. Within SAP, for example, the customer master table is called “KNA1”. The customer number is “KUNNR” and the customer name is “NAME1”. How exactly will your AI infer that “KUNRR” is the customer number?

Making sense of this labyrinth of systems is a challenge to any large company, and many attempts have been made to get better control over it, from ERP to data warehouses to master data management to data fabric and more. These various initiatives have had mixed success at best. Master data management (MDM) as a market has been around since at least 2004 (arguably well before that) and yet in 2025 McKinsey was still publishing research about how to do it, noting that less than a third of companies surveyed had an MDM system that was integrated with source systems, along with the data governance processes to manage it.

The latest approach has been to build a “semantic layer’, a business-friendly interface that links common terms like “customer” and “revenue” back to the source data that is buried away inside corporate systems. A semantic layer is a governed mapping between business concepts and the underlying tables, joins and business rules needed to calculate them.

In a related initiative, a “knowledge graph” is designed to allow business users to explore data and the relationships between it. A knowledge graph needs some form of semantic layer for it to be able to function effectively. If you have such a layer, then an AI system has a far greater chance of succeeding when you ask it a natural language question.

If an LLM is pointed at a raw database schema then it will have to guess relationships between tables, potentially misinterpret columns and hallucinate SQL that is syntactically correct but semantically wrong. In tests, the accuracy of LLM answers that involve several joins have been found to be around 20% or worse. If you have a semantic layer then the accuracy rises dramatically. In one set of tests the accuracy for simple queries using a semantic layer was around 100% compared to around 50% – 60% accuracy without the layer.

This basic issue is why the previously somewhat obscure subject area of semantic layers on data is now fashionable at AI seminars, with consultants keen to sell you their expertise in this area. The market for semantic layers and knowledge graphs has been estimated as being worth $2.2 billion in 2024 and is projected by one report to grow at a rapid 23% compound annual growth rate. Data management geeks have been trying for years to draw attention to the importance of this. Now, with AI, the area may finally emerge into the daylight. Enterprises investing in AI without investing in a semantic layer are effectively just automating confusion.

From Tokens to Truth: LLMs Need a Semantic Layer

Related Posts