Data Management and AI

The worlds of data management and artificial intelligence (AI) are gradually converging. There are many segments of the huge $92 billion enterprise data management market, including data governance, master data management, data quality, data integration, databases and data analytics. Every one of these has been touched by AI to some extent.

Enterprise data management continues to evolve. Companies today have to deal with their data being on servers in their data centres, in public clouds, private clouds or (most likely) hybrids of all these. Data volumes continue to grow rapidly, with 90% of the world’s data generated in the last two years, with data storage doubling every four years, according to IDC. The quality of enterprise data has barely budged, despite considerable investment in assorted technologies to help. In numerous surveys, barely a third of executives claim to trust their data, a number that has barely budged in similar surveys over the last decade. A Capgemini survey of 500 executives found that: “only 20% of business executives trust their data.”. This is a particular problem since large language models (LLMs) depend heavily on the data that they are exposed to, either in their initial training or by accessing additional corporate data through techniques such as retrieval augmented generation (RAG).

AI has been applied in various areas to try and help. In data governance, AI can be used to identify sensitive data, predict policy violations, and recommend governance actions. It can also be used to automatically detect personally identifiable data and to enforce masking and encryption rules. Generative AI and natural language processing can be used to automatically classify data, tag metadata, summarise assets and generate business glossary content. AI models themselves need managing as data assets, with support for this and model documentation, such as model cards, being increasingly supported by data catalogues. We can observe this in various data governance and privacy products such as those of Collibra and Alation, BigID and OneTrust.

In data integration, AI can be used to auto-detect database schema relationships, suggest database joins and improve the efficiency of data flows. This approach has been taken in products like Informatica CLAIRE, Microsoft Fabric Copilot and Talend AI.

In data quality, machine learning models can analyse data pipelines and identify anomalies in data flows and check for data freshness. Other machine learning models can detect outliers in data, identify potential duplicate records and help merge and match records from multiple source systems. Missing data can be added to enrich the data records. This enhanced, more trusted data can be used by master data management hubs, which themselves can use AI to match data entities and infer relationships between data. These approaches are taken in products like Informatica, SAP MDG, Ataccama, Monte Carlo AI, Anomalo and more.

Database management systems increasingly allow machine learning algorithms to run directly within their engines, which is faster. The database vendors are also using AI to help with various aspects of database management, such as self-tuning, query optimisation, automated index and query tuning, resource allocation, and even identifying security threats by flagging unusual access patterns or SQL injection attempts. Databases increasingly support vector search, essentially the way that LLMs access data. We see this within database products like Oracle, PostgreSQL, Snowflake, Databricks, InterSystems IRIS and Teradata, amongst others.

Analytics tools can use AI to provide a natural language interface to end users, insulating them from having to understand physical database schemas and relationships and enabling them to deal instead with curated “data products” that they can more easily understand.

Using AI for data management tasks promises various benefits. The sheer amount of data today means that manual approaches are increasingly challenged, for example, defining data quality rules or identifying anomalous data. Machine learning algorithms can help automate many such tasks. Databases can be far more self-maintaining if machine learning algorithms can be used to detect performance bottlenecks and fix themselves by tuning indexes. Generative AI can help speed up the population of the content of business glossaries in data catalogs, extract semantic meaning from data and allow easy access to data by business users.

There are challenges, too. Generative AI is vulnerable to prompt injection and assorted security vulnerabilities, such as data poisoning. LLMs are resource-intensive and are black boxes, unable to explain their decisions, which is an issue in situations such as heavily regulated industries. Nonetheless, judicious use of AI technology, such as machine learning algorithms, can bring clear benefits. Machine learning algorithms do not have the widespread hallucination problem of LLMs since many of them (such as linear regression or decision trees) are deterministic in nature. We can expect AI to continue to play a substantial role in most areas of data management in the years to come.

Related Posts