The Data Quality Landscape – Q1 2024
Data quality has been a thorny and rather intractable problem since the early days of computing. As soon as humans start entering data into a computer system there is the risk of error: names can be mistyped, postal codes left incomplete, addresses misspelled. If you are entering a new customer record then is “Sue Wright” the same as the existing customer “Susan Wright”? Maybe not, but what about if both live at the same address? What if they share the same social security number or date of birth? Welcome to the world of data quality, where decades of software development have gone into building algorithms that can detect common misspellings and that are aware that “Richard”, “Dick” and “Rick” may be the same name in different forms. Although the industry started out in the customer name and address domain, the same quality issues clearly exist for other data domains like product, asset, supplier etc.
Having duplicated or inaccurate data can have much more serious consequences than being sent your bank statement twice. A simple decimal point error in a displacement calculation led to the Spanish submarine S-80 class (Isaac Peral was the first) being 70 tons overweight, requiring a redesign, delays and cost overrun of up to €2 billion euros. Google Maps errors have led to the wrong houses being knocked down by demolition teams all over the world. In highly regulated industries like finance, insurance and pharmaceuticals, data quality errors have led to substantial fines for many companies. Such things make the news headlines, but mundane things like shipping an order to the wrong customer address, or sending an invoice to the wrong place cost plenty of money to companies all over the world. Despite investments in data quality software, the state of data quality remains poor: a series of Harvard Business Review tests found that 47% of newly created data records have at least one critical error. Executives are aware of this: in survey after survey, only a third (or fewer) of executives fully trust their own data.
The software industry has certainly tried to provide solutions. Modern data quality software can “profile” data, detecting outliers and finding obvious issues such as null values. Data quality software can be set up with business rules, either defined manually or with the help of artificial intelligence, that can then be used to validate data. Software suites can cleanse data, validate things like name and address, and detect possible duplicates in corporate systems. They can enrich data too. The latitude and longitude can be added to an address, and some software suites go far beyond this: an address can be noted as being in a certain voting district, in a flood plain, or at a certain elevation, all very handy in particular use cases. A business address can be enriched with information like the number of employees that work there, or the credit rating of the company. Data quality suites are sometimes sold as stand-alone best-of-breed solutions, but are often wrapped up in broader software suites that may include master data management, data integration or data catalogs.
Modern data quality suites are available in the cloud as well as on-premise, and frequently use artificial intelligence to help with tasks such as merging and matching records. For example, some software refers possible duplicate records for human review, observes the behaviour of the human domain experts, and can suggest new data quality rules, or improve its ability to spot likely duplicates. Such machine learning has been incorporated into some data quality products for several years now. It should be noted that data quality is an ongoing issue: the quality of data will decay over time, so data quality efforts need to be ongoing rather than one-off clean-ups. This aspect of data quality monitoring or observability has been given an additional impetus in the last few years by a new generation of vendors, such as Monte Carlo, who use artificial intelligence to generate automatic data quality rules and then produce alerts and reports when exceptions and anomalies appear.
A further impetus to the industry has been given by the interest in generative AI that was sparked in November 2022 by the general release of ChatGPT. Most large companies are concerned about privacy issues in using public AIs like those from Open AI or Google or Anthropic, so many have taken the approach of taking a partly trained large language model (such as LLaMA from META) and training it on their corporate data. For example, a customer help chatbot might be trained on prior answers given and the customer database, while an engineering application might be trained on the company’s engineering documents and procedures. It turns out that AIs, perhaps unsurprisingly, produce much better results when they are trained on high-quality data than low-quality data. Given the levels of trust in data noted earlier, many companies are having to revisit data quality as a task in preparation for the implementation of their artificial intelligence projects. The sheer amount of money being invested in AI projects may mean a quite substantial boost for the data quality industry.
The diagram that follows shows the major data quality vendors, displayed in three dimensions. See later for definitions of these.