The Data Quality Landscape – Q1 2023
Data quality has been an issue ever since data started to be captured on computers. Despite huge investments in technology across industries, the level of trust in the quality of data remains consistently and depressingly low. A Deloitte survey showed that 67% of executives are not comfortable in using data from their own corporate systems, while a May 2022 survey of 500 companies by a market research firm Pollfish found 77% of respondents admitting to problems with their data quality. These are consistent with earlier surveys e.g. a 2021 survey by Precisely of over 300 executives finding that 82% of C level executives found data quality was a barrier to successful data integration projects. Such issues go across industries, often with serious consequences: a US government study found that up to 10% of patients in US hospitals were mis-identified, with duplicate patient records running at 12%. Prescription errors in the US healthcare system are reckoned to cost $21 billion and cause 7,000 deaths annually, according to the Network for Excellence in Health Innovation.
There are many reasons for this state of affairs, with human nature playing a major part: if an employee is asked to type in data to a computer system that they see no direct use for, they will inevitably be less careful about its accuracy than with something that impacts them directly. An employee will pay close attention to their payroll slip and check that their expenses have been paid on time, but filling out some general background information on a customer that only benefits an unknown person in another department is liable to involve less diligence. These days data quality is often considered an important part of broader data governance initiatives, with business people taking ownership of their data rather than just delegating this to IT departments, who often lack the knowledge (or the authority) to do this job effectively.
Data quality tools emerged to try and improve things, often trying to improve data capture at source, as well as in scanning large amounts of data for likely errors. Data quality software initially focused on customer name and address data that is common to virtually every industry, with clever algorithms that are designed to spot common misspellings and errors. A modern data quality suite can scan (“profile”) data to spot likely errors based on statistics and examine data records to identify possible duplicates. Despite all the best efforts to ensure that customer or product records are unique, hard reality shows that duplication rates of 10% to 30% are common. One customer master system that this author examined some years ago had 80% duplicates. Good data quality software can help diagnose this issue, highlight likely errors and duplicates, and help combine duplicate records into a high-quality system of record. They can also suggest business rules that can be applied to help keep data quality high, and can monitor systems to check on progress over time. These days data quality software can be applied to different data domains such as product or material data, not just customer name and address records. Extensive 3rd party databases can be used to enrich name and address records. For example, an insurance company can check whether a house is built on a flood plain, or is in a high crime area, and adjust a quotation accordingly.
In recent times many vendors have adopted machine learning techniques to help with this process. Systems can observe human domain experts resolving possible duplicate records, and can then suggest more refined business rules and carry out more automation of common errors, freeing up human time for more useful activities. The use of machine learning in merging and matching records is now becoming common, with software getting much smarter at this than it was a few years ago.
Data quality is a persistent issue, and is not going to be magically resolved by a software fix. However, there is no doubt that the industry is evolving, and the use of machine learning in particular shows promise in spotting and resolving data quality problems with less need for costly human intervention.
The diagram that follows shows the major data quality vendors, displayed in three dimensions. See later for definitions of these.