The Data Quality Landscape – Q1 2024

Data quality has been a thorny and rather intractable problem since the early days of computing. As soon as humans start entering data into a computer system there is the risk of error: names can be mistyped, postal codes left incomplete, addresses misspelled. If you are entering a new customer record then is “Sue Wright” the same as the existing customer “Susan Wright”?  Maybe not, but what about if both live at the same address?  What if they share the same social security number or date of birth?  Welcome to the world of data quality, where decades of software development have gone into building algorithms that can detect common misspellings and that are aware that “Richard”, “Dick” and “Rick” may be the same name in different forms. Although the industry started out in the customer name and address domain, the same quality issues clearly exist for other data domains like product, asset, supplier etc.

Having duplicated or inaccurate data can have much more serious consequences than being sent your bank statement twice. A simple decimal point error in a displacement calculation led to the Spanish submarine S-80 class (Isaac Peral was the first) being 70 tons overweight, requiring a redesign, delays and cost overrun of up to €2 billion euros. Google Maps errors have led to the wrong houses being knocked down by demolition teams all over the world. In highly regulated industries like finance, insurance and pharmaceuticals, data quality errors have led to substantial fines for many companies. Such things make the news headlines, but mundane things like shipping an order to the wrong customer address, or sending an invoice to the wrong place cost plenty of money to companies all over the world. Despite investments in data quality software, the state of data quality remains poor: a series of Harvard Business Review tests found that 47% of newly created data records have at least one critical error. Executives are aware of this: in survey after survey, only a third (or fewer) of executives fully trust their own data.

The software industry has certainly tried to provide solutions. Modern data quality software can “profile” data, detecting outliers and finding obvious issues such as null values. Data quality software can be set up with business rules, either defined manually or with the help of artificial intelligence, that can then be used to validate data. Software suites can cleanse data, validate things like name and address, and detect possible duplicates in corporate systems. They can enrich data too. The latitude and longitude can be added to an address, and some software suites go far beyond this: an address can be noted as being in a certain voting district, in a flood plain, or at a certain elevation, all very handy in particular use cases. A business address can be enriched with information like the number of employees that work there, or the credit rating of the company. Data quality suites are sometimes sold as stand-alone best-of-breed solutions, but are often wrapped up in broader software suites that may include master data management, data integration or data catalogs.

Modern data quality suites are available in the cloud as well as on-premise, and frequently use artificial intelligence to help with tasks such as merging and matching records. For example, some software refers possible duplicate records for human review, observes the behaviour of the human domain experts, and can suggest new data quality rules, or improve its ability to spot likely duplicates. Such machine learning has been incorporated into some data quality products for several years now. It should be noted that data quality is an ongoing issue: the quality of data will decay over time, so data quality efforts need to be ongoing rather than one-off clean-ups. This aspect of data quality monitoring or observability has been given an additional impetus in the last few years by a new generation of vendors, such as Monte Carlo, who use artificial intelligence to generate automatic data quality rules and then produce alerts and reports when exceptions and anomalies appear.

A further impetus to the industry has been given by the interest in generative AI that was sparked in November 2022 by the general release of ChatGPT. Most large companies are concerned about privacy issues in using public AIs like those from Open AI or Google or Anthropic, so many have taken the approach of taking a partly trained large language model (such as LLaMA from META) and training it on their corporate data. For example, a customer help chatbot might be trained on prior answers given and the customer database, while an engineering application might be trained on the company’s engineering documents and procedures. It turns out that AIs, perhaps unsurprisingly, produce much better results when they are trained on high-quality data than low-quality data. Given the levels of trust in data noted earlier, many companies are having to revisit data quality as a task in preparation for the implementation of their artificial intelligence projects. The sheer amount of money being invested in AI projects may mean a quite substantial boost for the data quality industry.

The diagram that follows shows the major data quality vendors, displayed in three dimensions. See later for definitions of these.

It is important to understand that this is a high-level representation of the market, with vendors represented on the chart specialising in different areas and at very different price-points.  If you are considering data quality software, it is important to tailor your selection process to the particular needs that you have rather than relying on high-level diagrams such as this.  The Information Difference has various detailed models that can assist you in vendor selection and evaluation.

As part of the landscape process, each vendor was asked to provide at least ten reference customers (some vendors provided many times that number), who were surveyed to determine their satisfaction with the data quality software of the vendor.  The happiest customers based on this survey were those of Experian. Congratulations to them.

Below is a list of the main data quality vendors.

 

Main Vendors

Below is a list of the main data quality vendors.

VendorBrief DescriptionWebsite
AcceldataData Observability Vendorhttps://www.acceldata.io/
Address DoctorVendor that specialises in providing name and address information; now owned by Informatica.www.informatica.com/addressdoctor.html-fbid=-gz2yeRJkyH
AnomaloData quality detection and monitoring vendor.https://www.anomalo.com/
AtaccamaPrague-based company with a modern data quality suite.www.ataccama.com
ActivePrimeUS-based vendor of data quality solutions for CRM systems.www.activeprime.com
BigeyeData observability vendorhttps://www.bigeye.com/
CapscanLondon-based provider of address management and data integrity services, now owned by GB Group.www.gbgplc.com/uk
DatacticsUK-based vendor of data quality and matching software to banking, finance, government, healthcare and industry.www.datactics.com
DatrasMunich-based vendor with wide ranging data quality functionality.www.datras.de
DQ GlobalUK data quality and address verification software.www.dqglobal.com
ExperianUK-based vendor specialising in data quality, offering contact data validation, data observability and governance solutions.www.edq.com/
GoogleThe search engine giant also does data quality.github.com/OpenRefine
360 Science/helpITVendor of integrated contact data quality solutions. Now part of Syniti.www.helpit.com
Human InferenceDutch data quality vendor.www.humaninference.com
IBMData quality software from the industry giant.www.ibm.com
InformaticaCalifornia-based vendor, a major player in data quality.www.informatica.com
InfogixIllinois-based vendor specialising in controls and compliance. Now part of Precisely.www.infogix.com
InfoglideUS vendor specialising in identity resolution.www.infoglide.com
InfoshareUK data quality specialising in the public sector market.infoshare-is.com
Innovative SystemsLong established data management vendor with extensive offerings including data profiling, data quality, address
validation/geocoding, 360° view, and risk management solutions.
www.innovativesystems.com
Intelligent SearchIdentity management company now with a more general data quality capability. Now part of Experian.www.intelligentsearch.com
IrionItalian data quality vendor specialising in financial services.www.irion.it/index.php/en
Melissa DataUS/German global data quality vendor offering address verification, geocoding and matching solutions.www.melissadata.com
MicrosoftDQS is the data quality offering of the Redmond software behemoth.www.microsoft.com
Monte CarloUS based data quality and observability vendorwww.montecarlodata.com
NetricsNew Jersey vendor of matching software. Now owned by Tibco.www.tibco.com/products/automation/application-integration/pattern-matching
OracleThe software giant’s data quality offerings are based on the acquisitions of Datanomic and SilverCreek.www.oracle.com
Pitney BowesPitney Bowes, a global technology company, provides data quality solutions through its Spectrum producthttps://www.pitneybowes.com/
PreciselyData quality vendorhttps://www.precisely.com
ProspectaData quality vendorhttps://www.prospecta.com/
SAPThe software giant is a major data quality player.www.sap.com
SASOne of the leading players in data quality.www.sas.com/
Satori SoftwareProvider of address management solutions. Now part of BCC.www.satorisoftware.com
Soda DataData quality vendorhttps://www.soda.io/
TalendOpen-source vendor with quality functions that are tied to data integration and MDM.www.talend.com
TAMRVendor that applies machine learning to the data quality problem.www.tamr.com
TelmaiData observability and data quality vendorhttps://www.telm.ai/
Trillium SoftwareOne of the leading data quality vendors, acquired by Syncsort and now part of Precisely.www.trilliumsoftware.com
UniservLarge German data quality vendor.www.uniserv.com

Other vendors of data quality software include:

Data Leverhttp://www.redpoint.net/
Infosolvehttp://www.infosolvetech.com
Ixsighthttp://www.ixsight.com
TIQ Solutionshttp://www.tiq-solutions.com
Winpurehttp://www.winpure.com
Wizsofthttp://www.wizsoft.com

Research Methodology

The Information Difference Landscape diagram shows three dimensions of a vendor:

  • Market strength
  • Technology
  • Customer base.

“Market strength” is made up of a weighted set of five factors: revenues, growth, financial strength, geographic scope and partner network.  Each of these individual elements is scored, the total producing the “market strength” figure.  Similarly “technology” is made up of four factors: “technology breadth” (the coverage of the vendors in various data quality areas as illustrated below), the longevity of the software in the market, analyst perception of the product via briefings, and customer feedback from reference customers (this has a high weighting), which we surveyed.  In each case the scoring is on a scale of 0 (worst) to 6 (best).

Vendors were asked to submit answers to various questions via a questionnaire.  Vendors were interviewed directly by an analyst and their software was demonstrated and assessed.  Reference customers were surveyed to give their experience of the software of each vendor.  The technology functions that the vendors were asked about are shown below.  These are drawn from the Information Difference vendor functionality model; if you are interested in more detail on this then please contact The Information Difference.

Functional Areas