The Data Quality Landscape – Q1 2019
Data quality has been an issue in computing ever since people first started to store data on computers. Data may be incomplete, out of date, inconsistent, misspelt, unavailable or just plain wrong. When companies and governments started to maintain name and address lists of customers, citizens and prospects it became clear that getting clean and accurate name and address data was a thorny problem; in the USA alone around 45 million people move address each year, so one-off data clean-up exercises are insufficient. An industry of software vendors has sprung up to address this problem, using algorithms designed to detect common misspellings and others to detect likely matches amongst multiple records that may, or may not, be duplicates. Despite this, a 2002 PWC study found that almost a quarter of mail is incorrectly addressed.
The issue is by no means restricted to customer name and address data. A typical materials master file will have errors in 20-30% of entries, and product data is usually more complex than address data. The same accuracy issues occur with data about suppliers, assets, contracts, locations and staff. The consequences for data quality issues in such data can be much more serious than a marketing flyer being misaddressed. Government agencies need to be confident about the accuracy of terrorist watch lists, for example, and data quality is a key element in addressing financial fraud. A 2015 review of financial fraud studies in nine countries including the USA by the University of Plymouth found that the annual cost of financial fraud was over $4 trillion, with around 6% of transactions affected. This study went beyond banks and covered areas like pensions, insurance, social security, construction and education. The good news is that active measures to address the issue can have significant effect. The US Department of Agriculture reduced its losses by 28% in a $12 billion program over a three-year period, and Britain’s National Health Service reduced its losses to fraud by 60% over a ten-year period. Errors in location data can have expensive consequences, as a major oil company discovered in the 1990s when an exploration well in the North Sea was drilled into an existing production well due to inaccurate geospatial data. A unit of measure error involving imperial rather than metric units caused the Mars Climate Orbiter satellite to crash in 1999, but such things are not new. Christopher Columbus accidentally landed in America when he based his route on calculations using the shorter 4,856 foot Roman mile rather than the 7,091 foot Arabic mile of the Persian geographer that he was relying on.
Data quality software these days goes beyond simple name and address validation. Profiling data shows statistics about data files and suggests inter-relationships between content. Data matching can be based on specific business rules or based on probability. Data can be enriched, for example by adding latitude and longitude to address data, or by bringing in additional data such as household income, or whether a building lies in a flood plain, which is clearly helpful to the insurance industry. Although most data quality tools work on data stored in database, there are early attempts to apply data quality tools to so-called “big data”, which usually resides in file systems like the Hadoop Distributed File System. Trying to spot, for example, customer data that may be lurking within a weblog or social media is a very different problem to parsing a customer name field in a file, yet data quality is as real an issue in the big data world as it is in traditional databases. Even though a lot of big data is machine generated, issues can still occur with the quality of data even in more esoteric structures such as images and sensor logs. The data quality industry is just beginning to address some of these challenges that are emerging, even as the volumes of data involved are growing at an exponential pace. Regulatory changes such as the European GDPR privacy legislation create further challenges to organisations, as Google discovered in January 2019 when it was hit with a €50 million fine for two GDPR violations related to advert personalisation.
The diagram that follows shows the major data quality vendors, displayed in three dimensions. See later for definitions of these.
It is important to understand that this is a high-level representation of the market, with vendors represented on the chart specialising in different areas and at very different price-points. If you are considering data quality software, it is important to tailor your selection process to the particular needs that you have rather than relying on high-level diagrams such as this. The Information Difference has various detailed models that can assist you in vendor selection and evaluation.
As part of the landscape process, each vendor was asked to provide at least ten reference customers (some vendors provided many times that number), which were surveyed to determine their satisfaction with the data quality software of the vendor. The happiest customers based on this survey were those of Innovative Systems, fractionally ahead of those of Syncsort (formerly Trillium), and Datactics. Congratulations to those vendors.
Below is a list of the main data quality vendors.
|ActivePrime||US-based vendor of data quality for CRM systems.||www.activeprime.com|
|Address Doctor||Vendor that specialises in providing wide coverage of name and address information; now owned by Informatica.||www.informatica.com/addressdoctor.html - fbid=-gz2yeRJkyH|
|Ataccama||Prague-based company with a modern data quality suite.||www.ataccama.com|
|Capscan||London-based provider of address management and data integrity services, now owned by GB Group.||www.gbgplc.com/uk/|
|Data Mentors||Long-established US data quality vendor.||www.datamentors.com|
|Datactics||UK-based vendor of data quality and matching software to banking, finance, government, healthcare and industry.||www.datactics.com|
|Datiris||Colorado vendor of data profiling technology.||www.datiris.com|
|Datras||Munich-based vendor with wide ranging data quality functionality.||www.datras.de|
|DQ Global||UK data quality and address verification software.||www.dqglobal.com|
|Experian||UK-based vendor specialising in customer name and address validation, data profiling and data enrichment.||www.edq.com|
|The search engine giant does data quality.||github.com/OpenRefine|
|helpIT/360 Science||US/UK vendor of integrated contact data quality solutions including matching and address validation.||www.helpit.com|
|Human Inference||Dutch data quality vendor.||www.humaninference.com|
|IBM||Data quality software from the industry giant.||www.ibm.com|
|Infogix||Illinois-based vendor specialising in controls and compliance.||www.infogix.com|
|Infoglide||US vendor specialising in identity resolution.||www.infoglide.com|
|Informatica||California-based vendor, a major player in data quality.||www.informatica.com|
|Infoshare||UK data quality specialising in the public sector market.||www.infoshare-is.com|
|Innovative Systems||Long established data management vendor with extensive offerings including data profiling, data quality, address|
validation/geocoding, 360° view, and risk management solutions.
|Inquera||Israeli company with an approach to product data quality using machine-learning technology based on subject domain experts' knowledge.||www.inquera.com|
|Intelligent Search||Identity management company now with a more general data quality capability.||www.intelligentsearch.com|
|Irion||Italian data quality vendor specialising in financial services.||www.irion.it/index.php/en/|
|Melissa||US/German global data quality vendor offering address verification, geocoding and matching solutions.||www.melissa.com|
|Microsoft||DQS is the data quality offering of the Redmond software behemoth.||www.microsoft.com|
|Netrics||New Jersey vendor of matching software. Now owned by Tibco.||www.tibco.com/products/automation/application-integration/pattern-matching|
|Oracle||The software giant's data quality offerings are based on the acquisitions of Datanomic and SilverCreek.||www.oracle.com|
|Pitney Bowes||Pitney Bowes, a global technology company, provides data quality solutions through its Customer Information Management (CIM) unit, which is part of its Digital Commerce Solutions division.||www.pitneybowes.com/us/customer-information-management/data-quality.html|
|Postcode Anywhere||UK vendor of web-based addressing software.||www.postcodeanywhere.co.uk|
|SAP||The software giant is a major data quality player.||www.sap.com|
|SAS||One of the leading players in data quality.||www.sas.com/en_us/software/data-management/data-quality.html|
|Satori Software||Seattle-based provider of address management solutions.||www.satorisoftware.com|
|Syncsort||Trillium Software, one of the leading data quality vendors, now acquired by Syncsort.||www.syncsort.com|
|Talend||Open source vendor with wide range of quality functions that are tied to data integration and MDM.||www.talend.com|
|tamr||Vendor that applies machine learning to the data quality problem.||www.tamr.com|
|Trillium Software||One of the leading data quality vendors, now acquired by Syncsort.||www.trilliumsoftware.com|
|Uniserv||Large German data quality vendor.||www.uniserv.com|
Other vendors of data quality software include:
The Information Difference Landscape diagram shows three dimensions of a vendor:
- Market strength
- Customer base.
“Market strength” is made up of a weighted set of five factors: revenues, growth, financial strength, geographic scope and partner network. Each of these individual elements is scored, the total producing the “market strength” figure. Similarly “technology” is made up of four factors: “technology breadth” (the coverage of the vendors in various data quality areas as illustrated below), the longevity of the software in the market, analyst perception of the product via briefings, and customer feedback from reference customers (this has a high weighting), which we surveyed. In each case the scoring is on a scale of 0 (worst) to 6 (best).
Vendors were asked to submit answers to various questions via a questionnaire. Vendors were interviewed directly by an analyst and their software demonstrated and assessed. Reference customers were surveyed to give their experience of the software of each vendor. The technology functions which the vendors were asked about are as shown below. These are drawn from the Information Difference vendor functionality model; if you are interested in more detail on this then please contact The Information Difference.