The Data Quality Landscape – Q1 2018
Data quality has been an issue since the first explorers brought back inaccurate maps of distant lands with such minor issues as California being shown as an island on one antique map. Human nature means that mistakes are made when data is entered into computer systems, and all the validation rules in the world will not prevent that entirely. Consequently, we live in a world where our details are entered into supplier databases multiple times with inconsistencies such as old addresses and phone numbers, misspellings and missing data. A 2002 PWC study found that almost a quarter of mail is incorrectly addressed, and experienced consultants reckon than a typical materials master file will have errors in 20-30% of entries.
The job of data quality software is to address these data imperfections as much as possible, using algorithms and business rules to identify likely duplicate records, correct obvious misspellings and complete and consolidate records where possible. The industry has historically focused on customer name and address data, partly because every business has customers and so the problem is common to all businesses, and partly because the problem is relatively tractable. There are plenty of published algorithms (such as Soundex and Metaphone) that can spot likely matches in similar sounding names e.g. Smith and Smythe, and more elaborate statistical processes that can be applied to records to predict likely matches or duplicates. These days, data quality software goes much further, in some cases providing glossaries of common terms in various languages that enable software to recognise that “Richard”, “Dick” and “Ricky”, or “Kate”, “Kathie” and “Katherine”, are likely the same name. Vendor software can similarly be applied to address data, using postal codes to check addresses, and potentially enriching that data with latitude and longitude information, or whether an address is in a certain voting district or even within a flood plain.
As time has passed, data quality software has developed elaborate ability to “profile” data to spot likely errors e.g. spotting data that does not conform to an expected pattern, data that is out of expected range etc. Some products provide functionality for entering business-specific data quality rules, and for managing the workflow around alerting people to likely data errors and managing the process of correcting it. Some products can parse textual data, an important feature of handling product rather than customer data, where many source files are in free text rather than a more structured format. Matching algorithms in particular have grown more sophisticated, though there is usually a need for human intervention: the consequences of a false positive (or false negative) match in the case of a drug test or a terrorist watch alert are very different from those of a mis-addressed piece of direct mail.
Over the last few years many data quality suites have moved beyond simple profiling or name and address validation and developed broader functionality of the type described above. The last year has seen greater interest in applying the techniques of machine learning to data quality problems, though this is partly in response to the general level of increased interest in the field, so some vendors are now using “artificial intelligence” or “machine learning” labels about their software rather creatively. Nonetheless, the undoubted developments in machine learning definitely open up new possibilities for data quality software. Another area that has seen recent interest is the issue of applying data quality techniques to “big data” such as Hadoop files rather than just to traditional databases. Although much of this data Is machine generated, searching for meaningful content within it such as customer and product identifiers is an area that some vendors have started to develop functionality.
The diagram that follows shows the major data quality vendors, displayed in three dimensions. See later for definitions of these.
It is important to understand that this is a high-level representation of the market, with vendors represented on the chart specialising in different areas and at very different price-points. If you are considering data quality software, it is important to tailor your selection process to the particular needs that you have rather than relying on high-level diagrams such as this. The Information Difference has various detailed models that can assist you in vendor selection and evaluation.
As part of the landscape process, each vendor was asked to provide at least ten reference customers (some vendors provided many times that number), which were surveyed to determine their satisfaction with the data quality software of the vendor. The happiest customers based on this survey were those of Datactics followed by ActivePrime, then those of Innovative Systems, Experian and Syncsort (formerly Trillium). Congratulations to those vendors.
Below is a list of the main data quality vendors.
|ActivePrime||US-based vendor of data quality for CRM systems.||www.activeprime.com|
|Address Doctor||Vendor that specialises in providing wide coverage of name and address information; now owned by Informatica.||www.informatica.com/addressdoctor.html - fbid=-gz2yeRJkyH|
|Ataccama||Prague-based company with a modern data quality suite.||www.ataccama.com|
|Capscan||London-based provider of address management and data integrity services, now owned by GB Group.||www.gbgplc.com/uk/|
|Data Mentors||Long-established US data quality vendor.||www.datamentors.com|
|Datactics||UK-based vendor of data quality and matching software to banking, finance, government, healthcare and industry.||www.datactics.com|
|Datiris||Colorado vendor of data profiling technology.||www.datiris.com|
|Datras||Munich-based vendor with wide ranging data quality functionality.||www.datras.de|
|DQ Global||UK data quality and address verification software.||www.dqglobal.com|
|Experian||UK-based vendor specialising in customer name and address validation, data profiling and data enrichment.||www.edq.com|
|The search engine giant does data quality.||github.com/OpenRefine|
|helpIT/360 Science||US/UK vendor of integrated contact data quality solutions including matching and address validation.||www.helpit.com|
|Human Inference||Dutch data quality vendor.||www.humaninference.com|
|IBM||Data quality software from the industry giant.||www.ibm.com|
|Infogix||Illinois-based vendor specialising in controls and compliance.||www.infogix.com|
|Infoglide||US vendor specialising in identity resolution.||www.infoglide.com|
|Informatica||California-based vendor, a major player in data quality.||www.informatica.com|
|Infoshare||UK data quality specialising in the public sector market.||www.infoshare-is.com|
|Innovative Systems||Long established data management vendor with extensive offerings including data profiling, data quality, address|
validation/geocoding, 360° view, and risk management solutions.
|Inquera||Israeli company with an approach to product data quality using machine-learning technology based on subject domain experts' knowledge.||www.inquera.com|
|Intelligent Search||Identity management company now with a more general data quality capability.||www.intelligentsearch.com|
|Irion||Italian data quality vendor specialising in financial services.||www.irion.it/index.php/en/|
|Melissa Data||US/German global data quality vendor offering address verification, geocoding and matching solutions.||www.melissadata.com|
|Microsoft||DQS is the data quality offering of the Redmond software behemoth.||www.microsoft.com|
|Netrics||New Jersey vendor of matching software. Now owned by Tibco.||www.tibco.com/products/automation/application-integration/pattern-matching|
|Oracle||The software giant's data quality offerings are based on the acquisitions of Datanomic and SilverCreek.||www.oracle.com|
|Pitney Bowes||Pitney Bowes, a global technology company, provides data quality solutions through its Customer Information Management (CIM) unit, which is part of its Digital Commerce Solutions division.||www.pitneybowes.com/us/customer-information-management/data-quality.html|
|Postcode Anywhere||UK vendor of web-based addressing software.||www.postcodeanywhere.co.uk|
|SAP||The software giant is a major data quality player.||www.sap.com|
|SAS||One of the leading players in data quality.||www.sas.com/en_us/software/data-management/data-quality.html|
|Satori Software||Seattle-based provider of address management solutions.||www.satorisoftware.com|
|Syncsort||Trillium Software, one of the leading data quality vendors, now acquired by Syncsort.||www.syncsort.com|
|Talend||Open source vendor with wide range of quality functions that are tied to data integration and MDM.||www.talend.com|
|TAMR||Vendor that applies machine learning to the data quality problem.||www.tamr.com|
|Uniserv||Large German data quality vendor.||www.uniserv.com|
Other vendors of data quality software include:
The Information Difference Landscape diagram shows three dimensions of a vendor:
▪ Market strength
▪ Customer base.
“Market strength” is made up of a weighted set of five factors: revenues, growth, financial strength, geographic scope and partner network. Each of these individual elements is scored, the total producing the “market strength” figure. Similarly “technology” is made up of four factors: “technology breadth” (the coverage of the vendors in various data quality areas as illustrated below), the longevity of the software in the market, analyst perception of the product via briefings, and customer feedback from reference customers (this has a high weighting), which we surveyed. In each case the scoring is on a scale of 0 (worst) to 6 (best).
Vendors were asked to submit answers to various questions via a questionnaire. Vendors were interviewed directly by an analyst and their software demonstrated and assessed. Reference customers were surveyed to give their experience of the software of each vendor. The technology functions which the vendors were asked about are as shown below. These are drawn from the Information Difference vendor functionality model; if you are interested in more detail on this then please contact The Information Difference.