The Big Data Warehouse Landscape – Q4 2018
The concept of the data warehouse dates back to the 1980s. At that time databases were designed primarily to support the automation of core business operations, transaction processing of things like invoices and payments, processing sales orders, payroll and the like. Databases were optimised to support that type of processing, and this meant that they were much less efficient at handling processing that needed to access large swathes of the database rather than updating or inserting individual records. Yet for reporting purposes you need to be able to calculate totals and so on to carry out read-only processing on much of the database content. This type of analytic processing interfered with transaction processing and slowed it down, so the idea of the data warehouse was to separate the two types of processing. Transaction databases remained as they were, but a copy was taken, usually overnight in batch, of these databases for reporting and analytic purposes. This was the data warehouse.
As applications proliferated it became clear that common data such as customer, production, location and asset were often held in duplicate transaction systems, so the design of the data warehouse started to look different from the schemas of the transaction databases in order to resolve duplication. Different database products appeared that were optimised for query processing rather than transaction processing.
Data warehouses grew and grew. In 2003 the largest data warehouse in the world was 30 TB in size, yet even a decade later there were examples of petabyte sized data warehouses, a 30 fold increase in ten years. Database technologies had to adapt to meet these challenges. Parallel processing allowed complex queries to be split across multiple processors, which meant splitting the problem up into smaller bundles, which itself required quite different database optimisation. Another approach was to invert the shape of the database itself. Traditional row-oriented relational databases started to give way to column-oriented databases, which increased query efficiency at the cost of load and update speeds. These columnar databases could often operate in parallel too, bringing greater processing power to bear on complex queries across very large databases.
The next major challenge for data warehouses was to be able to handle more complex datatypes, beyond just numbers and descriptive text. Time series data was one example, and as e-commerce grew then data such as web logs became important to analyse. Far more devices started to have sensors, which themselves generated considerable volumes of data. Smart electricity meters, cars and aeroplanes started to churn out sizeable volumes of data, as did mobile telephone towers and call logs, along with seismic data in the energy industry. The volumes of such data stretched the capabilities of traditional databases. Indeed around 90% of all the data that is around today was created in the last two years. Although the price of disk storage and memory have plummeted, the sheer volume of data being generated has swamped the ability of conventional databases to handle it. The invention of a programming technique called MapReduce by Google in 2004 led to Hadoop, an open source distributed processing framework that manages data processing and storage for “big data” applications running in clustered systems. Modern data warehouse technology embraces these various approaches in different ways, with traditional data warehouses often sitting alongside “data lakes” of big data managed by Hadoop. Database and data warehouse vendors have continued to adapt, for example allowing the embedding of statistical processes as native database functions, and providing native support for time series data. Data warehouses are now commonly deployed in the cloud rather than just on-premise, and this has led to the development of cloud-only data warehouses such as Amazon Redshift and Snowflake Computing.
The main vendors in the market are summarised in the diagram below.
The landscape diagram represents the market in three dimensions. The size of the bubble represents the customer base of the vendor, i.e. the number of corporations it has sold data warehouse software to, adjusted for deal size. The larger the bubble, the broader the customer base, though this is not to scale. The technology score is made up of a weighted set of scores derived from: customer satisfaction as measured by a survey of reference customers¹, analyst impression of the technology, maturity of the technology in terms of its time in the market and the breadth of the technology in terms of its coverage against our functionality model. Market strength is made up of a weighted set of scores derived from: data warehouse revenue, growth, financial strength, size of partner ecosystem, customer base (revenue adjusted) and geographic coverage. The Information Difference maintains profiles of vendors that go into more detail. Customers are encouraged to carefully look at their own specific requirements rather than high-level assessments such as the Landscape diagram when assessing their needs.
A significant part of the “technology” dimension scoring is assigned to customer satisfaction, as determined by a survey of vendor customers. In this annual research cycle the vendors with the happiest customers were Teradata, based on a sample of 47 completed customer reference surveys. Our congratulations to them.
 In the absence of sufficient completed references, a neutral score was assigned to this factor
Below is a list of the significant data warehouse vendors.
|1010 Data||Provides column-oriented database and web-based data analysis platform.||www.1010data.com|
|Actian||Actian's product is an analytic database on commodity hardware.||www.actian.com|
|Amazon Redshift||Cloud-based data warehouse solution.||www.aws.amazon.com/redshift/|
|Exasol||German data warehouse appliance vendor.||www.exasol.com|
|Greenplum||Appliance vendor aiming at high-end warehouses, now part of Pivotal, a subsidiary of EMC, itself acquired by Dell in 2015.||pivotal.io/big-data/pivotal-greenplum|
|HPCC||An open-source, massively parallel platform for big data processing, developed by LexisNexis Risk Solutions.||www.hpccsystems.com|
|IBM||Infosphere Balanced Warehouse (formerly DB2) is the data warehouse software offering from the industry giant, which also offers two appliances: PureData for Operational Analytics (based on DB2) and PureData for Analytics powered by Netezza technology.||www.ibm.com|
|InfoBright||Provides a columnar-database analytics platform.||www.infobright.com|
|jSonar||Boston-based NoSQL data warehouse vendor.||www.jsonar.com|
|Kognitio||Mature data warehouse appliance, offering its data warehouse as a service.||www.kognitio.com|
|Magnitude||Now part of Magnitude Software, Kalido is an application to automate building and maintaining data warehouses that adapt to change, running on various database platforms.||www.kalido.com|
|MarkLogic||Enterprise NoSQL database vendor.||www.marklogic.com|
|Microsoft||As well as its SQL Server relational database, Microsoft acquired Data Allegro and at the end of 2010 launched its Parallel Warehouse based on this technology.||www.microsoft.com|
|MonetDB||MonetDB is an open-source columnar database system for high-performance applications.||www.monetdb.cwi.nl|
|Neo4j||Open source graph database.||www.neo4j.org|
|Oracle||Database and applications giant with its own data warehouse appliance.||www.oracle.com|
|ParStream||Columnar, in-memory, MPP database vendor aimed at analytic processing.||www.parstream.com|
|Pivotal||Owners of the Greenplum massively parallel data warehouse solution, now an open-source solution.||pivotal.io/big-data/pivotal-greenplum|
|Qubole||Markets the Qubole Data Service, which accelerates analytics workloads working on data stored in cloud databases.||www.qubole.com|
|Sand||Focuses on allowing customers to effectively retain massive amounts of compressed data in a near-line repository for extended periods.||www.sand.com|
|SAP/Sybase||Sybase was a pioneer in column-oriented analytic database technology, acquired in mid-2010 by giant SAP. SAP also offers the in-memory database technology HANA.||www.sap.com|
|SAS Institute||Comprehensive data warehouse technology from the largest privately owned software company in the world.||www.sas.com|
|Snowflake||Snowflake Computing sells a cloud-based data storage and analytics service called Snowflake||www.snowflake.com
|Teradata||Database giant with its own data warehouse solutions.||www.teradata.com|
|Vertica||Appliance vendor Vertica was purchased by HP in 2011||www.vertica.com|
|WhereScape||Not an appliance, but a framework for the development and support of data warehouses.||www.wherescape.com|
|XtremeData||US vendor that provides highly scalable cloud database platform.||www.xtremedata.com|