The Big Data Warehouse Landscape – Q4 2017
In the early days of databases, vendors focused on transaction processing in large enterprises for systems such as accounts receivable and sales order processing, which were the main use case for their deployment. As companies started to want to analyse information about their business performance it became clear that information would have to be extracted from multiple transaction systems, which usually had incompatible structures and classifications of common data such as customer and product hierarchies. In addition to the complication of having to rationalise and aggregate data from many systems, the underlying databases themselves were unsuited to analytic processing. The relational databases were designed for high performance when there was a high degree of concurrent update from many users, and struggled with queries that stretched across large swathes of the database. Enterprises started to deploy separate databases for such purposes, even if the vendor was the same. However, as data volumes grew and the processing demands became more demanding, specialist databases started to appear that were optimised for handling analytic style processing, often at the expense of the ability to handle highly concurrent updates, which in the case of data warehouses was not really a requirement.
Data warehouses grew greatly in size. In 2003 the largest data warehouse in the world was 30 TB in size, yet even a decade later there were examples of petabyte sized data warehouses, a 30-fold increase in ten years. Database technologies developed to meet these challenges. Parallel processing allowed complex queries to be split across multiple processors, which meant splitting the problem up into smaller bundles, which itself required quite different database optimisation. Another approach was to invert the shape of the database itself. Traditional row-oriented relational databases started to give way to column-oriented databases, which increased query efficiency at the cost of load and update speeds. These columnar databases could often operate in parallel too, bringing more processing power to bear on complex queries across large databases.
The next major issue was the need to handle data beyond tables of numbers that needed to be added up. The rise of the internet meant that large e-commerce systems now had to store text descriptions and images of products being sold, sometimes with video alongside. These datatypes were not well handled by traditional databases, and we saw the rise of new file systems to handle very large, distributed databases. Hadoop, and then Apache Spark, were file systems rather than databases, but traditional vendors needed to access these upstarts and combine their content with traditional structured, numeric content. The corporate data warehouse started to see “data lakes” of so-called big data files springing up alongside. Established vendors have reacted by providing connectors to such file systems, and SQL interface layers have been developed on top of these file systems so that programmers can access them using a language familiar to them. In reality layering a SQL interface on a file system has many issues, not least when it comes to performance, so results from early deployments of data lakes have been mixed. Nonetheless the need to bridge the different data types remains. Deployment of databases in the cloud rather than on-premise adds a further level of complication and opportunity. Internal IT departments now have to consider whether data should be deployed on-premise, in a public or private cloud or some hybrid arrangement. Analytic processing needs to run across this range of processing platforms. Greater use of in-memory processing has to an extent allowed vendors to deal with increasing demands, though software needs to seamlessly handle the transition of queries from in-memory to disk based processing.
Recently there has been a trend towards deploying more complex types of processing to databases. Beyond just adding up columns of numbers and calculating totals, in many industries it is necessary to deal with time-series data, or apply statistical processing to large volumes of data. Machine learning techniques add to the processing mix, and so we are now seeing traditional database vendors starting to embed algorithmic processing engines within their databases in a bid to allow these new demands to be handled within reasonable timeframes. A number of smaller vendors have sprung up to offer alternatives where the traditional databases are either struggling technically with such use cases or where they become unaffordable.
It is clear that, as data grows remorselessly in volume, the demands on data warehouses continue to mount, both in terms of traditional numeric data and in terms of more complex data types such as images and video. The recent desire to apply artificial intelligence and machine learning to data just adds to that pressure. Vendors large and small continue to innovate and try and keep pace with the increasingly complex burden of processing being demanded by customers.
The main vendors in the market are summarised in the diagram below.
The landscape diagram represents the market in three dimensions. The size of the bubble represents the customer base of the vendor, i.e. the number of corporations it has sold data warehouse software to, adjusted for deal size. The larger the bubble, the broader the customer base, though this is not to scale. The technology score is made up of a weighted set of scores derived from: customer satisfaction as measured by a survey of reference customers, analyst impression of the technology, maturity of the technology in terms of its time in the market and the breadth of the technology in terms of its coverage against our functionality model. Market strength is made up of a weighted set of scores derived from: data warehouse revenue, growth, financial strength, size of partner ecosystem, customer base (revenue adjusted) and geographic coverage. The Information Difference maintains profiles of vendors that go into more detail. Customers are encouraged to carefully look at their own specific requirements rather than high-level assessments such as the Landscape diagram when assessing their needs.
A significant part of the “technology” dimension scoring is assigned to customer satisfaction, as determined by a survey of vendor customers. In this annual research cycle the vendors with the happiest customers were Teradata. Our congratulations to them.
 In the absence of sufficient completed references, a neutral score was assigned to this factor
Below is a list of the significant data warehouse vendors.
|1010 Data||Provides column-oriented database and web-based data analysis platform.||www.1010data.com|
|Actian||Actian's product is an analytic database on commodity hardware.||www.actian.com|
|Amazon Redshift||Cloud-based data warehouse solution.||www.aws.amazon.com/redshift/|
|Exasol||German data warehouse appliance vendor.||www.exasol.com|
|Greenplum||Appliance vendor aiming at high-end warehouses, now part of Pivotal, a subsidiary of EMC, itself acquired by Dell in 2015.||pivotal.io/big-data/pivotal-greenplum|
|HPCC||An open-source, massively parallel platform for big data processing, developed by LexisNexis Risk Solutions.||www.hpccsystems.com|
|IBM||Infosphere Balanced Warehouse (formerly DB2) is the data warehouse software offering from the industry giant, which also offers two appliances: PureData for Operational Analytics (based on DB2) and PureData for Analytics powered by Netezza technology.||www.ibm.com|
|InfoBright||Provides a columnar-database analytics platform.||www.infobright.com|
|jSonar||Boston-based NoSQL data warehouse vendor.||www.jsonar.com|
|Kalido||Now part of Magnitude Software, Kalido is an application to automate building and maintaining data warehouses that adapt to change, running on various database platforms.||www.kalido.com|
|Kognitio||Mature data warehouse appliance, offering its data warehouse as a service.||www.kognitio.com|
|MarkLogic||Enterprise NoSQL database vendor.||www.marklogic.com|
|Microsoft||As well as its SQL Server relational database, Microsoft acquired Data Allegro and at the end of 2010 launched its Parallel Warehouse based on this technology.||www.microsoft.com|
|MonetDB||MonetDB is an open-source columnar database system for high-performance applications.||www.monetdb.cwi.nl|
|Neo4j||Open source graph database.||www.neo4j.org|
|Oracle||Database and applications giant with its own data warehouse appliance.||www.oracle.com|
|ParStream||Columnar, in-memory, MPP database vendor aimed at analytic processing.||www.parstream.com|
|Pivotal||Owners of the Greenplum massively parallel data warehouse solution, now an open-source solution.||pivotal.io/big-data/pivotal-greenplum|
|Qubole||Markets the Qubole Data Service, which accelerates analytics workloads working on data stored in cloud databases.||www.qubole.com|
|Sand||Focuses on allowing customers to effectively retain massive amounts of compressed data in a near-line repository for extended periods.||www.sand.com|
|SAP/Sybase||Sybase was a pioneer in column-oriented analytic database technology, acquired in mid-2010 by giant SAP. SAP also offers the in-memory database technology HANA.||www.sap.com|
|SAS Institute||Comprehensive data warehouse technology from the largest privately owned software company in the world.||www.sas.com|
|Teradata||Database giant with its own data warehouse solutions.||www.teradata.com|
|Vertica||Appliance vendor Vertica was purchased by HP in 2011||www.vertica.com|
|WhereScape||Not an appliance, but a framework for the development and support of data warehouses.||www.wherescape.com|
|XtremeData||US vendor that provides highly scalable cloud database platform.||www.xtremedata.com|