Several analytic frameworks have been announced in the last year. Among them are inexpensive data-warehousing solutions based on traditional Massively Parallel Processor (MPP) architectures (Redshift), systems which impose MPP-like execution engines on top of Hadoop (Impala, HAWQ) and systems which optimize MapReduce to improve performance on analytical workloads (Shark, Stinger/Tez). This benchmark provides quantitativeand qualitative comparisons of five systems. It is entirely hosted on EC2 and can be reproduced directly from your computer.Redshift - a hosted MPP database offered by Amazon.com based on the ParAccel data warehouse. We tested Redshift on HDDs.Hive - a Hadoop-based data warehousing system. (v0.12)Shark - a Hive-compatible SQL engine which runs on top of the Spark computing framework. (v0.8.1)Impala - a Hive-compatible* SQL engine with its own MPP-like execution engine. (v1.2.3)Stinger/Tez - Tez is a next generation Hadoop execution engine currently in development (v0.2.0)
This remains a work in progress and will evolve to include additional frameworks and new capabilities. We welcome contributions.What this benchmark is not
This benchmark is not intended to provide a comprehensive overview of the tested platforms. We are aware that by choosing default configurations we have excluded many optimizations. The choice of a simple storage format, compressed SequenceFile, omits optimizations included in columnar formats such as ORCFile and Parquet. For now, we've targeted a simple comparison between these systems with the goal that the results areunderstandable and reproducible.What is being evaluated?
This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF's, across different data sizes. Keep in mind that these systems have very different sets of capabilities. MapReduce-like systems (Shark/Hive) target flexible and large-scale computation, supporting complex User Defined Functions (UDF's), tolerating failures, and scaling to thousands of nodes. Traditional MPP databases are strictly SQL compliant and heavily optimized for relational queries. The workload here is simply one set of queries that most of these systems these can complete.