Here I recreate the analysis of part 2 of my series, A Picture is Worth 1000 Words, but with updated information for Redshift using the ra3.16xlarge instances with managed storage that were first announced in late 2019. The difference is eye-opening.
As a tease, the high order bit difference in total workload execution time (Q1 .. Q99):
Redshift 30x dc2.8xlarge (Gigaom2019) |
Redshift 16x ra3.16xlarge (rethinkio2020) |
|
TPC-DS whole workload (Q1..Q99) total execution time (lower is better) | 7143 s | 1269 s |
Configuration | 30-node dc2.8xlarge | 16-node ra3.16xlarge |
Schema* | unknown | best practice: replicated dimensions, sort-ordered facts |
Maximum capacity | 78 TB | 1024 TB |
Published on-demand hourly cost (current) | $ 144.00 (4.80/node/hr) |
$ 208.64 (13.04/node/hr + $0.024/GB/month) |
Whole Workload Execution Cost | $ 286 | $ 74 |
Table 1. Results Summary
* We note here that schema differences can have a significant effect on query performance and efficiency. So this is not strictly an apples-to-apples comparison. But it does amount to a roughly 5.5x difference in workload execution time compared with the original GigaOm2019 numbers, and we think is reflective of achievable Redshift performance on the TPCDS-like sf30000 scenario. We explore this further below.
A New (Architectural) Beginning
Nevertheless, these performance results drastically change the price-performance characteristics of the Redshift platform as depicted in the original GigaOm report. Note also that with RA3 instance types, the per-node storage capacity is now 64 TB of managed storage. This provides users with greatly increased flexibility in balancing capacity and performance requirements.
RA3 instances afford much greater flexibility in deploying configurations based on compute and IO performance requirements without having to make significant compromises due to per-node storage constraints. Under the pure MPP model of the DC2, per-node storage was limited to 2.6 TB, a balance point which would likely push a deployment into the choice between:
- a higher node count (increased cost) to support the additional capacity, or
- use of a legacy high capacity instance type (16TB HDD per node) with reduced performance
And the shared storage architecture is the foundation for additional improvements promised by Advanced Query Accelerator (AQUA). Having worked extensively with distributed storage technologies and computational storage approaches, including some of the early uses of SmartSSD technology, I can tell you that the general Data Warehouse architecture is ripe with the opportunity for numerous optimizations/accelerations in the storage layer – including offline vacuuming, data (re)-ordering, dynamic creation of additional indexes, tuning of bloom filters, offline garbage collection, and more.
The REDUX
For the redux, I begin with the original GigaOm data, replacing the published Redshift results with results from a 16-node ra3.16xlarge configuration to which I was provided access. I’ll note that I haven’t been able to get the code (schema/queries) used in GigaOm’s original tests. The analysis below is based on the AWSLabs-published Cloud Data Warehouse Benchmark Derived from TPC-DS repo, which represents a best-practice-based schema for TPC-DS, consistent with efficient implementations of the dimensional model. This schema uses replicated dimension tables and sort-ordered fact tables, consistent with the optimizations one would make on a wide range of MPP-based DW deployments — appliance, cloud, or SQL-on-hadoop. And I use a copy of the 30TB data provided to me by the Redshift team in an s3 bucket, which I then loaded into the ra3.16xlarge cluster.
So, to be fully transparent, we’re looking at results:
- on a different platform: 16-node ra3.16xlarge vs 30-node dc2.8xlarge
- likely using different schema directives (distribution, sort keys)
- and possibly with slight differences in data and queries, even though both appear to use the 2.10 version of the TPCDS tools.
One last note on these potential differences before we dive into the numbers, again in the interest of transparency. Different seeds or chunking approaches using the dsdgen tool can generate data with slightly different value distributions, and can certainly affect the initial ordering of data due to parallel load operations. And different parameterizations of qgen can result in slightly different query filter parameterizations for a handful of queries.
But my experience has been that these differences tend to be negligible when compared with the impact of properly distributed, replicated, and ordered data in dimensional schema. Data ordering in particular — whether done explicitly with CREATE TABLE ... AS SELECT
(CTAS) and/or INSERT .. SELECT
operations, or with the support of system-based “vacuum-like” tools — can significantly alter query plans and affect the amount of data physically scanned or network-distributed/replicated at query execution time. I’ll look at the effect of these optimizations in a followup article, in an effort to try and tease apart the contributions of these conflated factors.
OK, appropriate disclosures out of the way, I’ll dive right into the results. Additional background on my approach is in the original article.
Visual Insights
I begin this time with the comparative results of Redshift, Snowflake, and Azure DataWarehouse (now Synapse); i.e., I’ve dropped the BQ results for the visual exploration.
As a reminder, the comparative system characteristics from the original GigaOm2019 report “Table 4: GigaOm Field Test Price-Performance @30TB ($ per Query, per Hour) Comparison” are shown in Table 2 — here alongside the Redshift ra3.16xlarge details.
System | Hourly rate |
Redshift ra3.16xlarge (16-node) | 208.64 |
Snowflake 3XL Enterprise (64-node) | 192.00 |
Azure SQL Data Warehouse DW15000c (30-node) | 183.86 |
Table 2. Configuration Comparison
Again, we plot the cumulative execution time of the Q99 run on the y-axis (lower is better), with the x-axis simply ordered by query number.
The most obvious difference here is that the Redshift results are significantly better than both the Azure and Snowflake results. There is also an obvious difference in the execution of query 67.
Reordering the queries on the x-axis by the average query execution time (short queries to the left, longer to the right), as was done in Figure 3 in the original article, now shows:
Some new insights emerge:
- Redshift significantly outperforms the other two platforms on roughly 90% of the shortest running queries, I believe a nice validation of some of their short query acceleration (SQA) work over the past 12-18 months.
- On the longest running queries, Redshift also seems to slightly outperform Azure, and significantly outperform Snowflake – with a big difference in query 67 also visible here.
So, we have a completely different story with this Redshift run compared with the original GigaOm DC2 results.
A note here is in order. Data distribution and ordering can make a big difference in query performance. In this comparison, we are looking at both a different instance type (16-node ra3.xlarge vs. 30-node dc2.xlarge) and likely a different schema and data ordering strategy on the Redshift system as well. Without additional information regarding the schema specifics for the GigaOm2019 report, all we can do at this point is highlight the difference, which is significant. Remember the Twain quote … “lies, damned lies, and benchmarks”.
Ideally, one would like to apply similar best practices approaches equally across the platforms. In fact, each platform has CREATE TABLE
variants that specify data distribution and expected ordering, and automated or manual approaches to ensure proper ordering. In a followup article, we’ll look to tease apart the instance type and schema variants that are conflated in this initial analysis.
A/B Comparisons
Here we recreate the original A/B comparisons by simply plugging in the updated Redshift RA3 numbers. We now move to the chart form that combines cumulative execution time (line graphs, with secondary y-axis on the right) with individual query speedups in the bars. Note that we still compute our “directional” speedups as we did before to preserve the magnitude (a slowdown by 50% is turned into a “negative speedup” of 2.0).
Recall that here we order the queries by the difference between their runtimes on Redshift and Snowflake (the “delta”). Queries to the left are ones where Redshift (significantly) outperforms Snowflake; queries to the right are ones where the results are closer. But we note here that all the speedup bars on this chart are upward-facing, meaning that every Redshift query individually outperforms the corresponding Snowflake query. This is yet another indicator that we may be looking at less-than-optimal data layouts (distribution, ordering) in the original GigaOm tests — of at least some of the systems tested.
Now we’ll also look at an A/B comparison between Redshift and Azure DataWarehouse, this time using the RA3-based execution time ordering on the x-axis, i.e., the shortest running Redshift queries are to the left.
Here, we see strong speedups (5x or more) through the first 50% of the queries, with closer-to-parity speedups over the last 10% of the queries or so (minus Q67, as mentioned). Again, possible validation of Redshift’s SQA work; but also possibly due in part to schema and data ordering differences (replicated vs. non-replicated dimension tables, sorted vs. unsorted fact tables).
Additional Performance Metrics
We now present the tabular results from the latter part of the original article, updated with the Redshift RA3 results.
SQL DW (GigaOm2019) |
Redshift RA3 (rethinkio2020) |
Snowflake (GigaOm2019) |
BigQuery (GigaOm2019) |
|
Total Exe (s) | 2996 | 1269 | 5790 | 37283 |
mean (s) | 30.3 | 12.8 | 58.5 | 376.6 |
geomean (s) | 15.7 | 3.8 | 28.2 | 97.7 |
median | 15.5 | 3.1 | 31.7 | 81.1 |
Table 3. Whole workload runtime statistics, single stream (original article Table 1)Query Throughput (QpH) and Price-performance (QpH-per-$)We now look at the effect of these improved performance numbers on query throughput and price-performance, as originally presented in the GigaOm article. We note here that the RA3 configuration (based on $13.04 per node on-demand cost, plus $0.024/GB/month managed storage) now shows a cost of $74 to run the Q99 workload, vs. $286 for GigaOm’s DC2 implementation in the original analysis – a nearly four-fold improvement in performance and subsequent 75% reduction in whole workload execution cost. Storage cost for the run is negligible if amortized over the duration of the run. The resulting price-performance comparison chart: |
Response Time Distributions
The response time table now looks like:
category | N Queries | SQL DW | RA3 | Snowflake | BigQuery |
interactive | 4 | 4.9 | 0.6 | 20.6 | 26.9 |
reporting | 4 | 7.0 | 2.7 | 28.6 | 41.5 |
analytic | 2 | 50.0 | 8.8 | 63.5 | 117.1 |
complex | 4 | 65.5 | 54.3 | 181.2 | 1664.0 |
Table 4. Average execution time cross-tab for Snowflake categorization (category X platform)
(original article Table 3)
category | N queries | SQL DW | RA3 | Snowflake | BigQuery |
frequent | 8 | 18.5 | 6.5 | 32.8 | 259.0 |
long-running | 2 | 62.9 | 71.6 | 240.6 | 2440.2 |
Table 5. Average execution time cross-tab for GigaOm categorization (category X platform)
(original article Table 4)
Conclusion
These results, representing a best-practices approach to data distribution and ordering on Redshift, drastically change the comparative performance picture painted in the original GigaOm Cloud Data Warehouse Performance Testing Report of 2019. The analysis also highlights the like-for-like comparison issue discussed earlier in this article, and raises questions about possible optimizations on the other platforms as well. Although we may not be able to completely eliminate the differences without similar schema-optimized runs on the other platforms, we can explore at least one of these avenues on the Redshift platform – looking at the performance difference between DC2 and RA3 implementations using the same optimized schema we used here, and then comparing the DC2 optimized schema performance with both the RA3 results, and with the original (unknown schema) GigaOm2019 results. That should allow us to tease apart the contributions of schema and platform to the 5.5x performance improvement.
This comparison of configuration variants on the same schema and query set is one of the well-defined use cases for A/B testing identified in my original write-up and we will explore it in a subsequent article.