A Picture is Worth 1000 Words

This is Part 2 of a 5-part series on DW/Analytics Performance Benchmarking and its use in non-traditional ways, including DW/Analytics Engineering and Maintenance.

Over the past ten years, I've had a number of roles that have required me to formalize and communicate performance-oriented metrics to a wide range of audiences -- from engineers and engineering managers, to marketing and C-suite executives, to board members and members of the investment community. And I have learned -- often on the shoulders or with the feedback of others -- that, by far, the most compelling and effective means of communicating the implications of those metrics is through creative and compelling visualizations.

A picture is truly worth 1000 words (and often far more).

It was true when I was working on SSD-enabled storage and big data solutions, and it is true with current Data Warehouse and Analytics approaches -- whether on-prem, Cloud, or even hybrid/multi-cloud (including federated approaches). And effective performance analysis and visualization can do far more than serve as an evaluation/procurement vehicle. If used properly, it can:

inform schema design tradeoffs
identify the most impactful areas for query optimization through rewrites and/or pre-aggregations
help with effective scheduling of various workloads to improve overall platform effectiveness
highlight issues in meeting SLAs (formal or informal) and guide remedies
understand and evaluate the potential impact of new analytic campaigns on current operations

A Quick Example

As an initial example of some of these uses, I want to look at the TPC-DS-like performance data initially published by GigaOm in their Cloud Data Warehouse Performance Testing Report of 2019. In that paper, they present:

the individual query performance of all 103 queries in the TPC-DS spec against four selected Cloud Data Warehouse platform configurations:
- Azure SQL Data Warehouse (now Synapse) DW15000C
- Snowflake 3XL
- Amazon Redshift 30-node dc2.8xlarge (we note the availability since then of newer RA3 nodes, with reportedly significant performance improvements)
- Google BigQuery standard configuration / flat rate (which we believe to be a 500 "slot" configuration)
the "total execution" time of each of the platforms (summing up individual query times)
per query performance thumbnails of the implementations, a variant of a rank-based heat map presentation also used by Datamond
selected query performance comparisons for an "interesting" set of queries (though, like beauty, "interesting" is in the eye of the beholder)

To be clear, in this article I use the results as originally published by GigaOm to introduce a number of concepts and analytic approaches that I'll use throughout the rest of this series. As discussed here, and in two related excursions, A Picture is Worth 1000 Words REDUX and Schema Matters: Data Distribution and Ordering, there are additional platform-specific schema optimization and data ordering considerations that can significantly affect performance results in an MPP Data Warehouse implementation. Without additional details about the original GigaOm implementation for each of the platforms, it is difficult to say if their original results are representative of "best practice deployments", the standard (unoptimized) schema distributed with the TPC-DS tools, or something in between. Nevertheless, I use those results here to illustrate the types of insights that can be gained through some fairly straightforward analyses and visualizations.

Visual Insights

To get started, we look at some simple but effective analytics and visualizations that can be used to augment our understanding of the four platforms for this benchmark - using the data that GigaOm generated and some creative organizations of that data.

We begin by extracting GigaOm's published results from Table 3 of the Report, building an Excel spreadsheet and CSV file with the per query execution times.

We note here that we don't have the actual DDL (schema definitions) or query text for the GigaOm tests. The standard tpcds tools "kit" available from TPC includes a basic schema with primary/foreign key constraints, and includes the ability to generate the standard query set along with an approved set of "variants" to accommodate systems lacking ROLLUP or GROUPING SET support, or limited in the extent of their correlated subquery support. While those details are significant in a more thorough treatment, we simply don't have the information to address them here. We plan to do a deeper dive into this topic, looking at the effect of different schema and query variants using the TPC-DS benchmark, perhaps in parallel with the rest of this series.

For this analysis, we compute cumulative execution times for each of the platforms, i.e. Q1, Q1+Q2, ... Q1+Q2+...Qn. We then plot cumulative execution times of the four platforms. We note that queries 14, 23, 24, and 39 each include two SELECT statements, for a total of 103 queries; we sum the two subqueries for each of these to report what we refer to as the Q99 (standard set of TPCDS 99 query files). Plotting the cumulative execution times, we see the following:

Figure 1. Cumulative Q99 Execution Times

In this view, we see the cumulative query execution times plotted across the set of queries, Q1, Q2, ... Q99. Immediately, we see a number of interesting characteristics:

Google BigQuery is immediately seen as an "outlier" in the comparison
SQLDW, Snowflake, and Redshift all show relatively similar profiles
We can spot differences in how the various platforms handle some of the queries by the sections of the curves with the steeper slopes. These are many of the longer running queries in the Q99 set, but we can see that they differ across platforms. E.g.:
- Redshift's performance on query72 appears to be significantly worse than both Snowflake and Azure
- Azure looks to outperform both Redshift and Snowflake on query14 and query78

So this is a useful "quick glance" in itself. For now, we'll dismiss BigQuery from the next few analyses, but its obvious outlier status also highlights an immediate need for greater understanding.

Why does BQ fare so poorly in this test? Is it a general platform issue or a configuration issue? Can it be explained by a subset of the queries (say the 5-10 marked by the steepest parts of the curve ... 14, 23, 64, 72, 95)? What do these have in common? Is it a benchmark implementation issue (e.g. maybe missing some schema directives), a BQ feature issue (e.g. lack of support for replicated tables), or a query optimization issue (e.g. poor join strategies)?

There are some additional insights I've been able to glean with a deeper dive informed by the above, but I'll table that treatment of BQ for now and concentrate on the other three platforms, which, without the BQ results look like:

Figure 2. Drop the Google BQ results

This now sheds some more light on the comparison between AzureDW, Redshift, and Snowflake based on the curve "discontinuities" (higher slopes, longer-running queries). But there are further insights to be gleaned from this data. In the above charts, queries were in the order of their query numbers, Q1 .. Q99 (note that this may or may not be the actual order of execution in the test). This ordering is, in effect, arbitrary. Now, we compute the "average" query execution time across the remaining 3 platforms, and resort the data by average query execution time (retaining the cumulative execution time view). In essence, this nominally "sorts" short-running queries to the left of the graph and long-running queries to the right.

Figure 3. Queries ordered by average execution time

And a number of additional insights emerge, similar to the manner in which appropriate clustering and sorting of 2-D heat maps provides additional visual queues:

All three implementations perform similarly on the shortest running ~ 1/3 of the query set, with Redshift and AzureDW outperforming Snowflake
Redshift slightly outperforms AzureDW for the first ~ 1/2 of the query set; its performance on the ~ 10 longest running queries is poor compared with both Snowflake and AzureDW. This highlights the need for additional investigation.

Is there something in common across those queries (e.g. large number of joins)? Are these dimensional joins that aren't using replicated dimension tables? Are the table statistics up-to-date? Are we using the right column store indexes and/or fact table distribution?). A quick look at these issues, and perhaps EXPLAIN plans would be in order if we had them.

Snowflake generally doesn't seem to perform as well on the second ~ 1/2 of the query set (widening gap between it and the other two charts, other than the Redshift anomalies described above) and has trouble with query78 and query23. It seems to "follow" AzureDW (but with the widening gap) other than that. Some of the same questions arise again here.

Is there an issue with columnar sort/cluster keys or statistics? Or does Snowflake just not "scale" as well for longer running queries on larger configurations -- we note that a Snowflake 3XL is a 64-node configuration and we know that things like query startup, shutdown, and synchronization costs (during final aggregation and sort, or during distribution/exchange operations, e.g.) are influenced by Amdahl's Law. And we note that Snowflake doesn't have a REPLICATE directive for replicating the smaller dimension tables - meaning some of their joins need to do explicit broadcasts or data redistribution that the other two platforms may not be doing.

So this very simple act of reordering the queries in the cumulative, whole workload presentation gives us a lot of additional insight, and some clues as to where to look for explanations and/or possible optimizations.

A/B Comparisons

We're going to look at one more variant of this visualization that can provide additional insights, especially when doing A/B comparisons. Here, we're going to look at both Redshift vs. Snowflake and Azure vs. Snowflake head-to-head comparisons using the published TPC-DS data. But it should be noted that this approach generalizes, and can be effective applied to any of a number of relevant A/B investigations (comparing schema variants, planner variants or directives, configuration scaling, etc.)

Figure 4. Redshift vs. Snowflake A/B comparison

Here we order the queries by the difference between their runtime on Redshift and their runtime on Snowflake (the "delta"). Queries to the left are ones where Redshift outperforms Snowflake; queries to the right are ones where Snowflake (significantly) outperforms Redshift. We can break the set of queries into groupings we generally label by color in these comparative views -- in this case as "RED" or "BLUE". Here, the "RED" grouping is the set of queries where Redshift is outperforms Snowflake; the "BLUE" grouping where Snowflake outperforms Redshift, in this case by a significant amount. The systems perform comparably in the middle region (very similar slope). Further analysis can then focus attention on the "bookends" for a deeper dive into some of the possible differences described above (replicated dimension tables, network distribution overhead, join complexity/ordering, ...) - in conjunction with additional system-provided execution metrics (cpu util, IO rates, network distribution, cache hit ratios, ...) and EXPLAIN plans.

Figure 5. Snowflake vs. Azure A/B comparison

Here we order the queries by their execution time on Snowflake. I.e., the queries that run the fastest on Snowflake are to the left; those that run longest on Snowflake are to the right. Azure queries are correspondingly ordered. We see here that the queries that run "well" on Snowflake also run well on Azure (the first ~ 40% of queries). But the slower queries on Snowflake are significantly slower than they are on Azure. So Azure appears to handle the more complex queries better in this test. Again, questions remain re: schema specifics and data sorting, which could explain this difference.

Both of these charts also introduce "speedup" bars (column graph), where "speedup" is the ratio of execution times for each individual query. We do a little trick on the speedup values ... negating and inverting speedups that are less than one, which both preserves the "scale" of the speedup/slowdown (2.0 vs -2.0, rather than 2.0 vs. 0.5) and does a better job of highlighting speedups (bars pointing up) vs. slowdowns (bars pointing down). Speedups near 1.0 (or -1.0) are effectively parity. In this presentation, the cumulative execution time axis is moved to the right hand side, and the left axis shows speedup values. It's quite easy to pick off the queries that exhibit the most significant differences on the two platforms, focusing additional investigations into schema, data ordering, and query plans.

We also note that Snowflake's higher "speedups" largely occur with the shortest running queries (to the left of the chart), where large speedups are far less impactful on overall execution time than they are for long running queries. In contrast, speedups towards the right side of the chart are for longer running queries, so even relatively small speedups/slowdowns can lead to significant impact on overall performance, as we see here.

Additional Insights

There are a number of additional insights to be gleaned from some simple modifications / additions to the way benchmarking is often conducted.

Weighted Cumulative Execution Time and Geomean

Total execution time (summing the execution time of each of the Q99 queries) is probably the most frequently referenced metric in comparative benchmarking.

But total execution time (or mean query execution time) can be misleading, especially for workloads that may be skewed in terms of their use of short-running and long-running queries. TPC-DS is such a case, where ~ 10 % of the queries account for about 50% of the total execution time. In such a case, alternative metrics are often useful.

What you'd really like to use is a set of queries representative of your production workload -- with individual query runtimes weighted by the frequency at which those respective queries are run to determine whole workload execution time. In cases where benchmarking workloads are constructed from in-house operational summaries or sampling, this can be done directly. In other cases, some simple alternatives to total or mean query execution time may be useful.

Geomean is a metric sometimes used to better characterize workloads with skewed execution time distributions, as discussed briefly in a fivetran post to Ycombinator Hacker News. The median can also be useful. For the data we've been looking at, the difference is:

Table 1. Whole workload runtime statistics (single stream)

	SQL DW	Redshift	Snowflake	BigQuery
Total Exe (s)	2996	7143	5790	37283
mean (s)	30.3	72.2	58.5	376.6
geomean (s)	15.7	18.2	28.2	97.7
median	15.5	16.5	31.7	81.1

While all results show some skew (median and geomean significantly less than mean), Redshift's metrics again show that its mean (and total) execution times are being adversely affected by some longer running queries as the mean is well above the median or geomean (about 4x, vs. 2x for Redshift and Snowflake). The Redshift geomean and median are much closer to the SQLDW values here.

Query Throughput (QpH) and Price-performance (QpH-per-$)

It is straightforward to compute query throughput and price-performance metrics for these single stream tests using the GigaOm published data, and in fact they have done that. But there are additional insights to be gleaned from these metrics when examined in conjunction with query Response times in concurrency scenarios. We defer that to the next two installments of the series.

Response time Distributions

Although response time distributions are most useful in concurrency scenarios (where they represent the statistical effect of queueing theory), we introduce the concept here as a foundation for additional analysis, and find that they can be useful even in looking at single stream query execution times.

Here, we chart the 50th (median), 75th, 90th, 95th, and 99th percentile response times for each of the platforms. Conceptually, this is done by lining up each platform's queries in increasing execution time, and picking the points where 50%, 75%, 90%, 95%, or 99% of the response time falls. In practical terms, this can be done using Excel's PERCENTILE function. We use PERCENTILE.INC here because of the relatively limited set of query results.

We note that this presentation also shows the relative strength of Redshift for the bulk of shorter queries. As with any statistics (as per "lies, damned lies, and statistics"), one needs to be careful with response time metrics. We only have 103 queries here (and are really only tracking the Q99 set), so the 99th percentile response can be strongly skewed by just one "slow" query. We'll see a more effective use of response time distribution profiles in the next couple of parts in the series -- where we look at performance results across a much larger number of query iterations, and where we'll use the PERCENTILE.EXC function to eliminate outliers.

Table 2. Response time profiles

Response times (s)
percentile	SQL DW	Redshift	Snowflake	BigQuery
50	15.5	16.5	31.7	81.1
75	26.7	43.5	59.7	213.7
90	59.0	124.6	100.7	897.4
95	87.7	444.5	174.7	2002.1
99	162.2	797.9	506.3	3618.4

Metrics by Category

Another useful tool is the presentation of summary metrics by query categories. These could be different query types submitted by different user groups (dashboard reporting vs. ad hoc analytics), against different portions of the schema, or something as simple as a "binning" of queries by execution time. Or they could represent various attributes of the query texts or query plans themselves (e.g. queries using different join strategies, queries targeting different fact tables, etc.)

One of the original papers on TPC-DS, "Why You Should Run TPC-DS:A Workload Analysis", introduced the notion of a query class categorization. Unfortunately, the categorization (as far as I can tell) was never explicitly defined (i.e. which of the 99/103 queries fall into which query class "groups" (there are more class counts than queries). Their original breakdown shows:

Figure 6. TPC-DS categorization from the original paper.

Other TPC-DS benchmarking writeup have provided their own categorizations, e.g. Snowflake's TPC-DS results writeup presents the following figure:

Figure 7. TPC-DS query categorization

and the GigaOm report referenced here identifies a couple of query categories of interest:

long-running: 14a, 80
frequently run: 6, 29, 30, 39a, 44, 47, 93, 94

If we look at cross-tabs (contingency tables) for these two categorizations, and examine the statistics of each category, we see:

Table 3. Average execution time cross-tab for Snowflake categorization (category X platform)

category	N queries	SQL DW	Redshift	Snowflake	BigQuery
interactive	4	4.9	7.2	20.6	26.9
reporting	4	7.0	8.8	28.6	41.5
analytic	2	50.0	30.6	63.5	117.1
complex	4	65.5	220.3	181.2	1664.0

Table 4. Average execution time cross-tab for GigaOm categorization (category X platform)

(Note: we use both subqueries for 14 and 39; the original paper included the first subquery only for each of these).

category	N queries	SQL DW	Redshift	Snowflake	BigQuery
frequent	8	18.5	31.4	32.8	259.0
long-running	2	62.9	447.5	240.6	2440.2

This provides similar insights as previously for this limited test set. But again, in a more extensive benchmarking effort using, e.g. thousands of queries across different user groups hitting different schema or portions of the schema, there are likely to be additional insights from such treatments. It is important, however, to remember there is no "one" way to do this, as categories are effectively in the eye of the beholder.

Use of Additional Metrics and Statistics/ML

The categorization cross-tabs above are one example of common statistics-based approaches to further understanding. Cross-tabs are useful with categorical attributes.

Additional insights may be available using continuous variable attributes, including those generally available in most DW deployments. These include metrics like CPU utilization, IO (bytes read, written), network utilization, caching metrics, and possibly additional metrics available in the various query plan stages. These can be useful for generating 2D and 3D scatter plots, with additional "dimensions" mapped to glyph type and size. Still further analysis can be done using some of these metrics for regression analysis or other statistical learning techniques, or machine learning (ML) algorithms. We'll explore some of these in a future article or series where we have access to this level of system statistics.

Additional Uses of A/B Testing

There are numerous uses of the above techniques in a range of A/B testing scenarios. The approach can be used to compare, e.g.:

the impact of different schema directives (distribution, replication, sort/order, ...) on whole workload throughput or response time profiles;
product version comparison (i.e. product regression testing), comparing successive generations of a product and its effect on whole workload throughput and response time;
more detailed engineering design and implementation priorities and effects, including the impact of query optimizer changes or use of "hints", the use of higher fidelity statistics (more expensive to create/maintain), or effective use of bloom filters;
the impact of architectural platform changes, including architectural changes (e.g., Redshift RA3 / shared storage and AQUA), and use of networked storage IO accelerators and asynchronous "backend" data characterization and manipulation activities (background garbage collection, vacuuming, etc.)

And as we'll see in the next two installments of the series, they can be used to gain valuable insights into more complex "whole workload" scenarios -- including different query mixes, concurrency scaling, impact of background operations like LOADs and ELT, impact of system background operations (GC, vacuum, stats updates, ...), and even the effectiveness of horizontal scaling (e.g. with load balancing) across multiple warehouse instances.