Schema Matters: Data Distribution and Ordering

In A Picture is Worth 1000 Words REDUX, I compared the performance of a 16-node ra3.16xlarge Redshift cluster to the 30-node dc2.8xlarge results originally published by GigaOm in 2019, the primary goal being to explore some of the attributes and capabilities of the new Redshift RA3 nodes with a published comparison point.  I soon discovered that a simple apples-to-apples comparison was not possible, due to likely differences in schema and data distribution/ordering between my analysis of a best-practice implementation and the original GigaOm treatment, for which schema and data ordering details were not published.

As a reminder, the fundamental differences as presented in the "REDUX" are (1) the node type; and (2) (presumably) the "key" attributes of the schema, notably the distribution keys and sort keys.  Those differences are summarized here:

Redshift 30-node dc2.8xlarge (Gigaom-published results, 2019) Redshift 16-node ra3.16xlarge (rethinkio2020 REDUX)
TPC-DS whole workload (Q1 .. Q99) total execution time (lower is better) 7143 s 1269 s
Schema unknown best practice: replicated dimensions, sort-ordered facts

This is a 5.5-fold performance improvement, due as discussed previously to a combination of the platform and schema differences.

In this article, I use the same A/B comparison methodology I've been developing through the series in an attempt to tease apart these two vectors of differentiation - giving us a best-practice implementation for both DC2 and RA3 for comparison with each other, and with the original GigaOm results.

But first, a note about the schema variant I used.

Standard TPC-DS Schema

The standard TPC-DS tools distribution provides a tpcds.sql file with the DDL for creating the TPC-DS tables.  These tables each have a PRIMARY KEY() definition.  E.g., for the STORE_SALES fact table, the definition is:

create table store_sales
(
    ss_sold_date_sk           integer                       ,
    ss_sold_time_sk           integer                       ,
    ss_item_sk                integer               not null,
    ss_customer_sk            integer                       ,
    ss_cdemo_sk               integer                       ,
    ss_hdemo_sk               integer                       ,
    ss_addr_sk                integer                       ,
    ss_store_sk               integer                       ,
    ss_promo_sk               integer                       ,
    ss_ticket_number          integer               not null,
    ss_quantity               integer                       ,
    ss_wholesale_cost         decimal(7,2)                  ,
    ss_list_price             decimal(7,2)                  ,
    ss_sales_price            decimal(7,2)                  ,
    ss_ext_discount_amt       decimal(7,2)                  ,
    ss_ext_sales_price        decimal(7,2)                  ,
    ss_ext_wholesale_cost     decimal(7,2)                  ,
    ss_ext_list_price         decimal(7,2)                  ,
    ss_ext_tax                decimal(7,2)                  ,
    ss_coupon_amt             decimal(7,2)                  ,
    ss_net_paid               decimal(7,2)                  ,
    ss_net_paid_inc_tax       decimal(7,2)                  ,
    ss_net_profit             decimal(7,2)                  ,
    primary key (ss_item_sk, ss_ticket_number)
);

Workload-optimized TPC-DS Schema

Most scalable Data Warehouse implementations these days include variants of the CREATE TABLE statement that afford additional control over the layout and organization of data.  In the case of Redshift, there are options to define compression types, distribution keys, and sort keys.  These options can have significant effect on both storage efficiency (compression ratio) and query performance -- depending on the characteristics of the query and the data being queried.  They can affect both the query plan and the runtime execution characteristics.  This is particularly true with schema of well-structured dimensional data models, like TPC-DS.

The AWS Labs team has published an optimized schema for TPC-DS that includes these options, and is consistent with recommendations made in the new Amazon Redshift Advisor.  In this schema, the bulk of the (smaller) dimension tables are replicated (DISTSTYLE ALL).  The CATALOG SALES/RETURNS and STORE SALES/RETURNS tables are distributed (DISTKEY) on the *_ITEM_SK column and sorted on the *_SOLD/RETURNED_DATE_SK.  The WEB_RETURNS and WEB_SALES tables are distributed on *_ORDER_NUMBER and sorted on the *_SOLD_RETURNED_DATE_SK. For the scale factor 30000 (30TB) test, the ddl is here.

It should be noted that these options are generally not required -- it is certainly possible to run the TPC-DS workload with the default TPC-provided schema, though for larger datasets and more complex queries, the ability of the system to return such queries in reasonable time (or at all) may be affected.  Typically, however, one will have some insights into relative sizes of tables, cardinality of at least some of the "key" columns (e.g. PRIMARY KEY), and a general idea of how the data could be distributed and/or sorted to aid in parallel/distributed processing.  Additionally, Redshift includes a Query Plan review facility that integrates the advisor hints, highlighting recommended optimizations for distribution and sorting/ordering.  I plan to write about these in future articles on performance tuning.

DC2 vs. RA3

In this article, we want to understand how much of the performance improvement found in A Picture is Worth 1000 Words REDUX is due to the Redshift node type (DC2 vs. RA3), and how much is due to the best-practices approach to schema design on modern DW platforms.

We'll do this by first re-running the identical Q99 query workload against the same sf30000 (30TB) dataset we used in the REDUX article, this time on a 30-node dc2.8xlarge configuration (matching that of the original GigaOm article).

The results are summarized here:

GigaOm2019 This study (rethinkio2020) "REDUX" results (rethinkio2020)
Configuration 30-node dc2.8xlarge 30-node dc2.8xlarge 16-node ra3.16xlarge
Schema unknown best practice best practice
Whole workload execution (s) 7143 1233 1269

The DC2 test with the same best practices schema and data ordering provides nearly identical results to that of the RA3 configuration.  That is encouraging, as the near-2:1 ratio of dc2 to ra3 nodes here (30:16) is consistent with general Redshift sizing recommendations.  Recall that the ra3.16xlarge nodes have 50% more vCPUs (48 vs. 32), 57% more memory (384GB vs 244GB), and a different IO architecture than the dc2 nodes -- including networking and networked storage efficiencies delivered by the AWS Nitro System.

And both provide a roughly 5.5x speedup over the originally published GigaOm results.  And the DC2 implementation provides an additional 33% price-performance improvement over the RA3 configuration for those deployments that may not need the large amounts of flexible managed storage afforded by RA3.

Figure 1. Updated Whole Workload Execution Cost

It should be noted that the intent here is not to pit DC2 against RA3.  Those are two options with different price-performance and capacity "centers of mass".  Nor is the intent to cast DC2 as "legacy".  As this comparison points out, there are well defined use cases for each, with their respective price-points.  For warehouses that fit within the DC2 capacity envelope, that solution may be more cost effective.  For those requiring larger or growing datasets, the managed storage flexibility may offset the reduced cost.  And the significant increase in storage may enable additional optimizations throughout the business -- including support for multiple line-of-business applications, in-database-ELT, schema evolution and query optimization tradeoffs, and development of new analytic approaches.

Visual Insights

We can now repeat some of the analyses we've done earlier to compare the DC2 and RA3 results.  First, let's show the schema-optimized DC2 and RA3 results vs. the original GigaOm numbers.  We see the schema-optimized results track well with each other and show the same 5.5x performance improvement over the original published results. Results in this chart are ordered by their execution time in the original GigaOm results - i.e. the longer running queries in that result set sort to right of the chart.  We can plainly see with the added "bookend" that there is a very large difference in the execution time of the longest running queries from the initial GigaOm set -- (marching in from the right ... 72, 16, 14, 23, 67, 78, ...).  In fact, the longest running twelve queries account for almost 5/6 of the difference between the GigaOm results and the optimized schema results.

The DC2 and RA3 results track pretty well with one another.  Additional A/B analysis could be done to try and understand subtle differences in those results that might be due to the way in which different queries are affected by the difference in "balance" of the two deployments (30 vs. 16 nodes, 960 vs. 768 cores, differences in memory or NVMe I/O, ... ).  But at this difference in execution time (less than 3%), we're likely to be splitting hairs.  We'll save additional A/B analyses of RA3 and DC2 for future comparisons (possibly other workloads) where the architectural difference may be more visible.

The similar whole workload execution profiles also translate into very similar overall performance metrics, response times, and categorized performance metrics as explored in the REDUX.

Conclusion

Here we find similar results to our analysis in A Picture is Worth 1000 Words REDUX. These analyses demonstrate the value in proper distribution and ordering of data for efficient (and cost effective) use of parallel data warehouse architectures.  We reran a TPCDS-like workload on Redshift DC2 and RA3 configurations and demonstrated a 5.5x performance improvement over the previously published results by GigaOm.  We believe this improvement to be due largely to differences in schema variants (distribution and sort keys) and data ordering between our schema-optimized tests and the original GigaOm tests.  Together these factors can have significant effects on both query plans (e.g. join strategies) and runtime optimizations (data skipping), resulting in the compelling performance improvements shown here.

 

Scroll to top