Benchmarking and Performance Engineering

Philosophy and Approach

For performance characterization and optimization projects, RethinkIO employs a Benchmarking and Performance Engineering Methodology based on the following key principles:

  • Development of a repeatable execution framework (scripts, tools) to support performance analysis across multiple system configurations and software revisions (including performance regression).
    • Shell scripting, python, Perl, multi-threaded C/C++, SQL (including procedural/UDFs), Java/Scala, and others as required
  • Systematic collection of metrics in standard tools and tabular formats to support long-term retention in Databases and detailed analysis and visualization in tools such as Excel and R
    • Where appropriate, coordinated collection of system/application metrics through facilities like collectd, Graphite, InfluxDB, Prometheus, and visualization through Grafana
    • Innovative visualizations to demonstrate scalability (especially under concurrent workloads) and “whole workload” characteristics
  • Leverage (“wrap”) standard benchmarking tools and approaches to streamline framework development and support cross-platform / competitive analyses; augment where necessary to illustrate key product strengths or explore performance issues
    • FIO, iozone, vdbench, IOR, SpecSFS for file systems
    • TPC-DS, TPC-H, SSB and other standard database/analytics benchmarks
    • YCSB for NoSQL / KV benchmarks
    • Emerging object storage benchmarks (e.g. COSbench) for on-prem and cloud-based object storage systems (S3, Swift)

Performance Engineering Fundamentals

We look at Performance Engineering Fundamentals based on the following model.  While initially used by RethinkIO to describe fundamentals associated with storage devices and systems (as here), it applies generally to any fixed system or application resource that may come under contention by multiple threads, processes, or users – CPU, memory, networking, or disk throughput/IOPS.  It is effectively the operational implication of queueing theory.

In this model, as a System (system component) is presented with increasing load (x-axis), system response time (y-axis) increases.  There are three operational regions of interest:

  1. underutilized: the system resources are more than sufficient to meet the requested demand; response times are well within the target service level
  2. optimally provisioned: system resources are well matched to the requested demand; response times are nominally at or near the target service level and the system is effectively utilized
  3. oversubscribed: system resources are oversubscribed by demand that exceeds the system’s design point; response times increase beyond the target service level, and may increase exponentially under sustained oversubscribed load.

In storage devices and storage systems, response time represents the time to complete a physical IO request under various workloads (increasing IOPS or MB/s).  In database systems, response time would be query execution time (possibly first row or all result rows) in a multi-user/multi-query concurrency scenario.  And in object storage systems, response time would be time-to-object-first-byte or object throughput (objs/s or MB/s) in a multi-client request workload.

Examples

Below are some examples of representative analyses and visualizations created in the past:

Database Workload Performance

This left chart is a “Whole Workload A/B comparison” view, augmented by individual query speedup information. In this case, the workload includes running each of a set of queries that comprise a workload serially in succession (e.g. TPC-DS).  The chart shows two comparative executions of the same workload (different systems).  The line graphs present the cumulative runtimes (all queries) of the two systems.  The difference in the two line endpoints on the righthand side is the difference in overall workload execution time.

The bars are the individual query “speedups” (values on the left Y-axis).  The queries (X-axis) are “sorted” by their runtime, so short-running queries are to the left of the chart and long-running queries to the right.  Big speedups to the left of the graph tend not to matter in overall workload execution time (as query runtimes are short) as much as big speedups to the right.

The right chart is a “Query Concurrency” view.  It shows the average execution time of each of 12 queries (the different lines) as query concurrency is increased from 1 to 12.  Depending on the resource constraints and the “appetite” of the query, the impact of concurrent execution may vary widely – from no impact (12 copies of the query run as fast as 1) to significant impact (the line with the pronounced “knee”).

The use of these analyses and visualizations are explored further in the series Benchmarking: It’s not just for breakfast anymore.

Latency Comparisons

The two charts above  provide different views of Latency Distributions.  In this case, these were SSD response latencies, but the same methodology could be used to analyze the distribution in completion time (execution time variability) of one or more queries run many times in an operational setting.  The chart on the left shows a histogram-like view of the latency of two different devices (x-axis on a log scale).  The chart on the right is an alternative view of the same data showing cumulative latencies.  In this case, the blue line shows better latency through 80% of the events, but then spikes, whereas the red line is more consistent over the full range of events.

Latency vs. Throughput Tradeoffs

As mentioned above, basic queueing theory leaves us with a Latency-vs-Throughput tradeoff in the face of fixed system resources.  That is illustrated in this chart, where the bars represent throughput of an SSD, and the lines represent the various latency metrics (mean, 99th percentile and 999th percentile).  A fixed workload is applied in concurrencies of 1, 2, 4, … 512 to incrementally apply load to the system.  The result is well-behaved latencies up to about 256 concurrency, where there is a latency “knee”.  Throughput (bars) tends to also temper at this point – as device saturation is approached.

These and other analyses and visualizations are explored in the later installments of the series Benchmarking: It’s not just for breakfast anymore.

System Monitoring Infrastructure and Tools

In addition to analytics and visualization in Excel and R, we often use some of the open source system monitoring tools to capture system resource information in conjunction with benchmark or application execution.  This can often provide correlative information between application or benchmark performance and underlying system resource utilization – for identifying resource contention issues or understanding how much capacity (“headroom”) a system may have for concurrent operations, mixed workloads, etc.

The above charts and the architecture diagram represent an implementation of Grafana (metric charting package) using Graphite to accumulate system metrics (cpu, memory, network, disk activity) gathered from collectd daemons on a multi-server software-defined storage implementation.  We have also deployed Prometheus and Riemann.

Similar facilities can be used to capture and archive application-level metrics, e.g. serving as a historical repository of benchmarking results for regression analysis.

Scroll to top