Start Now Login

Table of Contents

Properly managing storage utilization is critical to performance and optimizing the cost of your Amazon Redshift cluster. We’ve talked before about how important it is to keep an eye on your disk-based queries, and in this post we’ll discuss in more detail the ways in which Amazon Redshift uses the disk when executing queries, and what this means for query performance.

Amazon Redshift uses storage in two ways during query execution:

  1. Disk-based Queries. When a query runs out of memory, the overflow “spills” to the disk and the query goes “disk-based”.
  2. Intermediate Storage. When a query needs to save the results of an intermediate operation, to use as input for a future operation.

Use excessive storage impacts your cluster because:

  1. Query performance suffers, because disk is 100 times slower than memory.
  2. Other queries will be slower, because increased I/O impacts the commit queue which is a shared resource.
  3. The cluster might hit 100% disk utilization, causing queries to fail or requiring the provisioning of more nodes.

The worst case is (3), when the cluster fills up. It can happen for the all nodes in a cluster at once, or start with just one node, and then propagate through the entire cluster, as the image below shows. 

disk utilization by node, Disk-based Queries

So let’s look into what we can do to fix and prevent this type of situation. 

The WLM and Disk-Based queries

If you’re not already familiar with how Redshift allocates memory for queries, you should first read through our article on configuring your WLM

The gist is that Redshift allows you to set the amount of memory that every query should have available when it runs. This value is defined by allocating a percentage of memory to each WLM queue, which is then split evenly among the number of concurrency slots you define.  When a query executes, it is allocated the resulting amount of memory, regardless of whether it needs more (or less). Hence allocating too much memory is wasteful (since each node in the cluster obviously has finite memory), whereas allocating too little memory can cause queries to spill to disk.

Queries which overflow their allocated WLM memory are “disk-based”. These queries usually suffer from significantly degraded performance since disk I/O is orders of magnitude slower than memory I/O. 

There are six types of internal operations that Redshift can spill to disk when executing a query:

  1. Aggregations
  2. Hashing for joins
  3. Saving intermediate rows for future query steps
  4. Sorting
  5. Removing duplicates from intermediate or final results (unique)
  6. Window functions

If any of these operations are processing more rows (i.e. more bytes) than will fit into allocated memory, Redshift has to start swapping data out to disk, resulting in a significant slowdown of the query.

AWS recommends that you keep the percentage of disk-based queries to under 10%.  On our own fleet of clusters, we’re usually running well under one percent:

disk-based queries, Disk-based Queries

How to Prevent Queries From Going Disk-based

  1. Allocate more memory to the query. There are two approaches to this:

Within the intermix.io dashboard, viewing Recommendations for an individual query will surface the exact touched tables and how to update them, as well as how much memory the query used and the amount of memory capacity in the WLM queue.

Disk-based Queries

Intermediate Storage

Redshift also uses the disks in each node for another type of temporary query data called “Intermediate Storage”, which is conceptually unrelated to the temporary storage used when disk-based queries spill over their memory allocation.  

Intermediate Storage is used when Redshift saves the results of an intermediate operation to disk to use as input for a future operation. Intermediate Storage can become important if your query stores a large amount of data between query operations, since that storage may cause your cluster to run out of disk space. It also introduces additional I/O, which can lead to slower execution times.

Ways to Limit the Amount of Intermediate Storage Used.

Since intermediate storage is used to carry results from one part of the query execution to another, the best way to reduce intermediate storage is to use predicates (e.g. WHERE clauses, JOIN … ON clauses, etc) on intermediate steps of your query (subqueries, CTEs, etc) to ensure that you are not carrying unnecessary data through your query processing.

For example, consider this query which joins on the results of two CTEs:

This query could be re-written as follows to limit the amount of data brought forth into the JOIN.

We have a great new feature called Query Recommendations that proactively lets you know if your query is using a significant amount of intermediate storage.

query recommendations

Impact to Cluster Storage Utilization

This value is important when diagnosing spikes in cluster storage utilization. 

For example, let’s assume you see your cluster storage spiking over some time period, but don’t see a corresponding increase in data transferred (via the COPY command) into your cluster. You could search for all queries which have a large Memory to Disk value to identify which queries contributed to your cluster running out of disk space.

Here’s a real-world example. The following chart shows the actual disk space used in a cluster, over a 2 week period, broken down by schema. This particular chart show consistent storage utilization over time, with small variation.

schema size, Disk-based Queries

The chart of % disk utilization tells a different story. On the same cluster, over the same period, the disk utilization hits 100% quite frequently. This is caused by some queries using an extraordinary amount of intermediate storage.

disk utilization

One of the cool features we recently released Cluster Recommendations, will surface queries with high disk utilization immediately. For this cluster, it appears that queries were using close to two terabytes of disk for intermediate results!

cluster recommendations

Conclusion

Monitoring both “Disk-based Queries” and “Intermediate Storage” is crucial to keeping your cluster healthy.  Keeping on top of this temporary disk utilization prevents your Amazon Redshift disks from filling up due to misbehaved queries, resulting in queries being killed and your users being interrupted.

One of the things we hear most often from our customers is that data is becoming more and more critical to their business necessitating Data SLAs.

Take for example one customer who builds hardware devices for consumers. They use their Amazon Redshift cluster to monitor and provide support for customers on the health of those devices (billions of events per day) and also to manage the complex supply chain for their overseas production lines.

This company relies on the data warehouse to run their business, and if the data pipelines don’t complete in time, or queries are slow or fail – the business loses revenue and customer satisfaction plummets.

We see this trend across all of our customers. In response – businesses are beginning to demand service level agreements (SLAs) for queries and reports built on top of the data warehouse. These SLAs have a number of different service level indicators – the most prominent are:

In fact, we have one customer that demands a 10 second SLA for any query that is executed through their BI reporting tool. Analysts get fed up with waiting for the “spinning wheel of death” waiting for a dashboard to load, or a query to complete.

Measuring & Improving Data SLAs for Query Latency

Measuring service levels for query latency is trickier than it seems. To understand why, consider the sheer scale of workloads that a typical data warehouse handles. Our average customers runs over one million queries per week on their Amazon Redshift cluster. These queries span three categories of applications

At the same time, each category of workload is growing:

So a single SLA across all the queries is meaningless. What you really need is to measure SLA on a very granular basis – for each data application, down to the individual user level. For a real world example, consider a company using Looker as their BI application. How do you easily filter out all the noise only to see the Looker queries? How do you monitor the performance of a single model?

Measuring SLAs is hard, but is only the beginning of the journey. Assuming you are able to measure the SLA from each of your data applications – the next step is to take steps to improve the SLA by finding and fixing issues impacting your workloads and users.

Introducing the Cluster Health Dashboard

The Cluster Health Dashboard helps data teams measure and improve SLAs. It does this by surfacing:

  1. SLA measures for any connected app, down to the individual users
  2. Proactively surfaces Recommendations to improve performance and optimize costs
  3. Top 5 slowest queries with quick links to Query Optimization Recommendations
Data SLA, dashboard

Measure Data SLAs for any App or Users

To see latency distribution for any connected app (see here for a list of supported apps), simply modify the filter at the top-right of the screen.

Query latency follows a power-law distribution – and most queries in the warehouse run rather quickly, with a long tail of slow queries.

Measuring query latency is best done by looking at different points across that distribution – the most common being the P95 and P99 latencies. These are shown prominently at the top of the page.

Further, we show you a distribution of latencies across the entire cluster, grouped by user (by default). Click on the chart to quickly go to the queries on the Discover page.

Example of SLA metrics for “Looker PDT” queries.

data latency distribution

Cluster Recommendations

Our team has significant experience running Amazon Redshift at scale. In fact, we run a very large fleet of Amazon Redshift clusters as part of our product architecture. Based on this experience, we surface observation and recommendations on how to optimize the cost and performance of the cluster. The Recommendations cover things like:

evaluate sql queries

Find & Fix Slow Queries

The “Top 5 Longest Running Queries” list default to showing you the slowest queries in your cluster, over the past day.

Data SLA, query insight

Click on any of these queries, and go to a detail showing recommendations to optimize the query.

Query Insights

Query optimization that dramatically reduces runtime for queries which use window functions.

The Simplified Problem

A common problem we see our customers solving in SQL is how to compare two rows per group in a dataset.  For example, say you have the following data in the product_prices table:

Productupdated_onPrice
widget_A2019-07-03 12:00:00$105
widget_A2019-07-03 11:00:00$110
widget_A2019-07-03 10:00:00$100
widget_B2019-07-02 06:00:00$255
widget_B2019-07-02 05:00:00$235

How do you efficiently SELECT the price change at each updated_on time for each Product?  I.e. we’d like to see a result set of:

ProductMost recent updated timePrice change
widget_A2019-07-03 12:00:00-$5
widget_A2019-07-03 11:00:00$10
widget_B2019-07-02 06:00:00$20

This type of problem arises in a lot of domains.  Another common example of this problem occurs when you want to compare the time difference between subsequent events. For example, you may have a table that logs timestamps for each page visit of a user to your website and you want to do an analysis on the time difference between visits.

A common but sub-optimal way we see customers solve this problem is by using the ROW_NUMBER() window function together with a self join.  The algorithm is straightforward: first select all your product prices and order them within each product by updated_on using the ROW_NUMBER() window function.  Then self join on the table, mapping each row to the row preceding it.  In SQL:

This query produces the desired result set, but at the cost of a join.  For real-world (i.e. less contrived) examples, solutions like this also often need to incorporate additional sub-selects or other expensive operations which compound the effect of the join and can make the query run excruciatingly slow.

This class of problem can manifest itself in other ways as well: sometimes instead of a self join, the query may just select from ordered_prices where row_number=1 to get the most recent updated_on and price.  But the general pattern is the same: using ROW_NUMBER() to label rows for further processing/joining.

A Better Solution

If you’re using ROW_NUMBER() in a query just to apply filters or joins later on in your transformation, there may be a better way.  Rather than co-opting ROW_NUMBER() to compare neighboring rows, a better solution is to apply a different window function that was designed to solve this problem:  LAG().  The LAG window function returns the value from a row offset from the current row.  The resulting SQL becomes:

This SQL says that we should order the rows within each product by updated_on, and take the difference of the price from the current row with the price of the following row within that group.  This eliminates the need for the join and can greatly simplify the query plan. One note: this result set will include NULL rows at the start of each product (i.e. where there is no previous row), but these can be trivially eliminated with a WHERE clause of price_change is not null when the results are used (or this query can be wrapped in a subselect to remove those rows).

A Real-World Example

The seed for this blog post was planted when a potential customer came to us with the same question we hear all the time:  “Why are our queries slow?”  After instrumenting their cluster with intermix.io they were able to quickly identify their worst queries and find out why they are slow and what to do about it.  In their case, one of the worst offending queries used this ROW_NUMBER() pattern and triggered a large number of Query Recommendations in their intermix.io Dashboard, including:

Query Recommendations block from intermix.io

Since the query exhibited the ROW_NUMBER() pattern above, it caused multiple unnecessary table scans, an expensive join, significant IO skew, and a large amount of intermediate data to be written to disk during execution.  The query operated over a single table which had 876 million rows–not a huge table, but large enough to bring Amazon Redshift to its knees if not optimized properly–and was taking over 5.5 hours to execute!

After reviewing our Recommendations, they reimplemented their query using the LAG function to eliminate the join (which greatly simplified their query by removing two additional sub-selects), and the query execution time dropped to 30 seconds.  From 5.5 hours to 30 seconds simply by reviewing targeted Query Recommendations and implementing a straightforward solution–talk about a good ROI on time spent optimizing SQL.  Their exact feedback after seeing the new query runtimes? “OMFG. takes 30s. Wow”

Lessons Learned

Amazon Redshift is a fantastic general-purpose data warehouse. But since it is so easy to add data and new workloads to your cluster, it can be very hard to know where and how to start optimizing SQL queries. 

Analysts and data engineers often wonder which queries are the worst offenders and where limited SQL tuning effort should go to give the biggest bang-for-the-buck. Knowing where to put query tuning effort is crucial since seemingly simple-looking queries can sometimes have inefficiencies lurking which can compound to slow a query by several orders of magnitude. 

The good news: this is the problem that intermix.io Query Recommendations was designed to solve. We help you pinpoint the root cause of query slowness so that you can spend your limited time optimizing the right queries in the right way.

Summary

The Amazon Redshift architecture allows to scale up by adding more nodes to a cluster. That can cause over-provisioning of nodes to address peak query volume. Unlike adding nodes, Concurrency Scaling adds more query processing power on an as-needed basis. 

The typical Concurrency Scaling for Amazon Redshift gives Redshift clusters additional capacity to handle bursts in query load. It works by off-loading queries to new, “parallel” clusters in the background. Queries are routed based on WLM configuration and rules.

Pricing for Concurrency Scaling is based on a credit model, with a free usage tier and beyond that a charge based on the time that a Concurrency Scaling cluster runs queries.

We tested Concurrency Scaling for one of our internal clusters. In this post, we present the results of our usage of the feature and information on how to get started.

Download the Top 14 Performance Tuning Techniques for Amazon Redshift

Cluster Requirements

To be eligible for concurrency scaling, an Amazon Redshift cluster must meet the following three requirements:

Eligible Query Types

Concurrency scaling does not work on all query types. For the first release, it handles read-only queries that meet three conditions:

For routing to a Concurrency Scaling cluster, a query needs to encounter queueing. Also, queries eligible for SQA (Short Query Acceleration) queue will not run on the Concurrency Scaling clusters.

Queuing and SQA are a function of a proper set-up of Redshift’s workload management (WLM). We recommend first optimizing your WLM because it will reduce the need for Concurrency Scaling. And that matters because Concurrency Scaling comes free up to a certain amount of hours. In fact, AWS claims that Concurrency Scaling will be free for 97% of customers, which takes us to pricing.

We’ve also tested enabling Redshift’s automatic WLM and captured our experience with it in this blog post “Should I Enable Amazon Redshift’s Automatic WLM?

See your data in intermix.io

Concurrency Scaling Pricing

For Concurrency Scaling, AWS has introduced a credit model. Each active Amazon Redshift cluster earns credits on an hourly basis, up to one hour of free Concurrency Scaling credits per day.

You only pay when your use of the concurrency scaling clusters exceeds the amount of credits that you’ve incurred.

The fee equals the per-second on-demand rate for a transient cluster used in excess of the free credits – only when it’s serving your queries – with a one-minute minimum charge each time a Concurrency Scaling cluster is activated. Calculating the per-second on-demand rate is based on general Amazon Redshift pricing, i.e. by node type and number of nodes in your cluster.

Enabling Concurrency Scaling

Concurrency scaling is enabled on a per-WLM queue basis. Go to the AWS Redshift Console and click on “Workload Management” from the left-side navigation menu. Select your cluster’s WLM parameter group from the subsequent pull-down menu.

You should see a new column called “Concurrency Scaling Mode” next to each queue. The default is ‘off’. Click ‘Edit’ and you’ll be able to modify the settings for each queue.

Configuration

Concurrency scaling works by routing eligible queries to new, dedicated clusters. The new clusters have the same size (node type and number) as the main cluster.  

The number of clusters used for concurrency scaling defaults to one (1), with the option to configure up to ten (10) total clusters.

The total number of clusters that should be used for concurrency scaling can be set by the parameter max_concurrency_scaling_clusters.  Increasing the value of this parameter provisions additional standby clusters.

Monitoring

There are a few additional charts in the AWS Redshift console. There is a chart called “Max Configured Concurrency Scaling Clusters” which plots the value of max_concurrency_scaling_clusters over time.

The number of Active Scaling clusters is also shown in the UI under Concurrency Scaling Activity:

The Queries tab in the UI also has a column to show if the query ran on the Main cluster or on the Concurrency Scaling cluster:

Whether a particular query ran on the main cluster or via a Concurrency Scaling cluster is stored in stl_query.concurrency_scaling_status.  

A value of 1 means the query ran on a Concurrency Scaling cluster, and other values mean it ran on the main cluster.

Example:

See your data in intermix.io

Concurrency Scaling info is also stored in some other tables/views, e.g. SVCS_CONCURRENCY_SCALING_USAGE.  There’s a list of catalog tables that store concurrency scaling information.

Download the Top 14 Performance Tuning Techniques for Amazon Redshift

Our Results

We enabled Concurrency Scaling for a single queue on an internal cluster at approximately 2019-03-29 18:30:00 GMT. We changed the max_concurrency_scaling_clusters parameter to 3 at approximately 2019-03-29 20:30:00.

To simulate query queuing, we lowered the # of slots for the queue from 15 slots to 5 slots.

Below is a chart from the intermix.io dashboard, showing the running versus queuing queries for this queue, after cranking down the number of slots.

We observe that the queueing time for queries went up, maxing out at about > 5 minutes.

Here’s the corresponding summary in the AWS console of what happened during that time:


Redshift spun up three (3) concurrency scaling clusters as requested. It appears that these clusters were not fully utilized, even though our cluster had many queries that were queuing.

The usage chart correlates closely with the scaling activity chart:

After a few hours, we checked and it looked like 6 queries ran with concurrency scaling.  We also spot-checked two queries against the UI. We haven’t checked how this value may be used if multiple concurrency clusters are active.

Conclusions

Concurrency Scaling may mitigate queue times during bursts in queries.

From this basic test, it appears that a portion of our query load improved as a result. However, simply enabling Concurrency Scaling didn’t fix all of our concurrency problems.  The limited impact is likely due to the limitations on the types of queries that can use Concurrency Scaling. For example, we have a lot of tables with interleaved sort keys, and much of our workload is writes.

While concurrency scaling doesn’t appear to be a silver bullet solution for WLM tuning in all cases, using the feature is easy and transparent.

We do recommend enabling the feature on your WLM queues. Start with a single concurrency cluster, and monitor the peak load via the console to determine whether the new clusters are being fully utilized.

As AWS adds support for additional query/table types, Concurrency Scaling should become more and more effective.

The number one driver of change for our customers is that they are experiencing huge growth in data and query volume.

We observe three main drivers behind that growth:

  1. Growth in the number of data sources connected to a Redshift cluster and the volume of data coming from those sources. We observe a data growth factor of ~2x every year, meaning data volume grows by a factor of 10x every 5 years.
  2. Growth in the number of workflows that are running once data resides inside a cluster. The reason is simple – the more data and sources you have, the more ways you have to join and combine that data.
  3. Growth in the number of data consumers that want more data in more combinations at a higher frequency. Much of that is driven by how easy dashboard tools make it to explore data.  

When building data pipelines with Amazon Redshift, that growth makes it quite challenging to monitor and detect anomalies across all of the queries running on the system. For example, a lot of our customers use Looker as their tool for data exploration, and a frequent issue are slow Looker dashboards.

One way of approaching that problem is to use the best practices for setting up your workload management in Amazon Redshift the right way. You get a more granular view of your users, and the resources they use, and the WLM gives you the power to isolate them from each other.

But even after optimizing your WLM configuration, you’re still lacking critical insights, for example:

Until now it’s been quite hard to surface this information in a single place. So we built a way to visualize that information in a more intuitive and powerful way. It’s a new feature that makes your life as a data engineer easier when working with Amazon Redshift.

It’s called “Query Insights” and it’s available immediately in your dashboard.

Check out some example of using Query Insights.

Detect a huge increase in query volume

This example shows how you can use Query Insights to quickly identify that a huge spike in query volume happened and the specific queries that caused it.

  1. notice that the query count spike at 8:59pm on Jan 24
  2. group by the default ‘Query Group’ attribute to segment the chart by query group.
  3. click on the chart to reveal a slide-in list of query groups in descending order of Count
  4. click the “view queries” Action to jump to the query details page to view the individual queries
intermix.io dashboard on queries


Identify which query group is causing my batch pipeline to slow down?

This example shows how you can use Query Insights to find the cause of an overall increase in latency.

  1. Detect there is an overall increase in latency, start investigating by grouping the data by “App”
  2. Notice the “intermix_collector” App is experiencing a largest increase in latency. Click on it and then group by a custom attribute
  3. The “intermix_collector” has custom attributes “DAG” and “Task” that appear on the left-pane. This is because these queries are tagged using the Intermix annotation format.
  4. Group by “DAG” and then “Task” to isolate the specific Task that is seeing a latency spike
  5. Finally, group by ‘Query Group’ to see the queries
query insights in intermix.io


Find the query that caused a Looker PDT latency spike

In this example, let’s monitor our Looker PDT queries to find the queries which are the slowest.

  1. Click on the “Looker PDT” so we’re only viewing data for Looker PDT queries
  2. The Looker PDT app has a few custom attributes. Let’s click on “model” to group by model name
  3. Notice the “supply chain” model is the slowest, click on it then group by “Query Group” to find the individual queries
query optimization


What’s Next

Look for “Transfer Insights” soon. This will allow you to monitor the users & apps which are loading data and rows into your Amazon Redshift cluster.  

Schedule a Personalized Demo

I’d appreciate the opportunity to give your team a personalized tour of Query Insights, as well as share our product roadmap and hear your feature requests and feedback.

Please find a time on my calendar at this link.