A Quick Guide to Using Short Query Acceleration and WLM for Amazon Redshift for Faster Queries
In November 2017, AWS introduced Short Query Acceleration (SQA) for Amazon Redshift. The promise of SQA is to speed up execution of short running queries, with better predictability of query execution times. That, of course, sounds amazing. So in this post, we’ll cover what SQA is, how you enable it, and what trade-offs you’re making with SQA.
The Problem of Slow Queries
A common issue with Amazon Redshift is that it slows down as your workload volume increases. Amplitude wrote about it. So did IronSource. And surveying Redshift users shows that “slow queries” and “slow dashboards” come out as the #1 and #2 issues. “Slow queries” clearly are enough of an issue for the Redshift product team to launch a feature like SQA.
But enabling SQA without any other adjustments to your cluster is not a recipe for success. There are other levers to pull first. And then SQA becomes one part of your performance tuning strategy.
How SQA Works
Short Query Acceleration uses machine learning to predict the execution time of a query. The algorithm moves short running queries to a ‘short query’ queue for faster processing.
The AWS docs describe in detail how you can enable SQA. There are two options available:
- via the Workload Management (WLM) settings in the Amazon Redshift console
- by using the AWS command line interface (CLI) or the Amazon Redshift API.
The key difference between the two options is that your slot count across all queues must be 15 or fewer when you use the WLM option.
Amazon Redshift – Key Concepts
There are three key concepts about Amazon Redshift that are helpful to understand before enabling SQA.
- Query queues
- User and query groups
- Workload management (WLM)
These concepts represent tools you can use to fine-tune your workloads. Using them in the right way will have a much bigger impact than flying solo with SQA. Adding SQA into the mix is “a” performance tuning setting, but not “the” setting. So let’s take a look at these concepts.
Amazon Redshift routes user queries to queues for processing. By default, there are two queues available. One for superusers, and a default queue for all other users.
The main purpose of the superuser queue is troubleshooting. It can only run one query at a time, and can’t be configured in any other way.
The default queue has a concurrency of five, with memory equally split across those five slots. You can change the concurrency of the default queue, and also add more queues.
User and Query Groups
It’s impractical to assign each individual query to a queue. With user and query groups, you can group queries based on access rights and workload patterns. The grouping then allows you to “bulk-assign” a set of queries to a specific queue at runtime.
Workload Management (WLM)
WLM is where things come together. It’s where you:
- define your queues
- associate your user and query groups to a queue
- set concurrency/slot count and memory for a queue
The point of defining more than one queue is to separate and protect your workloads from each other. The three common workload types are loads, transforms and ad-hoc queries.
Amazon Redshift WLM Best Practices
I’ve written about anti-patterns in “3 things to avoid when setting up an Amazon Redshift cluster” and best practices in “4 simple steps to set up your WLM in Amazon Redshift“. The 4 recommended best practices to set up your cluster is to:
- separate your users (one user, one login)
- group those users by workload patterns (load, transform, ad-hoc)
- create three WLM queues (one for each workload pattern)
- assign each group to its corresponding queue
And that’s when you can start fine-tuning your workloads. The two key choices to make are picking the right (1) slot count and (2) memory percentage for each queue. Slot count determines your concurrency. Giving queries enough memory means they don’t fall back to disk, degrading performance.
Getting your WLM settings right will allow you to get the most out of your cluster resources. Every query will have a slot to run, with enough memory. It’s the first step to workload scalability.
Workload Management and Short Query Acceleration
Queries experience a delay when there are not enough concurrency slots available. A short-running query might take 5 seconds to execute. But it takes 2 minutes to return if it’s stuck in the queue waiting for a query that takes 1:55 minutes to finish. By using SQA, these short queries can hop to the express queue.
But what if you’ve already tuned your WLM for your workloads? That implies that each query will have a concurrency slot available at runtime. There will be no queries stuck in the queue. And that means there’s nothing to accelerate.
In other words, if you fine-tune your cluster by using the WLM, you probably will not need SQA. And the upside of tuning your WLM vs. only using SQA is much bigger. With WLM, every query will run fast. With SQA, some queries may run fast.
So why did Amazon develop SQA if using workload management means you may not need it?
Turns out that fine-tuning your WLM is hard. The standard way of doing it is a combination of running scripts and guessing. But that may cause more frustration than results. And so many users fall back to using the default queue coupled with SQA.
But there’s a huge payoff if you get WLM right. The data team at Remind wrote about their experience using WLM. The results of their optimization efforts delivered a 99.9% reduction of queue wait time. And they did so while reducing the number of nodes from 16 to 12.
Source: Remind Engineering Blog, Improving Redshift Performance with Intermix
When to use SQA and the trade-off you’re making
But even for a fine-tuned cluster, SQA can still come in useful. A good use case example are unusual concurrency spikes.
Let’s take an e-commerce company as an example. It’s the first day of a new month. Everybody in the company wants to know “how many new widgets did we sell last month?” And then break that query further down by channel, region, customer segment, etc., etc.
That behavior means lots of complex, long-running dashboard queries that get triggered. They may push your query volume above your concurrency limits.
In that situation, having SQA enabled means your short running queries may still produce fast results.
But using SQA comes at a cost, in form of wasted memory. Since using SQA means adding one more queue, any memory associated with that queue cannot be used by another queue. That means lower concurrency for those queues. In short, with SQA you are trading off memory for concurrency. The only way to make up for that trade-off is to add more memory by adding more nodes. Which means higher cost.
SQA adds an “express queue” to your cluster for short-running queries. But you still need to configure your WLM for your workloads to get blazing fast performance. Otherwise SQA will not produce the expected results, and even lead to wasted memory.
If you’re ready to configure your workload management and get faster queries than ever, try intermix.io! It’s the performance analytics for Amazon Redshift that helps you stop guessing and start being more productive with your data.