One of the major propositions of Amazon Redshift is simplicity. It only takes minutes to spin up a cluster. The time-to-first-report, i.e. the time it takes to go from creating a cluster to seeing the results of your first query, can be less than 15 minutes. That’s true even for petabyte-scale workloads.
Because it’s so easy to set-up a cluster, however, it can also be easy to overlook a few housekeeping items when it comes to setting up Redshift. That can cause problems with scaling workloads down the road. A couple of general complaints we often hear are “slow queries in Redshift” or “slow Redshift dashboards”.
Enter Amazon Redshift workload management (WLM). Without using WLM, each query gets equal priority. As a result, some workloads may end up using excessive cluster resources and block your business-critical processes.
Here are three frequent issues we hear:
You can help address these challenges by using our top 15 performance tuning techniques for Amazon Redshift. However, odds are that you’ll also be able to get some quick performance gains by adjusting your WLM.
In this post, we’ll recommend a few simple best practices that will help you configure your WLM the right way and avoid these problems.
WLM is the single best way to achieve concurrency scaling for Amazon Redshift. Your users will be happy (thanks to fast queries). You can scale as your data volume grows. And you’ll spend less time putting out fires and more time on core business processes.
Table of Contents
Amazon Redshift operates in a queueing model. The first step in setting up WLM for Redshift is to define queues for your different workloads. Next, you need to assign a specific concurrency/memory configuration for each queue.
You can define up to 8 queues, with a total of up to 50 slots. In the Amazon Redshift documentation, you’ll read to not go above 15 slots. By using the techniques in this post, however, you’ll be able to use all 50 available slots. You will also have clear visibility to see when and how you need to fine-tune your settings.
The default configuration for Redshift is a single queue with a concurrency of 5. If you run more than 5 concurrent queries, then later queries will need to wait in the queue. That’s when the “Redshift queries taking too long” thing goes into effect.
In Redshift, the available amount of memory is distributed evenly across each concurrency slot. For example, if you have a total of 1 GB of memory, then with the default configuration, each of the 5 concurrency slots gets 200 MB.
If you run a Redshift query that needs more than 200 MB, then it falls back to disk, which means that it takes longer to execute. Disk-based queries also consume a lot of I/O operations. That slows down the entire cluster, not just queries in a specific queue.
Users then try to scale their way out of contention by adding more nodes, which can quickly become an expensive proposition. In addition, you may not see the results you want, since the performance increase is non-linear as you add more nodes.
Instead, you can achieve a much better return on your Amazon Redshift investment by fine-tuning your Redshift WLM. Configuring Redshift specifically for your workloads will help you fix slow and disk-based queries. It’s very likely that the default WLM configuration of 5 slots will not work for you, even if Short Query Acceleration is enabled (which is the Redshift default).
You can read how our customer, Udemy, managed to go all the way to 50 slots and squeeze every bit of memory and concurrency out of their 32-node cluster in this blog post.
Here is what they wrote:
When users run a query in Redshift, WLM assigns the query to the first matching queue and then executes rules based on the WLM configuration.
The key concept for using the WLM is to isolate your workload patterns from each other. You can create independent queues, with each queue supporting a different business process, e.g. data loads or dashboard queries. With separate queues, you can assign the right slot count and memory percentage.
The image below describes the four distinct steps to configure your WLM.
Let’s look at each of these four steps in detail.
The first step is to create individual logins for each Redshift user. A user can be a person, an app, or a process—anything that can run a query.
Separating users may seem obvious, but when logins get shared, you won’t be able to tell who is driving which workloads. Although this may not be too difficult with only a few users, the guesswork will increase quickly as your organization grows.
Most importantly: Never use the default Redshift user for queries. First, it has administrative privileges, which can be a serious security risk. Second, you should consider the default Redshift user as your lifeline when you run into serious contention issues— you’ll still be able to use it to run queries.
If your cluster is already up and running with a few users, we recommend doing a reset: delete the old users and assign everybody new logins.
The next step is to categorize all users by their workload type. There are three generic types of workloads:
Defining users by workload type will allow you to both group them together and separate them from each other. You’ll very likely find that workloads of the same type share similar usage patterns.
We can use these similarities in workload patterns to our advantage. By grouping them, we’ll have groups of queries that tend to require similar cluster resources. For example, loads are often low-memory and high-frequency. Ad-hoc queries, on the other hand, run less frequently, but can be memory-intensive.
Use the CREATE GROUP command to create the three groups ‘load’, ‘transform’ and ‘ad_hoc’, matching the workload types we defined for our users. Use ALTER GROUP to add the users we defined in step #2 to their corresponding group.
You can of course create more granular sub-groups, e.g. for departments such as sales, marketing, or finance. That way, you can give the users in each group the appropriate access to the data they require. However, you should still stay within the logic of workload patterns, without mixing different workload groups.
The final step determines what slot count to give each queue, and the memory allocated to each slot.
You should keep the default queue reserved for the default user, and set it to a concurrency of 1 with a memory percentage of 1%. The default queue is your insurance in case something goes wrong—just consider the 1% of memory as a cost of doing business.
For the other queues, slot count and memory will determine if each query has:
If both of these things are true, that’s when you get blazing fast Redshift queries and throughput. To apply the new settings, you need to create a new parameter group with the Redshift console.
Even with proper queue configuration, some queries within a queue take longer to execute, and may block other short-running queries during peak volume. By using Short Query Acceleration, Redshift will route the short queries to a special “SQA queue” for faster execution.
Concurrency Scaling for Amazon Redshift gives Redshift clusters additional capacity to handle bursts in query load. It works by off-loading queries to new, “parallel” clusters in the background. Queries are routed based on your WLM configuration and rules.
With your new WLM configuration, and SQA and Concurrency Scaling enabled, all that’s left now is to find the right slot count and memory percentage for your queues.
Unfortunately, that process can feel a little bit like trying to look into a black box.
AWS provides a repository of utilities and scripts for querying the system tables (STL tables and STV tables). The scripts help you to find out e.g. what the concurrency high-water mark is in a queue, or which queries fall back to disk.
There are three potential challenges, though, with using these AWS scripts:
That’s why we built intermix.io, making it easier to get valuable Redshift metrics and insights. With our Throughput and Memory Analysis, we make finding the right slot count and memory percentage simple. You can see all of the relevant metrics in an intuitive time-series dashboard.
Our Throughput Analysis shows you if your queues have the right slot count, or if queries are stuck in the queue. When queries get stuck, that’s when your users are waiting for their data.
With our Memory Analysis, you can see the volume of disk-based queries. Some queries will always fall back to disk, due to their size or type. But we recommend keeping the share of disk-based queries below 10% of total query volume per queue.
Ready to start implementing proper Redshift workload management? Start your free trial with intermix.io today, and we’ll work with you to find the right configuration for your queues.
Looker is a powerful tool for self-service data analytics. A lot of companies use Looker together with Amazon Redshift for powerful business intelligence and insights. By making it easy for users to create custom reports and dashboards, Looker helps companies derive more value from their data.
Unfortunately, “slow Looker dashboards” is one of the most frequent issues we hear with Amazon Redshift. Some of our customers who use Looker tell us that queries that should take seconds to execute instead take minutes, while dashboards seem to “hang”.
The good news is that we can probably help: the issue is likely a mismatch between your Looker workloads and your Amazon Redshift configuration. In this post, we’ll explain the causes of slow Looker dashboards, and how to fine-tune Amazon Redshift to get blazing-fast performance from Looker.
Analytics stacks often grow out of a simple experiment. Somebody spins up an Amazon Redshift cluster, builds a few data pipelines, and then connects a Looker dashboard to it. The data is popular, so you set more people up with dashboards—and at some point, the problems start.
Looker performance issues can range from slow dashboards to long execution times for persistent derived tables (PDTs). In some cases, these problems can even appear at the very start of the journey. Consider this post on the Looker forum, which complains that “first-run query performance is terrible”:
The key to solving bottlenecks lies in balancing your Looker workloads with your Redshift setup. First, let’s discuss how Amazon Redshift processes queries, and then we’ll look closer at how Looker generates workloads.
A key feature in Amazon Redshift is the workload management (WLM) console. Redshift operates in a queuing model. The WLM console allows you to set up different query queues, and then assign a specific group of queries to each queue.
For example, you can assign data loads to one queue, and your ad-hoc queries to another. By separating your workloads, you ensure that they don’t block each other. You can also assign the right amount of concurrency, a.k.a. “slot count,” to each queue.The default configuration for Redshift is one queue with a concurrency of 5.
It’s easy to overlook WLM and queuing when getting started with Redshift. But as your query volumes grow and you run more than 5 concurrent queries, your queries will start to get stuck in the queue as they wait for other queries to finish. When that happens, you’re experiencing the “slow Looker dashboards” phenomenon.
There are two components of the Looker platform, LookML and persistent derived tables (“PDTs”), that make it easy for a company to explore its data.
But we’ll see how they can also generate high query volumes with heavy workloads that can slow down your Redshift clusters.
LookML is a data modeling language that separates query structure from content. In other words, the query structure (e.g. how to join tables) is independent of the query content (e.g. what columns to access, or which functions to compute). A LookML project represents a specific collection of models, views and dashboards. The Looker app uses a LookML model to construct SQL queries and run them against Redshift.
The benefit of separating structure from content is that business users can run queries without having to write SQL. That abstraction makes a huge difference. Analysts with SQL skills only define the data structure once in a single place (a LookML project), and business users then leverage that data structure to focus on the content they need.Looker uses the LookML project to generate ad-hoc queries on the fly. The below image illustrates the process behind LookML:
Some Looks create complex queries that need to create temporary tables, e.g. to store an intermediate result of a query. These tables are ephemeral, and the queries to create the table run every time a user requests the data. It’s essential for these derived tables to perform well, so that they don’t put excessive strain on a cluster.
In some cases where a query takes a long time to run, creating a so-called PDT (“persistent derived table”) is the better option. Looker writes PDTs into a scratch Redshift schema, and refreshes the PDT on a set schedule. Compared to temporary tables, PDTs reduce query time and database load, because when a user requests the data from the PDT, it has already been created.
There’s a natural progression from single queries to PDTs when doing LookML modeling. When you’re starting out, you connect all tables into a LookML model to get basic analytics. To get new metrics or roll-ups and to iterate quickly, you start using derived tables. Finally, you leverage PDTs to manage the performance implications.
The separation of structure from content via LookML can have dramatic implications for query volume. The SQL structure of one productive analyst can be reused by countless other users.
Consider a simplified scenario with a single-node Amazon Redshift cluster, 5 business users, and a single LookML project. Each user has 10 dashboards with 20 Looks (i.e. a specific chart). Behind each Look is a single query. With each refresh, they will trigger a total of 5 (users) * 10 (dashboards) * 20 (looks) = 1,000 queries.
With a single-node Amazon Redshift cluster and a default WLM setup, you will process 5 queries at a time. You’ll need 1,000/5 = 200 cycles to process all of these queries. While 5 of these queries process, all of the other ones will have to wait in the queue. The below image shows a screenshot from the intermix.io dashboards that shows what your queue wait times can look like.
Let’s assume each query takes 15 seconds to run. For all queries to run, we’re looking at a total of 200 * 15 = 3,000 seconds (50 minutes). In other words, your last 15-second query will finish running after 50 minutes.
Even if you add a node now, i.e. you double the amount of queries you can process, you’re only cutting that total wait time in half—that’s still 25 minutes.
Now let’s also add PDTs into the mix. Our PDTs will generate more workloads, often with complex, memory-intensive and long-running queries. The PDTs then compete with the already slow ad-hoc queries for resources.
There are a few possible remedies: for example, throttling the number of per-user queries, reducing the row limit for queries, or allowing fewer data points. But the whole point of using Looker is to derive meaningful conclusions from huge amounts of data. Imposing query and row limits, or using fewer data points, doesn’t make sense.
The good news is that there are only 3 steps to getting faster Looker dashboards:
By following these 3 steps, you’ll also be able to optimize your query speeds, your node count, and your Redshift spend.
We’ve written before about “4 Simple Steps To Set-up Your WLM in Amazon Redshift For Better Workload Scalability.” The 4 steps, in summary, are:
The same logic applies for your Looker queries. Have your Looker queries run in a queue that’s separate from your loads and transforms. This will allow you to define the right concurrency and memory configuration for that queue. Having enough concurrency means each Looker query will run, while having enough memory means that you’ll minimize the volume of disk-based queries.
During peak times, Concurrency Scaling for Amazon Redshift gives your Redshift clusters additional capacity to handle any bursts in query load. Concurrency scaling works by off-loading queries to new, “parallel” clusters in the background. Queries are routed based on their WLM configuration and rules.
In your intermix.io dashboard, you can see the high watermark/peak concurrency for your Looker queries. You’ll also see how much memory they consume, telling you what memory percentage you should assign to each slot.
By using the right settings, you can balance your Redshift usage with your Looker workloads. Doing this step alone will give you much faster dashboards.
What is a redundant Looker workload? It’s a query that’s running but doesn’t need to be (for example, if users are refreshing their dashboards more frequently than they need). By reducing that refresh rate, your Redshift cluster will have to process less queries, which in turn drives down concurrency.
With intermix.io’s app tracing feature, you can see which of your Looker users are driving most of the query volume, down to the single Look. Below, you can see feedback from one of our customers during our private beta for app tracing:
Once you’ve squeezed all the juice out of your WLM, it’s time to adjust your Redshift node count. If you’re still encountering concurrency issues or disk-based queries, it may be time to add more nodes. In most cases, though, there’s an opportunity to reduce node count and save on your Redshift spend.
Consider the case of our customer Remind, a messaging app for schools. By configuring their workload management, they managed to reduce their Amazon Redshift spend by 25 percent.
That’s it! There are a few more tweaks you can do that will improve performance. Examples are setting your dist/sort keys for your PDTs, or moving some PDTs into your ELT process. But the 3 steps in this post will give you the biggest immediate return on your Looker investment.
Sign up today for a free trial of intermix.io. As your company grows, you can be confident that your Looker dashboards will always be lightning fast.
To win in today’s market, companies must invest in both their infrastructure and their people. Data-first companies like Netflix, Uber and Tinder are dominating their industries. Mentions of “AI” are often heard in advertisements, product launches, and earnings calls. Businesses are scrambling to re-imagine their tech infrastructure and provide their people with things like Amazon Redshift training.
Why is this happening now?
Data is the new differentiator:
Moving towards a data-driven organization can be daunting for decision-makers. Learning investments such as Amazon Redshift training can sometimes produce limited results, thanks to the complex nature of modern data platforms, which can intimidate even the most experienced IT professionals.
We’re all generating a lot more data than ever before. Storing it is a challenge; analyzing it is an even bigger challenge. Over the years, we’ve seen highly structured data warehouses give way to more anarchic data lakes, which in turn are giving way to multi-tiered structures like data lakehouses.
As a result of the shift, traditional concepts of “ETL” are being re-imagined for a new world where:
The response has been that teams are building complex data assembly lines which have common characteristics:
Data lake architectures pose new challenges. Since lakes store their data without any oversight of the contents, the data needs to have defined mechanisms for cataloging in order to make it usable. Without this, data isn’t reliable. Companies need data pipelines that offer governance consistency and access controls to the data lake.
New technologies like Amazon Redshift allow all of the storage and processing power required to run a data-first business, but you still need the tools to turn that data into actionable insights.
Companies use the data for the following purposes:
Building and managing complex data assembly line requires a new skill set.
The last time we saw a comparable shift was back when cloud computing was first developed. Running cloud apps required an operational (ie uptime, cost, and performance) mindset coupled with an ability to write code. The DevOps role was born out of a need to manage infrastructure as code. As a result, engineering teams had to establish new teams and hire new types of employees. These employees needed different tools to do their jobs.
Similarly, a new role has emerged to manage data pipelines: the data engineer. Data engineers manage complex data flows by writing code that manipulates data. But they are also accountable for the uptime, performance, and cost accounting for the data flows. This skillset is a combination of DevOps and data analyst, with a sprinkling of a database administrator. They also need platform knowledge, such as Amazon Redshift training.
Building a data assembly line involves:
Raw data often exists in application silos, stranded.93% of enterprises have a multi-cloud strategy, while 87% have a hybrid cloud strategy. So, for many organizations, data exists across multiple clouds, plus their own data centers. They will need to inspect and catalog this data before they realize any value.
Security is a fundamental part of any data pipeline. The data owners must audit all access, and also ensure that the right people and processes have the right permissions.
Before anyone can trust data, they have to transform and cleanse it. Consider how you identify a customer across the data – is it by email, name, or some unique ID? If you want to combine two data sources, which one of those do you use? This stage involves data validation, removal of duplication, and handling of null values.
Mobile and IoT generate huge volumes of data. Running fast queries on huge data volumes requires careful planning, tuning, and configuration of data analytics infrastructure.
Most companies start by hiring data engineers to implement reliable data pipelines, and they may also hire data scientists with analytics skills. For a data pipeline, you may also require staff with specialist knowledge, such as Hadoop, ETL or Amazon Redshift training.
Unliked end-user applications, data apps runs jobs on the data assembly line. There are three categories of data apps.
The fundamental problem solved by the data engineer is to ensure that the data assembly line is working.
Are data flows operating normally?
Do my data tables contain the correct results?
Are data apps able to access the data quickly?
This requires answering questions in real-time across multiple systems:
In order to accomplish this, you need new types of metrics. Traditional networking monitoring metrics like CPU and network utilization are irrelevant when monitoring data assembly lines, because data flows operate at a different layer.
A monitoring tool for data flows must consider:
We started intermix.io to solve these problems. Our mission is to provide data engineers with a single dashboard to help them monitor their mission critical data flows. And if there are problems, that they are the first to know and the reason why.
In November 2017, AWS introduced Short Query Acceleration (SQA) for Amazon Redshift. The promise of SQA is to speed up the execution of short running queries, with better predictability of query execution times. That, of course, sounds amazing. In this post, we’ll cover what SQA is, SQA best practices, and what trade-offs you’re making with SQA.
A common issue with Amazon Redshift is that it slows down as your workload volume increases. And surveying Redshift users shows that “slow queries” and “slow dashboards” come out as two of the most common issues. “Slow queries” are enough of an issue for the Redshift product team to launch a feature like SQA.
But using SQA without any other adjustments to your cluster is not a recipe for success. There are other levers to pull first. And then SQA becomes one part of your performance tuning strategy.
Short Query Acceleration uses machine learning to predict the execution time of a query. The algorithm moves short running queries to a ‘short query’ queue for faster processing.
Starting in 2018, SQA is enabled by default for all clusters. Users can still disable SQA via Workload Management on their Redshift console.
Whether you choose to disable SQA or not, it is important to understand 3 key Redshift concepts to make a more informed choice:
These concepts represent tools you can use to fine-tune your workloads. Using them in the right way will have a much bigger impact than flying solo with SQA. Adding SQA into the mix is “a” performance tuning setting, but not “the” setting. Let’s take a look at these concepts.
Amazon Redshift routes user queries to queues for processing. By default, there are two queues available. One for superusers, and a default queue for all other users.
The main purpose of the superuser queue is troubleshooting. It can only run one query at a time. You can’t configure it any other way.
The default queue has a concurrency of five, with memory equally split across those five slots. You can change the concurrency of the default queue, and also add more queues.
It’s impractical to assign each individual query to a queue. With user and query groups, you can group queries based on access rights and workload patterns. The grouping then allows you to “bulk-assign” a set of queries to a specific queue at runtime.
WLM is where things come together. It’s where you:
The point of defining more than one queue is to separate and protect your workloads from each other. The three common workload types are loads, transforms and ad-hoc queries.
To get the most out of SQA, it is important to set up your WLM correctly. Broadly speaking, the 4 recommended best practices for setting up your cluster are:
Related Reading: 3 things to avoid when setting up an Amazon Redshift cluster
That’s when you can start fine-tuning your workloads. The two key choices to make are picking the right (1) slot count and (2) memory percentage for each queue. Slot count determines your concurrency. Giving queries enough memory means they don’t fall back to disk, degrading performance.
Getting your WLM settings right will allow you to get the most out of your cluster resources. Every query will have a slot to run, with enough memory. It’s the first step to workload scalability.
Queries experience a delay when there are not enough concurrency slots available. A short-running query might take 5 seconds to execute. But it takes 2 minutes to return if it’s stuck in the queue waiting for a query that takes 1:55 minutes to finish. With the help of SQA, these short queries can hop to the express queue.
But what if you’ve already tuned your WLM for your workloads? That implies that each query will have a concurrency slot available at runtime. There will be no queries stuck in the queue. And that means there’s nothing to accelerate.
In other words, if you fine-tune your cluster by using the WLM, you probably will not need SQA. And the upside of tuning your WLM vs. only using SQA is much bigger. With WLM, every query will run fast. With SQA, only short-running queries may run fast.
So why did Amazon develop SQA if using workload management means you may not need it?
Turns out that fine-tuning your WLM is hard. The standard way of doing it is a combination of running scripts and guessing. But that may lead to more frustration than results. Therefore, a lot of users fall back to using the default queue coupled with SQA.
But there’s a huge payoff if you get WLM right. The data team at Remind used Intermix to improve their Redshift WLM performance. The results of their optimization efforts delivered a 99.9% reduction in queue wait times. And they did so while reducing the number of nodes from 16 to 12.
But even for a fine-tuned cluster, SQA can still come in useful. A good use case example are unusual concurrency spikes.
Let’s take an e-commerce company as an example. It’s the first day of a new month. Everybody in the company wants to know “how many new widgets did we sell last month?” And then break that query further by channel, region, customer segment etc.
To get those insights, a lot of complex, long-running dashboard queries need to be triggered. That may push your query volume above your concurrency limits.
In such a situation, having SQA enabled means your short running queries will produce fast results. That’s why, it is always a good idea to leave SQA enabled, by default.
However, using SQA comes at a cost, in the form of wasted memory. Using SQA means adding one more queue. No other queue can use the memory associated with the added queue. That translates to concurrency for those queues.
In short, with SQA you are trading off memory for concurrency. The only way to make up for that trade-off is to add more memory by adding more nodes, which translates to higher costs.
SQA adds an “express queue” to your cluster for short-running queries. But you still need to configure your WLM for your workloads to get blazing fast performance. Otherwise, SQA will not produce the expected results, and may even lead to wasted memory.
If you’re ready to configure your workload management and get faster queries than ever, try intermix.io. It’s the performance analytics for Amazon Redshift that takes the guesswork out of tuning your Redshift performance. Stop guessing and start being more productive with your data.
Amazon Redshift is a data warehouse that makes it fast, simple and cost-effective to analyze petabytes of data across your data warehouse and data lake. Amazon Redshift can deliver 10x the performance of other data warehouses by using a combination of machine learning, massively parallel processing (MPP), and columnar storage on SSD disks.
But even with all that power, it’s possible that you’ll see uneven query performance or challenges in scaling workloads. Performance optimization for Amazon Redshift is a matter of doing some thoughtful up-front planning and ongoing monitoring as your data volume, users and cluster grow.
In this article, we’re giving you our 15 best practices for performance tuning Redshift. With these practices, you’ll have a cluster that is faster, cheaper, and easier to scale than any other product on the market.
Here are the 15 performance techniques in summary:
At intermix.io, we use Amazon Redshift as part of our core platform. This blog post compiles our learnings from over three years of operating several large Redshift clusters at a high scale.
intermix.io is an analytics platform that provides a single monitoring dashboard for data engineers to keep an eye on their mission-critical data flows.
intermix.io uses Amazon Redshift for batch processing large volumes of data in near real-time. Our data pipeline processes over 20 billion rows per day. We serve data from Amazon Redshift to our application by moving it into RDS (via DBLINK) and Amazon Elasticsearch Service.
The Amazon Redshift Workload Manager (WLM) is critical to managing query performance. Amazon Redshift runs queries in a queueing model. The default WLM configuration has a single queue with five slots. Almost 99% of the time, this default configuration will not work for you and you will need to tweak it. Configuring the WLM for your workloads provides two main benefits:
You can have up to 8 queues with a total of up to 50 slots. A query will run in a single slot, by default. Queries can be routed into queues using certain rules. Setting up your WLM the right way will eliminate queue wait times and disk-based queries.
To set-up your WLM for your workloads, we recommend following a four-step process:
Our guide to setting up Redshift WLM to improve performance walks you through our four-step process to eliminate queue wait times and reduce disk-based queries. Both slow your cluster down, so let’s take a closer look at this Redshift performance tuning technique.
If you’ve used Redshift for any period of time, you may have come across a situation where a query that used to run for two seconds starts running much slower. The most common reason for this is queuing. The query was waiting in a queue because the number of slots in the cluster was too low for the number of concurrent queries that were executing.
The default configuration allows you to run five concurrent queries in one queue. That means if five queries are executing, the sixth one will queue until a slot becomes available.
The goal is to ensure that queries are not waiting in the queue. This can be done by matching the slot count of the queue with the actual concurrency of the queries running in that queue.
You can eliminate queue wait times by:
There is another benefit to this approach – you can use Short Query Acceleration for Amazon Redshift (“SQA”) the right way and avoid the downside of SQA. Activating SQA consumes memory within the cluster – which brings us to disk-based queries.
Increasing slot count to eliminate queuing can have an adverse side effect: disk-based queries. “Disk-based” means that the query runs out of RAM, and begins using the hard drive. Queries go disk-based because the query memory exceeds the ‘memory per slot’ in that queue. The memory per slot is calculated as:
memory assigned to that queue / # of slots
Since each queue is assigned a fixed percentage of a cluster’s memory (a value you’ll set when you configure your WLM queue), adding more slots will decrease the memory per slot.
Disk-based queries cause two major problems:
When the frequency of disk-based queries goes up, a chain reaction can occur. More I/O causes more CPU, which in turn make queries run slower, increasing overall concurrency.
As a rule of thumb, maintain your queues such that fewer than 10% of queries go disk-based.
With our Throughput and Memory Analysis dashboards in intermix.io, we make finding the right slot count and memory percentage easy. When you can see the relevant metrics in an intuitive, time-series dashboard, allocating the right slot count and memory percentage for each queue becomes simple.FIND THE RIGHT SLOT COUNT AND MEMORY PERCENTAGE FOR YOUR CLUSTER NOW
The Amazon Redshift COPY command takes advantage of the parallel architecture and is the recommended way of moving data into Redshift. The COPY command is optimized, but the COPY operation is still expensive. The best practice is to only copy rows that you need.
The goal is to minimize the number of rows ingested. The best way to do this is to ensure that your ETL tools are only COPYing in data that has changed since the last time. Otherwise, you will have two issues:
Here is an example of a CDC operation:
Adding compression to large, uncompressed columns has a big impact on cluster performance. Compression accomplishes two things:
We recommend using the Zstandard (ZSTD) encoding algorithm. This relatively new algorithm provides a high compression ratio and works across all Amazon Redshift data types. ZSTD is especially good with VARCHAR and CHAR fields that have a mixture of long and short strings. Unlike some of the other algorithms, ZSTD is unlikely to increase storage utilization.
Here is a real-world example of applying ZSTD to three Amazon Redshift logging tables. The average storage reduction is over 50%!
The Amazon Redshift COPY command loads data into a table. The default behavior of Redshift COPY command is to run two commands:
Amazon Redshift runs these commands to determine the correct encoding for the data being copied. This may be useful when a table is empty. But in the following cases, the extra queries are useless and should be eliminated:
In the example below, a single COPY command generates 18 ‘analyze compression’ commands and a single ‘copy analyze’ command.
Extra queries can create performance issues for other queries running on Amazon Redshift. They increase concurrency and hence, may saturate the number of slots in a WLM queue, causing other queries to have queue wait times.
The solution is to adjust the COPY command parameters to add “COMPUPDATE OFF” and “STATUPDATE OFF”. These parameters will disable these features during “UPSERT”s.
Here is an example of a “COPY” command carried out with those settings:
-- Load data into the staging table
COPY users_staging (id, name, city)
COMPUPDATE OFF STATUPDATE OFF;
It is common to connect an application framework like Django to Amazon Redshift. This is useful when using Redshift data in your application, i.e. in an OLTP scenario. However, since Amazon Redshift is an OLAP database, there is a chance it might not handle these queries well.
The challenge of using Redshift as an OLTP database is that queries can lack the low-latency that exists on a traditional RDBMS. Unlike OLTP databases, OLAP databases do not use an index. This is a result of the column-oriented data storage design of Amazon Redshift, which makes the trade-off to perform better for big data analytical workloads.
Consider this example from a live production cluster. The user ‘django_redshift’ is querying the table ‘search_word_level_course_vector”, a table with 443,744 rows. The query ran 374,372 times. Each query returned a single row.
Each query scans all 443,744 rows, takes about 0.02 seconds to run and returns a single row.
The impact on the cluster is quite dramatic:
There are two approaches to solve the problem:
Distribution style is a table property that decides how to distribute rows for a given table across the nodes in your Amazon Redshift cluster. Choosing the correct distribution style is important for query performance.
There are two major considerations to keep in mind when choosing a distribution style:
The default distribution style is ‘EVEN’. All nodes contain an equal number of rows for a given table. The benefits of the ‘EVEN’ distribution style are:
However, ‘EVEN’ distribution is not optimal when joining two tables. Consider what happens when two tables are JOINed:
and the query execution continues from here.
With EVEN distribution, it’s easy to see that step 3 requires the movement of data between nodes. This is not ideal because it requires network (broadcast) traffic and increases I/O utilization across the cluster. Both factors increase query latency.
To solve this problem and make JOINs faster, Amazon Redshift offers a KEY-based distribution style. With KEY-based distribution, Amazon Redshift will ensure that for a given column across two tables, step 3 (move data to a single node) will not be necessary. This is accomplished by applying an algorithm when writing data to nodes. The algorithm ensures that rows with the same value in the ‘DISTKEY’ column end up on the same node.
Consider an example where the name of the JOIN column is ‘customer_id’.
In this case, Query 1 will execute faster than the case when table 1 or table 2 uses an EVEN-based distribution.
Downsides of KEY-based distribution
But what happens when you run another type of query against table 1? For example, a query that does not join on “customer_id” but on another column? Or does not do a JOIN at all? Queries which do not JOIN on these columns may run much slower.
There are two main downsides of using KEY based distribution.
When to use KEY-based distribution
KEY-based distribution is great if and only if you have a major query that you want to optimize. In all other cases, use an EVEN-based distribution. Using EVEN distribution will:
Intermix.io makes it very easy to find tables with skew. The first thing you are going to notice by going into the “Storage Analysis” dashboard is that the utilization of Node 0 is always close to 100%.
This probably means that you have a problem with the distribution key. From there, going into the “Table Analysis” dashboard will show you the row skew for each table.Easily find and fix Row skew in Amazon Redshift
Amazon Redshift builds a custom query execution plan for every query. For a given query plan, an amount of memory is allocated. The memory allocation is determined by estimating the amount of memory needed to store intermediate query results (as in a JOIN or aggregation).
The query plan allocates a certain amount of memory to each query by estimating the amount of memory needed to store intermediate results (e.g. for a JOIN or aggregation).
It is important for a query to have sufficient memory to not spill to disk (go “disk-based”). Allocating too much memory is not desirable, either. Queries do not share memory. Allocating more memory than needed wastes memory since it is unavailable to other queries.
Here, it is important to note that the system is not adaptive. If the plan was wrong and the query needs more (or less) memory than was allocated – the execution engine will not go back and adjust the memory allocation after the query has already started executing.
What could cause the plan to be wrong? Very often, it is the number of rows in a table.
The ANALYZE command will ensure that the planner has an accurate, up-to-date view of the row counts for tables. Let’s look at an example of what happens if the statistics are wrong.
Let’s say the planner allocates too little memory to the query. Once the query starts running, it will encounter that it requires more memory than it was allocated. The query will go disk-based and thus will run slower than otherwise.
This could have been avoided by running the query in a slot with enough memory.
Now, let’s assume that the planner allocates too much memory to the query. Once the query starts running it will encounter that it requires less memory to store intermediate results than it was allocated. It will lead to either of these two results:
Amazon Redshift provides
a statistics called “stats off” to help determine when to run the ANALYZE command on a table. The “stats off” metric is the positive percentage difference between the actual number of rows and the number of rows seen by the planner.
As a best practice, we recommend running ANALYZE on any tables with a “stats off” percentage greater than 10%.
Amazon Redshift is a distributed, shared-nothing database that scales horizontally across multiple nodes. Query execution time is very tightly correlated with:
Below is an example of a poorly written query, and two optimizations to make it run faster.
select * from table1
select * from table2
from table1_cte as a
JOIN table2_cte as b
ON a.id = b.id
Queries can run faster by minimizing the amount of data moving between nodes. In practice, this means being careful when writing multi-stage queries where the results of one stage feeds into the next.
In the case of our example query, modifying your ‘WHERE’ clauses to only select rows needed will minimize the amount of data that needs to be moved around and speed up the query.
from table1_cte as a
JOIN table2_cte as b
ON a.id = b.id
Amazon Redshift is a column-oriented database. As a result, scanning a table doesn’t read each row in its entirety. Instead, individual columns can be scanned without needing to read other columns. You should be careful to only select columns that you will use for your query. Try to avoid using a
operation in all cases.
from table1_cte as a
JOIN table2_cte as b
ON a.id = b.id
The two optimizations can dramatically improve your query speeds.
Re-writing queries is easier said than done, though. That’s why, we’ve built “Query Insights” into our product, a dashboard that gives you recommendations on how to re-write queries to drastically improve speed (a few customers went from 30min execution time to 30 seconds).
You can also send an email to your users directly from intermix.io dashboard to let them know what can be changed about a certain query.Instantly Find and Fix Slow Queries
Row Skew happens when a table uses KEY based distribution, and the values in the DISTKEY column are not evenly distributed. The row skew metrics is a positive integer ranging from 1 to the number of rows in the table. Row skew is the ratio of:
High row skew results in uneven node disk utilization (cost) and slower queries (performance).
The chart below shows a real-world example. With uneven disk utilization, a single node(s) ends up having more rows for that table. This can be a major (cost) problem if you need to add more nodes in your cluster just because a single node is skewed.
With high row skew, doing a straight SELECT on that table will be slower than otherwise. This is because one node has more data than the next, and the query execution must wait for the “slowest” node to send up its data to the leader.
There are two options to eliminate row skew:
The exception to tolerate row skew is if – and only if – you make a conscious decision to optimize a single query. See the section “Use DISTKEYs Only When Necessary” in this article for more information.
Short Query Acceleration (SQA) will speed up the execution of short running queries. It does so by selecting certain queries to jump the queue. This can be useful when your cluster runs a mixture of big and small queries. In this case, a small query that would otherwise queue up behind a longer query will execute first.
SQA is enabled by default on Amazon Redshift clusters. But using SQA without any other adjustments to your cluster is not a recipe for success. There are other levers to pull first. See our quick guide to using Short Query Acceleration and WLM for Amazon Redshift for faster queries.
The Amazon Redshift COPY command is the recommended way of moving data into Amazon Redshift. The COPY command takes advantage of the parallel architecture in Amazon Redshift to move data. The COPY command can read files from various sources, including EMR, DynamoDB, and remote hosts via SSH.
Compressing files in S3 when loading large amounts of data will accomplish three goals:
Long-running COPY commands will see the most improvement with this performance tuning technique for Redshift.
Amazon Redshift is very good for aggregations on very long tables (e.g. tables with > 5 billion rows). Some use cases call for storing raw data in Amazon Redshift, reducing the table, and storing the results in subsequent, smaller tables later in the data pipeline.
This is a great use case in our opinion. However, managing very large tables presents two challenges:
This section discusses a few approaches to managing these issues for long tables more efficiently.
Pruning a long table requires running the DELETE operation. This needs to be done rather frequently to avoid the table filling up your disk.
After every DELETE operation, you need to run the following three maintenance steps on the table:
On a very long table, these operations can be very expensive.
To avoid the three steps, you can partition the very long table into smaller tables. Create multiple tables with the same schema, but with different table names. The rows in the table are then partitioned based on the chosen partition key. The job that INSERTs into these tables must be aware of the partitioning scheme.
To select from this table, create a view (with the original table name) and use the UNION directive to provide a consistent view to your application. This has the following benefits:
There is a downside to this approach, though. SELECTs on the table will go a bit slower since the UNION operation won’t be as fast as scanning a single table. But depending on your environment, it can be a small trade-off to avoid the pain of maintaining a very long table.
We’ve learned that sorting is an expensive operation. If you use an UPSERT method to COPY new data into a table, you will need to sort that table.
UPSERT is a method of de-duplicating data when copying into Amazon Redshift. The UPSERT operation merges new records with existing records using primary keys. While some RDBMSs support a single “UPSERT” statement, Amazon Redshift does not support it. Instead, you should use a staging table for merging records.
Since UPSERT performs a DELETE, it may leave the table in an unsorted state.
One approach to eliminate the need to sort the table is to COPY in sort order.
There are a few caveats when considering using this method:
In 2019, Amazon introduced RA3 nodes for Redshift. They use large SSDs for local caching coupled with automatic data eviction, data pre-fetching, and other data management techniques. With RA3 nodes, you can separate compute and storage. You can scale your clusters according to your compute needs, which brings down your overall cost of analytics.
The basic idea behind RA3 nodes is to use S3 for storing all permanent data and use the local disk for caching. You can fetch data from S3 on-demand. Additionally, Redshift identifies data that is used frequently – hot data – and keeps it local for fast compute times. You can create RA3 node clusters via the AWS management console.
Amazon Redshift launched with disruptive pricing. To compare the cost, we’re looking at the price for storing 1TB of data for one year ($ / TB / Year). With a 3-year commitment for the ds2.8xlarge nodes, the price comes down to $934 / TB / Year. That price point is unheard of in the data warehousing world.
The average Amazon Redshift customers double their data every year. In fact, that is one of the reasons why it’s important to focus on performance improvements – since managing performance becomes a bigger challenge as data volume grows.
At some point, the cost of storing all this data in Amazon Redshift becomes prohibitive. Keeping a multi-year history of data “forever” can become expensive. Deleting data may not be an option due to regulatory reasons.
Amazon Redshift prices are based on the size of your cluster. In other words, compute and storage are coupled. You’ll have to keep adding nodes for storage, even though you may not need the additional computing power of the vCPUs.
However, since storing data is cheap on Amazon Redshift, a common initial behavior is to store all historical raw data in Redshift. But data volume is growing. You may also want to use the faster but more expensive dense compute nodes. Many companies don’t want to make a capital commitment beyond a 1-year term.
Enter Amazon Redshift Spectrum. With Redshift Spectrum, you can leave data as-is in your S3 data lake, and query it via Amazon Redshift. In other words, you can de-couple compute from storage. This approach makes sense when you have data that doesn’t require frequent access. Leave your “hot” data in Amazon Redshift, and your “cold” data in S3.
The impact on cost can be substantial. The price for S3 Standard Storage is $281 / TB / Year. Thus, with Redshift Spectrum, you get the best of both worlds. We call it “data tiering”. You get to keep all your historical data, along with the performance of Amazon Redshift. With Redshift Spectrum you can benefit from the cost savings of using S3.
In “Amazon Redshift Spectrum: How Does It Enable a Data Lake?”, we’re taking an even closer look at using Redshift as part of a data lake architecture, including the use of Amazon Athena and AWS Glue. Talking of Redshift Spectrum, here is a bonus tip to fine-tune the performance of your Redshift cluster.
It is more efficient to store S3 data in columnar formats, such as Apache Parquet. Columnar formats deliver better performance when compared to row-based formats. There is a factor for cost-efficiency, too. With columnar formats, Redshift Spectrum scans only the columns that have the required data records. It does not read all the columns. Since you pay for the volume of data scanned, Apache Parquet helps cut down your data analysis costs.
You can even further optimize Parquet by using frequently filtered columns to sort data, thus enhancing the cost-efficiency of Redshift and Redshift Spectrum.
Amazon Redshift is a cloud-based data warehouse that offers high performance at low costs. But uneven query performance or challenges in scaling workloads are common issues with Amazon Redshift. Use the performance tuning techniques for Redshift mentioned here to lower the cost of your cluster, improve query performance, and make your data team more productive.
If you are a data engineer keen on enhancing their skills, subscribe to our weekly newsletter. We bring you the most newsworthy developments in data engineering, right to your inbox.