Table of Contents
Update 09/10/2019: AWS released Priority Queuing this week. We’re in the process of testing this new feature and will update this post with our results soon.
Today’s post is a bit long, but for good reason: the Amazon Redshift team recently introduced a new feature related to one of the most important Redshift management tools, the WLM. So bear with us, there are some interesting WLM insights ahead!
One of the key things to get right when optimizing your Redshift Cluster is its WLM (Workload Management) configuration. We’ve talked a lot about different aspects of WLM (e.g. in our WLM tuning post or our SQA post) since getting your WLM configuration right can mean the difference between your users having their queries run immediately versus having your users wait minutes or even hours before their queries even start executing.
As a reminder, Redshift’s Workload Manager allows you to define one or more queues for your clusters’ SQL queries, and to define the resources (e.g. memory) and rules (e.g. timeouts) that should apply to queries that run in those queues. The primary goals of the WLM are to allow you to maximize your query throughput and prioritize different types of workloads.
Because cluster resources are finite, configuring your WLM always results in a tradeoff between cluster resources and query concurrency: the more concurrent queries you let run in a queue (slots), the fewer resources (like memory and cpu) each query can be given. Memory is by far the most precious resource to consider when tuning WLM. Queries that need more memory than they are allocated spill over to disk, causing huge slowdowns in performance not only for the query that went disk-based, but for the cluster as a whole (since long-running queries take up memory and a concurrency slot, and disk-based queries consume disk IO). AWS recommends keeping your % of disk-based queries to under 10%, but in practice most Redshift administrators can (and should) typically keep it much lower.
Fortunately, finding the optimal tuning for your WLM is pretty straightforward – if you’re using intermix.io you can use our Throughput Analysis and Memory Analysis tools to quickly view your clusters’ concurrency and memory usage in each WLM queue, and see at a glance which users and applications are experiencing unacceptable queuing:
You can then adjust concurrency and/or memory in the AWS console of your cluster to give more memory to queues that have a large number of disk-based queries, or increase the number of slots in queues that have significant queuing
One of the limitations of Redshift’s WLM is that the total memory assigned to a queue is divided equally between all query slots (not queries) in the queue. That means that if you, say, allocate 1gb of memory to a queue with 10 slots, each query that runs in the queue will get 1gb / 10 = 100 mb of memory, even if it’s the only query running in that queue. So small queries that need less than 100mb waste the extra memory in their slot, and large queries that need more than 100mb spill to disk, even if 9 of the 10 slots (900mb) are sitting idle waiting for a query. Clearly this isn’t optimal.
One workaround is to use the Redshift session parameter wlm_query_slot_count to temporarily increase the number of slots that should be given to a query. If you set this parameter to, say, 2 in your database session before executing your query, then your query will consume 2 WLM concurrency slots (reducing the number of concurrent queries that can run in that queue) and get twice the memory. In the example above, a query that needed 150mb of memory would spill to disk when running in a single 100mb slot but run fully in memory when run with 2 slots.
This is a great way to allocate more memory to a big query when the following are true:
While wlm_query_slot_count can be a good solution for targeting individual memory-hungry queries on an ad-hoc basis, it is difficult to use this solution to reduce disk-based queries in a general and on-going way cluster-wide since each query requires a different setting and knowing in real-time how many slots you should assign to a particular query is difficult. Further, it is hard to know in a general way what impact assigning more slots to a query will have on queue wait times.
So where does this leave us? Detailed cluster monitoring lets you tune your concurrency and memory WLM settings to minimize both queue wait time and the % of disk-based queries you have. But since every slot in a queue is given the same fixed fraction of queue memory, inevitably some memory-hungry queries will end up spilling to disk causing query and cluster slowdowns. Using wlm_query_slot_count lets you target some of those individual disk-based queries to try to prevent them from spilling to disk, but makes it difficult to optimize per-query memory allocation in a more general way cluster-wide.
This is exactly the problem that Automatic WLM was introduced to solve. The key innovation of Auto WLM is that it assigns memory to each query dynamically, based on its determination of how much memory the query will need. It’s a little bit like having wlm_query_slot_count tuned for you automatically for each query that runs on your cluster. As a result, memory-hungry queries can be given up to the total amount of memory available to avoid them going disk-based. But nothing is free: giving more memory to memory-hungry queries means that the cluster can run fewer queries concurrently, resulting in more queuing overall.
So if you take away one thing from this post, it’s this: enabling Auto WLM will speed up slow, memory-intensive queries by preventing them from going to disk, but slow down smaller queries by introducing more queue wait time. The degree to which this will impact your cluster performance will depend on your specific workloads and your priorities.
intermix.io not only helps our customers keep their Redshift clusters operating at peak efficiency and their costs down–it helps us do the same for own internal Redshift clusters. So to see the impact of Automatic WLM, we first enabled Auto WLM on one of our non-production internal Redshift clusters and then used intermix.io to see how our cluster efficiency was impacted.
The first cluster we enabled it on was one of our development Redshift clusters. This cluster runs a batch ETL pipeline, and prior to enabling Auto WLM had a well-tuned WLM with minimal queue time but some large, slow, disk-based queries. Here is a chart of average execution time (light blue), average queue wait time (dark blue), and query count (green line) for a few days before we made the change:
So our average execution time is 5.57 seconds, and average queue time is 0.88 seconds. Looking at the same chart with Maximum selected, we see the queries that take the longest to run:
So while the average queue wait time and execution time is well below the data SLAs we need for this cluster, we have some queries running longer than 60 minutes–there is clearly room for improvement!
After enabling Automatic WLM on August 2nd, we saw a drop in average execution time by about half but a significant spike in average queue wait time, from under 1 second to over 10 seconds. The net result was a significant net increase in average query latency, even though there is a drop in average execution time:
The drop in average execution time is due to the big reduction in execution times for slow, disk-based queries, as shown in this chart of latencies for disk-based queries:
So Automatic WLM reduced our max query runtime from around 50 minutes to around 10 minutes–a 6x improvement!
For this cluster, which runs a consistent set of batch-processing ETL jobs (or “ELT”) and few ad-hoc queries, this net increase in average latency is a good tradeoff to get a big improvement in query runtimes for our slowest disk-based queries.
So far so good. Or so we thought…
Emboldened by our initial test, we enabled Auto WLM on five additional Redshift clusters. As with our first cluster, these five clusters had manually tuned WLMs and were operating well within our data SLAs. These clusters were significantly larger than our first test cluster (both in terms of nodes, query volume, and data stored). All clusters ran batch ETL jobs similar to the first cluster and ran a small percentage of ad-hoc queries.
Four of the five clusters showed a similar trend to our initial test, though we observed more modest improvements (since their maximum query runtimes were smaller–10 minutes or less compared to 50 minutes in our initial test). However, the fifth cluster immediately started setting off alarms due to exceeding one of our data SLAs. The root cause was that one particular set of pipeline queries (a combination of four COPYs) were now exceeding their data SLA summed max runtime requirement of 5 minutes due to excessive queueing. With our manually tuned WLM, each of the three queries were taking a max of 30 sec to execute, whereas with Auto WLM they were now taking as much 4 minutes each due to excessive queueing:
Since there are no parameters to tune with Auto WLM, we had no choice but to revert the WLM mode back to Manual, which rapidly got the queries back under their SLA requirement and our pipeline running smoothly.
In summary, Auto WLM has the following advantages over Manual WLM:
Auto WLM has the following disadvantages over Manual WLM:
We’re still in the early days of Automatic WLM and its likely that the AWS Redshift team will continuously make improvements to their tuning algorithms. But for the moment we can make the following broad recommendations around enabling Auto WLM:
As always, the most important thing to do is to measure your Redshift cluster performance quantitatively. It’s the only way to know if Automatic WLM is helping or hurting, and whether just optimizing the most problematic queries or adjusting your Manual WLM is a better option.
In our case, we are disabling it for our initial test cluster since that cluster is used by our developers for ad-hoc queries. Their feedback was that they could tolerate the long execution times of a small percentage of ETL jobs in exchange for faster interactive ad-hoc queries. We are however keeping it enabled for the four of the five clusters discussed above for the time being. But since our workloads continuously evolve as more data is added and most importantly as we optimize and modify our SQL queries, we will periodically revert to manual WLM whenever we review our cluster costs (and before adding nodes) to see if optimal manual tuning will let us save money by running our clusters with fewer nodes.
Amazon Redshift makes it easy to scale. Need more computing power or disk space? Simply add a few nodes with a click or two in the AWS Console and after a couple of hours of Read-Only mode your data warehouse will be ready to crunch your valuable and growing dataset faster than ever.
This ability to scale with minimal effort is also great when it comes to overcoming expected (or unexpected) short-lived spikes in cluster load. Need to power through that re-processing of a particularly large dataset? Add a few nodes. Need more disk space so you can do a deep copy of your largest table without running out of disk space? Clickity-click, done.
But while its just as easy to scale a cluster down as it is to scale a cluster up, too often the decision to remove nodes is a harder one to make since it can result in an overloaded cluster. Who needs frantic support requests when queries slow to a crawl or your cluster runs out of disk space? It’s too easy to throw money at the problem and take the attitude, don’t fix what ain’t broke.
And to make things more time-consuming, changing your cluster size requires re-tuning (or at least reviewing) your WLM configuration, since the total memory and concurrency available for your queries will change. (We’ve covered tuning WLMs in another blog post)
As a result, many Redshift customers run with over-provisioned clusters simply because they worry that removing nodes will cause their cluster to go down in flames, resulting in late hours making data pipeline changes or waiting for the cluster to scale back up. Even with affordable Redshift pricing, an over-provisioned cluster can get expensive.
So too often we see Amazon Redshift users spend valuable dollars from their budgets on resources they may not need.
In this post, we’ll describe how we reduced the spend on one of our own Redshift cluster by 28%. Without sacrificing query performance. We did so by applying some of the approaches we’ve shared in our top 14 performance tuning techniques for Amazon Redshift.
See 14 real-life examples of data pipelines built with Amazon Redshift
Before we look at how to evaluate whether nodes can be removed from your cluster, we should first look at how changing the number of nodes in a cluster affects its performance. Intuitively we know that adding nodes should do something like “increase cluster performance.” But what exactly does that mean, and what should you measure to see if the number of nodes you have in your cluster is the right number?
In the context of your queries, adding nodes to a Redshift cluster does multiple things:
So how do you optimize your Redshift spend while reducing the risk of bringing your pipeline to a screeching halt by accidentally under-powering your cluster? Measure, measure, measure. (Ok, that’s measuring three times, but measurement really is that important).
Since we now understand how adding a node impacts cluster performance, we can look at the specific metrics you should measure when considering whether you can remove nodes from your Redshift cluster. These are:
In the following section we’ll walk through an example analysis of these metrics for one of our own Redshift clusters to see if we can remove some nodes to save money.
The cluster we’re going to analyze has seven dc2.large nodes, and we want to see if we can remove some nodes from it.
Our first step is to see if we need the nodes simply to store our data. Since Amazon Redshift’s disk, memory, and CPU all scale together (in units of nodes), we can’t remove a node if we need that node just for data storage. However, in that case we should consider other solutions to reduce disk utilization so that we can remove a node. For example, changing compression encodings or pruning our tables – for more detail, see our top 14 performance tuning techniques.
We start by looking at the historical disk utilization of the cluster. You can view this in the AWS Redshift dashboard under PercentageDiskSpaceUsed, or in your Intermix dashboard under Storage Analysis. In this case, we’re looking at the last day and a half of disk utilization, which we know represents the current typical workload for this cluster (you may want to look at a longer period of time, depending on what you consider a representative time period is):
We can see that our cluster has used between 30% and 58% of its disk space during this period. Since our cluster has seven dc2.large nodes, the total cluster storage is 1.12 Tb (160 GB * 7). So we’re using between 336 Gb (30%) and 650 Gb (58%) of storage at any given time.
This means that if we want to keep our disk utilization no higher than 70-80% (a good rule of thumb to leave headroom for query execution), we can potentially reduce our cluster size from seven to five, resulting in a total storage of 5 * 160 Gb = 800 Gb. Our peak disk utilization of 650 Gb will be approximately 81%, which is acceptable since the peak is short-lived.
Note that these calculations are all approximate, since your cluster may use a different amount of disk for disk-based queries after the resize, so it is important to look at the characteristics of your disk-based queries too (see below). However, it gives a good starting point for estimating the minimum number of nodes for your cluster.
So now that we know we may only need five nodes to support our typical data storage needs for the cluster, lets move on to our WLM concurrency.
The next step is to look at the query concurrency in your cluster to see if the cluster is operating without significant queue wait time, i.e. that queries aren’t sitting around waiting for cluster resources to become available. In your Intermix.io dashboard, you can find this under Throughput Analysis:
This example shows that our four WLM queues are operating with very little queue wait time (a peak queue wait time of around 0.15 sec, which is well below what is acceptable for the workload on this cluster). So this means that as long as we can keep sufficient memory in each queue after the resize, we should be ok. Which brings us to…
Next we evaluate whether the cluster has a large percentage of disk-based queries. To do that, we look at the Memory Analysis page for our cluster in the Intermix dashboard:
The left charts show the total amount of memory consumed by the queries running in each WLM queue at any given time, and the right charts show the number of disk-based queries. By looking at the breakdown of memory usage per queue, we can make the following observations:
The final metric to look at is the COPY and UNLOAD times, since reducing nodes will reduce the number of slices available for parallel data transfers into and out of the cluster. What we want to look for here is that our COPYs and UNLOADs are happening quickly enough such that reducing the available parallelism by ~30% isn’t likely to have an impact. In our case, we look at all COPYs and UNLOADs for the cluster, and compare the average time to our internal requirement for the workflows running on this cluster. For this cluster, we need COPYs and UNLOADs to finish in less than 5 minutes on average:
Since COPYs and UNLOADs are both finishing in around 6 seconds, removing 2 of the 7 nodes shouldn’t cause a performance problem for transfers.
After studying the metrics described above, we concluded that we can remove two nodes from our cluster without impacting our workloads. And lo and behold, we did the resize, retuned our WLMs, and re-reviewed our Storage, Throughput, and Memory metrics to verify that all is well–and all was.
We reduced the cost of operating this cluster by 28% with no negative impact on our pipeline performance!
When it comes to making infrastructure changes that could impact mission-critical data pipelines, the more information you have up-front the more secure you’ll feel. And the more secure you feel, the quicker you can iterate.
Amazon Redshift is a data warehouse that makes it fast, simple and cost-effective to analyze petabytes of data across your data warehouse and data lake. Amazon Redshift can deliver 10x the performance of other data warehouses by using a combination of machine learning, massively parallel processing (MPP), and columnar storage on SSD disks.
But even with all that power, it’s possible that you’ll see uneven query performance, or challenges in scaling workloads. Performance optimization for Amazon Redshift is a matter of doing some thoughtful up-front planning and ongoing monitoring as your data volume, users and cluster grow.
In this post, we’re describing 14 best practices for performance tuning for Amazon Redshift. If you follow these practices, you’ll have a cluster that is faster, cheaper and easier to scale than any other product on the market.
Here are the 14 top performance techniques in summary:
For data teams in charge of managing an Amazon Redshift cluster, using these best practices will help them be successful in building complex data pipelines.
We know this from experience. Here at intermix.io, we use Amazon Redshift as part of our core platform. This blog post compiles learnings we’ve made over three years of operating several, very large Redshift clusters at high scale.
intermix.io is an analytics platform that provides a single monitoring dashboard for data engineers to keep an eye on their mission-critical data flows.
intermix.io uses Amazon Redshift for batch processing large volumes of data in near real-time. Our data pipeline processes over 20 billion rows per day. We serve data from Amazon Redshift to our application by moving it into RDS (via DBLINK) and Amazon Elasticsearch Service. And goes without saying that we’re drinking our own Champaign – we use intermix.io to monitor intermix.io
The Amazon Redshift Workload Manager (WLM) is critical to managing query performance. Amazon Redshift run queries in a queueing model. The default WLM configuration has a single queue with five slots. 99% of the time this default configuration will not work for you and you will need to tweak it. Configuring the WLM for your workloads provides two main benefits:
You can have up to 8 queues with a total of up to 50 slots. A query will run in a single slot, by default. Queries can be routed into queues using certain rules. Setting up your WLM the right way will eliminate queue wait times and disk-based queries.
To set-up your WLM for your workloads, we recommend following a four-step process:
We’ve written a detailed guide on tuning WLM that describes the four-steps. The approach will help you eliminate queues wait times and reduce disk-based queries. Both will slow your cluster down, so let’s take a closer look.
Eliminate queue wait times by matching queue slot count to peak concurrency
If you’ve used Redshift for any period of time, you may have come across a situation where a query that used to run for two seconds, all of the sudden starts running much, much slower. The most common reason for this is queueing. The query was waiting in a queue because the # of slots in the cluster was too low for the number of concurrent of queries executing.
The default configuration allows you to run five concurrent queries in one queue. That means if five queries are executing, the sixth one will queue until a slot becomes available.
The goal is to ensure that queries are not waiting in the queue. This can be done by matching the slot count of the queue with the actual concurrency of the queries running in that queue.
You can eliminate queue wait times by:
Another benefit of using this approach is that you can use Short Query Acceleration for Amazon Redshift (“SQA”) the right way, because there is a downside of SQA. Activating SQA consumes memory within the cluster – which takes us to disk-based queries.
Reduce disk-based queries by assigning enough memory to your queues
Increasing slot count to eliminate queuing can have an adverse side effect: Disk-based queries. “Disk-based” means that the query runs out of RAM, and begins using the hard drive. Queries go disk-based because the query memory exceeds the ‘memory per slot’ in that queue. The memory per slot is calculated as:
memory assigned to that queue / # of slots
Since each queue is assigned a fixed percentage of a cluster’s memory (a value you’ll set when you configure your WLM queue), adding more slots will decrease the memory per slot.
Disk-based queries cause two major problems:
When the frequency of disk-based queries goes up, a chain reaction can occur. More I/O causes more CPU, which in turn make queries run slower, increasing concurrency overall.
As a rule of thumb, maintain your queues so that on average fewer than 10% of queries go disk-based.
With our Throughput and Memory Analysis dashboards in intermix.io, we make finding the right slot count and memory percentage easy.
Here is how:
When you can see the relevant metrics in an intuitive, time-series dashboard allocating the right slot count and memory percentage for each queue is very easy.FIND THE RIGHT SLOT COUNT AND MEMORY PERCENTAGE FOR YOUR CLUSTER NOW
The Amazon Redshift COPY command takes advantage of the parallel architecture and is the recommended way of moving data into Redshift. The COPY command is optimized, but the COPY operation is still expensive. The best practice is to only copy in rows that are needed.
The goal is to minimize the number of rows ingested. The best way to do this is to ensure that your ETL tools are only COPYing in data that has changed since the last time. Otherwise, you will have two issues:
Here is an example of a CDC operation:
Adding compression to large, uncompressed columns will have a big impact on cluster performance. Compression accomplishes two things:
We recommend using the Zstandard (ZSTD) encoding algorithm. This relatively new algorithm provides a high compression ratio and works across all Amazon Redshift data types. ZSTD is especially good with VARCHAR and CHAR fields that have a mixture of long and short strings. Also, unlike some of the other algorithms, ZSTD is unlikely to increase storage utilization,
Below is a real-world example of applying ZSTD to three Amazon Redshift logging tables. The average storage reduction is over 50%!
The Amazon Redshift COPY command loads data into a table. The default behavior of Redshift COPY command is to run two commands:
Amazon Redshift runs these commands to determine the correct encoding for the data being copied. This may be useful when a table is empty. But in the following two cases the extra queries are useless and thus should be eliminated:
In the below example, a single COPY command generates 18 ‘analyze compression’ commands and a single ‘copy analyze’ command.
Extra queries can create performance issues for other queries running on Amazon Redshift. They increase concurrency and hence may saturate the number of slots in a WLM queue, causing other queries to have queue wait times.
The solution is to adjust the COPY command parameters to add “COMPUPDATE OFF” and “STATUPDATE OFF”. These parameters will disable these features during “UPSERT”s.
Here is an example of a “COPY” command with these options set.
-- Load data into the staging table
COPY users_staging (id, name, city)
COMPUPDATE OFF STATUPDATE OFF;
It is common to connect an application framework like Django to Amazon Redshift. This is useful when using Redshift data in your application, i.e. in an OLTP scenario. Since Amazon Redshift is an OLAP database, it may not handle these queries well.
The challenge of using Redshift as an OLTP database is that queries can lack the low-latency that would exist on a traditional RDBMS and transactional queries. Unlike OLTP databases, OLAP databases do not use an index. This is a result of the column-oriented data storage design of Amazon Redshift which makes the trade-off to perform better for big data analytical workloads.
Consider this example from a live production cluster. The user ‘django_redshift’ is querying the table ‘search_word_level_course_vector”, a table with 443,744 rows. The query ran 374,372 times. Each query returned a single row.
Each query scans all 443,744 rows, takes about 0.02 seconds to run and returns a single row.
The impact on the cluster is quite dramatic:
There two approaches to resolve the problem.
Distribution style is a table property which decides how to distribute rows for a given table across the nodes in your Amazon Redshift cluster. Choosing the correct distribution style is important for query performance. The below diagram shows the three options.
There are two major consideration when choosing a distribution style:
The default distribution style is ‘EVEN’. All nodes contain an equal number of rows for a given table. The benefits of the ‘EVEN’ distribution style are:
However, ‘EVEN’ distribution is not optimal when joining two tables. Consider what happens when two tables are JOINed:
… and the query execution continues from here.
With EVEN distribution, it’s easy to see that step 3 requires the movement of data between nodes. This is not ideal because it requires network (broadcast) traffic and increases I/O utilization across the cluster. Both factors increase query latency.
To solve this problem and make JOINs faster, Amazon Redshift offers a KEY-based distribution style. With KEY-based distribution, Amazon Redshift will ensure that for a given column across two tables, step 3 (move data to a single node) will not be necessary. This is accomplished by applying an algorithm when writing data to nodes. The algorithm ensures that rows with the same value in the ‘DISTKEY’ column end up on the same node.
Consider an example where the name of the JOIN column is ‘customer_id’.
In this case, Query 1 will execute faster than the case when table 1 or table 2 use an EVEN-based distribution.
Downsides of KEY-based distribution
But what happens when I run another type of query against table 1? For example, a query that does not join on “customer_id” but on another column? Or does not do a JOIN at all? Queries which do not JOIN on these columns may run much slower.
There are two main downsides of using KEY based distribution.
When to use KEY-based distribution
KEY-based distribution is great if and only if you have a major query that you want to optimize. In all other cases, use an EVEN-based distribution. Using EVEN distribution will:
Intermix.io makes it very easy to find tables with skew. The first thing you are going to notice by going in to the “Storage Analysis” dashboard, is that the utilization of Node 0 is always close to 100%.
This probably means that you have a problem with the distribution key. From there, going into the “Table Analysis” dashboard will show you the row skew for each table.Easily find and fix Row skew in Amazon Redshift
Amazon Redshift builds a custom query execution plan for every query. For a given query plan, an amount of memory is allocated. The memory allocation is determined by estimating the amount of memory needed to store intermediate query results (as in a JOIN or aggregation).
The query plan allocates a certain amount of memory to each query, by estimating the amount of memory needed to store intermediate results (e.g. for a JOIN or aggregation).
It is important for a query to have sufficient memory to not spill to disk (go “disk-based”). Allocating too much memory is not desirable either however. Queries do not share memory. Allocating more memory than needed wastes memory since it is unavailable to other queries.
An important point is that the system is not adaptive. If the plan was wrong and the query needs more (or less) memory than was allocated – the execution engine will not go back and adjust the memory allocation after the query has already started executing.
What could cause the plan to be wrong? A major factor is the number of rows in a table.
The ANALYZE command will ensure that the planner has an accurate, up-to-date view of the row counts for tables. Let’s look at an example of what happens if the statistics are wrong.
EXAMPLE 1 – Table has more rows than the planner thinks it has
In this example, the planner will allocate too little memory to the query. Once the query starts running, it will encounter that it requires more memory than it was allocated. The query will go disk-based and thus will run slower than otherwise.
This could have been avoided by running the query in a slot with enough memory. It would not have gone disk-based.
EXAMPLE 2 – Table has fewer rows than the planner thinks it has
In this example, the planner will allocate too much memory to the query. Once the query starts running it will encounter that it requires less memory to store intermediate results than it was allocated. There will be one of two possible results:
Amazon Redshift provides a statistics called “stats off” to help determine when to run the ANALYZE command on a table. The “stats off” metric is the positive % difference between the actual number of rows and the number of rows seen by the planner.
As a best practice, we recommend running ANALYZE on any tables with a “stats off” percentage greater than 10%.
Amazon Redshift databases require periodic maintenance known as vacuuming. Amazon Redshift is based on PostgreSQL, but unlike PostgreSQL, Redshift doesn’t offer autovacuum. So when a row is deleted from a table in Amazon Redshift, the disk space used by that row is not immediately recovered. A special command is necessary to recover the disk space for use by other tables.
In Amazon Redshift, the “VACUUM FULL” operation will accomplish two things:
In most cases, it’s not desirable to do both things at the same time. The requirements for sorting a table are very different from reclaiming space. Sorting may use a lot of resources and time.
We recommend separating the VACUUM DELETE ONLY operation from the SORT operation. The recommendation is to run VACUUM DELETE ONLY
Amazon Redshift is a distributed, shared-nothing database that scales horizontally across multiple nodes. Query execution time is very tightly correlated with:
Below is an example of a poorly written query, and two optimizations to make it run faster.
select * from table1
select * from table2
from table1_cte as a
JOIN table2_cte as b
ON a.id = b.id
Optimization #1: Limit Rows Processed by using a WHERE clause
Queries can run faster by minimizing the amount of data moving between nodes. In practice, this means being careful when writing multi-stage queries where the results of one stage feeds into the next.
In the case of our example query, modifying your ‘WHERE’ clauses to only select rows needed will minimize the amount of data that needs to be moved around and speed up the query.
from table1_cte as a
JOIN table2_cte as b
ON a.id = b.id
Optimization #2: Limit Columns Scanned
Amazon Redshift is a columnar-oriented database. As a result, scanning a table doesn’t read each row in its entirety. Instead, individual columns can be scanned without needing to read other columns. This means that you should be careful to only select columns that you will use for your query. Try avoiding using a
operation in all cases.
from table1_cte as a
JOIN table2_cte as b
ON a.id = b.id
Using these two optimizations when writing queries will have dramatic positive effects on your query speeds.
Re-writing queries is easier said than done. That’s why we’ve built “Query Insights” into our product, a dashboard that gives you recommendations on how to re-write queries to drastically improve speed (a few customers went from 30min execution time to 30 seconds).
You can also send an email to your users directly from intermix.io dashboard to let them know what can be changed about a certain query.Instantly Find and Fix Slow Queries
Row Skew results when a table uses KEY based distribution, and the values in the DISTKEY column are not evenly distributed. The row skew metrics is a positive integer ranging from 1 to the number of rows in the table. Row skew is the ratio of:
High row skew results in uneven node disk utilization (cost) and slower queries (performance).
The chart below shows a real-world example. With uneven disk utilization, a single node(s) ends up having more rows for that table. This can be a major (cost) problem if you need to add more nodes in your cluster just because a single node is skewed.
With high row skew, doing a straight SELECT on that table will be slower than otherwise. This is because one node has more data than the next, and the query execution must wait for the “slowest” node to send up its data to the leader.
There are two options to eliminate row skew:
The exception to tolerate row skew is if – and only if – you make a conscious decision to optimize a single query. See the section “Use DISTKEYs Only When Necessary” in this article for more information.
Short Query Acceleration (SQA) will speed up the execution of short running queries. It does this by selecting certain queries to jump the queue. This can be useful when your cluster runs a mixture of big and small queries. In this case, a small query that would otherwise queue up behind a longer query will execute first.
SQA is enabled by default on Amazon Redshift clusters.But using SQA without any other adjustments to your cluster is not a recipe for success. There are other levers to pull first. And then SQA becomes one part of your performance tuning strategy. See our quick guide to using Short Query Acceleration and WLM for Amazon Redshift for faster queries.
The Amazon Redshift COPY command is the recommended way of moving data into Amazon Redshift. The COPY command takes advantage of the parallel architecture in Amazon Redshift to move data. The COPY command can read files from various sources, including EMR, DynamoDB, and remote hosts via SSH.
Compressing files in S3 when loading large amounts of data will accomplish three goals:
Long-running COPY commands will see the most improvement.
Amazon Redshift is very good for aggregations on very long tables (e.g. tables with > 5 billion rows). Some use cases call for storing raw data in Amazon Redshift, and then reducing the table and storing the results in subsequent, smaller tables later in the data pipeline.
This is a great use case in our opinion. However, managing very large tables presents two challenges:
This section discusses a few approaches to managing these issues for long tables.
Use UNION to make it easier to PRUNE very long tables
Pruning a long table requires running the DELETE operation. This needs to be done rather frequently to avoid the table filling up your disk.
After every DELETE operation, you need to run the following three maintenance steps on the table:
On a very long table, these operations can be very expensive.
To avoid the three steps you can partition the very long table into smaller tables. Create multiple tables with the same schema, but with different table names. The rows in the table are then partitioned based on the chosen partition key. The job that INSERTs into these tables must be aware of the partitioning scheme.
To select from this table, create a view (with the original table name) and use the UNION directive to provide a consistent view to your application. This has the following benefits:
There is a downside to this approach. SELECTs on the table will go a bit slower, since the UNION operation won’t be as fast as scanning a single table. But depending on your environment, it’s a small trade-off worth it and a good solution to avoid the pain of maintaining a very long table.
COPY in sort order
We’ve learned that sorting is an expensive operation. If you use an UPSERT method to COPY new data into a table, you will need to sort that table.
UPSERT is a method of de-duplicating data when copying into Amazon Redshift. The UPSERT operation merges new records with existing records using primary keys. While some RDBMSs support a single “UPSERT” statement, Amazon Redshift does not support it. Instead, you should use a staging table for merging records.
Since UPSERT performs a DELETE, it may leave the table in an unsorted state.
One approach to eliminate the need to sort the table is to COPY in sort order. This will prevent the need for you to ever sort the table.
There are a few caveats when considering using this method:
Amazon Redshift launched with disruptive pricing. To compare the cost, we’re looking at the price for storing 1TB of data for one year ($ / TB / Year). With a 3-year commitment for the ds2.8xlarge nodes, the price comes down to $934 / TB / Year. That price point is unheard of in the data warehousing world.
The average Amazon Redshift customers double their data every year. In fact, that is one of the reasons why it’s important to focus on performance improvements – since managing performance becomes a bigger challenge as data volume grows.
At some point, the cost of storing all this data in Amazon Redshift become prohibitive. So kKeeping a multi-year history of data “forever” can become expensive. Deleting data may not be an option, e.g. for regulatory reasons or multi-year comparisons.
The issue here is that Amazon Redshift prices based on the size of your cluster, i.e. compute and storage are coupled. You’ll have to keep adding nodes for storage, even though you may not need the additional computing power of the additional vCPUs.
Because pricing is so cheap, a common initial behavior is to store all historical raw data in Redshift. But data volume is growing. You may also want to use the faster but more expensive dense compute nodes. Many companies don’t want to make a capital commitment beyond a 1-year term.
So keeping a multi-year history of data “forever” can become expensive. Deleting data may not be an option, e.g. for regulatory reasons or multi-year comparisons.
The issue here is that Amazon Redshift prices based on the size of your cluster, i.e. it couples compute and storage. You’ll have to keep adding nodes for storage, even though you may not need the additional computing power of the additional vCPUs.
Enter Amazon Redshift Spectrum. With Redshift Spectrum, you can leave data as-is in your S3 data lake, and query it via Amazon Redshift. You can de-couple compute from storage. This approach makes sense when you have data that doesn’t require frequent access. Leave your “hot” data in Amazon Redshift, and your “cold” data in S3.
The impact on cost can be substantial. The price for S3 Standard Storage is $281 / TB / Year. And so with Redshift Spectrum, you get the best of both worlds. We call it “data tiering”. You get to keep all your historical data, along with the performance of Amazon Redshift. With Redshift Spectrum you can benefit from the cost savings of using S3.
In “Amazon Redshift Spectrum: Diving into the Data Lake!”, we’re taking an even closer look at using Redshift as part of a data lake architecture, including the use of Amazon Athena and AWS Glue.
Amazon Redshift is a cloud-based data warehouse that offers high performance at lower cost. But uneven query performance or challenges in scaling workloads are possible. With the performance tuning techniques we describe in this blog post, you’ll get the best performance results possible, lower the cost of operating your cluster, and achieve higher productivity for your data team.
If you’d like a pdf version of this post, click here and we’ll send you a link for download.
Looker is a powerful tool for self-service analytics. A lot of companies use Looker on top of Amazon Redshift for business intelligence. It helps companies derive value from their data by making it easy to create custom reports and dashboards.
“Slow Looker dashboards” is one of the most frequent issues we hear with Amazon Redshift. Looker users tell us that some queries that should take seconds to execute takes minutes. Dashboards seem to “hang”.
The problem is probably a mismatch between your Looker workloads and your Amazon Redshift configuration. In this post, I’ll explain the causes of slow dashboards, and how to fine-tune Amazon Redshift to get blazing fast Looker dashboards.
A lot of times, analytics stacks grow out of an experiment. Somebody spins up an Amazon Redshift cluster, builds a few data pipelines, and then connects a Looker dashboard to it. The data is popular, so you set more people up with dashboards. And at some point, the problems start.
Complains can range from slow dashboards to long execution times for persistent derived tables (PDTs). In some cases, these issues can even appear at the very start of the journey. Consider this post on the Looker Discourse. The complaint is that “first-run query performance is terrible”.
The key to solving bottlenecks lies in balancing your Looker workloads with your Redshift set-up. So let’s first understand how Amazon Redshift processes queries. Then we’ll look closer at how Looker generates workloads.
A key feature in Amazon Redshift is the workload management (WLM) console. Redshift operates in a queuing model. In the WLM console you can set up different query queues, and then assign a specific group of queries to each queue.
For example, you can assign data loads to one and your ad-hoc queries to another. By separating your workloads, you ensure that they don’t block each other. You can also assign the right amount of concurrency aka “slot count” to each queue.
The default configuration for Redshift is one queue with a concurrency of 5. It’s easy to not notice the WLM and the queues when getting started with Redshift. But as your query volumes grows, and you run more than 5 concurrent queries, your queries get stuck in the queue. They’re waiting for other queries to finish. When that happens, you’re experiencing slow dashboards.
There are two components of the Looker platform – LookML and persistent derived tables (“PDT”) – that make it easy for a company to explore its data.
But we’ll see how they can also generate high query volume with heavy workloads that can slow down a Redshift cluster.
LookML is a data modeling language that separates query structure from content. The query structure (e.g. how to join tables) is independent of the query content (e.g. what columns to access, which functions to compute). A LookML project represents a specific collection of models, views and dashboards. The Looker app uses a LookML model to construct SQL queries and run them against Redshift.
The benefit of separating structure from content is that business users can run queries without writing SQL. That abstraction makes a huge difference. Analysts with SQL skills define the data structure once, in one place (a LookML project). Business users then leverage that data structure to focus on the content they need. No need from them to write complex SQL statements. Looker uses the LookML project to generate ad-hoc queries on the fly. Image 2 describes the process behind LookML.
Some Looks create complex queries that need to create temporary tables, e.g. to store an intermediate result of a query. These tables are ephemeral, and the queries to create the table run every time a user requests the data. It’s key for derived tables to perform so they don’t put excessive strain on a cluster.
In some cases, where a query takes a long time to run, creating a so-called PDT (“persistent derived table”) is the better option. Looker writes PDTs into a scratch schema in Redshift, and allows to refresh the PDT on a set schedule. When the user requests the data from the PDT, it has been already created. So compared to temporary tables, PDTs reduce query time and database load.
There’s a natural progression from single queries to PDTs when doing LookML modeling. When you start, you connect all tables into a LookML model to get basic analytics. To get new metrics or roll-ups, and to iterate quickly, you start using derived tables. Then you set it up to persist and leverage PDTs to manage the performance implications.
The separation of structure from content via LookML can have dramatic implications for query volume. The SQL structure of one productive analyst can be re-used by countless users.
A Simple Math Example
Consider a simplified scenario with a 1-node Amazon Redshift cluster, 5 business users and a single LookML project. Each user has 10 dashboards with 20 Looks (i.e. a specific chart). Behind each Look is a single query. With each refresh, they will trigger a total of 5 (users) * 10 (dashboards) * 20 (looks) = 1,000 queries.
With a 1-node Amazon Redshift cluster and a default WLM set-up, you will process 5 queries at a time. You’ll need 1,000 / 5 = 200 cycles to process all queries. While 5 queries process, all other queries have to wait in the queue. Image 3 shows a screenshot from the intermix.io dashboards that shows what queue wait time can look like.
Let’s assume each query takes 15 seconds to run. For all queries to run, we’re looking a total of 200 * 15 = 3,000 seconds (50 minutes). In other words, your last 15-second query will finish running after 50 minutes.
Even if you add a node now, i.e. you double the amount of queries you can process, you’re only cutting that total wait time in half. That’s still 25 minutes.
Now let’s also add PDTs into the mix. Our PDTs will generate more workloads, often with complex, memory-intensive and long-running queries. The PDTs then compete with the already-slow ad-hoc queries for resources.
A remedy can be to throttle the number of per-user queries. Or to reduce the row limit for queries or allow fewer data points. But the whole point of using Looker is to derive meaningful conclusions from huge amounts of data. Imposing query and row limits, or using fewer data points doesn’t make sense.
One part of the answer is, of course, to add more nodes to your cluster. But it’s easy to overspend, so the other part is to configure your WLM the right way, and identify workloads in Looker that may be unnecessary, such as high refresh rates. You need to configure your WLM in a way that Looker queries and PDTs have enough concurrency and memory.
The 3 steps:
By following these 3 steps, you’ll also be able to optimize your query speeds, your node count and along with it your Redshift spend.
I’ve written before about 4 Simple Steps To Set-up Your WLM in Amazon Redshift For Better Workload Scalability. The 4 steps in summary:
The same logic applies for your Looker queries. Have your Looker queries run in a queue that’s separate from your loads and transforms. That will allow you to define the right concurrency and memory configuration for that queue. Enough concurrency means each Looker query will run. Enough memory means you minimize the volume of disk-based queries.
During peak time, Concurrency Scaling for Amazon Redshift gives Redshift clusters additional capacity to handle bursts in query load. It works by off-loading queries to new, “parallel” clusters in the background. Queries are routed based on WLM configuration and rules.
In your intermix.io dashboard, you can see the high watermark / peak concurrency for your Looker queries. You’ll also see how much memory they consume, telling you what memory percentage you need to assign to each slot.
By using the right setting, you can balance your Redshift settings with your Looker workloads. This step alone will give you much faster dashboards.
What is a “redundant Looker workload”? It’s a query that’s running but doesn’t need to. For example, if users are refreshing their dashboards more frequently than they need to. By reducing that refresh rate, your Redshift cluster will have to process less queries. That drives down concurrency.
With our app tracing feature, you can see which one of your Looker users are driving most of the query volume, down to the single Look. See Image 5 with customer feedback via Slack during our private beta for app tracing.
The final step then is to see if you need to add more nodes, or if there’s an opportunity to reduce your node count.
Once you’ve squeezed all the juice of your WLM, it’s time to adjust your node count. If you’re still encountering concurrency issues or disk-based queries, it may be time to add more nodes. In most cases though we see how there’s an opportunity to reduce node count and save on your Redshift spend.
Consider the case of our customer Remind, a messaging app for schools. By configuring their WLM they managed to reduce their Amazon Redshift spend by 25%.
That’s it! There are a few more tweaks you can do that will improve performance. Examples are setting your dist / sort keys for your PDTs, or moving some PDTs into your ELT process. But the 3 steps in this post will give you the biggest immediate return on your Looker investment
So if you’re ready to scale, get fast dashboards, and handle more data and users with Looker, sign-up for a free trial. We’ll help you streamline your WLM queues. And as your company grows, you can be confident that your Looker dashboards will always be lightning fast.
To win in today’s market, companies must build core competencies in advancing their use of data. Data-first companies are dominating their industries. e.g. Netflix vs network TV; Tinder vs Match.com; Stitch Fix vs The Mall. Mentions of “AI” are often heard in advertisements, product launches, and earnings calls.
Why is this happening now?
Data is the new differentiator:
These trends have led to a shift in how companies build data platforms. The shift is away from single, monolithic data warehouses to data lake based architectures. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, and then decide at a later time how to use it.
As a result of the shift, traditional concepts of “ETL” are being re-imagined for a new world where:
The response has been that teams are building complex data assembly lines which have common characteristics:
Data lake architectures pose new challenges. Since data is stored without any oversight of the contents, the data needs to have defined mechanisms for cataloging in order to make it usable. Without this, data cannot be found or trusted. So meeting the needs of the company requires that data assembly lines assign governance consistency and access controls to the data lake.
Companies use the data for the following purposes:
Building and managing complex data assembly line requires a new skill set.
The last time we saw a similar shift was 10 years ago when cloud computing was first developed. Running cloud apps required an operational (ie uptime, cost, and performance) mindset coupled with an ability to write code. The DevOps role was born out of a need to manage infrastructure as code. As a result, engineering teams had to establish new teams and hire new types of employees. These employees needed different tools to do their jobs.
Similarly, a new role has emerged to manage data assembly lines: the data engineer. Data engineers manage complex data flows by writing code that manipulates data. But they are also accountable for the uptime, performance, and cost accounting for the data flows. This skillset is a combination of DevOps and data analyst, with a sprinkling of database administrator. No wonder they are in high demand!
Building a data assembly line involves:
Raw data is stranded in application silos. 23% of enterprise application workloads live in the cloud, growing to 52% in 5 years. Over half of enterprises intend to have “multi-cloud” architectures. So within five years, over half of a company’s data will be in at least two clouds plus their own data centers.
This data must be inspected and catalogued so it can be understood by analytics databases for processing and analysis.
Data must be secured to ensure that data assets are protected. Access to data must be audited. Access rights to data must be assigned to the correct teams and tools.
3. Transformation and Cleaning.
Data must be reduced and cleaned so it can be trusted. Consider how a customer is identified across the data – is this done by email, name, or some unique ID? If I want to combine two data sources, which one of those do I use? Duplicate data should be omitted, and data should be validated to ensure it is complete, without any gaps.
Data volumes are enormous today because of mobile and IoT. Running fast queries on huge data volumes requires careful planning, tuning, and configuration of data analytics infrastructure.
The first hire for most data teams is usually a data scientist, but this is the wrong first hire to make. The data engineer should be the first hire of any data team. Without a data engineer, the company’s data is stranded and unusable.
Unliked end-user applications, data apps runs jobs on the data assembly line. There are three categories of data apps.
Monitoring the performance of these data apps is critical to building reliable data assembly lines.
The fundamental problem solved by the data engineer is to ensure that the data assembly line is working.
Are data flows operating normally?
Do my data tables contain the correct results?
Are data apps able to access the data quickly?
This requires answering questions in real-time across multiple systems:
In order to accomplish this, you need new types of metrics. Traditional networking monitoring metrics like CPU and network utilization are irrelevant when monitoring data assembly lines, because data flows operate at a different layer.
A monitoring tool for data flows must consider:
We started intermix.io to solve these problems. Our mission is to provide data engineers with a single dashboard to help them monitor their mission critical data flows. And if there are problems, that they are the first to know and the reason why.
Today, Uncork Capital and S28 Capital along with PAUA Ventures, Bastian Lehmann, CEO Postmates, and Hasso Plattner, Founder of SAP are backing us to help reach this goal. If we are successful, then all companies will have the tools they need to win with data.
Photo by Hans-Peter Gauster