Are you paying too much for Amazon Redshift? 4 steps to reduce your costs.
Amazon Redshift’s pricing can be a bit confusing, since you can choose to pay hourly based on your node usage (on demand), by the number of bytes scanned (spectrum), by the time spent over your free daily credits (concurrency scaling), or by committing to an annual plan (reserved instance).
And as you scale, you may increase the cost of your Redshift account by creating inefficient processes. So if you find yourself in this position, try these tips to improve your performance AND reduce your Redshift costs:
Table of Contents
Scaling Up Is Easy. Reducing Your Redshift Cost is Not.
Amazon Redshift makes it easy to scale. Need more computing power or disk space? Simply add a few nodes with a click or two in the AWS Console and your data warehouse will be ready to crunch your valuable and growing dataset faster than ever.
This ability to scale with minimal effort is also great when it comes to overcoming short-lived spikes in cluster load. Need to power through that re-processing of a particularly large dataset? Add a few nodes. Need more disk space so you can do a deep copy of your largest table without running out of disk space? Clickity-click, done.
While the act of scaling a cluster up/down is simple, the decision to remove a cluster is often a difficult one to make. This is because folks fear overloading a cluster. Who needs frantic support requests when queries slow to a crawl or your cluster runs out of disk space? It’s too easy to throw money at the problem and take the attitude, don’t fix what ain’t broke.
To complicate things even more, changing your cluster size requires re-tuning (or at least reviewing) your WLM configuration, since the total memory and concurrency available for your queries will change. (We’ve covered tuning WLMs in another blog post)
As a result, many Redshift customers run with over-provisioned clusters simply because they worry that removing nodes will cause their cluster to go down in flames They fear late hours making data pipeline changes or waiting for the cluster to scale back up. Even with affordable Redshift pricing, an over-provisioned cluster can get expensive.
Many of our customers start using Intermix to improve their performance and find that they’ve been spending valuable dollars that they don’t need to!
In this post, we’ll describe how we reduced the spend on one of our own Redshift cluster by 28%. Without sacrificing query performance. We did so by applying some of the approaches we’ve shared in our top 14 performance tuning techniques for Amazon Redshift.
How Node Count Affects Cluster Performance and Redshift Cost
Before we decide to remove nodes from a cluster, we must first examine how the number of nodes affects the cluster’s performance. One might think that adding nodes should do something like “increase cluster performance.” But what exactly does that mean, and what should you measure to see if the number of nodes you have in your cluster is the right number?
In the context of your queries, adding nodes to a Redshift cluster does multiple things:
- It gives your cluster more storage space. For example, adding a dc2.large node will give you 160gb of storage (which will be utilized depending on your tables’ DISTKEY setup). This extra disk space is used for storing your data and as temporary storage when running disk-based queries (e.g. large joins).
- It gives you more memory to allocate to your queries. This means you can either increase the amount of memory allocated per query or increase the concurrency of your queries (i.e. increase the WLM slot count). If you are seeing a large % of disk-based queries, you may be able to optimize your WLMs after adding nodes to try to move some of those queries into memory.
- It gives your queries more CPU. This means that you can increase your query concurrency in your WLM if your queries are CPU-bound.
- It gives you more IO for getting data into and out of your cluster. Redshift is very efficient at copying data to and from S3 because it parallelizes transfers among the slices (one virtual CPU, share of memory, and disk) in your cluster. So adding nodes may make your COPYs and UNLOADs faster.
Measure Twice, Cut Once
So how do you optimize your Redshift spend (and decrease Redshift cost) while reducing the risk of bringing your pipeline to a screeching halt by under-powering your cluster? Measure, measure, measure. (Ok, that’s measuring three times, but measurement really is that important).
Since we now understand how adding a node impacts cluster performance, we can look at the specific metrics you should measure when considering whether you can remove nodes from your Redshift cluster. These are:
- Current and historic disk utilization
- WLM concurrency and queue wait-times
- Percentage of disk-based queries
- Copy and Unload times
In the following section we’ll walk through an example analysis of these metrics for one of our own Redshift clusters to see if we can remove some nodes to save money.
Case Study: How We Reduced Our Redshift Cost by Removing Nodes Without Impacting Performance
The cluster we’re going to analyze has seven dc2.large nodes, and we want to see if we can remove some nodes from it.
1. Disk Utilization
Our first step is to see if we need the nodes simply to store our data. Since Amazon Redshift’s disk, memory, and CPU all scale together (in units of nodes), we can’t remove a node if we need that node for data storage. In that case, we should consider other solutions to reduce disk usage so that we can remove a node. For example, changing compression encodings or pruning our tables – for more detail, see our top 15 performance tuning techniques.
We start by looking at the historical disk utilization of the cluster. You can view this in the AWS Redshift dashboard under PercentageDiskSpaceUsed, or in your Intermix dashboard under Storage Analysis. In this case, we’re looking at the last day and a half of disk utilization, which we know represents the current typical workload for this cluster (you may want to look at a longer period of time, depending on what you consider a representative time period is):
We can see that our cluster has used between 30% and 58% of its disk space during this period. Since our cluster has seven dc2.large nodes, the total cluster storage is 1.12 Tb (160 GB * 7). So we’re using between 336 Gb (30%) and 650 Gb (58%) of storage at any given time.
This means that if we want to keep our disk utilization no higher than 70-80% (a good rule of thumb to leave headroom for query execution), we can reduce our cluster size from seven to five, resulting in a total storage of 5 * 160 Gb = 800 Gb. Our peak disk usage of 650 Gb will be approximately 81%, which is acceptable since the peak is short-lived.
Note that these calculations are all approximate, since your cluster may use a different amount of disk for disk-based queries after the resize, so it is important to look at the characteristics of your disk-based queries too (see below). However, it gives a good starting point for estimating the smallest number of nodes for your cluster.
So now that we know we may only need five nodes to support our typical data storage needs for the cluster, lets move on to our WLM concurrency.
2. WLM Concurrency
The next step is to look at the query concurrency in your cluster to see if the cluster is operating without significant queue wait time, i.e. that queries aren’t sitting around waiting for cluster resources to become available. In your Intermix.io dashboard, you can find this under Throughput Analysis:
This example shows that our four WLM queues are operating with very little queue wait time (a peak queue wait time of around 0.15 sec, which is well below what is acceptable for the workload on this cluster). So this means that as long as we can keep enough memory in each queue after the resize, we should be ok. Which brings us to…See your data in intermix.io
3. Percentage of Disk-Based Queries
Next, we check whether the cluster has a large percentage of disk-based queries. To do that, we look at the Memory Analysis page for our cluster in the Intermix dashboard:
The left charts show the total amount of memory consumed by the queries running in each WLM queue at any given time. The charts on the right show the number of disk-based queries. By looking at the breakdown of memory usage per queue, we can make the following observations:
- No queues have more than 1-2% disk-based queries. We should typically try to keep % disk-based queries < 10%, so we have plenty of headroom.
- The non disk-based queries in queues 1 and 3 typically use 30-80 mb of memory each, so we have some headroom in the amount of memory allocated per slot (87mb for Queue 1 and 137mb for queue 3). This means that while we expect our % of disk-based queries to increase after removing two nodes, reducing the total memory by 28% (2 out of 7 nodes) shouldn’t be a change that brings the cluster to its knees (i.e. we’re not running right at the limit of our memory per slot).
- Most of the disk-based queries in queue 3 are due to a few very large (8 gb) queries. Changing the total memory of the cluster by reducing nodes won’t affect this.
4. COPY and UNLOAD times
The final metric to look at is the COPY and UNLOAD times, since reducing nodes will reduce the number of slices available for parallel data transfers into and out of the cluster. What we want to look for here is that our COPYs and UNLOADs are happening fast enough such that reducing the available parallelism by ~30% isn’t likely to have an impact. In our case, we look at all COPYs and UNLOADs for the cluster, and compare the average time to our internal requirement for the workflows running on this cluster. For this cluster, we need COPYs and UNLOADs to finish in less than 5 minutes on average:
Since COPYs and UNLOADs are both finishing in around 6 seconds, removing 2 of the 7 nodes shouldn’t cause a performance problem for transfers.
5. Pulling the Trigger
After studying the metrics described above, we concluded that we can remove two nodes from our cluster without impacting our workloads. And lo and behold, we did the resize, retuned our WLMs, and re-reviewed our Storage, Throughput, and Memory metrics to verify that all is well–and all was.
We reduced the cost of operating this cluster by 28% with no negative impact on our pipeline performance!
Reducing Your Operating Costs with intermix
When it comes to making infrastructure changes that could impact mission-critical data pipelines, the more information you have up-front the more secure you’ll feel. And the more secure you feel, the quicker you can iterate. If you want to learn more about your Redshift cluster and how you can improve it, sign up now for a free Intermix trial.
Join 11,000 of your peers.
Subscribe to our newsletter SF Data.
People at Facebook, Amazon and Uber read it every week.
Every Monday morning we'll send you a roundup of the best content from intermix.io and around the web. Make sure you're ready for the week! See all issues.