Start Now Login

Table of Contents

Background

Update 09/10/2019: AWS released Priority Queuing this week. We’re in the process of testing this new feature and will update this post with our results soon.

Today’s post is a bit long, but for good reason:  the Amazon Redshift team recently introduced a new feature related to one of the most important Redshift management tools, the WLM.  So bear with us, there are some interesting WLM insights ahead!

One of the key things to get right when optimizing your Redshift Cluster is its WLM (Workload Management) configuration.  We’ve talked a lot about different aspects of WLM (e.g. in our WLM tuning post or our SQA post) since getting your WLM configuration right can mean the difference between your users having their queries run immediately versus having your users wait minutes or even hours before their queries even start executing.

As a reminder, Redshift’s Workload Manager allows you to define one or more queues for your clusters’ SQL queries, and to define the resources (e.g. memory) and rules (e.g. timeouts) that should apply to queries that run in those queues.  The primary goals of the WLM are to allow you to maximize your query throughput and prioritize different types of workloads.

Manual WLM Tuning

Because cluster resources are finite, configuring your WLM always results in a tradeoff between cluster resources and query concurrency:  the more concurrent queries you let run in a queue (slots), the fewer resources (like memory and cpu) each query can be given. Memory is by far the most precious resource to consider when tuning WLM. Queries that need more memory than they are allocated spill over to disk, causing huge slowdowns in performance not only for the query that went disk-based, but for the cluster as a whole (since long-running queries take up memory and a concurrency slot, and disk-based queries consume disk IO).  AWS recommends keeping your % of disk-based queries to under 10%, but in practice most Redshift administrators can (and should) typically keep it much lower.

Fortunately, finding the optimal tuning for your WLM is pretty straightforward – if you’re using intermix.io you can use our Throughput Analysis and Memory Analysis tools to quickly view your clusters’ concurrency and memory usage in each WLM queue, and see at a glance which users and applications are experiencing unacceptable queuing:

You can then adjust concurrency and/or memory in the AWS console of your cluster to give more memory to queues that have a large number of disk-based queries, or increase the number of slots in queues that have significant queuing

One more trick

One of the limitations of Redshift’s WLM is that the total memory assigned to a queue is divided equally between all query slots (not queries) in the queue.  That means that if you, say, allocate 1gb of memory to a queue with 10 slots, each query that runs in the queue will get 1gb / 10 = 100 mb of memory, even if it’s the only query running in that queue.  So small queries that need less than 100mb waste the extra memory in their slot, and large queries that need more than 100mb spill to disk, even if 9 of the 10 slots (900mb) are sitting idle waiting for a query.  Clearly this isn’t optimal.

One workaround is to use the Redshift session parameter wlm_query_slot_count to temporarily increase the number of slots that should be given to a query.  If you set this parameter to, say, 2 in your database session before executing your query, then your query will consume 2 WLM concurrency slots (reducing the number of concurrent queries that can run in that queue) and get twice the memory. In the example above, a query that needed 150mb of memory would spill to disk when running in a single 100mb slot but run fully in memory when run with 2 slots.

This is a great way to allocate more memory to a big query when the following are true:

  1. The query is a repeated (not one-off) query, so you can look at past statistics to predict how much memory (i.e. how many slots) it will need to avoid going disk-based.
  2. The query runs in a queue with other queries that can afford an increase in queue wait time.
  3. The query uses much more memory compared to other queries in its queue, making increasing the memory in the queue too wasteful.
  4. You can’t (or don’t want to) spend time optimizing the query or your table definitions to reduce the amount of memory it needs.

While wlm_query_slot_count can be a good solution for targeting individual memory-hungry queries on an ad-hoc basis, it is difficult to use this solution to reduce disk-based queries in a general and on-going way cluster-wide since each query requires a different setting and knowing in real-time how many slots you should assign to a particular query is difficult.  Further, it is hard to know in a general way what impact assigning more slots to a query will have on queue wait times.

Auto WLM

So where does this leave us?  Detailed cluster monitoring lets you tune your concurrency and memory WLM settings to minimize both queue wait time and the % of disk-based queries you have. But since every slot in a queue is given the same fixed fraction of queue memory, inevitably some memory-hungry queries will end up spilling to disk causing query and cluster slowdowns. Using wlm_query_slot_count lets you target some of those individual disk-based queries to try to prevent them from spilling to disk, but makes it difficult to optimize per-query memory allocation in a more general way cluster-wide.

This is exactly the problem that Automatic WLM was introduced to solve. The key innovation of Auto WLM is that it assigns memory to each query dynamically, based on its determination of how much memory the query will need.  It’s a little bit like having wlm_query_slot_count tuned for you automatically for each query that runs on your cluster.  As a result, memory-hungry queries can be given up to the total amount of memory available to avoid them going disk-based.  But nothing is free: giving more memory to memory-hungry queries means that the cluster can run fewer queries concurrently, resulting in more queuing overall.

So if you take away one thing from this post, it’s this: enabling Auto WLM will speed up slow, memory-intensive queries by preventing them from going to disk, but slow down smaller queries by introducing more queue wait time.  The degree to which this will impact your cluster performance will depend on your specific workloads and your priorities.

A Real-world example

intermix.io not only helps our customers keep their Redshift clusters operating at peak efficiency and their costs down–it helps us do the same for own internal Redshift clusters. So to see the impact of Automatic WLM, we first enabled Auto WLM on one of our non-production internal Redshift clusters and then used intermix.io to see how our cluster efficiency was impacted.

Manual WLM

The first cluster we enabled it on was one of our development Redshift clusters.  This cluster runs a batch ETL pipeline, and prior to enabling Auto WLM had a well-tuned WLM with minimal queue time but some large, slow, disk-based queries.  Here is a chart of average execution time (light blue), average queue wait time (dark blue), and query count (green line) for a few days before we made the change:

So our average execution time is 5.57 seconds, and average queue time is 0.88 seconds.  Looking at the same chart with Maximum selected, we see the queries that take the longest to run:

So while the average queue wait time and execution time is well below the data SLAs we need for this cluster, we have some queries running longer than 60 minutes–there is clearly room for improvement!

Auto WLM

After enabling Automatic WLM on August 2nd, we saw a drop in average execution time by about half but a significant spike in average queue wait time, from under 1 second to over 10 seconds.  The net result was a significant net increase in average query latency, even though there is a drop in average execution time:

The drop in average execution time is due to the big reduction in execution times for slow, disk-based queries, as shown in this chart of latencies for disk-based queries:

So Automatic WLM reduced our max query runtime from around 50 minutes to around 10 minutes–a 6x improvement!

For this cluster, which runs a consistent set of batch-processing ETL jobs (or “ELT”) and few ad-hoc queries, this net increase in average latency is a good tradeoff to get a big improvement in query runtimes for our slowest disk-based queries.

So far so good.  Or so we thought…

Another Real-world example

Emboldened by our initial test, we enabled Auto WLM on five additional Redshift clusters. As with our first cluster, these five clusters had manually tuned WLMs and were operating well within our data SLAs.  These clusters were significantly larger than our first test cluster (both in terms of nodes, query volume, and data stored). All clusters ran batch ETL jobs similar to the first cluster and ran a small percentage of ad-hoc queries.

Four of the five clusters showed a similar trend to our initial test, though we observed more modest improvements (since their maximum query runtimes were smaller–10 minutes or less compared to 50 minutes in our initial test).  However, the fifth cluster immediately started setting off alarms due to exceeding one of our data SLAs. The root cause was that one particular set of pipeline queries (a combination of four COPYs) were now exceeding their data SLA summed max runtime requirement of 5 minutes due to excessive queueing.  With our manually tuned WLM, each of the three queries were taking a max of 30 sec to execute, whereas with Auto WLM they were now taking as much 4 minutes each due to excessive queueing:

Since there are no parameters to tune with Auto WLM, we had no choice but to revert the WLM mode back to Manual, which rapidly got the queries back under their SLA requirement and our pipeline running smoothly.

Automatic WLM Advantages and Disadvantages

In summary, Auto WLM has the following advantages over Manual WLM:

  1. Long-running disk-based queries can be given more memory dynamically, preventing them from going to disk and improving both their performance and overall cluster performance.
  2. You can Set It and Forget It (though since cluster workloads typically evolve somewhat gradually over time, Manual WLMs also don’t typically need to be changed very often once tuned).

Auto WLM has the following disadvantages over Manual WLM:

  1. You can not prioritize workloads to ensure your data SLAs are met.
  2. Queries will experience longer latencies on average; in particular, the performance of short ad-hoc queries will likely be impacted.
  3. In times of increased load or as your workloads evolve the only way you’ll be able to improve your cluster performance will be to add nodes to your cluster (via scaling or concurrency scaling clusters).
  4. It’s a bit of a blackbox: Redshift will decide in an opaque way which of your users’ queries and workloads to prioritize.

So should I enable Auto WLM?

We’re still in the early days of Automatic WLM and its likely that the AWS Redshift team will continuously make improvements to their tuning algorithms. But for the moment we can make the following broad recommendations around enabling Auto WLM:

  1. Establish a baseline: First and foremost, before you enable Auto WLM you should optimally tune your WLM manually so you can compare apples-to-apples and know if Automatic WLM is the best option to keep your users happy and cluster costs down (if you want help doing this, signup for a free trial of Intermix). Without this step, you won’t know if you can potentially be running with fewer nodes or if your cluster performance problems are due to just one or two problematic queries.
  2. Evaluate your workloads and priorities: Enable Auto WLM if your workloads are such that you can tolerate additional queuing for all queries. For example, if your cluster is primarily processing long-running batch ETL jobs, then an additional minute of queuing per query may not matter. But if your cluster is primarily handling Looker dashboards or ad-hoc queries, your users may not appreciate a 30-60 second increase in query latency for queries that otherwise would execute in a few seconds. Since most clusters run a mix of ETL and BI queries, you’ll need to measure your query runtimes for each workload before and after enabling Auto WLM to know how your users are affected by the change.
  3. Monitor your data SLAs. With no knobs to tune other than a number of nodes in your cluster, your data SLAs will be up to Redshift. Be sure you monitor your data pipelines and ensure that Auto WLM is giving you the performance you need. If not (and you don’t want to scale your cluster), prioritizing your workloads using Manual WLM may be a better option.
  4. Try it out. As long as you can measure your cluster performance, you can try it out to see if it meets your users’ needs. However, if optimizing for cost is important, we also recommend that you periodically compare your cluster’s performance with Auto WLM to vs. an optimized Manual WLM so see which lets you run with fewer nodes.

As always, the most important thing to do is to measure your Redshift cluster performance quantitatively. It’s the only way to know if Automatic WLM is helping or hurting, and whether just optimizing the most problematic queries or adjusting your Manual WLM is a better option.

In our case, we are disabling it for our initial test cluster since that cluster is used by our developers for ad-hoc queries. Their feedback was that they could tolerate the long execution times of a small percentage of ETL jobs in exchange for faster interactive ad-hoc queries. We are however keeping it enabled for the four of the five clusters discussed above for the time being. But since our workloads continuously evolve as more data is added and most importantly as we optimize and modify our SQL queries, we will periodically revert to manual WLM whenever we review our cluster costs (and before adding nodes) to see if optimal manual tuning will let us save money by running our clusters with fewer nodes.

Properly managing storage utilization is critical to performance and optimizing the cost of your Amazon Redshift cluster. We’ve talked before about how important it is to keep an eye on your disk-based queries, and in this post we’ll discuss in more detail the ways in which Amazon Redshift uses the disk when executing queries, and what this means for query performance.

Amazon Redshift uses storage in two ways during query execution:

  1. Disk-based Queries. When a query runs out of memory, the overflow “spills” to the disk and the query goes “disk-based”.
  2. Intermediate Storage. When a query needs to save the results of an intermediate operation, to use as input for a future operation.

Use excessive storage impacts your cluster because:

  1. Query performance suffers, because disk is 100 times slower than memory.
  2. Other queries will be slower, because increased I/O impacts the commit queue which is a shared resource.
  3. The cluster might hit 100% disk utilization, causing queries to fail or requiring the provisioning of more nodes.

The worst case is (3), when the cluster fills up. It can happen for the all nodes in a cluster at once, or start with just one node, and then propagate through the entire cluster, as the image below shows. 

disk utilization by node, Disk-based Queries

So let’s look into what we can do to fix and prevent this type of situation. 

The WLM and Disk-Based queries

If you’re not already familiar with how Redshift allocates memory for queries, you should first read through our article on configuring your WLM

The gist is that Redshift allows you to set the amount of memory that every query should have available when it runs. This value is defined by allocating a percentage of memory to each WLM queue, which is then split evenly among the number of concurrency slots you define.  When a query executes, it is allocated the resulting amount of memory, regardless of whether it needs more (or less). Hence allocating too much memory is wasteful (since each node in the cluster obviously has finite memory), whereas allocating too little memory can cause queries to spill to disk.

Queries which overflow their allocated WLM memory are “disk-based”. These queries usually suffer from significantly degraded performance since disk I/O is orders of magnitude slower than memory I/O. 

There are six types of internal operations that Redshift can spill to disk when executing a query:

  1. Aggregations
  2. Hashing for joins
  3. Saving intermediate rows for future query steps
  4. Sorting
  5. Removing duplicates from intermediate or final results (unique)
  6. Window functions

If any of these operations are processing more rows (i.e. more bytes) than will fit into allocated memory, Redshift has to start swapping data out to disk, resulting in a significant slowdown of the query.

AWS recommends that you keep the percentage of disk-based queries to under 10%.  On our own fleet of clusters, we’re usually running well under one percent:

disk-based queries, Disk-based Queries

How to Prevent Queries From Going Disk-based

  1. Allocate more memory to the query. There are two approaches to this:

Within the intermix.io dashboard, viewing Recommendations for an individual query will surface the exact touched tables and how to update them, as well as how much memory the query used and the amount of memory capacity in the WLM queue.

Disk-based Queries

Intermediate Storage

Redshift also uses the disks in each node for another type of temporary query data called “Intermediate Storage”, which is conceptually unrelated to the temporary storage used when disk-based queries spill over their memory allocation.  

Intermediate Storage is used when Redshift saves the results of an intermediate operation to disk to use as input for a future operation. Intermediate Storage can become important if your query stores a large amount of data between query operations, since that storage may cause your cluster to run out of disk space. It also introduces additional I/O, which can lead to slower execution times.

Ways to Limit the Amount of Intermediate Storage Used.

Since intermediate storage is used to carry results from one part of the query execution to another, the best way to reduce intermediate storage is to use predicates (e.g. WHERE clauses, JOIN … ON clauses, etc) on intermediate steps of your query (subqueries, CTEs, etc) to ensure that you are not carrying unnecessary data through your query processing.

For example, consider this query which joins on the results of two CTEs:

This query could be re-written as follows to limit the amount of data brought forth into the JOIN.

We have a great new feature called Query Recommendations that proactively lets you know if your query is using a significant amount of intermediate storage.

query recommendations

Impact to Cluster Storage Utilization

This value is important when diagnosing spikes in cluster storage utilization. 

For example, let’s assume you see your cluster storage spiking over some time period, but don’t see a corresponding increase in data transferred (via the COPY command) into your cluster. You could search for all queries which have a large Memory to Disk value to identify which queries contributed to your cluster running out of disk space.

Here’s a real-world example. The following chart shows the actual disk space used in a cluster, over a 2 week period, broken down by schema. This particular chart show consistent storage utilization over time, with small variation.

schema size, Disk-based Queries

The chart of % disk utilization tells a different story. On the same cluster, over the same period, the disk utilization hits 100% quite frequently. This is caused by some queries using an extraordinary amount of intermediate storage.

disk utilization

One of the cool features we recently released Cluster Recommendations, will surface queries with high disk utilization immediately. For this cluster, it appears that queries were using close to two terabytes of disk for intermediate results!

cluster recommendations

Conclusion

Monitoring both “Disk-based Queries” and “Intermediate Storage” is crucial to keeping your cluster healthy.  Keeping on top of this temporary disk utilization prevents your Amazon Redshift disks from filling up due to misbehaved queries, resulting in queries being killed and your users being interrupted.

Summary

The Amazon Redshift architecture allows to scale up by adding more nodes to a cluster. That can cause over-provisioning of nodes to address peak query volume. Unlike adding nodes, Concurrency Scaling adds more query processing power on an as-needed basis. 

The typical Concurrency Scaling for Amazon Redshift gives Redshift clusters additional capacity to handle bursts in query load. It works by off-loading queries to new, “parallel” clusters in the background. Queries are routed based on WLM configuration and rules.

Pricing for Concurrency Scaling is based on a credit model, with a free usage tier and beyond that a charge based on the time that a Concurrency Scaling cluster runs queries.

We tested Concurrency Scaling for one of our internal clusters. In this post, we present the results of our usage of the feature and information on how to get started.

Download the Top 14 Performance Tuning Techniques for Amazon Redshift

Cluster Requirements

To be eligible for concurrency scaling, an Amazon Redshift cluster must meet the following three requirements:

Eligible Query Types

Concurrency scaling does not work on all query types. For the first release, it handles read-only queries that meet three conditions:

For routing to a Concurrency Scaling cluster, a query needs to encounter queueing. Also, queries eligible for SQA (Short Query Acceleration) queue will not run on the Concurrency Scaling clusters.

Queuing and SQA are a function of a proper set-up of Redshift’s workload management (WLM). We recommend first optimizing your WLM because it will reduce the need for Concurrency Scaling. And that matters because Concurrency Scaling comes free up to a certain amount of hours. In fact, AWS claims that Concurrency Scaling will be free for 97% of customers, which takes us to pricing.

We’ve also tested enabling Redshift’s automatic WLM and captured our experience with it in this blog post “Should I Enable Amazon Redshift’s Automatic WLM?

See your data in intermix.io

Concurrency Scaling Pricing

For Concurrency Scaling, AWS has introduced a credit model. Each active Amazon Redshift cluster earns credits on an hourly basis, up to one hour of free Concurrency Scaling credits per day.

You only pay when your use of the concurrency scaling clusters exceeds the amount of credits that you’ve incurred.

The fee equals the per-second on-demand rate for a transient cluster used in excess of the free credits – only when it’s serving your queries – with a one-minute minimum charge each time a Concurrency Scaling cluster is activated. Calculating the per-second on-demand rate is based on general Amazon Redshift pricing, i.e. by node type and number of nodes in your cluster.

Enabling Concurrency Scaling

Concurrency scaling is enabled on a per-WLM queue basis. Go to the AWS Redshift Console and click on “Workload Management” from the left-side navigation menu. Select your cluster’s WLM parameter group from the subsequent pull-down menu.

You should see a new column called “Concurrency Scaling Mode” next to each queue. The default is ‘off’. Click ‘Edit’ and you’ll be able to modify the settings for each queue.

Configuration

Concurrency scaling works by routing eligible queries to new, dedicated clusters. The new clusters have the same size (node type and number) as the main cluster.  

The number of clusters used for concurrency scaling defaults to one (1), with the option to configure up to ten (10) total clusters.

The total number of clusters that should be used for concurrency scaling can be set by the parameter max_concurrency_scaling_clusters.  Increasing the value of this parameter provisions additional standby clusters.

Monitoring

There are a few additional charts in the AWS Redshift console. There is a chart called “Max Configured Concurrency Scaling Clusters” which plots the value of max_concurrency_scaling_clusters over time.

The number of Active Scaling clusters is also shown in the UI under Concurrency Scaling Activity:

The Queries tab in the UI also has a column to show if the query ran on the Main cluster or on the Concurrency Scaling cluster:

Whether a particular query ran on the main cluster or via a Concurrency Scaling cluster is stored in stl_query.concurrency_scaling_status.  

A value of 1 means the query ran on a Concurrency Scaling cluster, and other values mean it ran on the main cluster.

Example:

See your data in intermix.io

Concurrency Scaling info is also stored in some other tables/views, e.g. SVCS_CONCURRENCY_SCALING_USAGE.  There’s a list of catalog tables that store concurrency scaling information.

Download the Top 14 Performance Tuning Techniques for Amazon Redshift

Our Results

We enabled Concurrency Scaling for a single queue on an internal cluster at approximately 2019-03-29 18:30:00 GMT. We changed the max_concurrency_scaling_clusters parameter to 3 at approximately 2019-03-29 20:30:00.

To simulate query queuing, we lowered the # of slots for the queue from 15 slots to 5 slots.

Below is a chart from the intermix.io dashboard, showing the running versus queuing queries for this queue, after cranking down the number of slots.

We observe that the queueing time for queries went up, maxing out at about > 5 minutes.

Here’s the corresponding summary in the AWS console of what happened during that time:


Redshift spun up three (3) concurrency scaling clusters as requested. It appears that these clusters were not fully utilized, even though our cluster had many queries that were queuing.

The usage chart correlates closely with the scaling activity chart:

After a few hours, we checked and it looked like 6 queries ran with concurrency scaling.  We also spot-checked two queries against the UI. We haven’t checked how this value may be used if multiple concurrency clusters are active.

Conclusions

Concurrency Scaling may mitigate queue times during bursts in queries.

From this basic test, it appears that a portion of our query load improved as a result. However, simply enabling Concurrency Scaling didn’t fix all of our concurrency problems.  The limited impact is likely due to the limitations on the types of queries that can use Concurrency Scaling. For example, we have a lot of tables with interleaved sort keys, and much of our workload is writes.

While concurrency scaling doesn’t appear to be a silver bullet solution for WLM tuning in all cases, using the feature is easy and transparent.

We do recommend enabling the feature on your WLM queues. Start with a single concurrency cluster, and monitor the peak load via the console to determine whether the new clusters are being fully utilized.

As AWS adds support for additional query/table types, Concurrency Scaling should become more and more effective.

At intermix.io, we run a fleet of over ten Amazon Redshift clusters. In this post, I’ll describe how we use AWS IAM authentication for our database users. By using IAM credentials, we can enforce security policies on users and passwords. It’s a scalable and secure way to manage access to your cluster(s).

The approach we describe is useful in environments where security is a top concern. Examples are industries with regulatory requirements such as Finance or Healthcare. Other use cases include enterprises with strict IT security compliance requirements.

————

Secure ways to manage access to Amazon Redshift

Amazon Redshift started as a data warehouse in the cloud. A frequent initial use case was business intelligence. Over the years though, the use cases for Amazon Redshift have evolved. Redshift is still used for reporting. But it’s now also part of mission-critical data services and applications. intermix.io is an example – we built our own application on top of Redshift. And so it’s critical to track and manage user access to a cluster.

The common way to for users to log onto Amazon Redshift is by providing a database username and password. This is due to the fact that Amazon Redshift is based on PostgreSQL.

But using a username / password combination has a few drawbacks.  The biggest one is that there is no way to enforce security best practices for password changes.

This may not be an issue when you are in a small company where some 2-3 people have access to your cluster. But that’s different for enterprises with a large pool of users. It’s also different when you’re a small company where headcount is growing fast. It’s easy to lose track of everybody who has access to your most valuable data sets.

Luckily, AWS has recently developed alternatives to using a username and password. There are three options to maintaining login credentials for your Amazon Redshift database:

  1. Permit users to create user credentials and login with their IAM credentials.
  2. Permit users to login with a federated sign-on sign-on (SS) through a SAML 2.0-compliant identity provider.
  3. Generate temporary database credentials. Permissions are granted through an AWS Identity and Access Management (IAM) permissions policy. By default, these credentials expire after 15 minutes but you can configure them to expire up to an hour after creation.

This post will discuss #3: using IAM credentials to generate expiring database credentials.

For programmatic generation of temporary or auto-expiring user credentials, Amazon Redshift provides the get-cluster-credentials command for the AWS Command Line Interface (AWS CLI). The CLI uses the API endpoint GetClusterCredentials, which you can also call directly. GetClusterCredentials returns a database user name, a temporary password and a temporary authorization.

Download the Top 14 Performance Tuning Techniques for Amazon Redshift

How intermix.io uses IAM to Generate Temporary Passwords

One of my first tasks on joining intermix was to update the way we enforce credential rotation for our clusters.

The old way – a manual process

Handling credentials used to be a manual process, and that’s a pain in the neck. We’re already using the AWS Secrets Manager. It could have been a pretty trivial exercise to add auto-rotation to our secrets and then trigger ALTER <user> queries to update them with new credentials. That would have been my initial approach. But one of my new colleagues pointed out the option of using IAM. Redshift allows you to get time-scoped IAM credentials associated with a role within Redshift itself.

The new way – using IAM to generate expiring passwords

That turned into a somewhat larger undertaking. First I had to understand how we were using Redshift across our platform. With a reasonable idea of how the change would work, I had to change our Redshift connection logic to pull credentials before connecting. That turned out to be a pretty easy change. We’re a Python shop and Boto3 – the AWS SDK for Python – is exhaustive. Boto enables Python developers to create, configure, and manage AWS services.

We deployed the change into one of our testing environments. Everything went well until we hit the rate limit for the Redshift API.  We were making too many requests to GetClusterCredentials. We have a lot of in-flight transformations that all connect to Redshift. Making those calls along-side the connect exhausted the rate-limit. But that wasn’t insurmountable. It was pretty easy to add a caching mechanism to our connection logic so that we didn’t need to generate a new credential every time.

Once we got the caching mechanism deployed, we were able to disable logins. Access to the cluster was now only available through IAM credentials.

That left us with an awkward situation, though. Our own developers have to connect to our clusters to run queries or test new features. They needed a way to generate IAM credentials and connect to a remote cluster.

We already understood how to generate IAM credentials and had code that handled that. We then solved our need by creating a task in our execution and deployment framework. The task connects to a Redshift cluster in any of our environments using IAM credentials. You can see it in use below!

Generate IAM Credentials

The code we used to do this isn’t actually that important because it’s made possible by the Redshift API and some IAM policies. So you can do this yourself, even without seeing what I’ve done. You need an AWS client for your preferred programming language, and an IAM policy profile granting your users access to get IAM credentials on your clusters.

See your data in intermix.io

Getting started

The first thing you’re going to want to do is to create an IAM policy with the Redshift Actions needed to create an IAM credential on a Redshift cluster. You can set this to be as restricted as you’d like, but for the purposes of this blog post we’re not going to restrict it at all.

Once you have that policy created, go ahead and create a group to contain the users to whom you want to grant access. If you’ve already got that group created, great! Attach the policy you defined in the previous step. Add the users you’d like into the group and attach the policy under the Permissions tab.

At this point, all users will now be able to generate IAM credentials for existing users on your clusters. But using the AWS CLI is manual and can be error-prone. My recommendation is to create a utility to generate the credentials and connect to the cluster on behalf of the user. This utility could also generate ODBC or JDBC connection strings if that’s how your users connect to a cluster.

Here’s a quick Python-based example that outputs a JDBC URL. The examples assume that:

You’re now in a situation where you have an IAM policy and a group containing your users. You’ve also configured your Redshift users with passwords DISABLED.

In short, you’ve secured your Redshift cluster. Your security team can enforce credential rotation on users using standard IAM behavior vs. enforcing them on the database itself.

I hope that helps!

For more on best practices when working with Amazon Redshift, read our post on 3 Things to Avoid When Setting Up an Amazon Redshift Cluster. Or download our best-practices guide for more tips on enterprise-grade deployments for Amazon Redshift.

In this post, we’ll lay out the 5 major components of Amazon Redshift’s architecture.

  1. Data applications
  2. Clusters
  3. Leader nodes
  4. Compute nodes
  5. Redshift Spectrum

Understanding the components and how they work is fundamental for building a data platform with Redshift. In the post, we’ll provide tips and references to best practices for each component.

————-

Download our Data Pipeline Resource Bundle

See 14 real-life examples of data pipelines built with Amazon Redshift

  • Full stack breakdown
  • Summary slides with links to resources
  • PDF containing detailed descriptions

Since launch, Amazon Redshift has found rapid adoption among SMBs and the enterprise. In the early days, business intelligence was the major use case for Redshift.

That has changed.

Today, we still, of course, see companies using BI dashboards like Tableau, Looker and Periscope Data with Redshift. But with rapid adoption, the uses cases for Redshift have evolved beyond reporting. And that has come with a major shift in end-user expectations:

The shift in expectations has implications for the work of the database administrator (“DBA”) or data engineer in charge of running an Amazon Redshift cluster. The static world is gone. Today, data sets have become so large and diverse that data teams have to innovate around how to collect, store, process, analyze and share data.

In the case of Amazon Redshift, much of that depends on understanding the underlying architecture and deployment model. It’s what drives the cost, throughput volume and the efficiency of using Amazon Redshift.

And so in this blog post, we’re taking a closer look at the Amazon Redshift architecture, its components, and how queries flow through those components. We’ll include a few pointers on best practices.

See your data in intermix.io

Amazon Redshift Architecture and The Life of a Query

Image 1 shows how Amazon Redshift processes queries across this architecture. We’re excluding Redshift Spectrum in this image as that layer is independent of your Amazon Redshift cluster.  

Redshift Architecture
Image 1: Amazon Redshift Architecture

However, we do recommend using Spectrum from the start as an extension into your S3 data lake. We’ll go deeper into the Spectrum architecture further down in this post. Let’s first take a closer look at role of each one of the five components

Data applications

In other reference architectures for Redshift, you will often hear the term “SQL client application”. And SQL is certainly the lingua franca of data warehousing.

But with the shift away from reporting to new types of use cases, we prefer to use the term “data apps”. Unlike writing plain SQL in an editor, they imply the use of data engineering techniques, i.e. the use of code/software to work with data.

Data apps run workloads or “jobs” on an Amazon Redshift cluster. There are three generic categories of data apps:

  1. Data integration (“Loads”): This category includes applications that move data from external data sources and systems into Redshift. Examples are Informatica, Stitch Data, Fivetran, Alooma, or ETLeap.
  2. Workflow orchestration (“Transforms”): These are systems that run batch jobs on a predetermined schedule. For example, once data is in a cluster you will still need to filter, clean, join or aggregate data across various sources. Examples for these tools in the open source are Apache Airflow, Pinball or Luigi.
  3. Analysis (“Ad-hoc”): These are apps for data science, reporting, and visualization. Examples are Tableau, Jupyter notebooks, Mode Analytics, Looker, Chartio, Periscope Data. Ad-hoc queries might run queries to extract data for downstream consumption, e.g. for a machine learning application or a data API.

See your data in intermix.io

What does this mean for the DBA?

The Amazon Redshift architecture is designed to be “greedy”. A query will consume all the resources it can get. To protect workloads from each other, a best practice for Amazon Redshift is to set up workload management (“WLM”). WLM is a key architectural requirement. Setting up your WLM should be a top-level architecture component. We’ve also discussed the pros and cons of turning on automatic WLM in one of our posts.

Cluster

A “cluster” is the core infrastructure component for Redshift, which executes workloads coming from external data apps. There are two key components in a cluster:

  1. Compute Nodes: A cluster contains at least one “compute node”, to store and process data.
  2. Leader Node: Clusters with two or more compute nodes also have a “leader node”. The leader coordinates the distribution of workloads across the compute nodes. A cluster only has one leader node.

In our experience, most companies run multi-cluster environments, also called a “fleet” of clusters. For example, at intermix.io we run a fleet of ten clusters.

What does this mean for the DBA?

It’s easy to spin up a cluster, pump in data and begin performing advanced analytics in under an hour.  That makes it easy to skip some best practices when setting up a new Amazon Redshift cluster. Read more at 3 Things to Avoid When Setting Up an Amazon Redshift Cluster

Leader Node

The leader node has four major roles:

  1. Communication with data apps: When running workloads on a cluster, data apps interact only with the leader node. The compute nodes are transparent to external data apps.
  2. Distribution of workloads: The leader node parses queries, develops an execution plan, compiles SQL into C++ code and then distributes the compiled code to the compute nodes.
  3. Caching of query results: When a query is executed in Amazon Redshift, both the query and the results are cached in the memory of the leader node, across different user sessions to the same database. When query or underlying data have not changed, the leader node skips distribution to the compute nodes and returns the cached result, for faster response times.
  4. Maintenance of catalog tables: The system catalogs store schema metadata, such as information about tables and columns. System catalog tables have a PG prefix. A query that references only catalog tables or that does not reference any tables, runs exclusively on the leader node.

Download the Top 14 Performance Tuning Techniques for Amazon Redshift

What does this mean for the DBA?

In some cases, the leader node can become a bottleneck for the cluster. The pattern is an increase in your COMMIT queue stats. For example, larger nodes have more metadata, which requires more processing by the leader node. You can Query STL_COMMIT_STATS to determine what portion of a transaction was spent on commit and how much queuing is occurring.

Compute Nodes

The compute nodes handle all query processing, in parallel execution (“massively parallel processing”, short “MPP”).

Amazon Redshift provides two categories of nodes:

As your workloads grow, you can increase the compute and storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.

Compute nodes are also the basis for Amazon Redshift pricing. You can start with hourly on-demand consumption. Prices for on-demand range from $0.25 (dense compute) to $6.80 per hour (dense storage), with discounts of up to 69% for 3-year commitments.

What does this mean for the DBA?

The execution speed of a query depends a lot on how fast Redshift can access and scan data that’s distributed across nodes. A best practice is to choose the right distribution style for your data by defining distribution keys.

Adding nodes is an easy way to add more processing power. It’s also an easy way to address performance issues – by resizing your cluster and adding more nodes. Many Redshift customers run with over-provisioned clusters. Because nodes are the basis for pricing, that can add up over time. And removing nodes is a much harder process. But it’s also the only way to reduce your Redshift cost.

Redshift Spectrum

In some cases, it may make sense to shift data into S3. The cost of S3 storage is roughly a tenth of Redshift compute nodes. With Amazon Redshift Spectrum you can query data in Amazon S3 without first loading it into Amazon Redshift.

Image 2 shows what an extended Architecture with Spectrum and query caching looks like. The leader nodes decides:

The leader node includes the corresponding steps for Spectrum into the query plan. The compute nodes in the cluster issue multiple requests to the Amazon Redshift Spectrum layer.

Query Caching Redshift Spectrum
Image 2: Extended Amazon Redshift Architecture with Query Caching and Redshift Spectrum

Spectrum scans S3 data, runs projections, filters and aggregates the results. Spectrum sends the final results back to the compute nodes. The compute nodes run any joins with data sitting in the cluster. That way, you can join data sets from S3 with data sets in Amazon Redshift.

Redshift pricing is based on the data volume scanned, at a rate or $5 per terabyte.

What does this mean for the DBA?

Using Redshift Spectrum is a key component for a data lake architecture. Amazon Redshift is the access layer for your data applications. Spectrum is the query processing layer for data accessed from S3. We’ve written more about the detailed architecture in “Amazon Redshift Spectrum: Diving into the Data Lake

How to get the most out of your Amazon Redshift cluster

In this post, we described the  Amazon Redshift’s architecture. We explained how the architecture affects working with data and queries.

If you want to dive deeper into Amazon Redshift and Amazon Redshift Spectrum, register for one of our public training sessions. Each month, we host a free training with live Q&A to answer your most burning questions about Amazon Redshift and building data lakes on Amazon AWS. If you have a burning question about the architecture that you want to answer right now – open this chat window, we’re around to answer your questions!