Start Now Login

At intermix.io, we run a fleet of over ten Amazon Redshift clusters. In this post, I’ll describe how we use AWS IAM authentication for our database users. 

AWS access is managed by configuring policies and connecting them to IAM identities (users, groups of users, or roles) or AWS resources. A policy is essentially an object in AWS that defines their permissions when associated with an identity or resource.  Evaluation of the policies occurs when an IAM principal (user or role) makes a request, and permissions within the policies determine whether the request is allowed or denied. Policies reside in AWS as JSON documents. There is support for six types of policies: identity-based policies, resource-based policies, permissions boundaries, organization SCPs, ACLs, and session policies.

By using IAM credentials, we can enforce security policies on users and passwords. It’s a scalable and secure way to manage access to your cluster(s).

The approach we describe is useful in environments where security is a top concern. Examples are industries with regulatory requirements such as Finance or Healthcare. Other use cases include enterprises with strict IT security compliance requirements.

Table of Contents

Secure Ways to Manage Access to Amazon Redshift

Amazon Redshift started as a data warehouse in the cloud. A frequent initial use case was business intelligence. Over the years, though, the use cases for Amazon Redshift have evolved. Redshift is still used for reporting. But it’s now also part of mission-critical data services and applications. intermix.io is an example – we built our application on top of Redshift. And so it’s critical to track and manage user access to a cluster.

The standard way for users to log onto Amazon Redshift is by providing a database username and password; this is because Amazon Redshift is based on PostgreSQL. But using a username/password combination has a few drawbacks.  The biggest one is that there is no way to enforce security best practices for password changes; this may not be an issue when you are in a small company where some 2-3 people can access your cluster. But that’s different for enterprises with a large pool of users. It’s also different when you’re a small company where headcount is growing fast. It’s easy to lose track of everybody who has access to your most valuable data sets.

Luckily, AWS has recently developed alternatives to using a username and password. There are three options to maintaining login credentials for your Amazon Redshift database:

  1. Permit users to create user credentials and login with their IAM credentials.
  2. Permit users to login with a federated sign-on sign-on (SSO) through a SAML 2.0-compliant identity provider.
  3. Generate temporary database credentials. Permissions are granted through an AWS Identity and Access Management (IAM) permissions policy. By default, these credentials expire after 15 minutes, but you can configure them to expire up to an hour after creation.

This post will discuss #3: using IAM credentials to generate expiring database credentials.Amazon Redshift provides the GetClusterCredentials API operation and get-cluster-credentials command for the AWS Command Line Interface (AWS CLI). Both offer the ability to generate temporary database user credentials programmatically. It is also possible to configure your SQL client with Amazon Redshift JDBC or ODBC drivers who manage the process of calling the GetClusterCredentials operation; this retrieves the database user credentials establishing a connection between the SQL client and Amazon Redshift database. Checkout JDBC and ODBC Options for Creating Database User Credentials for more information.

How intermix.io Uses IAM to Generate Temporary Passwords

The Old Way: Manual

Handling credentials used to be a manual process, and that’s a pain in the neck. We’re already using the AWS Secrets Manager. It could have been a pretty trivial exercise to add auto-rotation to our secrets, and trigger ALTER <user> queries to update them with new credentials. That would have been my initial approach. But one of my new colleagues pointed out the option of using IAM. Redshift allows you to get time-scoped IAM credentials associated with a role within Redshift itself.

The New Way: Using IAM to Generate Expiring Passwords

That turned into a somewhat larger undertaking. First, I had to understand how we were using Redshift across our platform. With a reasonable idea of how the change would work, I changed our Redshift connection logic to pull credentials before connecting. That turned out to be a pretty easy change. We’re a Python shop and Boto3 – the AWS SDK for Python – is exhaustive. Boto enables Python developers to create, configure, and manage AWS services.

We deployed the change into one of our testing environments. Everything went well until we hit the rate limit for the Redshift API.  We were making too many requests to GetClusterCredentials. We have a lot of in-flight transformations that all connect to Redshift. Making those calls along-side the connect exhausted the rate-limit. But that wasn’t insurmountable. It was pretty easy to add a caching mechanism to our connection logic so that we didn’t need to generate a new credential every time.

Once we got the caching mechanism deployed, we were able to disable logins. Access to the cluster was now only available through IAM credentials.

That left us with an awkward situation, though. Our developers have to connect to our clusters to run queries or test new features. They needed a way to generate IAM credentials and connect to a remote cluster.

We already understood how to generate IAM credentials and had code that handled that. We then solved our need by creating a task in our execution and deployment framework. The task connects to a Redshift cluster in any of our environments using IAM credentials. You can see it in use below!

Generate IAM Credentials

The code we used to do this isn’t that important because it’s made possible by the Redshift API and some IAM policies. So you can do this yourself, even without seeing what I’ve done. You need an AWS client for your preferred programming language and an IAM policy profile granting your users access to get IAM credentials on your clusters.

See your data in intermix.io

Getting Started

The first thing you’re going to want to do is create an IAM policy with the Redshift Actions needed to create an IAM credential on a Redshift cluster. 

The following is a policy that allows the IAM role to call the GetClusterCredentials operation, which automatically creates a new user and specifies groups the user joins at login. The “Resource”: “*” wildcard grants the role access to any resource, including clusters, database users, or user groups.

Of course, this example is not secure and strictly for demonstration. Please see Resource policies for GetClusterCredentials for more information and examples to achieve more granular access control.

Once you have that policy created, go ahead and create a group to contain the users to whom you want to grant access; if you’ve already got that group created, great! Attach the policy you defined in the previous step. Add the users you’d like into the group and attach the policy under the Permissions tab. At this point, all users will now be able to generate IAM credentials for existing users on your clusters. 

The following shows how to use Amazon Redshift CLI to generate temporary database credentials for an existing user named adam.

As a side note, if the user doesn’t exist in the database and AutoCreate is true, the creation of a new occurs with PASSWORD disabled. In the case where the user doesn’t exist, and AutoCreate is false, the request fails.

aws redshift get-cluster-credentials –cluster-identifier clusterA –db-user adam –db-name dbA –duration-seconds 7200.

The result is the following: 

{
  “DbUser”: “IAM:adam”,
  “Expiration”: “2020-10-08T21:10:53Z”,
  “DbPassword”: “EXAMPLEjArE3hcnQj8zt4XQj9Xtma8oxYEM8OyxpDHwXVPyJYBDm/gqX2Eeaq6P3DgTzgPg==”
}

Check out the official documentation for a more in-depth exploration of the CLI method.

However, using the AWS CLI is manual and can be error-prone. My recommendation is to create a utility to generate the credentials and connect to the cluster on behalf of the user. This utility could also generate ODBC or JDBC connection strings if that’s how your users connect to a cluster.

Here’s a quick Python-based example that outputs a JDBC URL. The examples assume that:

You’re now in a situation where you have an IAM policy and a group containing your users. You’ve also configured your Redshift users with passwords DISABLED.

In short, you’ve secured your Redshift cluster. Your security team can enforce credential rotation on users using standard IAM behavior vs. direct database implementation.

For more on best practices when working with Amazon Redshift, read our post on 3 Things to Avoid When Setting Up an Amazon Redshift Cluster. Or download our best-practices guide for more tips on enterprise-grade deployments for Amazon Redshift.

Here at intermix, we’re always working on new features that make your life as a data engineer easier when working with Amazon Redshift. For example, “disk full” errors in Amazon Redshift are an all-too-common performance issue—even when you thought you had plenty of space to spare in your Redshift deployment. That’s why we’ve made it easier than ever to use intermix to find the cause of Amazon Redshift disk usage spikes.

What Causes Disk Usage Spikes in Amazon Redshift?

The problem of Redshift disk usage spikes is so common that Amazon has released its own guide, “How can I troubleshoot high or full disk usage with Amazon Redshift?” This page features 9 different possible causes of Redshift disk usage spikes and “disk full” errors, including query processing issues, tables with VARCHAR(MAX) columns, copying large files, and more. But what about the particular situation when you encounter a disk usage spike when running a Redshift query?

If you encounter a Redshift “disk full” error when running a query, one thing’s for sure: you’ve run out of space in one or more of the nodes in your Redshift cluster. Disk usage spikes when running queries stem from two issues: either the query is using so much memory that it’s overflowing to disk, or your data is too large for the hard disks on the cluster.

But what if you don’t have the time or knowledge to troubleshoot the possible causes yourself? That’s exactly where intermix comes in.

How to Find the Cause

In your intermix dashboard, go to the Storage Analysis tab. (Don’t have an account yet? Sign up here for a free 2-week trial.) Next, click anywhere on the Schema chart to get the list of tables for that exact time. The tables are sorted by growth, so the ones which are causing the disk usage spike will show up at the top of the list.

And that’s it! Once you’ve isolated the problematic table(s), you can take steps to curb the issue, such as removing extraneous data or adding another node to the cluster.

Disk Usage Spikes in intermix.io

What’s Next for Solving Disk Usage Performance Issues?

Since we’ve introduced this feature to intermix, we’ve made a few changes in response to best practices and user feedback:

As always, we’d love to hear your feedback, so please get in touch over chat or email. If you aren’t yet an intermix customer, sign up for your free trial and start optimizing your Redshift deployment today. And for the latest in data engineering news, check out our weekly newsletter “SF Data Weekly” with over 6,000 subscribers.

Amazon Redshift offers an attractive feature that can help organizations manage their hosting bill. It’s called concurrency scaling, and according to Amazon, it “automatically and elastically scales query processing power to provide consistently fast performance for hundreds of concurrent queries.”

Introduction to Amazon Redshift Concurrency Scaling

Before concurrency scaling, Redshift users faced a familiar dilemma – dealing with peak demand. There were two options:

Concurrency scaling adds resources to your Redshift cluster on an on-demand basis, adding processing power during peak time and withdrawing it in quieter moments. 

In terms of pricing, concurrency scaling works on a credit system that should make it free for most users. Amazon allows you to earn one free hour of scaling for every 24 hours of main Redshift cluster usage, and these credits accrue over time. Any usage outside of your credits gets billed on a per-second basis according to your Redshift agreement.

Concurrency scaling makes financial sense, but can it offer consistent service? Let’s find out.

Cluster Requirements

There are three eligibility requirements for concurrency scaling. Your Redshift cluster must be:

This means that single-node clusters are not eligible. Also, note that the cluster must have had fewer than 32 nodes at creation. If your cluster originally had 50 nodes and you scale down to 32, you’re still not eligible for concurrency scaling.

Eligible Query Types

Concurrency scaling does not work on all query types. For the first release, it handles read-only queries that meet three conditions:

For routing to a concurrency scaling cluster, a query needs to encounter queueing. Also, queries eligible for SQA (Short Query Acceleration) queue will not run on the concurrency scaling clusters.

Queuing and SQA are a function of a proper set-up of Redshift’s workload management (WLM). We recommend first optimizing your WLM because it will reduce the need for concurrency scaling. And that matters because, while AWS claims that concurrency scaling will be free for 97% of customers, you could face an additional usage charge if you exceed your credits.

We’ve also tested enabling Redshift’s automatic WLM and captured our experience with it in this blog post, “Should I Enable Amazon Redshift’s Automatic WLM?

See your data in intermix.io

Enabling Concurrency Scaling

Concurrency scaling is enabled on a per-WLM queue basis. Go to the AWS Redshift Console and click on “Workload Management” from the left-side navigation menu. Select your cluster’s WLM parameter group from the subsequent pull-down menu.

You should see a new column called “Concurrency Scaling Mode” next to each queue. The default is ‘off’. Click ‘Edit’ and you’ll be able to modify the settings for each queue.

How to enable concurrency scaling in the AWS Redshift Console “Workload Management” section

How We Configured Redshift Concurrency Scaling

Concurrency scaling works by routing eligible queries to new, dedicated clusters. The new clusters have the same size (node type and number) as the main cluster.  

The number of clusters used for concurrency scaling defaults to one (1), with the option to configure up to ten (10) total clusters.

The total number of clusters that should be used for concurrency scaling can be set by the parameter max_concurrency_scaling_clusters.  Increasing the value of this parameter provisions additional standby clusters.

The total number of clusters that should be used for Redshift concurrency scaling can be set by the parameter max_concurrency_scaling_clusters.

Monitoring our Concurrency Scaling Test

There are a few additional charts in the AWS Redshift console. There is a chart called “Max Configured Concurrency Scaling Clusters” which plots the value of max_concurrency_scaling_clusters over time.

Max Configured Concurrency Scaling Clusters” chart in the AWS Redshift Console

The number of Active Scaling clusters is also shown in the UI under Concurrency Scaling Activity:

The AWS Redshift Console displays the Active Scaling clusters in the Concurrency Scaling Activity report

The Queries tab in the UI also has a column to show if the query ran on the Main cluster or on the Concurrency Scaling cluster:

The AWS Redshift Console has a Queries tab to show which cluster the query ran on, either the Main Cluster or the Concurrency Scaling cluster

Whether a particular query ran on the main cluster or via a concurrency scaling cluster is stored in stl_query.concurrency_scaling_status.  

Wondering whether your query ran on the main cluster or Concurrency Scaling cluster? Here’s where to look in your AWS Redshift Console.

A value of 1 means the query ran on a Concurrency Scaling cluster, and other values mean it ran on the main cluster.

Example:

See your data in intermix.io

Concurrency Scaling info is also stored in some other tables/views, e.g. SVCS_CONCURRENCY_SCALING_USAGE.  TherConcurrency scaling info is also stored in some other tables/views, such asSVCS_CONCURRENCY_SCALING_USAGE. 

The following views have similar information as the corresponding STL views or SVL views:

These views work in the same way as their STL or SVL equivalents.

Results of our Concurrency Scaling Tests

We enabled concurrency scaling for a single queue on an internal cluster at approximately 2019-03-29 18:30:00 GMT. We changed the max_concurrency_scaling_clusters parameter to 3 at approximately 2019-03-29 20:30:00.

To simulate query queuing, we lowered the # of slots for the queue from 15 slots to 5 slots.

Below is a chart from the intermix.io dashboard, showing the running versus queuing queries for this queue, after cranking down the number of slots.

The intermix.io “concurrency and queuing” dashboard shows the Redshift performance of time spent in running versus queuing queries

We observe that the queueing time for queries went up, maxing out at about > 5 minutes.

Our intermix test showed that when Redshift Concurrency Scaling was turned on, the queuing time for queries increased

Here’s the corresponding summary in the AWS console of what happened during that time:

The intermix.io team tested the Concurrency Scaling function; this is what the AWS Redshift console showed during our test


Redshift spun up three (3) concurrency scaling clusters as requested. It appears that these clusters were not fully utilized, even though our cluster had many queries that were queuing.

The usage chart correlates closely with the scaling activity chart:

During our intermix.io test of Redshift Concurrency Scaling, we saw correlations between the Redshift Console scaling activity chart & usage chart

After a few hours, we checked and it looked like 6 queries ran with concurrency scaling.  We also spot-checked two queries against the UI. We haven’t checked how this value may be used if multiple concurrency clusters are active.

Conclusion: Is Redshift Concurrency Scaling Worth it?

Concurrency scaling may mitigate queue times during bursts in queries.

From this basic test, it appears that a portion of our query load improved as a result. However, simply enabling concurrency scaling didn’t fix all of our concurrency problems.  The limited impact is likely due to the limitations on the types of queries that can use concurrency scaling. For example, we have a lot of tables with interleaved sort keys, and much of our workload is writes.

While concurrency scaling doesn’t appear to be a silver bullet solution for WLM tuning in all cases, using the feature is transparent and easy to use. You can start with a single concurrency cluster, then monitor the peak load via the console to determine whether the new clusters are being fully utilized.

Though it may not have lived up to be the automatic solution advertized, concurrency scaling will become more and more effective over time as AWS adds more features and support. We strongly recommend enabling the feature on your WLM queues.

Apache Airflow has come a long way since it was first started as an internal project within Airbnb back in 2014 thanks to the core contributors’ fantastic work in creating a very engaged community while all doing some superhero lifting of their own. In this post, we’re going to go through some of the exciting things coming down the pipe as the project gears up for a very hotly anticipated 2.0 release.

What is Apache Airflow?

Apache Airflow is an open-source workflow management system that allows you programmatically author, schedule, and monitor data pipelines in Python. It is the most popular and effective open-source tool on the market for managing workflows, with over 18,000 stars and over 1,300 contributors on Github.

Since being open-sourced in 2015, Airflow has proven to be the dominant tool in its class, beating out alternatives like Spotify’s Luigi, Pinterest’s Pinball, and even the ever-present Hadoop-centric Oozie.  This is due to its core principles of configurability, extensibility, and scalability. As we’ll see, the new features being developed and the direction of the project still adheres to these principles.

Apache Airflow Short Term (v1.10)

Improvements that are currently being worked on in active PRs or recently merged and will be included in the ongoing Apache Airflow 1.10 release.

RBAC and the UI

Joy Gao took on the herculean task of converting the front end framework from Flask Admin to Flask AppBuilder, which was an incredible feat for one person to do largely on her own. One of the primary benefits realized in this update is ground-level support for role-based authentication controls (RBAC) that will open the door for various auth backends and allow admin users to dictate who has access to specific elements of their Airflow cluster.

Along with this RBAC capability will come an improved UI that will allow for better security around user access. Our team at Astronomer is hoping to fix the UI so that refreshing the dashboard is not needed to check on the status of DAGs – we’d like to be able to view the status of our DAGs in real-time without driving ourselves crazy pressing that Refresh button.

The Kubernetes Executor

One of the most exciting developments anticipated by the Astronomer team is the release of the Kubernetes Executor that Daniel Imberman of Bloomberg has been leading development on. This is long-awaited from the community and will allow users to auto-scale workers via Kubernetes, ensuring that resources are not wasted. This is especially important for expanding the viable use cases for Airflow, as right now we must run Airflow on a low powered EC2 instance and use it to schedule external jobs or run it on expensive hardware that is massively underutilized when tasks aren’t running. While it is being included in the 1.10 release, the release of a new executor is part of a long-term but active effort to make Airflow completely could-native, which we’ll discuss in the following section.

Move From Jira to Github

In a procedural change, the Airflow community moved their issue tracking from Jira to Github.  This signals their growth as a community and their ability to adapt to the changing environments of issue tracking.

Apache Airflow Long Term (v2.0+)

In addition to the short-term fixes outlined above, there are a few longer-term efforts that will have a huge bearing on the stability and usability of the project. Most of these items have been identified by the Airflow core maintainers as necessary for the v2.x era and subsequent graduation from “incubation” status within the Apache Foundation.

First Class API Support

A very desirable feature (at least from users of Airflow that we work with) is first class support for an API to control everything from connection creation to DAG pausing to requesting usage metrics. Right now, the API is strictly experimental with functionality being added regularly, so, in practice, if you want to create this behavior, you end up writing a plugin that manipulates the underlying MySQL or PostgreSQL. Given that much of the current functionality in the UI is based on direct modification of the database and not via any API, the inclusion of a first-class API that handles all functionality would mean that everything done in the UI could also be done in the CLI, further expanding the use cases Airflow could facilitate.

Making Airflow Cloud Native

As mentioned above in relation to the Kubernetes Executor, the most significant long-term push in the project is to make Airflow cloud native. Today it is still up to the user to figure out how to operationalize Airflow for Kubernetes, AWS Batch, or other cloud-based infrastructure. At Astronomer, we have done this and provide it in a dockerized package for our customers.

We feel this is an important step for the project to keep up with the changing deployment landscape and we plan to open-source what we can as we go, but knocking this out is easier said than done. One of the most fundamental problems blocking this initiative is the need for a high-availability, massively-distributed, and auto-rebalancing datastore, something that is hard to do with a simple postgresql or mysql.

A promising lead towards addressing this is added support for CockroachDB, a database following the Google Spanner whitepaper (and founded by former Google File System engineers) and designed precisely for the features listed above.

Improved Test Suite

A common complaint among contributors to Airflow is the long time that it can take for Travis, the CI of choice for the Airflow project, to run all the tests when cutting a new release. This has been brought up in the past but given Airflow’s codebase has hit a scale where it can take up to an hour for Travis to run, we see this test suite finally making it over the line (and are looking forward to helping!). One factor to help in this process is the break out of plugins (which has been growing and is a large codebase in and of itself). Which brings us to…

Plugin Manager

Before we talk about this, it’s important to note that this is NOT on the official Airflow roadmap (at least not yet) but is rather something that the Astronomer team has been mulling around as we see the continued proliferation of plugins.

The brilliance of airflow plugins (and why they have contributed in no small part to the success of the entire project) is how wide-ranging they can be.  They can enable your workflows to connect with GCP, AWS, and Hadoop ecosystems as well as any number of other APIs and databases rather trivially.

Ironically, this is also their weakness. Importing Google Cloud plugins and opening the door to extra processing and dependency conflicts makes zero sense if your stack is on AWS. The inherent brittleness of plugins that have to interact with constantly changing APIs by their very nature require a different release schedule from the core project, which is a slower procedure as any error could affect core functionality.

All plugins should be on their own release schedule with an independent testing suite to make sure that all updates take advantage of the latest changes in external projects. Getting this to be as easy as a pip install will be huge for making Airflow more available to other systems.

As we look toward the next few years of our roadmap, we’re doubling down on our community contributions to help Airflow retain its status as the most flexible, extensible, and reliable scheduler available, regardless of how it’s being run. In Astronomer speak, we’d recommend hanging onto your helmets – the ship is about to kick into hyperdrive.

Which future Airflow developments are you and your network excited about?  Join our Slack channel to get in on the conversation and keep up-to-date with the latest news and get support from the community. 

This is a guest blog post by Pete DeJoy. Pete is a Product Specialist at Astronomer, where he helps companies adopt Airflow. 

What are Query Groups in Amazon Redshift and intermix?

Amazon Redshift is a robust, enterprise-class cloud data warehouse—but that doesn’t mean it’s always the most user-friendly solution. Getting the most out of your Redshift deployment can be a challenge at the best of times (which is one reason why we wrote our article “15 Performance Tuning Techniques for Amazon Redshift”).

In particular, Redshift queries can cause performance issues and mysterious slowdowns if you don’t know how to optimize them. But without a dedicated way to analyze Redshift query performance, how can you hope to isolate the problem?

That’s why we’ve introduced the Query Groups feature to Intermix, our performance monitoring solution for Amazon Redshift. Query Groups is a powerful asset that intelligently classifies and ranks query workloads on your cluster. Using Query Groups, you can answer questions like:

How Do Query Groups Work in Amazon Redshift and intermix?

Using Intermix, the queries in a Query Group are grouped together using a proprietary algorithm, and ranked by volume, execution time, and queue time. More metrics might be added in the future depending on user demand and necessity. All of the queries in a Query Group share a SQL structure and operate on the same tables.

Example: Find the Queries Causing a Query Spike

At 8:17 a.m. on August 7, the below cluster experienced an eightfold spike in queries—wow! Typically, this type of event is caused by a handful of new queries that suddenly increased in volume. But how do you find which queries, and who ran them?

intermix.io’s Query Groups feature can quickly determine which of your queries are responsible.

First, click on the new Query Groups page in the left navigation panel. By default, Query Groups are sorted by Rank. In this case, we want to re-sort by “Rank Change,” which will order the list of Query Groups by the “fastest movers.” In other words, this will help us quickly see the groups which have been moving up the ranks in the past week.

Sure enough, we see a handful of Query Groups which suddenly started running. Clicking into the first one, we can isolate the exact queries that are causing the problem.

You could also use the same procedure to determine the queries that underwent a spike in latency or queue time.

What’s Next for Query Groups?

Query Groups in Redshift and Intermix offer the potential to make your Redshift performance tuning process radically more efficient—and we’re just getting started. We plan to expand the “grouping” concept in the future to add:

Sound like what you’ve been looking for,? We’re here to assist. Sign up today for a free trial of the Intermix platform. Query Groups are just one of the ways that we’ve helped Redshift customers get the most from their cloud data warehouse.