3 Things to Avoid When Setting Up an Amazon Redshift Cluster

Amazon Redshift is a petabyte-scale data warehouse. Since its launch in 2012, it has seen huge adoption. It’s easy to spin up a cluster, pump in data and begin performing advanced analytics in under an hour. Products like Fivetran are very powerful in that context. Fivetran replicates data from business applications and loads it into Redshift and other data warehouses.

Setting up your cluster for the long-term

Because it is so easy to get going, data engineers often skip important best practices when setting up a cluster. Skipping best practices may create performance issues and you will pay the price later as your data volume and pipeline complexity grows. Common pain points are slow queries and lack of workload scalability.

We’ve already written about the “Top 14 Performance Tuning Techniques for Amazon Redshift” but in this post, we’ll focus on the Top 3 things NOT to do when setting up a cluster:

  1. Don’t use the master Redshift user
  2. Don’t use a single schema
  3. Don’t use the default WLM queue

By avoiding these 3 things, you’re setting yourself up for success. As the complexity of your data pipeline grows, you’ll be in a better position to handle the performance implications. Let’s see how and why.

Download our Data Pipeline Resource Bundle

See 14 real-life examples of data pipelines built with Amazon Redshift

  • Full stack breakdown
  • Summary slides with links to resources
  • PDF containing detailed descriptions

#1 – Don’t use the master Redshift user

When launching a Redshift cluster, you create a master user. By default, that master user has access to the initial database. What happens next is that the login for the master user gets shared. ETL pipelines, scheduled jobs and dashboard tools all login with the same user.
The problem with that approach is that you lose granularity. It gets hard to see your different users, and who runs which queries. Troubleshooting bad queries starts to become harder as you add more users. It may work if you only have 3-5 users accessing Redshift, but it gets impossible once you have 10 or more.

Instead, use the CREATE USER command and create individual logins to isolate workloads. One user, one login, no exceptions. That way, you’ll have more control and better visibility into your single workloads. We also recommend grouping your users by type of workload (loads, transforms, ad-hoc). This will come in handy later when you set up your workload management.

Keep the master user idle and don’t run any queries with it. It’s your lifeline in case your cluster becomes unresponsive. When nothing else works, you’ll be able to run queries with the master user (see #3 on WLM below).

#2 – Don’t use a single schema

By default, a database in Redshift has a single schema, with the name PUBLIC. And by default, all users have CREATE and USAGE privileges on the PUBLIC schema of a database. With a single schema and read/write access for all users, you’re creating dependencies. When data, user and query volume grow, it becomes hard to untangle these dependencies. More users create more joins across more tables. You’re boxing yourself in. Don’t do it.

Instead, plan your data architecture. Start with two different schemas. A “raw schema” where you load all your data, and a “data schema” that’s available for running ad-hoc queries. And then you move data between the schemas with your transform jobs.

As a rule of thumb, users in each role should only have access to the schemas and tables that they need, and no more. That’s also another reason why you should put in place individual users and user groups.
See your data in intermix.io

#3 – Don’t use the default WLM queue

Amazon Redshift operates in a queueing model, with query queues and slots. You can use the workload management (WLM) console to define query queues. Redshift then routes queries to the appropriate queues at runtime.

You also have the option to use Amazon Redshift’s automatic WLM but we wouldn’t recommend that just yet.

User session and queries can overlap. Some queries can consume cluster resources for long periods of time. That impacts the performance of other queries. That’s why you should separate / isolate your workloads from each other.

By default, Amazon Redshift configures two query queues:

  1. A superuser queue reserved for superusers. You can’t configure this queue, and it does not appear in the Redshift console.
  2. A default user queue to run queries. The default configuration for the queue is to run 5 concurrent queries.

There’s a 99% chance that the default queue will not work for you. Every company’s workloads are different. But setting up WLM is something people like to ignore because it’s so easy to dump data into Redshift. And when things get slow you add another node.

However, that’s an expensive proposition over time. Consider that the performance increase of adding more nodes without WLM is degressive.

In 4 Simple Steps to Set-up your WLM in Amazon Redshift, we describe in detail how to set up your cluster for better workload scalability. Here are the main four steps:

  • Categorize your users into 3 groups (load, transform, ad-hoc)
  • Define 3 new WLM queues: one each for your loads, transforms and ad-hoc queries
  • Assign the user groups to their corresponding queues
  • Make the default queue a catch-all queue with a concurrency of 1 and memory of 1%. Do not assign any workloads to this queue

What’s more, any queries that are not routed to other queues should run in the default queue. That’s your insurance if something goes wrong. By not assigning your default user to any queue, it will run in the default queue. And there will always be that one slot available. That’s the lifeline mentioned above.

Download the Top 14 Performance Tuning Techniques for Amazon Redshift

Workflow Scalability

Avoid the three key mistakes when configuring your Amazon Redshift cluster. That’s your starting point for long term success. Your workloads become scalable.

There is still much more performance tuning to do at scale. As you load more data into your cluster and add more users, things can become slow. That’s when it makes sense to use a product like intermix.io alongside with an ETL tool. When it’s critical that analysts, apps, and dashboards are meeting SLA and having good experiences, we help you understand how your queries and data loads are performing.

But setting things up the right way from the start makes that a lot easier. And as your business grows, your data infrastructure can scale with it.

If you want to dive deeper into Amazon Redshift, register for one of our public training sessions. Each month, we host a free training with live Q&A to answer your most burning questions about Amazon Redshift and building data lakes on Amazon AWS.

Lars Kamp

Lars Kamp

Join 11,000 of your peers.
Subscribe to our newsletter SF Data.
People at Facebook, Amazon and Uber read it every week.

Every Monday morning we'll send you a roundup of the best content from intermix.io and around the web. Make sure you're ready for the week! See all issues.