Start Now Login
3 Things to Avoid When Setting Up an Amazon Redshift Cluster

3 Things to Avoid When Setting Up an Amazon Redshift Cluster

Amazon Redshift is a petabyte-scale data warehouse. Since its launch in 2012, it has seen huge adoption. It’s easy to spin up a cluster, pump in data and begin performing advanced analytics in under an hour. Products like Fivetran are very powerful in that context. Fivetran replicates data from business applications and loads it into Redshift and other data warehouses.

Setting up your cluster for the long-term

Because it is so easy to get going, data engineers often skip important best practices when setting up a cluster. Skipping best practices may create performance issues and you will pay the price later as your data volume and pipeline complexity grows. Common pain points are slow queries and lack of workload scalability.

We’ve already written about the “Top 14 Performance Tuning Techniques for Amazon Redshift” but in this post, we’ll focus on the Top 3 things NOT to do when setting up a cluster:

  1. Don’t use the master Redshift user
  2. Don’t use a single schema
  3. Don’t use the default WLM queue

By avoiding these 3 things, you’re setting yourself up for success. As the complexity of your data pipeline grows, you’ll be in a better position to handle the performance implications. Let’s see how and why.

#1 – Don’t use the master Redshift user

When launching a Redshift cluster, you create a master user. By default, that master user has access to the initial database. What happens next is that the login for the master user gets shared. ETL pipelines, scheduled jobs and dashboard tools all login with the same user.
The problem with that approach is that you lose granularity. It gets hard to see your different users, and who runs which queries. Troubleshooting bad queries starts to become harder as you add more users. It may work if you only have 3-5 users accessing Redshift, but it gets impossible once you have 10 or more.

Instead, use the CREATE USER command and create individual logins to isolate workloads. One user, one login, no exceptions. That way, you’ll have more control and better visibility into your single workloads. We also recommend grouping your users by type of workload (loads, transforms, ad-hoc). This will come in handy later when you set up your workload management.

Keep the master user idle and don’t run any queries with it. It’s your lifeline in case your cluster becomes unresponsive. When nothing else works, you’ll be able to run queries with the master user (see #3 on WLM below).

#2 – Don’t use a single schema

By default, a database in Redshift has a single schema, with the name PUBLIC. And by default, all users have CREATE and USAGE privileges on the PUBLIC schema of a database. With a single schema and read/write access for all users, you’re creating dependencies. When data, user and query volume grow, it becomes hard to untangle these dependencies. More users create more joins across more tables. You’re boxing yourself in. Don’t do it.

Instead, plan your data architecture. Start with two different schemas. A “raw schema” where you load all your data, and a “data schema” that’s available for running ad-hoc queries. And then you move data between the schemas with your transform jobs.

As a rule of thumb, users in each role should only have access to the schemas and tables that they need, and no more. That’s also another reason why you should put in place individual users and user groups.

#3 – Don’t use the default WLM queue

Amazon Redshift operates in a queueing model, with query queues and slots. You can use workload management (WLM) to define query queues. Redshift then routes queries to the appropriate queues at runtime.

User session and queries can overlap. Some queries can consume cluster resources for long periods of time. That impacts the performance of other queries. That’s why you should separate / isolate your workloads from each other.

By default, Amazon Redshift configures two query queues:

  1. A superuser queue reserved for superusers. You can’t configure this queue, and it does not appear in the Redshift console.
  2. A default user queue to run queries. The default configuration for the queue is to run 5 concurrent queries.

There’s a 99% chance that the default queue will not work for you. Every company’s workloads are different. But setting up WLM is something people like to ignore because it’s so easy to dump data into Redshift. And when things get slow you add another node.

However, that’s an expensive proposition over time. Consider that the performance increase of adding more nodes without WLM is degressive.

How to set up your WLM is fodder for a later blog post. Here are the main steps:

What’s more, any queries that are not routed to other queues should run in the default queue. That’s your insurance if something goes wrong. By not assigning your default user to any queue, it will run in the default queue. And there will always be that one slot available. That’s the lifeline mentioned above.

Workflow Scalability

Avoid the three key mistakes when configuring your Amazon Redshift cluster. That’s your starting point for long term success. Your workloads become scalable.

There is still much more performance tuning to do at scale. As you load more data into your cluster and add more users, things can become slow. That’s when it makes sense to use a product like intermix.io alongside with an ETL product like Fivetran. When it’s critical that analysts, apps and dashboards are meeting SLA and having good experiences, we help you understand how your queries and data loads are performing.

But setting things up the right way from the start makes that a lot easier. And as your business grows, your data infrastructure can scale with it. 

Related content
Apache Spark vs. Amazon Redshift: Which is better for big data? Amazon Redshift Spectrum: Diving into the Data Lake! What Causes "Serializable Isolation Violation Errors" in Amazon Redshift? A Quick Guide to Using Short Query Acceleration and WLM for Amazon Redshift for Faster Queries What is TensorFlow? An Intro to The Most Popular Machine Learning Framework Titans of Data with Mirko Novakovic - How Containers are Giving Rise to New Data Services Why We Built intermix.io - “APM for Data” 4 Simple Steps To Set-up Your WLM in Amazon Redshift For Better Workload Scalability World-class Data Engineering with Amazon Redshift - Training Announcing App Tracing - Monitoring Your Data Apps With intermix.io Have Your Postgres Cake with Amazon Redshift and eat it, too. 4 Real World Use Cases for Amazon Redshift 3 Steps for Fixing Slow Looker Dashboards with Amazon Redshift Zero Downtime Elasticsearch Migrations Titans of Data with Florian Leibert – CEO Mesosphere Improve Amazon Redshift COPY performance:  Don’t ANALYZE on every COPY Building a Better Data Pipeline - The Importance of Being Idempotent The Future of Machine Learning in the Browser with TensorFlow.js Gradient Boosting Libraries — A Comparison Crowdsourcing Weather Data With Amazon Redshift The Future of Apache Airflow Announcing Query Groups – Intelligent Query Classification Top 14 Performance Tuning Techniques for Amazon Redshift Product Update: An Easy Way To Find The Cause of Disk Usage Spikes in Amazon Redshift How We Reduced Our Amazon Redshift Cost by 28%
Ready to start seeing into your data infrastructure?
Get started with a 14-day free trial, with access to the full platform

No Credit Card Required