3 Things to Avoid When Setting Up an Amazon Redshift Cluster
Amazon Redshift is a petabyte-scale data warehouse. Since its launch in 2012, it has seen huge adoption. It’s easy to spin up a cluster, pump in data and begin performing advanced analytics in under an hour. Products like Fivetran are very powerful in that context. Fivetran replicates data from business applications and loads it into Redshift and other data warehouses.
Setting up your cluster for the long-term
Because it is so easy to get going, data engineers often skip important best practices when setting up a cluster. Skipping best practices may create performance issues and you will pay the price later as your data volume and pipeline complexity grows. Common pain points are slow queries and lack of workload scalability.
We’ve already written about the “Top 14 Performance Tuning Techniques for Amazon Redshift” but in this post, we’ll focus on the Top 3 things NOT to do when setting up a cluster:
- Don’t use the master Redshift user
- Don’t use a single schema
- Don’t use the default WLM queue
By avoiding these 3 things, you’re setting yourself up for success. As the complexity of your data pipeline grows, you’ll be in a better position to handle the performance implications. Let’s see how and why.
#1 – Don’t use the master Redshift user
When launching a Redshift cluster, you create a master user. By default, that master user has access to the initial database. What happens next is that the login for the master user gets shared. ETL pipelines, scheduled jobs and dashboard tools all login with the same user.
The problem with that approach is that you lose granularity. It gets hard to see your different users, and who runs which queries. Troubleshooting bad queries starts to become harder as you add more users. It may work if you only have 3-5 users accessing Redshift, but it gets impossible once you have 10 or more.
Instead, use the CREATE USER command and create individual logins to isolate workloads. One user, one login, no exceptions. That way, you’ll have more control and better visibility into your single workloads. We also recommend grouping your users by type of workload (loads, transforms, ad-hoc). This will come in handy later when you set up your workload management.
Keep the master user idle and don’t run any queries with it. It’s your lifeline in case your cluster becomes unresponsive. When nothing else works, you’ll be able to run queries with the master user (see #3 on WLM below).
#2 – Don’t use a single schema
By default, a database in Redshift has a single schema, with the name PUBLIC. And by default, all users have CREATE and USAGE privileges on the PUBLIC schema of a database. With a single schema and read/write access for all users, you’re creating dependencies. When data, user and query volume grow, it becomes hard to untangle these dependencies. More users create more joins across more tables. You’re boxing yourself in. Don’t do it.
Instead, plan your data architecture. Start with two different schemas. A “raw schema” where you load all your data, and a “data schema” that’s available for running ad-hoc queries. And then you move data between the schemas with your transform jobs.
As a rule of thumb, users in each role should only have access to the schemas and tables that they need, and no more. That’s also another reason why you should put in place individual users and user groups.
#3 – Don’t use the default WLM queue
Amazon Redshift operates in a queueing model, with query queues and slots. You can use workload management (WLM) to define query queues. Redshift then routes queries to the appropriate queues at runtime.
User session and queries can overlap. Some queries can consume cluster resources for long periods of time. That impacts the performance of other queries. That’s why you should separate / isolate your workloads from each other.
By default, Amazon Redshift configures two query queues:
- A superuser queue reserved for superusers. You can’t configure this queue, and it does not appear in the Redshift console.
- A default user queue to run queries. The default configuration for the queue is to run 5 concurrent queries.
There’s a 99% chance that the default queue will not work for you. Every company’s workloads are different. But setting up WLM is something people like to ignore because it’s so easy to dump data into Redshift. And when things get slow you add another node.
However, that’s an expensive proposition over time. Consider that the performance increase of adding more nodes without WLM is degressive.
How to set up your WLM is fodder for a later blog post. Here are the main steps:
- Categorize your users into 3 groups (load, transform, ad-hoc)
- Define 3 new WLM queues: one each for your loads, transforms and ad-hoc queries
- Assign the user groups to their corresponding queues
- Make the default queue a catch-all queue with a concurrency of 1 and memory of 1%. Do not assign any workloads to this queue
What’s more, any queries that are not routed to other queues should run in the default queue. That’s your insurance if something goes wrong. By not assigning your default user to any queue, it will run in the default queue. And there will always be that one slot available. That’s the lifeline mentioned above.
Avoid the three key mistakes when configuring your Amazon Redshift cluster. That’s your starting point for long term success. Your workloads become scalable.
There is still much more performance tuning to do at scale. As you load more data into your cluster and add more users, things can become slow. That’s when it makes sense to use a product like intermix.io alongside with an ETL product like Fivetran. When it’s critical that analysts, apps and dashboards are meeting SLA and having good experiences, we help you understand how your queries and data loads are performing.
But setting things up the right way from the start makes that a lot easier. And as your business grows, your data infrastructure can scale with it.