Redshift Performance Tuning Techniques: View, Insert, Join, and More

Amazon Redshift is billed as “the world’s fastest cloud data warehouse.” But even Ferraris need a tune-up every now and then. Whether you’re experiencing persistent sluggishness or mysterious crashes, Redshift has its share of frustrations and challenges.

The good news is that the vast majority of these issues can be resolved. In previous articles, we’ve written about general Redshift best practices, including the top 14 performance tuning techniques for Amazon Redshift. This time, we’ll focus more on improving the efficiency of specific Redshift actions: performing views, inserts, joins, and updates in Redshift.

Table of Contents:

Redshift View Performance Tuning

Database views are subsets of a particular database as the result of a query on a database table. Views have a variety of purposes: designing database schemas, simplifying or summarizing data, combining information from multiple tables, and more.

However, there’s one big problem for Redshift view performance: every time you access the view, Redshift needs to evaluate the underlying database query that corresponds to that view. If the query itself is inefficient, then accessing the view will likewise be frustratingly slow.

To improve Redshift view performance, users have multiple options, including CREATE TABLE AS SELECT (CTAS) and materialized views.

CREATE TABLE AS SELECT (CTAS)

The CREATE TABLE AS SELECT (CTAS) statement in SQL copies the columns from an existing table and creates a new table from them. The formal syntax of the command is as follows:

CTAS is a very helpful tool to improve the performance of Redshift views, and the table generated by CTAS can be used like any other view or table. However, there are a few important caveats to using CTAS for Redshift view performance:

  • The CTAS table is not refreshed when the data in the underlying table changes. This means that you’ll have to refresh the CTAS table manually.
  • On a related note, performing manual CTAS refreshes will require a good deal of oversight from users. Because Redshift does not denote whether a table was created by a CTAS command or not, users will have to keep track of this information and decide when it’s time to perform a refresh.

For these reasons, many Redshift users have chosen to use the new materialized views feature to optimize Redshift view performance.

Materialized Views

Materialized views is a new Amazon Redshift feature that was first introduced in March 2020, although the concept of a materialized view is a familiar one for database systems. A materialized view is a database object that contains the precomputed results of a database query, similar to a CTAS table. The major difference between materialized views and CTAS tables is that materialized views are snapshots of the database that are regularly and automatically refreshed, which improves efficiency and manageability.

The following example command demonstrates how to create a materialized view in Redshift:

The BACKUP clause determines whether the data in the materialized view is backed up as part of your Redshift cluster snapshots. The table_attributes clause specifies the method by which the data in the materialized view is distributed.

Redshift Insert Performance Tuning

As the name suggests, the INSERT command in Redshift inserts a new row or rows into a table. However, many Redshift users have complained about slow Redshift insert speeds and performance issues. The Redshift insert performance tips in this section will help you get data into your Redshift data warehouse quicker.

The COPY Command

If you’re moving large quantities of information at once, Redshift advises you to use COPY instead of INSERT. The COPY command was created especially for bulk inserts of Redshift data.

According to Redshift’s official AWS documentation: “We strongly encourage you to use the COPY command to load large amounts of data. Using individual INSERT statements to populate a table might be prohibitively slow.”

The COPY command allows users to upload rows of data stored in Amazon S3, Amazon EMR, and Amazon DynamoDB, as well as via remote SSH connections. A single row moved using the COPY command has a maximum size of 4 megabytes.

Bulk Inserts

If for some reason the COPY command isn’t an option, you can still make your Redshift INSERT commands more efficient by using the bulk insert functionality.

The AWS documentation recommends that you use INSERT in conjunction with staging tables for temporarily storing the data that you’re working on.

For example, the following code creates a new staging table students_stage by copying all the rows from the existing students table:

If the staging table already exists, you can also populate it with rows from another table. The code below takes all of the rows from the students table and copies them into the staging table students_stage:

Multi-Row Inserts

Performing a multi-row insert is another option if you need or prefer to use INSERT rather than COPY.

Multi-row inserts are faster than single-row inserts by the very nature of Redshift. Because Redshift performs data compression when transferring information between tables, compressing a single row of data takes up a greater proportion of time than compressing many rows. (Just like it makes no sense to drive your car a single block, due to the time it takes to start it up and find a parking space.)

Below is an example of a (very small) multi-row insert. The default value indicates that the field will be populated with the DEFAULT option for the table:

Redshift Join Performance Tuning

SQL joins have a bad reputation of being slow, or at least slower than the alternative: using denormalization to avoid join operations entirely. We’re happy to report, however, that when it comes to Redshift join performance, this stereotype can be entirely avoided with the right tweaks and performance tunings.

KEY-Based Distribution Style

Redshift tables have four different options for distribution styles, i.e. how the rows in the table are distributed across the nodes in the cluster:

  • AUTO: Redshift automatically chooses the distribution style, based on the size of the table data.
  • EVEN: All Redshift nodes have the same number of rows per table.
  • KEY: Rows are distributed across nodes according to the values in a single column (the DISTKEY column).
  • ALL: Every node stores a copy of every row in the table.

The default option is AUTO, which often means an EVEN distribution style in practice. However, the EVEN distribution style isn’t optimal for Redshift join performance. This is because data from different nodes must be exchanged between these nodes, which requires slow network and I/O operations.

Instead, you can improve Redshift join performance by using the KEY-based distribution style for certain use cases. In the KEY-based distribution style, Redshift places rows with the same value in the DISTKEY column on the same node. This means that if you execute a Redshift join operation on the DISTKEY, it can take place within a single node, without needing to send data across the network.

Note that the KEY-based distribution style also has its limits: it should only be used for major queries to improve Redshift join performance. Using the KEY-based distribution style everywhere will result in a few unpleasant consequences:

  • If the values in the DISTKEY column are not evenly distributed, the rows will be unevenly distributed among the nodes in your Redshift cluster. This is a phenomenon known as “row skew.”
  • Because the rows are unevenly distributed, queries such as SELECT operations across all the nodes will be slower. The operation will complete more quickly on nodes with fewer rows, and these nodes will have to wait for the nodes with more rows. In other words, a cluster is only as strong as its weakest link.

Cross Joins

While they may appear innocent, cross joins can make your Redshift join performance horribly slow. A cross join is a join operation that takes the Cartesian product of two tables: each row in the first table is paired with each row in the second table. Although the cross join does have practical uses, in many cases, it occurs when joining two tables without applying any filters or join conditions.

Due to their extreme performance slowdown, cross joins should only be used when absolutely necessary. Cross joins often result in nested loops, which you can check for by monitoring Redshift’s STL_ALERT_EVENT_LOG for nested loop alert events.

Redshift Update Performance Tuning

Last but not least, many users want to improve their Redshift update performance when updating the data in their tables. Performing an update in Redshift is actually a two-step process: first, the original record needs to be deleted from the table; second, the new record needs to be written for each of the table’s columns.

In many cases, you can perform Redshift updates faster by doing an “upsert” that combines the operations of inserting and updating data.

Upserts

The SQL standard defines a MERGE statement that inserts and/or updates new records into a database. If the record is not already present, the MERGE statement inserts it; if it is, then the existing record is updated (if necessary) with the new information. This operation is also referred to as UPSERT (update + insert).

However, even though MERGE is part of the official SQL standard, as of this writing it’s not yet implemented in Redshift. Instead, the Redshift AWS documentation encourages users to use a staging table to perform merge operations. This involves a multi-step process:

  1. Create a staging table that has the same schema as the original table.
  2. Upload the data that you want to “upsert” to the staging table.
  3. Find and delete rows in the original table that have the same primary key as any rows in the staging table.
  4. Insert the new rows from the staging table in the original table.

For best results with your Redshift update performance, follow the guidelines for upserts below:

  • The entire set of steps should be performed in an atomic transaction. This will prevent you from suffering data loss if the last step of the process fails.
  • If necessary, rebalance the data distribution among the nodes in your cluster after the upsert is complete. As mentioned above, uneven data distributions can slow down queries.

In Conclusion

Struggling with how to optimize the performance of Redshift views, inserts, joins, and updates? As we’ve shown in this article, there’s no shortage of ways to do so:

  • Sluggish Redshift view performance can be fixed by using CTAS (CREATE TABLE AS SELECT) commands and materialized views.
  • Instead of moving rows one-by-one, move many of them at once using the COPY command, bulk inserts, or multi-row inserts.
  • Avoiding cross joins and switching to a KEY-based distribution style (as needed) can help improve Redshift join performance.
  • Perform “upserts” properly by wrapping the entire process in an atomic transaction and rebalancing the distribution of data once the operation is complete.

Here at Intermix.io, we know all about what it takes to get the most from your Redshift deployment. That’s why we’ve built an industry-leading analytics platform for Redshift cloud data warehouses. Intermix gives you crystal-clear insights into exactly what’s going on with Redshift: how your jobs are performing, who’s touching your data, the dependencies between queries and tables, and much more.

Sign up today for a free trial of Intermix, and discover why so many businesses are using Intermix to optimize their use of Amazon Redshift.

Mark Smallcombe

Mark Smallcombe

Join 11,000 of your peers.
Subscribe to our newsletter SF Data.
People at Facebook, Amazon and Uber read it every week.

Every Monday morning we'll send you a roundup of the best content from intermix.io and around the web. Make sure you're ready for the week! See all issues.