Start Now Login
Improve Amazon Redshift COPY performance:  Don’t ANALYZE on every COPY

Improve Amazon Redshift COPY performance:  Don’t ANALYZE on every COPY

Introduction

One of the core challenges of using any data warehouse is the process of moving data to a place where the data can be queried. Amazon Redshift provides two methods to access data:

1- copy data into Redshift local storage by using the COPY command

2- use Amazon Redshift Spectrum to query S3 data directly (no need to copy it in)

This post highlights an optimization that can be made when copying data into Amazon Redshift.

The COPY Command

The Amazon Redshift COPY command loads data into a table. The files can be located in an S3 bucket, an Amazon EMR cluster, or a remote host that is accessed using SSH. The maximum size of a single input row from any source is 4 MB. Amazon Redshift Spectrum external tables are read-only. You can’t COPY to an external table.

The COPY command appends the new data to the table. In Amazon Redshift, primary keys are not enforced. This means that deduplication must be handled by your application.

The recommended way of de-duplicating records in Amazon Redshift is to use the UPSERT process.

Download the Top 14 Performance Tuning Techniques for Amazon Redshift


What is “UPSERT”?

UPSERT is a method of de-duplicating data when copying into Amazon Redshift. The UPSERT operation merges new records with existing records using primary keys. While some RDBMSs support a single “UPSERT” statement, Amazon Redshift does not support it. Instead, you should use a staging table for merging records.

Here is an example of an “UPSERT” statement for Amazon Redshift.

Automatic Encoding Detection

The default behavior of Redshift COPY command is to automatically run two commands as part of the COPY transaction:

  1. “COPY ANALYZE PHASE 1|2”  
  2. “COPY ANALYZE $temp_table_name”

Amazon Redshift runs these commands to determine the correct encoding for the data being copied. This may be useful when a table is empty. But in the following cases the extra queries are useless and thus should be eliminated:

  1. COPY into a temporary table (ie as part of an UPSERT)
  2. The table already has data in it . This is because encoding cannot be changed when a table already exists. So even if the COPY command determines that a better encoding style exists, it’s impossible to modify the encoding of the table without doing a deep copy operation.

Unnecessary Queries

In the below example, a single COPY command generates 18 ‘analyze compression’ commands and a single ‘copy analyze’ commands.

Extra queries can create performance issues for other queries running on Amazon Redshift. Extra queries may saturate the number of slots in a WLM queue, thus causing all other queries to have queue wait times.

The Solution

The solution is to adjust the COPY command parameters to add “COMPUPDATE OFF” and “STATUPDATE OFF”. These parameters will disable these features during “UPSERT”s. Here is an example of a “COPY” command with these options set.

 

Ready to start seeing into your data infrastructure?
Get started with a 14-day free trial, with access to the full platform

No Credit Card Required