Start Now Login
Improve Amazon Redshift COPY performance:  Don’t ANALYZE on every COPY

Improve Amazon Redshift COPY performance:  Don’t ANALYZE on every COPY

Introduction

One of the core challenges of using any data warehouse is the process of moving data to a place where the data can be queried. Amazon Redshift provides two methods to access data:

1- copy data into Redshift local storage by using the COPY command

2- use Amazon Redshift Spectrum to query S3 data directly (no need to copy it in)

This post highlights an optimization that can be made when copying data into Amazon Redshift.

The COPY Command

The Amazon Redshift COPY command loads data into a table. The files can be located in an S3 bucket, an Amazon EMR cluster, or a remote host that is accessed using SSH. The maximum size of a single input row from any source is 4 MB. Amazon Redshift Spectrum external tables are read-only. You can’t COPY to an external table.

The COPY command appends the new data to the table. In Amazon Redshift, primary keys are not enforced. This means that deduplication must be handled by your application.

The recommended way of de-duplicating records in Amazon Redshift is to use the UPSERT process.

Download the Top 14 Performance Tuning Techniques for Amazon Redshift


What is “UPSERT”?

UPSERT is a method of de-duplicating data when copying into Amazon Redshift. The UPSERT operation merges new records with existing records using primary keys. While some RDBMSs support a single “UPSERT” statement, Amazon Redshift does not support it. Instead, you should use a staging table for merging records.

Here is an example of an “UPSERT” statement for Amazon Redshift.

Automatic Encoding Detection

The default behavior of Redshift COPY command is to automatically run two commands as part of the COPY transaction:

  1. “COPY ANALYZE PHASE 1|2”  
  2. “COPY ANALYZE $temp_table_name”

Amazon Redshift runs these commands to determine the correct encoding for the data being copied. This may be useful when a table is empty. But in the following cases the extra queries are useless and thus should be eliminated:

  1. COPY into a temporary table (ie as part of an UPSERT)
  2. The table already has data in it . This is because encoding cannot be changed when a table already exists. So even if the COPY command determines that a better encoding style exists, it’s impossible to modify the encoding of the table without doing a deep copy operation.

See your data in intermix.io

Unnecessary Queries

In the below example, a single COPY command generates 18 ‘analyze compression’ commands and a single ‘copy analyze’ commands.

Extra queries can create performance issues for other queries running on Amazon Redshift. Extra queries may saturate the number of slots in a WLM queue, thus causing all other queries to have queue wait times.

The Solution

The solution is to adjust the COPY command parameters to add “COMPUPDATE OFF” and “STATUPDATE OFF”. These parameters will disable these features during “UPSERT”s. Here is an example of a “COPY” command with these options set.

 

Related content
3 Things to Avoid When Setting Up an Amazon Redshift Cluster Apache Spark vs. Amazon Redshift: Which is better for big data? Amazon Redshift Spectrum: Diving into the Data Lake! What Causes "Serializable Isolation Violation Errors" in Amazon Redshift? A Quick Guide to Using Short Query Acceleration and WLM for Amazon Redshift for Faster Queries Why We Built intermix.io - “APM for Data” 4 Simple Steps To Set-up Your WLM in Amazon Redshift For Better Workload Scalability World-class Data Engineering with Amazon Redshift - Training Have Your Postgres Cake with Amazon Redshift and eat it, too. 4 Real World Use Cases for Amazon Redshift 3 Steps for Fixing Slow Looker Dashboards with Amazon Redshift Zero Downtime Elasticsearch Migrations Building a Better Data Pipeline - The Importance of Being Idempotent Crowdsourcing Weather Data With Amazon Redshift The Future of Apache Airflow Top 14 Performance Tuning Techniques for Amazon Redshift How We Reduced Our Amazon Redshift Cost by 28% 14 Examples of Data Pipelines Built with Amazon Redshift A DBA’s Guide to the Amazon Redshift Architecture How We Use AWS IAM to Generate Temporary Amazon Redshift Passwords Amazon Redshift Concurrency Scaling - A Guide and Our Test Results How-To Configure Amazon Redshift for Performance - "The Second Set-up" How Dow Jones Uses Amazon Redshift to Build a Cloud-based Data Platform How we see United Airlines as Data-Led Intuit’s Journey to Cloud Analytics What is Amazon Redshift? A Deep Dive Into Pricing and Technology
Ready to start seeing into your data infrastructure?
Get started with a 14-day free trial, with access to the full platform

No Credit Card Required