Start Now Login
Improve Amazon Redshift COPY performance:  Don’t ANALYZE on every COPY

Improve Amazon Redshift COPY performance:  Don’t ANALYZE on every COPY

Introduction

One of the core challenges of using any data warehouse is the process of moving data to a place where the data can be queried. Amazon Redshift provides two methods to access data:

1- copy data into Redshift local storage by using the COPY command

2- use Amazon Redshift Spectrum to query S3 data directly (no need to copy it in)

This post highlights an optimization that can be made when copying data into Amazon Redshift.

The COPY Command

The Amazon Redshift COPY command loads data into a table. The files can be located in an S3 bucket, an Amazon EMR cluster, or a remote host that is accessed using SSH. The maximum size of a single input row from any source is 4 MB. Amazon Redshift Spectrum external tables are read-only. You can’t COPY to an external table.

The COPY command appends the new data to the table. In Amazon Redshift, primary keys are not enforced. This means that deduplication must be handled by your application.

The recommended way of de-duplicating records in Amazon Redshift is to use the UPSERT process.

What is “UPSERT”?

UPSERT is a method of de-duplicating data when copying into Amazon Redshift. The UPSERT operation merges new records with existing records using primary keys. While some RDBMSs support a single “UPSERT” statement, Amazon Redshift does not support it. Instead, you should use a staging table for merging records.

Here is an example of an “UPSERT” statement for Amazon Redshift.

Automatic Encoding Detection

The default behavior of Redshift COPY command is to automatically run two commands as part of the COPY transaction:

  1. “COPY ANALYZE PHASE 1|2”  
  2. “COPY ANALYZE $temp_table_name”

Amazon Redshift runs these commands to determine the correct encoding for the data being copied. This may be useful when a table is empty. But in the following cases the extra queries are useless and thus should be eliminated:

  1. COPY into a temporary table (ie as part of an UPSERT)
  2. The table already has data in it . This is because encoding cannot be changed when a table already exists. So even if the COPY command determines that a better encoding style exists, it’s impossible to modify the encoding of the table without doing a deep copy operation.

Unnecessary Queries

In the below example, a single COPY command generates 18 ‘analyze compression’ commands and a single ‘copy analyze’ commands.

Extra queries can create performance issues for other queries running on Amazon Redshift. Extra queries may saturate the number of slots in a WLM queue, thus causing all other queries to have queue wait times.

The Solution

The solution is to adjust the COPY command parameters to add “COMPUPDATE OFF” and “STATUPDATE OFF”. These parameters will disable these features during “UPSERT”s. Here is an example of a “COPY” command with these options set.

 

Related content
3 Things to Avoid When Setting Up an Amazon Redshift Cluster Apache Spark vs. Amazon Redshift: Which is better for big data? Amazon Redshift Spectrum: Diving into the Data Lake! What Causes "Serializable Isolation Violation Errors" in Amazon Redshift? A Quick Guide to Using Short Query Acceleration and WLM for Amazon Redshift for Faster Queries What is TensorFlow? An Intro to The Most Popular Machine Learning Framework Titans of Data with Mirko Novakovic - How Containers are Giving Rise to New Data Services Why We Built intermix.io - “APM for Data” 4 Simple Steps To Set-up Your WLM in Amazon Redshift For Better Workload Scalability World-class Data Engineering with Amazon Redshift - Training Announcing App Tracing - Monitoring Your Data Apps With intermix.io Have Your Postgres Cake with Amazon Redshift and eat it, too. 4 Real World Use Cases for Amazon Redshift 3 Steps for Fixing Slow Looker Dashboards with Amazon Redshift Zero Downtime Elasticsearch Migrations Titans of Data with Florian Leibert – CEO Mesosphere Building a Better Data Pipeline - The Importance of Being Idempotent The Future of Machine Learning in the Browser with TensorFlow.js Gradient Boosting Libraries — A Comparison Crowdsourcing Weather Data With Amazon Redshift The Future of Apache Airflow Announcing Query Groups – Intelligent Query Classification Top 14 Performance Tuning Techniques for Amazon Redshift Product Update: An Easy Way To Find The Cause of Disk Usage Spikes in Amazon Redshift How We Reduced Our Amazon Redshift Cost by 28%
Ready to start seeing into your data infrastructure?
Get started with a 14-day free trial, with access to the full platform

No Credit Card Required