Improve Amazon Redshift COPY performance: Don’t ANALYZE on every COPY
Table of Contents
One of the core challenges of using any data warehouse is the process of moving data to a place where the data can be queried. Amazon Redshift provides two methods to access data:
1- copy data into Redshift local storage by using the COPY command
2- use Amazon Redshift Spectrum to query S3 data directly (no need to copy it in)
This post highlights an optimization that can be made when copying data into Amazon Redshift.
The COPY Command
The Amazon Redshift COPY command loads data into a table. The files can be located in an S3 bucket, an Amazon EMR cluster, or a remote host that is accessed using SSH. The maximum size of a single input row from any source is 4 MB. Amazon Redshift Spectrum external tables are read-only. You can’t COPY to an external table.
The COPY command appends the new data to the table. In Amazon Redshift, primary keys are not enforced. This means that deduplication must be handled by your application.
The recommended way of de-duplicating records in Amazon Redshift is to use the UPSERT process.
Download the Top 14 Performance Tuning Techniques for Amazon Redshift
What is “UPSERT”?
UPSERT is a method of de-duplicating data when copying into Amazon Redshift. The UPSERT operation merges new records with existing records using primary keys. While some RDBMSs support a single “UPSERT” statement, Amazon Redshift does not support it. Instead, you should use a staging table for merging records.
Here is an example of an “UPSERT” statement for Amazon Redshift.
-- Start transaction BEGIN;
-- Create a staging table
CREATE TABLE my_staging (LIKE cities);
-- Load data into the staging table
COPY my_staging (name, zipcode, state)
-- Update records
SET name = s.name, zipcode = s.zipcode
FROM my_staging s
WHERE name.id = s.id;
-- Insert records
INSERT INTO cities
SELECT s.* FROM my_staging s LEFT JOIN cities
ON s.id = cities.id
WHERE cities.id IS NULL;
-- Drop the staging table
DROP TABLE my_staging;
-- End transaction
Automatic Encoding Detection
The default behavior of Redshift COPY command is to automatically run two commands as part of the COPY transaction:
- “COPY ANALYZE PHASE 1|2”
- “COPY ANALYZE $temp_table_name”
Amazon Redshift runs these commands to determine the correct encoding for the data being copied. This may be useful when a table is empty. But in the following cases the extra queries are useless and thus should be eliminated:
- COPY into a temporary table (ie as part of an UPSERT)
- The table already has data in it . This is because encoding cannot be changed when a table already exists. So even if the COPY command determines that a better encoding style exists, it’s impossible to modify the encoding of the table without doing a deep copy operation.
In the below example, a single COPY command generates 18 ‘analyze compression’ commands and a single ‘copy analyze’ commands.
Extra queries can create performance issues for other queries running on Amazon Redshift. Extra queries may saturate the number of slots in a WLM queue, thus causing all other queries to have queue wait times.
The solution is to adjust the COPY command parameters to add “COMPUPDATE OFF” and “STATUPDATE OFF”. These parameters will disable these features during “UPSERT”s. Here is an example of a “COPY” command with these options set.
-- Load data into the staging table
COPY users_staging (id, name, city)
COMPUPDATE OFF STATUPDATE OFF;
Join 11,000 of your peers.
Subscribe to our newsletter SF Data.
People at Facebook, Amazon and Uber read it every week.
Every Monday morning we'll send you a roundup of the best content from intermix.io and around the web. Make sure you're ready for the week! See all issues.