Troubleshooting Data Loading Errors in Amazon Redshift
Amazon Redshift is a feature-rich, enterprise-class data warehouse, but this power and complexity comes at the price of troubleshooting Redshift issues. If you’ve ever encountered the dreaded “Error occurred while trying to execute a query” message in Redshift, you will understand.
Many Redshift administrators are all too familiar with data loading errors that can disrupt their ETL workflow—errors such as:
- Errors 6-7: The character is not a valid UTF-8 character.
- Error 1204: Input data exceeded the acceptable range for the data type.
- Error 1208: FLOAT data type format error.
- Error 1220: String contains invalid or unsupported UTF-8 codepoints.
Not only are these Redshift data loading errors mysterious and inscrutable, they can bring your ETL processes to a shuddering halt if you don’t know how to address them. So why do these Amazon Redshift data loading errors occur? What is their root cause, and how can you fix data loading errors in Amazon Redshift?
In this post, we’ll work through some of the most common data loading errors in Redshift, and help you troubleshoot.
Table of Contents
- Loading Data in Amazon Redshift
- Troubleshooting Data Loading Errors in Amazon Redshift
- Resolving Pesky Redshift Data Loading Errors
- Beyond Redshift Troubleshooting
Loading Data in Amazon Redshift
The “extract” phase of ETL requires you to capture information from a variety of source files and databases. Amazon offers a list of best practices for loading data in Redshift. Restructuring your Redshift ETL architecture to follow these best practices may help resolve some performance issues and errors:
- Use the COPY command to load data in parallel, much more efficiently than using INSERT statements. COPY commands in Redshift can load data from Amazon S3, Amazon EMR, Amazon DynamoDB, and/or sources located on remote hosts. A single COPY command is capable of loading data from multiple files. If a COPY command is not possible for whatever reason, use techniques such as multi-row inserts and bulk inserts.
- Redshift does not include support for a single merge statement that updates and inserts information simultaneously. However, you can replicate this functionality using staging tables. First load the data into the staging table, and then join the staging and target tables.
- Load data in sequential blocks according to sort key order. This will reduce the need to run the VACUUM command, which reclaims unused space in a table or database by resorting the rows.
- Very large files and datasets should be compressed using tools such as gzip, lzop, or bzip2.
Troubleshooting Data Loading Errors in Amazon Redshift
The Redshift data loading best practices above may help with general performance issues, but not with localized errors that pertain to specific issues with your source data. So how can you perform root cause analysis to identify and resolve Redshift data loading errors?
Fortunately, Redshift includes a built-in troubleshooting system for dealing with exactly these problems. When you encounter a data loading error in Redshift, information about that error is logged in a system table called STL_LOAD_ERRORS. Below is an example query that you can run on the STL_LOAD_ERRORS table in order to retrieve information about your most recent Redshift data loading errors:
select d.query, substring(d.filename,14,20), d.line_number as line, substring(d.value,1,16) as value, substring(le.err_reason,1,48) as err_reason from stl_loaderror_detail d, stl_load_errors le where d.query = le.query and d.query = pg_last_copy_id();
Below is an example output of this query:
query | substring | line | value | err_reason -------+-------------------+------+----------+---------------------------- 558| allusers_pipe.txt | 251 | 251 | String contains invalid or unsupported UTF8 code 558| allusers_pipe.txt | 251 | ZRU29FGR | String contains invalid or unsupported UTF8 code 558| allusers_pipe.txt | 251 | Kaitlin | String contains invalid or unsupported UTF8 code 558| allusers_pipe.txt | 251 | Walter | String contains invalid or unsupported UTF8 code
In the above example, we pull from the following columns of STL_LOAD_ERRORS:
- query: The ID number of the query that caused the error.
- filename: The path to the input file that caused the error (represented in the output table as substring).
- line_number: The line in the input file that caused the error (represented in the output table as line).
- value: The contents of the line that caused the error.
- err_reason: The stated cause or explanation of the error.
Other useful columns from STL_LOAD_ERRORS include:
- userid: The ID number of the user who generated the error.
- tbl: The ID number of the table where the error occurred.
- err_code: The error code of the issue, represented as an integer.
- session: The session ID during which the error occurred.
- starttime: The timestamp at which the load began.
For more detailed information about an error, you can also query the STL_LOADERROR_DETAIL table. You may be interested in pulling information from the following columns in the STL_LOADERROR_DETAIL table:
- field: The specific field in the table where the error occurred.
- is_null: A Boolean value that describes whether or not the given value in the table is null.
Note that standard users can only use the STL_LOAD_ERRORS and STL_LOADERROR_DETAIL tables to view their own data. To view all of the table data, you will need to have superuser access.
In the example above, we’re presented with the error message “String contains invalid or unsupported UTF8 code.” However, nothing seems to be amiss with a cursory examination of the corresponding values. The source of the issue is likely an invisible, unsupported or invalid UTF-8 character that has been introduced into the data. For example, older versions of Redshift only supported UTF-8 characters with a maximum length of 3 bytes; this length has now been extended to the 4-byte maximum for UTF-8 characters.
Resolving Pesky Redshift Data Loading Errors
Querying the STL_LOAD_ERRORS and STL_LOADERROR_DETAIL tables, and analyzing the results, is highly informative for a great many Redshift data loading errors. In some cases, however, you’ll need to dig a little deeper before you can successfully troubleshoot the problem.
What are the root causes that lead to Redshift data loading errors? Many such errors can be attributed to character encoding and data conversion issues. When troubleshooting data loading errors in Redshift, pay heed to the following concerns:
- The only character encoding supported by Redshift is UTF-8. If you have data in another encoding, it’s highly advisable to convert it before using it in your ETL workflow.
- The CHAR and VARCHAR data types are used to store fixed-length strings and variable-length strings, respectively. Make sure that your strings have enough characters to hold the input data. The length of the string is specified in bytes—not characters. For example, the expression “char(12)” creates a fixed-length string with a length of 12 bytes, while the expression “varchar(12)” creates a variable-length string with a maximum length of 12 bytes. The second string is able to hold a maximum of 12 1-byte characters, 6 2-byte characters, 4 3-byte characters, or 3 4-byte characters.
In particular, making adjustments to the parameters of the COPY command can help you deal with unexpected data loading errors. Some useful COPY parameters are as follows:
- ACCEPTINVCHARS: This parameter enables the COPY command to load data into VARCHAR columns, even if the input string contains invalid UTF-8 characters. Each invalid character is replaced with a default replacement character, which you can specify using the ‘replacement_char‘ argument (the default replacement is the question mark character ‘?’).
- DATEFORMAT: This parameter can help resolve the error “Date format is invalid.” in Amazon Redshift. By default, Redshift interprets dates in the format ‘YYYY-MM-DD’. However, you can tell Redshift to automatically recognize different date formats by using the ‘auto‘ argument.
- TIMEFORMAT: Similarly, this parameter can help resolve the error “Timestamp format is invalid.” By default, Redshift interprets timestamps in two possible formats: ‘YYYY-MM-DD HH:MI:SS’ for TIMESTAMP database columns, or ‘YYYY-MM-DD HH:MI:SSOF’ for TIMESTAMPTZ database columns, where ‘OF’ represents the offset from Coordinated Universal Time (UTC). By setting the ‘auto‘ argument for the TIMEFORMAT parameter, Redshift will automatically recognize timestamps in different formats.
- BLANKSASNULL, EMPTYASNULL: Redshift’s default behavior is to load fields that are blank or empty in CHAR and VARCHAR columns as is. Using these two parameters will instead load blank and empty fields as NULL in Redshift. This can make it easier to detect problematic whitespace characters. (Note that blank and empty fields for other data types, such as INT and FLOAT, are already loaded as NULL in Redshift.)
- IGNOREHEADER: Some data loading errors in Redshift occur when the system erroneously interprets the header of a file (e.g. a CSV file) as data. This parameter instructs the COPY command to ignore the file header. You can set the number of rows to ignore with the ‘number_rows‘ argument.
- FILLRECORD: In some cases, your source files and databases in ETL may not be complete. The FILLRECORD parameter enables you to work with incomplete ETL sources while avoiding data loading errors. This parameter fills missing columns at the end of a row with either zero-length strings or NULL values, depending on the data type of the column.
Beyond Redshift Troubleshooting
Efficiently isolating and resolving Redshift data loading errors is a crucial part of the “extract” stage of ETL. However, it’s only one piece of the puzzle when it comes to making your Redshift ETL processes run smoothly.
That’s why we’ve built a powerful analytics and performance monitoring solution for your ETL processes and data warehouse: intermix.io. The intermix.io platform helps you track the flow of your data throughout Amazon Redshift and Amazon Web Services.
Tired of spending hours or days debugging frustrating issues? intermix.io lets you perform root cause analysis quickly and easily, fixing your problems faster so that you can keep enjoying the data-driven insights your business needs. We provide a single pane of glass for analysts and IT teams, giving you ultimate clarity and visibility into your ETL workflow.
Looking for more advice on how to structure your ETL architecture in Amazon Redshift? Check out our comprehensive guide “Top 14 Performance Tuning Techniques for Amazon Redshift.”
Join 11,000 of your peers.
Subscribe to our newsletter SF Data.
People at Facebook, Amazon and Uber read it every week.
Every Monday morning we'll send you a roundup of the best content from intermix.io and around the web. Make sure you're ready for the week! See all issues.