Start Now Login

Amazon Redshift is a feature-rich, enterprise-class data warehouse, but this power and complexity comes at the price of troubleshooting Redshift issues. If you’ve ever encountered the dreaded “Error occurred while trying to execute a query” message in Redshift, you will understand.

Many Redshift administrators are all too familiar with data loading errors that can disrupt their ETL workflow—errors such as:

Not only are these Redshift data loading errors mysterious and inscrutable, they can bring your ETL processes to a shuddering halt if you don’t know how to address them. So why do these Amazon Redshift data loading errors occur? What is their root cause, and how can you fix data loading errors in Amazon Redshift?

In this post, we’ll work through some of the most common data loading errors in Redshift, and help you troubleshoot.  

Table of Contents

  1. Loading Data in Amazon Redshift
  2. Troubleshooting Data Loading Errors in Amazon Redshift
  3. Resolving Pesky Redshift Data Loading Errors
  4. Beyond Redshift Troubleshooting

Loading Data in Amazon Redshift

The “extract” phase of ETL requires you to capture information from a variety of source files and databases. Amazon offers a list of best practices for loading data in Redshift. Restructuring your Redshift ETL architecture to follow these best practices may help resolve some performance issues and errors:

Troubleshooting Data Loading Errors in Amazon Redshift

The Redshift data loading best practices above may help with general performance issues, but not with localized errors that pertain to specific issues with your source data. So how can you perform root cause analysis to identify and resolve Redshift data loading errors?

Fortunately, Redshift includes a built-in troubleshooting system for dealing with exactly these problems. When you encounter a data loading error in Redshift, information about that error is logged in a system table called STL_LOAD_ERRORS. Below is an example query that you can run on the STL_LOAD_ERRORS table in order to retrieve information about your most recent Redshift data loading errors:

Below is an example output of this query:

In the above example, we pull from the following columns of STL_LOAD_ERRORS:

Other useful columns from STL_LOAD_ERRORS include:

For more detailed information about an error, you can also query the STL_LOADERROR_DETAIL table. You may be interested in pulling information from the following columns in the STL_LOADERROR_DETAIL table:

Note that standard users can only use the STL_LOAD_ERRORS and STL_LOADERROR_DETAIL tables to view their own data. To view all of the table data, you will need to have superuser access.

In the example above, we’re presented with the error message “String contains invalid or unsupported UTF8 code.” However, nothing seems to be amiss with a cursory examination of the corresponding values. The source of the issue is likely an invisible, unsupported or invalid UTF-8 character that has been introduced into the data. For example, older versions of Redshift only supported UTF-8 characters with a maximum length of 3 bytes; this length has now been extended to the 4-byte maximum for UTF-8 characters.

Resolving Pesky Redshift Data Loading Errors

Querying the STL_LOAD_ERRORS and STL_LOADERROR_DETAIL tables, and analyzing the results, is highly informative for a great many Redshift data loading errors. In some cases, however, you’ll need to dig a little deeper before you can successfully troubleshoot the problem.

What are the root causes that lead to Redshift data loading errors? Many such errors can be attributed to character encoding and data conversion issues. When troubleshooting data loading errors in Redshift, pay heed to the following concerns:

In particular, making adjustments to the parameters of the COPY command can help you deal with unexpected data loading errors. Some useful COPY parameters are as follows:

Beyond Redshift Troubleshooting

Efficiently isolating and resolving Redshift data loading errors is a crucial part of the “extract” stage of ETL. However, it’s only one piece of the puzzle when it comes to making your Redshift ETL processes run smoothly.

That’s why we’ve built a powerful analytics and performance monitoring solution for your ETL processes and data warehouse: intermix.io. The intermix.io platform helps you track the flow of your data throughout Amazon Redshift and Amazon Web Services.

Tired of spending hours or days debugging frustrating issues? intermix.io lets you perform root cause analysis quickly and easily, fixing your problems faster so that you can keep enjoying the data-driven insights your business needs. We provide a single pane of glass for analysts and IT teams, giving you ultimate clarity and visibility into your ETL workflow.

Looking for more advice on how to structure your ETL architecture in Amazon Redshift? Check out our comprehensive guide “Top 14 Performance Tuning Techniques for Amazon Redshift.”

Amazon Redshift was introduced in 2012 as the first Cloud Data Warehouse. It quickly became the fastest-growing service among Amazon Web Services. Amazon Redshift and these ETL tools forever changed the methods of working with analytical data. Let’s examine in detail what’s behind these mysterious ETL and ELT abbreviations and why they are so important.

ETL stands for: 

In recent years, there has been a rise in ELT platforms, which push the raw data directly into the data warehouse. This approach is quick and simple to set up, but as with approach comes with its limitations. There is plenty of content out there on ETL vs ELT, here is a good overview.

ETL & ELT Tools for Amazon Redshift

With an ever growing number of ETL platforms in the market, it can be difficult to choose one. Below is an overview of the top modern ETL & ELT tools for Amazon Redshift.

Fivetran

Fivetran architecture
Image source: https://fivetran.com/docs/getting-started/architecture 

Fivetran is an easy to use cloud service. It can help you gather all your data into your warehouse in minutes without any coding. Fivetran supports a wide range of data sources.

Fivetran automatically cleans and normalizes your data. It organizes it into standardized schemas and then routes to a warehouse. Once the data from the connected data source lands in your warehouse, you can apply custom transformations to data via SQL.

Fivetran is a scalable and secure solution that meets PCI, HIPAA, GDPR and SOC 2 criteria with advanced security features like SSH tunnels.

SimilarWeb Category Rank: 15,200

Pricing: Pricing available on request.

Xplenty

Xplenty is a cloud-based ETL and ELT platform that easily unites multiple data sources. The Xplenty platform offers a simple, intuitive visual interface for building data pipelines between a large number of sources and destinations.

While Xplenty supports both ETL and ELT workflows, their powerful on platform transformation layer makes Xplenty an ideal choice for customers looking to take the traditional ETL approach.

Xplenty is somewhat unique on this list in that it supports sending data to all major data warehouses as well as databases and Salesforce. This flexibility allows Xplenty to scale well across many different teams within a given company.

When it comes to security, the ETL approach is definitely the more secure, giving the customers complete control over their data. Xplenty is SOC 2 compliant and offers a Field Level Encryption feature to ensure customers can adhere to compliance requirements and best practices.

SimilarWeb Category Rank: 8,219

Pricing: Pricing available on request.

Alooma

alooma dashboard
Image source: https://www.alooma.com/ 

Alooma is an enterprise data pipeline platform, recently acquired by Google. Alooma is scalable and secure cloud-based ETL tool. It provides dozens of native integrations with various data sources.

One of the unique features of Alooma is Code Engine. This enables you to enrich and customize data with Python scripts.

Alooma Mapper provides automatic data mapping from any source to a specified destination. When an automatic mapping is not enough, you can make custom mappings in a simple and intuitive UI.

However, with the Google acquisition, Alooma has announced that they will end support for Amazon Redshift in 2020.

SimilarWeb Category Rank: 7,306

Pricing: Pricing available on request.

Stitch

stitch dashboard
Image source: https://www.stitchdata.com/docs/getting-started/set-up-stitch-data-pipeline 

Stitch is a developer-focused tool for rapidly moving data. Stitch is built on top of Singer open source ETL framework. It provides a rich collection of community-driven integrations. If you can’t find the necessary integration, you can always create it yourself. Otherwise, order its development from the implementation partners. Stitch provides rich customization options for replication, enterprise-grade security, error handling, and advanced reports.

Stitch can be difficult to learn for non-technical users. But high-quality documentation and almost unlimited platform extensibility compensate this drawback.

SimilarWeb Category Rank: 4,906

Pricing: From free plan (including 5 million rows per month and selected free integrations) to $1250 per month.

Talend 

talend dashboard

Talend Cloud Integration is a feature rich solution synchronizing your data between any SaaS and on-premises applications quickly. It also provides cloud-based data preparation tools to access, clean, transform, and enrich data easily. 

Talend allows easy design of cloud-to-cloud and hybrid integration workflows. It automates provisioning of your warehouse. Talend Cloud Integration provides powerful features for data integration, big data integration, and application integration. 

Talend was named a Leader in the 2019 Gartner Magic Quadrant for Data Integration Tools.

SimilarWeb Category Rank: 1,567

Pricing: Pricing available on request.

Blendo

blendo dashboard
Image source: https://app.blendo.co/ 

Blendo is a simple ETL and ELT data integration tool with user-friendly web interface built especially for non-technical users. Just choose one from dozens ready-made data connectors and create ELT data integrations from almost any data source. Blendo will collect, detect, and optimize data according to your data warehouse. 

Blendo lacks some database connectors, but the service is actively developing and constantly adding new data sources.

SimilarWeb Category Rank: 13,890

Pricing: From $125 per month for the standard package to $1,000 per month for the advanced package.

Hevo Data

hevodata dashboard
Image source: https://hevodata.com/blog/building-data-pipelines-hevo/ 

Hevo is a data integration platform that allows you to quickly set up your warehouse. It brings data from any source to any destination without writing any code. Hevo provides user-friendly point and click Interface to set up data integration logic in minutes. 

Hevo supports hundreds of data sources and destinations. They can be integrated in real time with Hevo’s Real-time Streaming Architecture.

For advanced data processing and enrichment, Hevo allows writing custom data transformation scripts in Python.

Hevo provides automatic schema detection and mapping. With a robust Schema Registry Module, Hevo intelligently detects the schema changes of incoming data. It notifies you immediately with Real-time Alerts over email or Slack. 

For advanced users, Hevo also provides granular activity logs, version control. It has Extensive APIs to manage data pipelines programmatically.

SimilarWeb Category Rank: 14,248

Pricing: From $499 per month for the Starter plan to $999 per month for the Business plan. Enterprise plan price is available on request.

ETLeap

ETL Leap Dashboard
Image source: https://etleap.com/product/ 

ETLeap is an ETL tool for engineering, analytics, and data science teams. It helps build data pipelines and data warehouses without friction.

With ETLeap you can integrate with any of 50+ supported data sources. ETLeap will monitor and maintain your data pipelines for availability and completeness.

You can control your data pipelines over the intuitive dashboard. The interactive data wrangler automatically guesses how to parse incoming data and enables adjustments. Optionally, you can apply custom data transformations with scripts in any language.

ETLeap provides a secure and flexible ETL service. ETLeap is HIPAA-compliant and EU-US privacy shield certified. It provides end-to-end encryption and single sign-on support. You can run it as a hosted solution (SaaS) or in your VPC. 

SimilarWeb Category Rank: 71,974 

Pricing: Pricing available on request.

Data Virtuality

Data Virtuality dashboard
Image source: https://datavirtuality.com/connectors/ 

Data Virtuality is an advanced data integration solution with user-friendly UI. It centralizes data from multiple sources. 

Data Virtuality supports 200+ data sources. It offers a virtual data layer for modeling and transforming data with SQL on the fly.

Data Virtuality enables configuring replication settings flexibly. It has pre-built templates for retrieving data and customizable data pipelines. It can be hosted either in the cloud or on-premise. You can ensure data is optimally integrated with distributed join optimizations, dynamic cost, and rule-based optimizations.

SimilarWeb Category Rank: 22,810

Pricing: From $249 per month for the Starter bundle to $599 per month for the Large bundle. Enterprise bundles price is available on request.

FlyData

flydata dashboard

FlyData allows you to replicate data changes in your relational database or csv/json files directly to Amazon Redshift.

Flydata provides a limited list of data sources, but syncs them in real time right into Amazon Redshift. Flydata ensures 100% accuracy with each load, so your data is always up-to-date. 

FlyData offers SecureTunnel VPN to securely access your database behind a firewall. FlyData provides error handling system, optimized specifically for Amazon Redshift. 

SimilarWeb Category Rank: 22,312

Pricing: From $199 per month to $4667 per month.

Informatica

Informatica Cloud provides codeless, optimized integration to hundreds of applications and data sources. Both on-premises and on cloud. Informatica was named a Leader by Gartner in several markets. Including Magic Quadrant for Enterprise Integration Platform as a Service, Magic Quadrant for Data Integration Tools, Magic Quadrant for Data Quality Tools.

Informatica Cloud has everything you may need for data integration projects and initiatives. Informatica Cloud provides several hundred connectors to databases, cloud data lakes, on-premises and SaaS applications.

Informatica Cloud allows you to build advanced integrations quickly and run them at scale. You can use recommendations for automated parsing driven by the CLAIRE™ engine. Also, build complex data integrations using mapping designer with out-of-the-box advanced data integration transformations. 

Since this is a large product suite, its initial set up can be difficult. It will take a lot of time to learn how to use all its capabilities to its full potential. 

SimilarWeb Category Rank: 1,165

Pricing: Informatica provides various pricing plans starting at $2,000/month.

Matillion

matillion dashboard
Image source: https://www.matillion.com/events/etl-vs-elt-whats-big-difference/ 

Matillion is a powerful and easy-to-use cloud ETL/ELT solution. It has a huge selection of pre-built connectors out-of-the-box. It provides an intuitive interface to visually orchestrate sophisticated data workflows.

A simple and intuitive UI allows you to get started quickly. But Matillion allows more than this. It offers many advanced functions. Dozens of Transformation and Orchestration Components help perform advanced data transformations. You can orchestrate your ETL processes from start to finish in a graphical user interface.

Matillion provides Generic API connector integrating with almost any JSON or XML based API. The component converts the API into a pseudo SQL dialect. This can then be queried in a similar way to any normal database. 

Ready-made components are most convenient. With Matillion you can also create your own Python or Bash script to transform and enrich data.

SimilarWeb Category Rank: 7,216

Pricing: Pricing on Matillion depends on instance size, from “Medium” at $1.37 per hour, to “XLarge” at $5.48 per hour.

Conclusion

Modern ETL tools have a lot to offer. They benefit users with completely different backgrounds and needs. You can find a simple and easy-to-use solution for quickly integrating data from several sources. You can also create a complex data pipeline. We hope this review helps you find a solution for your data pipelines.

In this blog post, we are going to cover the difference between Amazon Redshift vs Amazon Redshift Spectrum vs Amazon Athena vs Amazon Aurora (you probably guessed that one…) along with a practical example of when you would use each of these technologies.

We’ll also give a high-level overview of database concepts. Let’s get into it.

Table of Contents

Database Concepts

When dealing with databases, you will often hear two terms:

Engineers love acronyms, and if you just keep “OLAP” and “OLTP” in mind, you can have a seat at the table and look smart.

A Practical Example: United Airlines

Let’s use United Airlines as an example to describe what the two acronyms mean.

United Airlines sells tickets for their flights. When a customer purchases a ticket on United’s website or mobile app, United needs to keep a record of that purchase in their production database. 

They will use an OLTP style database to store the record.

At the end of the month, the CFO at United wants to know “how many tickets did we sell this month?”. To get the answer, they could query the production database. However, those queries would put more load onto the production database, which will slow it down. 

That’s not desirable, as as a slow database will have a negative impact on selling tickets.

And so the answer is to use an analytics database or data warehouse that has a replica of all transactions in the production database. The process to move data from production to analytics is called “ETL”. You

Once the data is in the data warehouse, you can go crazy with even the most complex queries (e.g. “how many tickets did we sell in France with our iOS App in the 2nd week of July, and which ones came from net new customers?”), without affecting the production database.

Data Lakes and Serverless Query Engines

Back to our United Airlines example. Over time, the data volume adds up, and it gets expensive to store a copy of all historic ticket sales in the data warehouse. It’s also not very useful to have data from e.g. 3 or 5 years ago in your warehouse, when all you really care about is the last fiscal month, quarter and year.

And at that point, you move your data from your warehouse into your data lake, where storage is about 5x cheaper.

Now say there’s the occasional ask for historic data – “how many tickets did we sell 5 years ago in July, and how does that compare with July this year?”. And so you either move the data back into your warehouse to answer that question, or you use a query engine that can query the data directly in your data lake, and / or maybe join it with data in your warehouse. 

A query engine doesn’t store data – it just completes the job of running the query. The “serverless” part means you don’t need to spin up a server to run the engine.

And with that background, we can now answer the question.

Amazon Redshift vs. Redshift Spectrum vs. Amazon Athena vs Amazon Aurora

amazon redshift vs amazon redshift spectrum vs amazon aurora
amazon redshift vs amazon redshift spectrum vs amazon aurora

All four are Amazon AWS products, and I add Amazon S3 into the mix for a complete answer.

In our United Airlines example, United would use:

In short, you end up with a stack where Aurora is your production database, S3 your data lake with a long term history of your production data, and then you have a choice of three AWS products to run analytics (Redshift, Redshift Spectrum, Athena) on top of that production data. 

So that, of course, begs the question – when you do use which product for your analytics?

Putting it All Together – A Cloud-based Analytics Stack

Which product to use is a function of the cost, complexity and execution speed for your workloads and queries. 

Amazon Redshift

Amazon Redshift excels at running complex analytic queries, joins and aggregrations over large datasets because it leverages high-performance local disks, sophisticated query execution, and join-optimized data formats. Redshift is so powerful that you can run the “T” part of ETL within the warehouse, and not some external processing engine like Spark or Hadoop. In fact, it’s a shift from “ETL” to “ELT”, which is fodder for an entire blog post, but a good concept to keep in mind. If execution speed and for queries and transformations is of essence, then using Amazon Redshift is the way to go.

But you do have to load data into your warehouse, with a data engineer or DBA to manage your Redshift cluster against the storage and CPU requirements you need. And at its cheapest pricing (dense storage nodes with 3-year RIs), Redshift runs at ~$1,000 / TB / Year, about 4x as expensive as S3, which runs at about $250 / TB / Year.

Redshift Spectrum

That’s where Redshift Spectrum comes into play. With Redshift Spectrum, Amazon Redshift users can take advantage of inexpensive S3 storage and still scale out to pull, filter, aggregate, group and sort data. 

Because Spectrum is serverless, there’s nothing to provision or manage. Pricing for Spectrum is $5 per terabyte of data scanned, and combined with cheaper S3 storage than Redshift, it can be a more cost-efficient proposition than storing data in a Redshift cluster only. You pay only for the queries you run against the data that you actually scan.

The trade-off is that Redshift Spectrum queries do run slower than queries in an Amazon Redshift cluster, mainly because of data movement between S3 and the cluster. But if you’re fine with that trade-off, and the priority is cost, then a combination of S3 / Spectrum is a great choice. 

But you do have to watch the frequency at which your Spectrum queries run, and how much they cost you. If scanning your data with Spectrum costs more than just storing it in Redshift, it’s to move data back into the cluster. 

Amazon Athena

Athena Athena follows the same logic as Spectrum, except that you’re going full-in on serverless and skip the warehouse. That scenario usually only works for simple, interactive queries that don’t require large-scale aggregations.

But none of these options are “either / or” choices. We see all three analytics engine deployed with customers, depending on the use case.

Properly managing storage utilization is critical to performance and optimizing the cost of your Amazon Redshift cluster. We’ve talked before about how important it is to keep an eye on your disk-based queries, and in this post we’ll discuss in more detail the ways in which Amazon Redshift uses the disk when executing queries, and what this means for query performance.

Amazon Redshift uses storage in two ways during query execution:

  1. Disk-based Queries. When a query runs out of memory, the overflow “spills” to the disk and the query goes “disk-based”.
  2. Intermediate Storage. When a query needs to save the results of an intermediate operation, to use as input for a future operation.

Use excessive storage impacts your cluster because:

  1. Query performance suffers, because disk is 100 times slower than memory.
  2. Other queries will be slower, because increased I/O impacts the commit queue which is a shared resource.
  3. The cluster might hit 100% disk utilization, causing queries to fail or requiring the provisioning of more nodes.

The worst case is (3), when the cluster fills up. It can happen for the all nodes in a cluster at once, or start with just one node, and then propagate through the entire cluster, as the image below shows. 

disk utilization by node, Disk-based Queries

So let’s look into what we can do to fix and prevent this type of situation. 

The WLM and Disk-Based queries

If you’re not already familiar with how Redshift allocates memory for queries, you should first read through our article on configuring your WLM

The gist is that Redshift allows you to set the amount of memory that every query should have available when it runs. This value is defined by allocating a percentage of memory to each WLM queue, which is then split evenly among the number of concurrency slots you define.  When a query executes, it is allocated the resulting amount of memory, regardless of whether it needs more (or less). Hence allocating too much memory is wasteful (since each node in the cluster obviously has finite memory), whereas allocating too little memory can cause queries to spill to disk.

Queries which overflow their allocated WLM memory are “disk-based”. These queries usually suffer from significantly degraded performance since disk I/O is orders of magnitude slower than memory I/O. 

There are six types of internal operations that Redshift can spill to disk when executing a query:

  1. Aggregations
  2. Hashing for joins
  3. Saving intermediate rows for future query steps
  4. Sorting
  5. Removing duplicates from intermediate or final results (unique)
  6. Window functions

If any of these operations are processing more rows (i.e. more bytes) than will fit into allocated memory, Redshift has to start swapping data out to disk, resulting in a significant slowdown of the query.

AWS recommends that you keep the percentage of disk-based queries to under 10%.  On our own fleet of clusters, we’re usually running well under one percent:

disk-based queries, Disk-based Queries

How to Prevent Queries From Going Disk-based

  1. Allocate more memory to the query. There are two approaches to this:

Within the intermix.io dashboard, viewing Recommendations for an individual query will surface the exact touched tables and how to update them, as well as how much memory the query used and the amount of memory capacity in the WLM queue.

Disk-based Queries

Intermediate Storage

Redshift also uses the disks in each node for another type of temporary query data called “Intermediate Storage”, which is conceptually unrelated to the temporary storage used when disk-based queries spill over their memory allocation.  

Intermediate Storage is used when Redshift saves the results of an intermediate operation to disk to use as input for a future operation. Intermediate Storage can become important if your query stores a large amount of data between query operations, since that storage may cause your cluster to run out of disk space. It also introduces additional I/O, which can lead to slower execution times.

Ways to Limit the Amount of Intermediate Storage Used.

Since intermediate storage is used to carry results from one part of the query execution to another, the best way to reduce intermediate storage is to use predicates (e.g. WHERE clauses, JOIN … ON clauses, etc) on intermediate steps of your query (subqueries, CTEs, etc) to ensure that you are not carrying unnecessary data through your query processing.

For example, consider this query which joins on the results of two CTEs:

This query could be re-written as follows to limit the amount of data brought forth into the JOIN.

We have a great new feature called Query Recommendations that proactively lets you know if your query is using a significant amount of intermediate storage.

query recommendations

Impact to Cluster Storage Utilization

This value is important when diagnosing spikes in cluster storage utilization. 

For example, let’s assume you see your cluster storage spiking over some time period, but don’t see a corresponding increase in data transferred (via the COPY command) into your cluster. You could search for all queries which have a large Memory to Disk value to identify which queries contributed to your cluster running out of disk space.

Here’s a real-world example. The following chart shows the actual disk space used in a cluster, over a 2 week period, broken down by schema. This particular chart show consistent storage utilization over time, with small variation.

schema size, Disk-based Queries

The chart of % disk utilization tells a different story. On the same cluster, over the same period, the disk utilization hits 100% quite frequently. This is caused by some queries using an extraordinary amount of intermediate storage.

disk utilization

One of the cool features we recently released Cluster Recommendations, will surface queries with high disk utilization immediately. For this cluster, it appears that queries were using close to two terabytes of disk for intermediate results!

cluster recommendations

Conclusion

Monitoring both “Disk-based Queries” and “Intermediate Storage” is crucial to keeping your cluster healthy.  Keeping on top of this temporary disk utilization prevents your Amazon Redshift disks from filling up due to misbehaved queries, resulting in queries being killed and your users being interrupted.