Amazon Redshift Spectrum: Diving into the Data Lake!

Amazon Redshift Spectrum: Diving into the Data Lake!

Amazon’s Simple Storage Service S3 has been around since 2006. Enterprises have been pumping their data into this data lake at a furious rate. Within 10 years of its birth, S3 stored over 2 trillion objects, each up to 5 terabytes in size. Enterprises know their data is valuable and worth preserving. But much of this data lies inert, in “cold” data lakes, unavailable for analysis, so-called “dark data”.

The Dark Data Problem. Source: Amazon AWS.

Analyzing “Dark Data”

So what lies below the surface of data lakes? The first thing for enterprises is to find out what dark data they have accumulated. Then analyze it in search of valuable insights. That means analysts need solutions that allow them to access petabytes of dark data.   

With Amazon Redshift Spectrum you can query data in Amazon S3 without first loading it into Amazon Redshift. For nomenclature purposes, we’ll use “Redshift” for “Amazon Redshift”, and “Spectrum” for “Amazon Redshift Spectrum”.

Today, there are three major existing ways to access and analyze data in S3. 

So why not use these existing options? For example, companies already use Amazon Redshift to analyze their “hot” data. So why not load that cold data from S3 into Redshift and call it a day?

Two reasons:

Redshift Spectrum offers the best of both worlds. With Spectrum, you can:

Data Stack with Amazon Redshift, Amazon Redshift Spectrum, Amazon Athena, AWS Glue and S3.

Spectrum is the “glue” or “bridge” layer that provides Redshift an interface to S3 data. Redshift becomes the access layer for your business applications. Spectrum is the query processing layer for data accessed from S3. The following picture illustrates the relationship between these services.A closer look at Redshift Spectrum

From a deployment perspective, Spectrum is “under the hood”. It’s a group of managed nodes in your private VPC, available to any of your Redshift clusters that are Spectrum-enabled. It pushes compute-intensive tasks down to the Redshift Spectrum layer. That layer is independent of your Amazon Redshift cluster.  

There are three key concepts to understand how to run queries with Redshift Spectrum:

  1. External data catalog
  2. External schemas
  3. External tables

The external data catalog contains the schema definitions for the data you wish to access in S3. It’s a central metadata repository for your data assets.  Potential options for your data catalog:

The external schema contains your tables. External tables allow you to query data in S3 using the same SELECT syntax as with other Amazon Redshift tables. External tables are read-only, i.e. you can’t write to an external table.

You can keep writing your usual Redshift queries. The main change with Spectrum is that the queries now also contain a reference to data stored in S3.                  

Joining internal and external tables

The Redshift query engine treats internal and external tables the same way. You can do the typical operations like queries and joins on either type of table. Or a combination of both: Query an external table and join its data with that from an internal one.

An example: You are using Redshift to analyze data of your e-commerce site visitors. What pages they visit, how long they stay, what they buy (or not), etc. You keep a year’s worth of data in your Redshift clusters. Older data you move to S3.

Then you notice an odd seasonal variation. You want to see if this was also true for past years, vs. an aberration for this year. Luckily you have saved historic clickstream data in S3, going back many years. You can now access that historic data via an external table with Spectrum, and run the same queries you’re running in Amazon Redshift. Or create new insights by joining other past data with this year’s.

Redshift parses, compiles and distributes a SQL query to the nodes in a cluster the normal way. The part of the query that references an external gets sent to Spectrum. Spectrum processes the relevant data in S3, and sends the result back to Redshift. Redshift collects the partial results from its nodes and Spectrum, concatenates, joins, etc. and returns the complete result.


A few points to keep in mind when working with Spectrum:

You should also do the homework to ensure that processing of data in S3 is economical and efficient. You can save on costs and get better performance if you partition the data, compress data, or convert it to columnar formats such as Apache Parquet.

In summary, Spectrum adds one more tool to your Redshift-based data warehouse investment. You can now use its power to probe and analyze your data lake on an as-needed basis for a very low per query price.

Lars Kamp
Lars Kamp
Join over 1,000 subscribers to our content.
Data engineers from Netflix, Amazon AWS, Coinbase and Instacart are reading our articles.
Performance Analytics for Amazon Redshift
Try 14 days for free. Deal with issues before they take down your cluster.