Start Now Login
Amazon Redshift Spectrum: Diving into the Data Lake!

Amazon Redshift Spectrum: Diving into the Data Lake!

Amazon’s Simple Storage Service S3 has been around since 2006. Enterprises have been pumping their data into this data lake at a furious rate. Within 10 years of its birth, S3 stored over 2 trillion objects, each up to 5 terabytes in size. Enterprises know their data is valuable and worth preserving. But much of this data lies inert, in “cold” data lakes, unavailable for analysis, so-called “dark data”.

The Dark Data Problem. Source: Amazon AWS.

Analyzing “Dark Data”

So what lies below the surface of data lakes? The first thing for enterprises is to find out what dark data they have accumulated. Then analyze it in search of valuable insights. That means analysts need solutions that allow them to access petabytes of dark data.   

With Amazon Redshift Spectrum you can query data in Amazon S3 without first loading it into Amazon Redshift. For nomenclature purposes, we’ll use “Redshift” for “Amazon Redshift”, and “Spectrum” for “Amazon Redshift Spectrum”.

Today, there are three major existing ways to access and analyze data in S3. 

So why not use these existing options? For example, companies already use Amazon Redshift to analyze their “hot” data. So why not load that cold data from S3 into Redshift and call it a day?

Two reasons:

Redshift Spectrum offers the best of both worlds. With Spectrum, you can:

Data Stack with Amazon Redshift, Amazon Redshift Spectrum, Amazon Athena, AWS Glue and S3.

Spectrum is the “glue” or “bridge” layer that provides Redshift an interface to S3 data. Redshift becomes the access layer for your business applications. Spectrum is the query processing layer for data accessed from S3. The following picture illustrates the relationship between these services.

A closer look at Redshift Spectrum

From a deployment perspective, Spectrum is “under the hood”. It’s a group of managed nodes in your private VPC, available to any of your Redshift clusters that are Spectrum-enabled. It pushes compute-intensive tasks down to the Redshift Spectrum layer. That layer is independent of your Amazon Redshift cluster.  

There are three key concepts to understand how to run queries with Redshift Spectrum:

  1. External data catalog
  2. External schemas
  3. External tables

The external data catalog contains the schema definitions for the data you wish to access in S3. It’s a central metadata repository for your data assets.  Potential options for your data catalog:

The external schema contains your tables. External tables allow you to query data in S3 using the same SELECT syntax as with other Amazon Redshift tables. External tables are read-only, i.e. you can’t write to an external table.

You can keep writing your usual Redshift queries. The main change with Spectrum is that the queries now also contain a reference to data stored in S3.                 

Joining internal and external tables

The Redshift query engine treats internal and external tables the same way. You can do the typical operations like queries and joins on either type of table. Or a combination of both: Query an external table and join its data with that from an internal one.

An example: You are using Redshift to analyze data of your e-commerce site visitors. What pages they visit, how long they stay, what they buy (or not), etc. You keep a year’s worth of data in your Redshift clusters. Older data you move to S3.

Then you notice an odd seasonal variation. You want to see if this was also true for past years, vs. an aberration for this year. Luckily you have saved historic clickstream data in S3, going back many years. You can now access that historic data via an external table with Spectrum, and run the same queries you’re running in Amazon Redshift. Or create new insights by joining other past data with this year’s.

Redshift parses, compiles and distributes a SQL query to the nodes in a cluster the normal way. The part of the query that references an external gets sent to Spectrum. Spectrum processes the relevant data in S3, and sends the result back to Redshift. Redshift collects the partial results from its nodes and Spectrum, concatenates, joins, etc. and returns the complete result.

Using Redshift Spectrum is part of our top 14 performance tuning techniques for Amazon Redshift. The post is a great resource for obtaining the best overall performance of your Amazon Redshift cluster, e.g. for moving data between Amazon Redshift and S3.

Summary

A few points to keep in mind when working with Spectrum:

You should also do the homework to ensure that processing of data in S3 is economical and efficient. You can save on costs and get better performance if you partition the data, compress data, or convert it to columnar formats such as Apache Parquet.

In summary, Spectrum adds one more tool to your Redshift-based data warehouse investment. You can now use its power to probe and analyze your data lake on an as-needed basis for a very low per query price.

Related content
3 Things to Avoid When Setting Up an Amazon Redshift Cluster Apache Spark vs. Amazon Redshift: Which is better for big data? What Causes "Serializable Isolation Violation Errors" in Amazon Redshift? A Quick Guide to Using Short Query Acceleration and WLM for Amazon Redshift for Faster Queries What is TensorFlow? An Intro to The Most Popular Machine Learning Framework Titans of Data with Mirko Novakovic - How Containers are Giving Rise to New Data Services Why We Built intermix.io - “APM for Data” 4 Simple Steps To Set-up Your WLM in Amazon Redshift For Better Workload Scalability World-class Data Engineering with Amazon Redshift - Training Announcing App Tracing - Monitoring Your Data Apps With intermix.io Have Your Postgres Cake with Amazon Redshift and eat it, too. 4 Real World Use Cases for Amazon Redshift 3 Steps for Fixing Slow Looker Dashboards with Amazon Redshift Zero Downtime Elasticsearch Migrations Titans of Data with Florian Leibert – CEO Mesosphere Improve Amazon Redshift COPY performance:  Don’t ANALYZE on every COPY Building a Better Data Pipeline - The Importance of Being Idempotent The Future of Machine Learning in the Browser with TensorFlow.js Gradient Boosting Libraries — A Comparison Crowdsourcing Weather Data With Amazon Redshift The Future of Apache Airflow Announcing Query Groups – Intelligent Query Classification Top 14 Performance Tuning Techniques for Amazon Redshift Product Update: An Easy Way To Find The Cause of Disk Usage Spikes in Amazon Redshift How We Reduced Our Amazon Redshift Cost by 28%
Ready to start seeing into your data infrastructure?
Get started with a 14-day free trial, with access to the full platform

No Credit Card Required