Amazon Redshift vs Redshift Spectrum vs Amazon Aurora
In this blog post, we are going to cover the difference between Amazon Redshift vs Amazon Redshift Spectrum vs Amazon Athena vs Amazon Aurora (you probably guessed that one…) along with a practical example of when you would use each of these technologies.
We’ll also give a high-level overview of database concepts. Let’s get into it.
Table of Contents
When dealing with databases, you will often hear two terms:
- “online transaction processing” or short “OLTP”. OLTP systems are optimized for doing lots of writes, i.e. getting data “in”. You will also hear the term “production workloads” when talking about OLTP systems.
- “online analytical processing” or short “OLAP”. OLAP systems are optimized for doing lots of reads, i.e. getting data “out”. You will also hear the term “analytical workloads”, “data warehouse”, or “analytics” when talking about OLAP systems.
Engineers love acronyms, and if you just keep “OLAP” and “OLTP” in mind, you can have a seat at the table and look smart.
A Practical Example: United Airlines
Let’s use United Airlines as an example to describe what the two acronyms mean.
United Airlines sells tickets for their flights. When a customer purchases a ticket on United’s website or mobile app, United needs to keep a record of that purchase in their production database.
They will use an OLTP style database to store the record.
At the end of the month, the CFO at United wants to know “how many tickets did we sell this month?”. To get the answer, they could query the production database. However, those queries would put more load onto the production database, which will slow it down.
That’s not desirable, as as a slow database will have a negative impact on selling tickets.
And so the answer is to use an analytics database or data warehouse that has a replica of all transactions in the production database. The process to move data from production to analytics is called “ETL”. You
- Extract the data from the production database
- Transform the data suitable for analysis
- Load the data into your analytics database.
Once the data is in the data warehouse, you can go crazy with even the most complex queries (e.g. “how many tickets did we sell in France with our iOS App in the 2nd week of July, and which ones came from net new customers?”), without affecting the production database.
Data Lakes and Serverless Query Engines
Back to our United Airlines example. Over time, the data volume adds up, and it gets expensive to store a copy of all historic ticket sales in the data warehouse. It’s also not very useful to have data from e.g. 3 or 5 years ago in your warehouse, when all you really care about is the last fiscal month, quarter and year.
And at that point, you move your data from your warehouse into your data lake, where storage is about 5x cheaper.
Now say there’s the occasional ask for historic data – “how many tickets did we sell 5 years ago in July, and how does that compare with July this year?”. And so you either move the data back into your warehouse to answer that question, or you use a query engine that can query the data directly in your data lake, and / or maybe join it with data in your warehouse.
A query engine doesn’t store data – it just completes the job of running the query. The “serverless” part means you don’t need to spin up a server to run the engine.
And with that background, we can now answer the question.
Amazon Redshift vs. Redshift Spectrum vs. Amazon Athena vs Amazon Aurora
All four are Amazon AWS products, and I add Amazon S3 into the mix for a complete answer.
- Amazon S3 is a flat object store and commonly referred to these days as a “data lake”.
- Amazon Aurora is an OLTP-style database. It’s a MySQL and PostgreSQL-compatible relational database built for the cloud. You’ll use Aurora for your production workloads.
- Amazon Redshift is a relational, OLAP-style database. It’s a data warehouse built for the cloud, to run the most complex analytical workloads in standard SQL.
- Amazon Redshift Spectrum is a feature of Amazon Redshift. Spectrum is a serverless query processing engine that allows to join data that sits in Amazon S3 with data in Amazon Redshift.
- Amazon Athena is a serverless query processing engine based on open source Presto. Athena allows writing interactive queries to analyze data in S3 with standard SQL.
In our United Airlines example, United would use:
- Amazon Aurora to sell tickets
- Amazon Redshift to store short-term historical data to analyze how many tickets they’ve sold
- Amazon S3 for cheaper storage of all long-term historical ticket data
- Amazon Redshift Spectrum to join long-term historical data in S3 with short-term historical data in Amazon Redshift, e.g. for multi-year comparisons from ticket sales in a current year vs. ticket sales from 10 years ago.
- Amazon Athena for quick ad-hoc querying of data in S3, e.g. to answer a single-year question about ticket sales that requires data that only sits in S3, e.g. “how many tickets did we sell in July 10 years ago?”
In short, you end up with a stack where Aurora is your production database, S3 your data lake with a long term history of your production data, and then you have a choice of three AWS products to run analytics (Redshift, Redshift Spectrum, Athena) on top of that production data.
So that, of course, begs the question – when you do use which product for your analytics?
Putting it All Together – A Cloud-based Analytics Stack
Which product to use is a function of the cost, complexity and execution speed for your workloads and queries.
Amazon Redshift excels at running complex analytic queries, joins and aggregrations over large datasets because it leverages high-performance local disks, sophisticated query execution, and join-optimized data formats. Redshift is so powerful that you can run the “T” part of ETL within the warehouse, and not some external processing engine like Spark or Hadoop. In fact, it’s a shift from “ETL” to “ELT”, which is fodder for an entire blog post, but a good concept to keep in mind. If execution speed and for queries and transformations is of essence, then using Amazon Redshift is the way to go.
But you do have to load data into your warehouse, with a data engineer or DBA to manage your Redshift cluster against the storage and CPU requirements you need. And at its cheapest pricing (dense storage nodes with 3-year RIs), Redshift runs at ~$1,000 / TB / Year, about 4x as expensive as S3, which runs at about $250 / TB / Year.
That’s where Redshift Spectrum comes into play. With Redshift Spectrum, Amazon Redshift users can take advantage of inexpensive S3 storage and still scale out to pull, filter, aggregate, group and sort data.
Because Spectrum is serverless, there’s nothing to provision or manage. Pricing for Spectrum is $5 per terabyte of data scanned, and combined with cheaper S3 storage than Redshift, it can be a more cost-efficient proposition than storing data in a Redshift cluster only. You pay only for the queries you run against the data that you actually scan.
The trade-off is that Redshift Spectrum queries do run slower than queries in an Amazon Redshift cluster, mainly because of data movement between S3 and the cluster. But if you’re fine with that trade-off, and the priority is cost, then a combination of S3 / Spectrum is a great choice.
But you do have to watch the frequency at which your Spectrum queries run, and how much they cost you. If scanning your data with Spectrum costs more than just storing it in Redshift, it’s to move data back into the cluster.
Athena Athena follows the same logic as Spectrum, except that you’re going full-in on serverless and skip the warehouse. That scenario usually only works for simple, interactive queries that don’t require large-scale aggregations.
But none of these options are “either / or” choices. We see all three analytics engine deployed with customers, depending on the use case.
Join 11,000 of your peers.
Subscribe to our newsletter SF Data.
People at Facebook, Amazon and Uber read it every week.
Every Monday morning we'll send you a roundup of the best content from intermix.io and around the web. Make sure you're ready for the week! See all issues.