4 Real World Use Cases for Amazon Redshift
Since Amazon Redshift launched in 2013, customer keep asking us “what are some of the uses cases for Amazon Redshift?”. It’s 2017, and with the four years since launch, that’s a long time in technology. Key things that have changed since then:
- new Amazon Redshift features: more to do with your data
- new node types: process data faster
- major adoption: more data use cases
Each bullet sort of feeds the other, and it’s somewhat of a positive feedback loop. When you can process more data faster, and do more stuff with it – your use cases will evolve.
Redshift started out as a simpler, cheaper and faster alternative to legacy on-premise warehouses. If you read Jeff Barr’s blog post announcing Redshift (“Amazon Redshift – The New AWS Data Warehouse”), the pitch was all about simplicity and price.
Fast forward to the end of 2017, and the use cases are way more sophisticated than just running a data warehouse in the cloud. I recommend looking at the slides from ReInvent 2016, “What’s New With Redshift” (also see full video of the talk on YouTube).
You will find 4 uses cases:
- Traditional Data Warehousing
- Log Analysis
- Business Applications
- Mission-critical Workloads
Download our Data Pipeline Resource Bundle
See 14 real-life examples of data pipelines built with Amazon Redshift
- Full stack breakdown
- Summary slides with links to resources
- PDF containing detailed descriptions
Traditional Data Warehousing
Data warehousing has been around since Kimball and Inmon. What changed with Amazon Redshift was the price at which you can get it – about 20x less than what you had to carve out going with the legacy vendors like Oracle and Teradata.
The use case for data warehousing is to unify disparate data sources in a single place and run custom analytics for your business.
Let’s say you‘re the head of business intelligence for a web property that also has a mobile app. The typical categories of data sources are:
- your core production database with all customer data (“who are my customers?”)
- event data from your website and your mobile app (“how are they behaving when they use our products?”)
- data from your SaaS systems that you need to support your business (ads, payments, support, etc.) (“How did I acquire those customers, how much are they paying me, and what support costs do they cause me?”)
With a rich ecosystem of data integration vendors, it’s easy to build pipelines to those sources and feed data into Redshift. Put a powerful BI / dashboard tool on top, and you have a full-blown BI stack.
A key advantage of Redshift that I think a lot of people are not aware of is simplicity. It used to take months if not quarters to get a data warehouse up and running. And you’d need the help of an Accenture or IBM. None of that anymore. You can spin up a Redshift cluster in less than 15 minutes, and build a whole business intelligence stack in a weekend.
Some examples by teams who built an early analytics stack on Redshift:
- Analytics at Clearbit: Clear, simple, scaleable
- Building a Data-Informed Culture: An Introduction to Data at Gusto
- Building Analytics at Simple
The combination of price / speed / simplicity expanded the addressable market for data warehousing from large corporations to SMBs. However, because it is so easy to get going, data engineers must make sure to follow the best practices when setting up their cluster and avoid any performance issues they might have data volume and pipeline complexity grows.
Because of the cost of storage, in previous generations of data warehouses you had to aggregate data. It was too expensive to store raw data. That changed with Amazon Redshift. Because Redshift is cheap, it’s possible to store raw, event-level data without getting killed on storage cost.
Event-level data comes with three key benefits.
- You get to keep the maximum amount of fidelity – no information gets lost in aggregation.
- Your level of analytical insight goes up. Geared with granular information, you can “slice ’n dice” your data in any possible way along any dimension.
- You can run historic replays of data, see what happened in the build up to specific event you’re tracking, and build “what-if” type of scenarios by changing the parameters of your algorithms / models.
And so that made Amazon Redshift a perfect fit for analyzing machine-generated data like web logs or clickstream data. Massive amounts of data that come in at high velocity.
Download the Top 14 Performance Tuning Techniques for Amazon Redshift
Because Redshift is fast and cheap, processing machine data data is cost-effective, and you can drive the time required for “ingest-to-insight” (i.e. the time between pushing data into Redshift and the final analysis output) below the 5 minute mark. Not just for basic aggregations, but complex 10-way joins, across billions (billions with a “b”) of rows. That’s remarkable.
The business value comes with exposing that data back into the business. For decision making, data-driven services, etc. Here are a few links of tech talks I recommend reading up on, they describe in detail how Yelp, Lyft and Pinterest use Amazon Redshift to process data and then expose it to services that need it.
- Open-Sourcing Yelp’s Data-Pipeline
- Lyft Enables Massive Growth of Ridesharing Platform
- Powering Interactive Data Analysis at Pinterest by Amazon Redshift
The business value here goes beyond mere cost savings by migrating your warehouse to the cloud. Rather, you’re enabling new services, informed by data. These “data-driven services” are the foundation for better / faster decision making, and also new revenue-generating products.
That distinction is key. In previous days, with the use of basic reporting, companies would look at a data warehouse as a “cost center”. Yes, data is important, but also expensive. And so the goal was to keep that cost down as much as possible. Limited exposure of data to a limited set of people, etc.
Now, it makes sense to increase spend into your data infrastructure. That’s because an incremental $ spent on analyzing data can lead to a larger incremental increase in $ revenue generated.
That leads us to the next use case, where Redshift drives new revenue as the core engine behind an analytics product, i.e. business applications.
Not all companies have the technical abilities and budget to build and run a custom streaming pipeline with near real-time analytics.
But analytical use cases can be pretty similar across a single industry or vertical. That has given rise to “analytics-as-a-service” vendors. They use Redshift under the covers, and then offer analytics in a SaaS model to their customers.
These vendors either run a single cluster in a multi-tenant model, or offer a single cluster to customers in a premium model. Take Acquia Lift as an example. The charging model then is a subscription fee to the analytics service.
To give you some back-of-the-envelope math: In a multi-tenant model, you can cram data from 10s of customers onto a single node cluster, which costs you ~$200 / month. Price out the actual analytics service at $500 / month / subscriber, and you have a business with some pretty good gross margins.
Using Redshift for mission-critical workloads has emerged in the past few years. Here, data sitting in Redshift feeds into time-sensitive apps. It’s key that the the database stays up, because otherwise the business goes down (quite literally).
In some cases, e.g. the NASDAQ, that is daily reporting. That reporting can’t be late or wrong, otherwise somebody might quite literally go to jail.
Other cases include building predictive models on top of Redshift, and then embed the results programmatically into another app, via a data API. An example is automated ad-bidding, where bids across certain ad networks are adjusted on a near real-time basis. The adjustments are calculated on ROI and what ad types performed best in the last week, day and even hour.
Redshift has driven down the cost of running a data warehouse, and as a result expanded the addressable market. Because Redshift is cheap, it allows to store event-level data, which opens up a whole new world of use cases. Some of these use cases include data-driven services that create new revenue streams for companies.
And when data in Amazon Redshift becomes critical for business success, it’s important to make sure your cluster is not a black box. If you’re part of a data team that’s building mission-critical data pipelines, sign-up for a free trial. We’re sure you’ll be surprised by the amount of visibility you’ll get, and how much faster that allows you to move.