Start Now Login
4 Real World Use Cases for Amazon Redshift

4 Real World Use Cases for Amazon Redshift

Since Amazon Redshift launched in 2013, customer keep asking us  “what are some of the uses cases for Amazon Redshift?”. It’s 2017, and with the four years since launch, that’s a long time in technology. Key things that have changed since then:

Each bullet sort of feeds the other, and it’s somewhat of a positive feedback loop. When you can process more data faster, and do more stuff with it – your use cases will evolve.

Redshift started out as a simpler, cheaper and faster alternative to legacy on-premise warehouses. If you read Jeff Barr’s blog post announcing Redshift (“Amazon Redshift – The New AWS Data Warehouse”), the pitch was all about simplicity and price.

Fast forward to the end of 2017, and the use cases are way more sophisticated than just running a data warehouse in the cloud. I recommend looking at the slides from ReInvent 2016, “What’s New With Redshift” (also see full video of the talk on YouTube).

You will find 4 uses cases:

  1. Traditional Data Warehousing
  2. Log Analysis
  3. Business Applications
  4. Mission-critical Workloads

Traditional Data Warehousing

Data warehousing has been around since Kimball and Inmon. What changed with Amazon Redshift was the price at which you can get it – about 20x less than what you had to carve out going with the legacy vendors like Oracle and Teradata.

The use case for data warehousing is to unify disparate data sources in a single place and run custom analytics for your business.

Let’s say you‘re the head of business intelligence for a web property that also has a mobile app. The typical categories of data sources are:

  1. your core production database with all customer data (“who are my customers?”)
  2. event data from your website and your mobile app (“how are they behaving when they use our products?”)
  3. data from your SaaS systems that you need to support your business (ads, payments, support, etc.) (“How did I acquire those customers, how much are they paying me, and what support costs do they cause me?”)

With a rich ecosystem of data integration vendors, it’s easy to build pipelines to those sources and feed data into Redshift. Put a powerful BI / dashboard tool on top, and you have a full-blown BI stack.

A key advantage of Redshift that I think a lot of people are not aware of is simplicity. It used to take months if not quarters to get a data warehouse up and running. And you’d need the help of an Accenture or IBM. None of that anymore. You can spin up a Redshift cluster in less than 15 minutes, and build a whole business intelligence stack in a weekend.

Some examples by teams who built an early analytics stack on Redshift:

The combination of price / speed / simplicity expanded the addressable market for data warehousing from large corporations to SMBs. However, because it is so easy to get going, data engineers must make sure to follow the best practices when setting up their cluster and avoid any performance issues they might have data volume and pipeline complexity grows.

Log Analysis

Because of the cost of storage, in previous generations of data warehouses you had to aggregate data. It was too expensive to store raw data. That changed with Amazon Redshift. Because Redshift is cheap, it’s possible to store raw, event-level data without getting killed on storage cost.

Event-level data comes with three key benefits.

  1. You get to keep the maximum amount of fidelity – no information gets lost in aggregation.
  2. Your level of analytical insight goes up. Geared with granular information, you can “slice ’n dice” your data in any possible way along any dimension.
  3. You can run historic replays of data, see what happened in the build up to specific event you’re tracking, and build “what-if” type of scenarios by changing the parameters of your algorithms / models.

And so that made Amazon Redshift a perfect fit for analyzing machine-generated data like web logs or clickstream data. Massive amounts of data that come in at high velocity.

Because Redshift is fast and cheap, processing machine data data is cost-effective, and you can drive the time required for “ingest-to-insight” (i.e. the time between pushing data into Redshift and the final analysis output) below the 5 minute mark. Not just for basic aggregations, but complex 10-way joins, across billions (billions with a “b”) of rows. That’s remarkable.

The business value comes with exposing that data back into the business. For decision making, data-driven services, etc. Here are a few links of tech talks I recommend reading up on, they describe in detail how Yelp, Lyft and Pinterest use Amazon Redshift to process data and then expose it to services that need it.

The business value here goes beyond mere cost savings by migrating your warehouse to the cloud. Rather, you’re enabling new services, informed by data. These “data-driven services” are the foundation for better / faster decision making, and also new revenue-generating products.

That distinction is key. In previous days, with the use of basic reporting, companies would look at a data warehouse as a “cost center”. Yes, data is important, but also expensive. And so the goal was to keep that cost down as much as possible. Limited exposure of data to a limited set of people, etc.

Now, it makes sense to increase spend into your data infrastructure. That’s because an incremental $ spent on analyzing data can lead to a larger incremental increase in $ revenue generated.

That leads us to the next use case, where Redshift drives new revenue as the core engine behind an analytics product, i.e. business applications.

Business Applications

Not all companies have the technical abilities and budget to build and run a custom streaming pipeline with near real-time analytics.

But analytical use cases can be pretty similar across a single industry or vertical. That has given rise to “analytics-as-a-service” vendors. They use Redshift under the covers, and then offer analytics in a SaaS model to their customers.

These vendors either run a single cluster in a multi-tenant model, or offer a single cluster to customers in a premium model. Take Acquia Lift as an example. The charging model then is a subscription fee to the analytics service.

To give you some back-of-the-envelope math: In a multi-tenant model, you can cram data from 10s of customers onto a single node cluster, which costs you ~$200 / month. Price out the actual analytics service at $500 / month / subscriber, and you have a business with some pretty good gross margins.

Mission-critical Workloads

Using Redshift for mission-critical workloads has emerged in the past few years. Here, data sitting in Redshift feeds into time-sensitive apps. It’s key that the the database stays up, because otherwise the business goes down (quite literally).

In some cases, e.g. the NASDAQ, that is daily reporting. That reporting can’t be late or wrong, otherwise somebody might quite literally go to jail.

Other cases include building predictive models on top of Redshift, and then embed the results programmatically into another app, via a data API. An example is automated ad-bidding, where bids across certain ad networks are adjusted on a near real-time basis. The adjustments are calculated on ROI and what ad types performed best in the last week, day and even hour.

Conclusion

Redshift has driven down the cost of running a data warehouse, and as a result expanded the addressable market. Because Redshift is cheap, it allows to store event-level data, which opens up a whole new world of use cases. Some of these use cases include data-driven services that create new revenue streams for companies.

And when data in Amazon Redshift becomes critical for business success, it’s important to make sure your cluster is not a black box. If you’re part of a data team that’s building mission-critical data pipelines, sign-up for a free trial. We’re sure you’ll be surprised by the amount of visibility you’ll get, and how much faster that allows you to move.

Related content
3 Things to Avoid When Setting Up an Amazon Redshift Cluster Apache Spark vs. Amazon Redshift: Which is better for big data? Amazon Redshift Spectrum: Diving into the Data Lake! What Causes "Serializable Isolation Violation Errors" in Amazon Redshift? A Quick Guide to Using Short Query Acceleration and WLM for Amazon Redshift for Faster Queries What is TensorFlow? An Intro to The Most Popular Machine Learning Framework Titans of Data with Mirko Novakovic - How Containers are Giving Rise to New Data Services Why We Built intermix.io - “APM for Data” 4 Simple Steps To Set-up Your WLM in Amazon Redshift For Better Workload Scalability World-class Data Engineering with Amazon Redshift - Training Announcing App Tracing - Monitoring Your Data Apps With intermix.io Have Your Postgres Cake with Amazon Redshift and eat it, too. 3 Steps for Fixing Slow Looker Dashboards with Amazon Redshift Zero Downtime Elasticsearch Migrations Titans of Data with Florian Leibert – CEO Mesosphere Improve Amazon Redshift COPY performance:  Don’t ANALYZE on every COPY Building a Better Data Pipeline - The Importance of Being Idempotent The Future of Machine Learning in the Browser with TensorFlow.js Gradient Boosting Libraries — A Comparison Crowdsourcing Weather Data With Amazon Redshift The Future of Apache Airflow Announcing Query Groups – Intelligent Query Classification Top 14 Performance Tuning Techniques for Amazon Redshift Product Update: An Easy Way To Find The Cause of Disk Usage Spikes in Amazon Redshift How We Reduced Our Amazon Redshift Cost by 28%
Ready to start seeing into your data infrastructure?
Get started with a 14-day free trial, with access to the full platform

No Credit Card Required