Table of Contents
Data is valuable resource powering up analytics, predictive models and decision making. For a company to make data-driven decisions, it first must go through building its data infrastructure. And a data warehouse plays a central role in such an infrastructure.
Data warehouses are data storage and processing systems that aggregate data from different sources into a single place. With data in one place, you can combine and query it across the different sources. That includes data coming from users interacting through web and mobile, background system logs, or third party data.
Amazon Redshift is a cloud warehouse solution by Amazon Web Services (AWS). Since AWS first introduced Redshift in 2012, it got everyone’s attention for its amazing performance and low price point. In the following years, there were massive improvements from operational and data pipeline perspective. Today, it’s the market leader for cloud warehouses.
So how does Redshift work, and what’s been driving its adoption? There are two basic components you need to understand about Amazon Redshift:
The technology is nothing new. Other warehouses use it, and there are even open source data warehouses that are free to use. It’s the combination of the two, and the simplicity that Redshift offers to start with a data warehouse.
In this post, I’ll explain these two components. Before that, it’s helpful to understand basic nomenclature and key concepts. We’ll start with the key concepts.
If you’ve ever googled “Redshift” you must have read the following. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.
Let’s break down what this means, and explain a few other key concepts that are helpful for context on how Redshift operates.
Fully managed. AWS takes care of things like warehouse setup, operation and redundancy, as well as scaling and security. All this is automated in the background, so the client has a smooth experience.
Warehouse service. Redshift is a cloud service, the customer does not own the physical hardware of the warehouse, but can use it through a subscription as a service. This has dramatic impact on the procurement model for customers.
Getting started with Redshift is easy. Rather than buying and installing hardware, they can spin up a Redshift warehouse, upload data and run queries in less than 15 minutes. And since AWS manages a fleet of tens of thousands of Redshift clusters, customers benefit from automating capabilities that would not make economic sense for any individual on-premise DBA.
Petabyte. There are fifteen zeros in a petabyte, 1,000x bigger than a terabyte. To illustrate, this is equivalent to 13.3 years of HD video. Compare that with traditional on-premise data warehouses that operate in the terabyte range. Going to petabytes has dramatic impact on the analytical capabilities.
Many companies today already generate a terabyte of data per day. In the old on-premise world, storing and analyzing that much data would have been cost-prohibitive, leading to the famous “analysis gap” or “dark data”. They may collect data but don’t analyze it. Redshift solves that problem.
Cluster. A cluster is a collection of nodes which perform the actual storing and processing of data. Each cluster runs an Amazon Redshift engine distributed and replicated across all nodes.
Nodes. AWS offers four different node types for Redshift. Each node type comes with a combination of computing resources (CPU, memory, storage and I/O)
Databases. A cluster can have one or more databases. A database contains one or more named schemas, which in turn contain a set of tables and other objects.
Workloads. You’ll often hear the term “workloads”, which implies “running queries” on your Redshift cluster. There are three generic types of workloads.
Understanding and distinguishing the three generic workloads from each other is an important concept for any data warehouse. Different Workloads use different amounts of cluster resources, and it makes total sense to be able to keep disparate workloads separate, to avoid resource contention. Redshift manages them with the “Workload Manager”. We explain workloads in detail in our post “4 Simple Steps to Workload Management”.
With the key concept out of the way, we can dive into the two components, technology and pricing. We’ll see each one of them factors in, but it’s the combination that makes the magic happen and allows Redshift to reach state-of-the-art performance at a low price.
One of the key concepts behind Redshift’s exceptional performance is its columnar data storage.
Columnar storage is a type of Relational Database Management System optimized for heavy-reading tasks. Most of the queries you run on your warehouse are scans, powering up analytics (?), dashboards, or machine learning models.
A common term for these types of queries is “OLAP” – “OnLine Analytical Processing”. As the name suggests, OLAP databases are good for analytics. OLAP queries tend to be complex and process a lot of data, touching many different tables. In fact, Redshift is an “OLAP” database engine.
Compare this with row-oriented systems which are a better option for write-heavy tasks. It’s a different data modeling approach used for business applications which process a company’s transactions. You’ll hear the term “production database”, or “online transaction processing” (“OLTP”). Most OLTP databases tend to have large amounts of small queries and a high percent of write activity. In general, OLTP systems load small amounts of data with high frequency.
A Simple Example
For a simple example, let’s take a website like Glassdoor, with information on companies, their salary levels and reviews by their employees.
You create an account on Glassdoor. In your profile you submit the company you work for and your salary. That transaction creates a record in Glassdoor’s production database. That same production database also holds records for all other user profiles.
Let’s say Glassdoor creates a new marketing report with the average salary across different companies. They may even break that down by level or geography and show a historical trend. For that information, they need to create a query that selects, filters and aggregates the records for a corresponding company. It’s a much larger query compared to the initial write query for each individual user in Glassdoor’s system.
Comparing row-oriented vs. column-oriented formats
Now let’s see how that query plays out in row-based vs. a columnar-oriented format. The following table illustrates the difference between the two approaches.
You can see that in the column-oriented the data is flipped around. Each row now represents a column. So why would a columnar storage work better for a warehouse?
Analytical queries execute fast on tables in columnar format. Let’s take for example a query “give me the average salary per company”. This query only touches the “salary” and “company” columns, groups by company and then calculates the averages.
Compare that to running the same query on a row-oriented database. The query needs to sift through the entire production database. To return a result, it has to read every single row. It runs much longer and is inefficient because it’s scanning the entire database.
There’s an additional catch here. If a query scans less rows, it consumes less data. As a result, the data fits into the RAM and can process in-memory. Memory / RAM is much faster than disk. Even more, since the data is stored column-wise, Redshift applies different data encodings per text and numerical columns, thus more efficiently compressing the data.
Columnar format is nothing new. On-premise warehouses like Oracle, Teradata and Netezza use it as well. The difference here is that they require hardware, installation and maintenance. It can take weeks and months to get running. Modern cloud warehouses like Redshift deliver the database as a service, with a cluster up and running in less than 15 minutes.
Redshift also pairs columnar storage with scaling out the number of nodes, which get us to “MPP” – massively parallel processing.
MPP is the process of coordinated, simultaneous computation of data across multiple nodes in the cluster. Each node is using its own operating system and memory, also known as a “loosely coupled” system.
An alternative is “tightly coupled”, or Symmetric Parallel Systems (SMP). Separate processors use a single operating system and memory. For warehouse solutions, MPP has shown to deliver better results.
MPP Distributes Compute Loads
The Redshift architecture uses MPP and consists of a leader node and compute nodes. The leader node distributes rows of a table across the different nodes which independently store and process data and queries. How you distribute data across the nodes depends on your distribution key.
The leader node coordinates the execution of a query across the different nodes. This applies for all types of workloads. Once done, the leader node combines the results from each node, to return the final result of the query.
This process has many advantages. By adding nodes, you add more storage, memory and CPU. With each node, a cluster’s processing capacity goes up and scales in a linear fashion. That means that a 10-node cluster processes the same query about twice as fast as a 5-node cluster.
This concept applies to all three types of generic workloads (loads, transforms and ad-hoc queries). For example, loading flat files into Redshift is a very efficient process and also takes advantage of parallel processing. The leader node spreads the workload across the nodes while reading from multiple files. Loading also scales linearly as you add more nodes to your cluster.
Elasticity – How Redshift Scales
Because of MPP’s simultaneous processing, as you add more nodes to your cluster, you load data faster, execute transform faster, and return the results of ad-hoc queries faster. And Redshift takes advantage of the elasticity of the cloud. Adding nodes is a simple process and takes a few clicks in the AWS console.
We describe Redshift’s linear scalability in our use case story “Intuit’s Journey to Cloud Analytics”. The key to linear scaling the correct configuration of your cluster. Just like with any other database, if you don’t’ configure it right, you won’t enjoy the performance benefits. You can read more in our post “How we configure Amazon Redshift for Performance”.
With traditional MPP databases, storage and compute are coupled together. That means if you add more nodes, you add storage and compute at the same rate. It also means you’re allocating resources for peak-consumption. That’s inefficient, as you end up in a situation where you either have too much compute or too much storage.
There’s a solution to that, by using a data lake architecture.
Let’s look at how pricing fits into the equation. We’re starting off with data lakes and Redshift Spectrum. There’s a technology component here as well, but we can more about the economics and the cost of running queries.
Companies keep their most recent and “hot” data in a warehouse, like Redshift. It’s the data that’s closet to the current business.
What happens with the “rest” of it, like historical or “unused” data, e.g. columns and tables that nobody queries? You don’t want to delete that data, but also not pay the premium for keeping it in your cluster.
The answer is to store in your data lake. Data Lakes are a cheap long-term storage for structured and unstructured data. In AWS, that product is Amazon S3. The cost for storing data in S3 is about one-tenth of storing it in a Redshift cluster.
Storing data in S3 implies frequent communication between S3 and Redshift if you still want to query it. One approach is to build up an Extract-Transform-Load (ETL) pipeline and load the data into the cluster, and then unload it again once analysis is complete. That’s complex and not very efficient.
Decoupling Storage from Compute
The other approach is to query data in S3 and join it with data in the cluster, via a feature called “Redshift Spectrum”. Spectrum is a “serverless query engine”. “Serverless” means there isn’t any infrastructure to set up or manage.
Point to your data in S3, define the schema, and you can start querying using standard SQL queries. Redshift Spectrum runs queries directly on S3, as if they were normal Redshift tables. By using S3 and Redshift Spectrum you’re separating storage from compute for your cluster. To store more data and process, there’s no need to add more nodes. Instead, you store in S3 and use Redshift Spectrum to join and query it.
This is a huge benefit, as you store your data at a lower cost, but are still able to query in your data warehouse. Separating compute and storage enables customers to scale resources on an as-needed basis, rather than pre-allocating resources for peak consumption.
Redshift Spectrum queries do run slower, but the cost trade-off is worth it, increasing the ROI on your data investment.
And that takes us to pricing.
Amazon Redshift is priced by node type and the number of nodes you’re running. There are two categories of nodes, in parentheses the price per hour for on-demand.
AWS offers two billing models for Redshift. On-demand, or reserved instances. Users can choose their desired model depending on their annual usage plans. The more upfront you can plan your usage, the more you save.
The lowest pricing in the market
Your lowest starting price with Redshift is $1,380 per year for one node of dc2 with a 1-year commitment. Looking at it from a storage perspective, the lowest price comes down to $934.20/TB/Year, with a 3-year commitment for any Dense Storage cluster.
If you want a quick rule of thumb, for a 3-year commitment:
4-5 TB is sufficient for most medium-sized companies, enterprises may go up to about 80-100TB. Very large enterprises that collect massive amounts of data every day go indeed to a Petabyte. A good example is the NASDAQ, which stores trading data in Redshift. But you can get started with as little as a few 100s of GBs, and not pay through the nose.
Overall – these price point are much lower by a factor of about 20x than for on-premise warehouses, which makes it so attractive for adoption.
This post introduced Amazon Redshift to help you get started. We covered the basic concepts behind the cluster and described its most important features.
You could summarize it that Redshift is simple, fast and cheap. It’s no wonder that it has found broad adoption in the SMB market and the enterprise alike.
We, at intermix.io, assist organizations to overcome your analytics problems by giving you a unified view to track your data sources, queries and user activity for your cluster, along with a historical timeline.
Unlike simple monitoring tools, we give you query optimization recommendations and a place to collaborate to improve cluster and query performance.
If you’re suffering from slow queries, dashboards or ETL pipelines, schedule a demo with us or start a free trial. Coupled with our Expert Services for Redshift, you’ll spend less time fighting fires and more time building.