AWS Redshift Architecture: Clusters & Nodes & Data Apps, oh my!

In this post, we’ll lay out the 5 major components of Amazon Redshift’s architecture.

  1. Data applications
  2. Clusters
  3. Leader nodes
  4. Compute nodes
  5. Redshift Spectrum

Understanding the components and how they work is fundamental for building a data platform with Redshift. In the post, we’ll provide tips and references to best practices for each component.

————-

Download our Data Pipeline Resource Bundle

See 14 real-life examples of data pipelines built with Amazon Redshift

  • Full stack breakdown
  • Summary slides with links to resources
  • PDF containing detailed descriptions

Since launch, Amazon Redshift has found rapid adoption among SMBs and the enterprise. In the early days, business intelligence was the major use case for Redshift.

That has changed.

Today, we still, of course, see companies using BI dashboards like Tableau, Looker and Periscope Data with Redshift. But with rapid adoption, the uses cases for Redshift have evolved beyond reporting. And that has come with a major shift in end-user expectations:

  • Mission-critical: Redshift is now at the core of data lake architectures, feeding data into business-critical applications and data services the business depends on. End users expect service level agreements (SLAs) for their data sets.
  • Growing data volume: On average, data volume grows 10x every 5 years. The average intermix.io customer doubles their data volume each year. End-users expect data platforms to handle that growth.
  • More choice: We see a constant flux of new data sources and new tools to work with data. End-users expect to operate in a self-service model, to spin up new data sources and explore data with the tools of their choice.

The shift in expectations has implications for the work of the database administrator (“DBA”) or data engineer in charge of running an Amazon Redshift cluster. The static world is gone. Today, data sets have become so large and diverse that data teams have to innovate around how to collect, store, process, analyze and share data.

In the case of Amazon Redshift, much of that depends on understanding the underlying architecture and deployment model. It’s what drives the cost, throughput volume and the efficiency of using Amazon Redshift.

And so in this blog post, we’re taking a closer look at the Amazon Redshift architecture, its components, and how queries flow through those components. We’ll include a few pointers on best practices.

See your data in intermix.io

Amazon Redshift Architecture and The Life of a Query

This architecture diagram shows how Amazon Redshift processes queries across this architecture. We’re excluding Redshift Spectrum in this image as that layer is independent of your Amazon Redshift cluster. (We’ll explain that part in a bit.)

This Amazon Redshift Architecture Diagram displays the 5 steps to process a query
Image 1: Amazon Redshift Architecture

However, we do recommend using Spectrum from the start as an extension into your S3 data lake. We’ll go deeper into the Spectrum architecture further down in this post. Let’s first take a closer look at role of each one of the five components

Data apps: More than SQL client applications

In other reference architectures for Redshift, you will often hear the term “SQL client application”. And SQL is certainly the lingua franca of data warehousing.

But with the shift away from reporting to new types of use cases, we prefer to use the term “data apps”. Unlike writing plain SQL in an editor, they imply the use of data engineering techniques, i.e. the use of code/software to work with data.

Data apps run workloads or “jobs” on an Amazon Redshift cluster. There are three generic categories of data apps:

  1. Data integration (“Loads”): This category includes applications that move data from external data sources and systems into Redshift. Examples are Informatica, Stitch Data, Fivetran, Alooma, or ETLeap.
  2. Workflow orchestration (“Transforms”): These are systems that run batch jobs on a predetermined schedule. For example, once data is in a cluster you will still need to filter, clean, join or aggregate data across various sources. Examples for these tools in the open source are Apache Airflow, Pinball or Luigi.
  3. Analysis (“Ad-hoc”): These are apps for data science, reporting, and visualization. Examples are Tableau, Jupyter notebooks, Mode Analytics, Looker, Chartio, Periscope Data. Ad-hoc queries might run queries to extract data for downstream consumption, e.g. for a machine learning application or a data API.

See your data in intermix.io

What does this mean for the DBA?

The Amazon Redshift architecture is designed to be “greedy”. A query will consume all the resources it can get. To protect workloads from each other, a best practice for Amazon Redshift is to set up workload management (“WLM”). WLM is a key architectural requirement. Setting up your WLM should be a top-level architecture component. We’ve also discussed the pros and cons of turning on automatic WLM.

Two Types of Nodes that make a Cluster

A “cluster” is the core infrastructure component for Redshift, which executes workloads coming from external data apps. There are two key components in a cluster:

  1. Compute Nodes: A cluster contains at least one “compute node”, to store and process data.
  2. Leader Node: Clusters with two or more compute nodes also have a “leader node”. The leader coordinates the distribution of workloads across the compute nodes. A cluster only has one leader node.

In our experience, most companies run multi-cluster environments, also called a “fleet” of clusters. For example, at intermix.io we run a fleet of ten clusters.

What does this mean for the DBA?

It’s easy to spin up a cluster, pump in data and begin performing advanced analytics in under an hour.  That makes it easy to skip some best practices when setting up a new Amazon Redshift cluster. Read more at 3 Things to Avoid When Setting Up an Amazon Redshift Cluster

Leader Node

The leader node has four major roles:

  1. Communication with data apps: When running workloads on a cluster, data apps interact only with the leader node. The compute nodes are transparent to external data apps.
  2. Distribution of workloads: The leader node parses queries, develops an execution plan, compiles SQL into C++ code and then distributes the compiled code to the compute nodes.
  3. Caching of query results: When a query is executed in Amazon Redshift, both the query and the results are cached in the memory of the leader node, across different user sessions to the same database. When query or underlying data have not changed, the leader node skips distribution to the compute nodes and returns the cached result, for faster response times.
  4. Maintenance of catalog tables: The system catalogs store schema metadata, such as information about tables and columns. System catalog tables have a PG prefix. A query that references only catalog tables or that does not reference any tables, runs exclusively on the leader node.

Download the Top 14 Performance Tuning Techniques for Amazon Redshift

What does this mean for the DBA?

In some cases, the leader node can become a bottleneck for the cluster. The pattern is an increase in your COMMIT queue stats. For example, larger nodes have more metadata, which requires more processing by the leader node. You can Query STL_COMMIT_STATS to determine what portion of a transaction was spent on commit and how much queuing is occurring.

Compute Nodes

The compute nodes handle all query processing, in parallel execution (“massively parallel processing”, short “MPP”).

Amazon Redshift provides two categories of nodes:

  • Dense compute nodes come with solid-state disk-drives (“SDD”) and are best for performance intensive workloads.
  • Dense storage nodes come with hard disk drives (“HDD”) and are best for large data workloads.

As your workloads grow, you can increase the compute and storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.

Compute nodes are also the basis for Amazon Redshift pricing. You can start with hourly on-demand consumption. Prices for on-demand range from $0.25 (dense compute) to $6.80 per hour (dense storage), with discounts of up to 69% for 3-year commitments.

What does this mean for the DBA?

The execution speed of a query depends a lot on how fast Redshift can access and scan data that’s distributed across nodes. A best practice is to choose the right distribution style for your data by defining distribution keys.

Adding nodes is an easy way to add more processing power. It’s also an easy way to address performance issues – by resizing your cluster and adding more nodes. Many Redshift customers run with over-provisioned clusters. Because nodes are the basis for pricing, that can add up over time. And removing nodes is a much harder process. But it’s also the only way to reduce your Redshift cost.

Redshift Spectrum Architecture

In some cases, it may make sense to shift data into S3. The cost of S3 storage is roughly a tenth of Redshift compute nodes. With Amazon Redshift Spectrum you can query data in Amazon S3 without first loading it into Amazon Redshift.

Image 2 shows what an extended Architecture with Spectrum and query caching looks like. The leader nodes decides:

  • if a cached result is returned
  • what part of the query
    • runs locally in the cluster
    • gets sent to Amazon Redshift Spectrum

The leader node includes the corresponding steps for Spectrum into the query plan. The compute nodes in the cluster issue multiple requests to the Amazon Redshift Spectrum layer.

Extended Amazon Redshift Architecture with Query Caching and Redshift Spectrum
Image 2: Extended Amazon Redshift Architecture with Query Caching and Redshift Spectrum

Spectrum scans S3 data, runs projections, filters and aggregates the results. Spectrum sends the final results back to the compute nodes. The compute nodes run any joins with data sitting in the cluster. That way, you can join data sets from S3 with data sets in Amazon Redshift.

Redshift pricing is based on the data volume scanned, at a rate or $5 per terabyte.

What does this mean for the DBA?

Using Redshift Spectrum is a key component for a data lake architecture. Amazon Redshift is the access layer for your data applications. Spectrum is the query processing layer for data accessed from S3. We’ve written more about the detailed architecture in “Amazon Redshift Spectrum: Diving into the Data Lake

How to get the most out of your Amazon Redshift cluster

In this post, we described the  Amazon Redshift’s architecture. We explained how the architecture affects working with data and queries.

If you want to dive deeper into Amazon Redshift and Amazon Redshift Spectrum, register for one of our public training sessions. Each month, we host a free training with live Q&A to answer your most burning questions about Amazon Redshift and building data lakes on Amazon AWS. If you have a burning question about the architecture that you want to answer right now – open this chat window, we’re around to answer your questions!

Lars Kamp

Lars Kamp

Join 11,000 of your peers.
Subscribe to our newsletter SF Data.
People at Facebook, Amazon and Uber read it every week.

Every Monday morning we'll send you a roundup of the best content from intermix.io and around the web. Make sure you're ready for the week! See all issues.