Start Now Login
A DBA’s Guide to the Amazon Redshift Architecture

A DBA’s Guide to the Amazon Redshift Architecture

In this post, we’ll lay out the 5 major components of Amazon Redshift’s architecture.

  1. Data applications
  2. Clusters
  3. Leader nodes
  4. Compute nodes
  5. Redshift Spectrum

Understanding the components and how they work is fundamental for building a data platform with Redshift. In the post, we’ll provide tips and references to best practices for each component.

————-

Since launch, Amazon Redshift has found rapid adoption among SMBs and the enterprise. In the early days, business intelligence was the major use case for Redshift.

That has changed.

Today, we still, of course, see companies using BI dashboards like Tableau, Looker and Periscope Data with Redshift. But with rapid adoption, the uses cases for Redshift have evolved dbeyond reporting. And that has come with a major shift in end-user expectations:

The shift in expectations has implications for the work of the database administrator (“DBA”) or data engineer in charge of running an Amazon Redshift cluster. The static world is gone. Today, data sets have become so large and diverse that data teams have to innovate around how to collect, store, process, analyze and share data.

In the case of Amazon Redshift, much of that depends on understanding the underlying architecture and deployment model. It’s what drives the cost, throughput volume and the efficiency of using Amazon Redshift.

And so in this blog post, we’re taking a closer look at the Amazon Redshift architecture, its components, and how queries flow through those components. We’ll include a few pointers on best practices.

Amazon Redshift Architecture and The Life of a Query

Image 1 shows how Amazon Redshift processes queries across this architecture. We’re excluding Redshift Spectrum in this image as that layer is independent of your Amazon Redshift cluster.  

Image 1: Amazon Redshift Architecture

However, we do recommend using Spectrum from the start as an extension into your S3 data lake. We’ll go deeper into the Spectrum architecture further down in this post. Let’s first take a closer look at role of each one of the five components

Data applications

In other reference architectures for Redshift, you will often hear the term “SQL client application”. And SQL is certainly the lingua franca of data warehousing.

But with the shift away from reporting to new types of use cases, we prefer to use the term “data apps”. Unlike writing plain SQL in an editor, they imply the use of data engineering techniques, i.e. the use of code/software to work with data.

Data apps run workloads or “jobs” on an Amazon Redshift cluster. There are three generic categories of data apps:

  1. Data integration (“Loads”): This category includes applications that move data from external data sources and systems into Redshift. Examples are Informatica, Stitch Data, Fivetran, Alooma, or ETLeap.
  2. Workflow orchestration (“Transforms”): These are systems that run batch jobs on a predetermined schedule. For example, once data is in a cluster you will still need to filter, clean, join or aggregate data across various sources. Examples for these tools in the open source are Apache Airflow, Pinball or Luigi.
  3. Analysis (“Ad-hoc”): These are apps for data science, reporting, and visualization. Examples are Tableau, Jupyter notebooks, Mode Analytics, Looker, Chartio, Periscope Data. Ad-hoc queries might run queries to extract data for downstream consumption, e.g. for a machine learning application or a data API.

What does this mean for the DBA?

The Amazon Redshift architecture is designed to be “greedy”. A query will consume all the resources it can get. To protect workloads from each other, a best practice for Amazon Redshift is to set up workload management (“WLM”). WLM is a key architectural requirement. Setting up your lWLM should be a top-level architecture component.

Cluster

A “cluster” is the core infrastructure component for Redshift, which executes workloads coming from external data apps. There are two key components in a cluster:

  1. Compute Nodes: A cluster contains at least one “compute node”, to store and process data.
  2. Leader Node: Clusters with two or more compute nodes also have a “leader node”. The leader coordinates the distribution of workloads across the compute nodes. A cluster only has one leader node.

In our experience, most companies run multi-cluster environments, also called a “fleet” of clusters. For example, at intermix.io we run a fleet of ten clusters.

What does this mean for the DBA?

It’s easy to spin up a cluster, pump in data and begin performing advanced analytics in under an hour.  That makes it easy to skip some best practices when setting up a new Amazon Redshift cluster. Read more at 3 Things to Avoid When Setting Up an Amazon Redshift Cluster

Leader Node

The leader node has four major roles:

  1. Communication with data apps: When running workloads on a cluster, data apps interact only with the leader node. The compute nodes are transparent to external data apps.
  2. Distribution of workloads: The leader node parses queries, develops an execution plan, compiles SQL into C++ code and then distributes the compiled code to the compute nodes.
  3. Caching of query results: When a query is executed in Amazon Redshift, both the query and the results are cached in the memory of the leader node, across different user sessions to the same database. When query or underlying data have not changed, the leader node skips distribution to the compute nodes and returns the cached result, for faster response times.
  4. Maintenance of catalog tables: The system catalogs store schema metadata, such as information about tables and columns. System catalog tables have a PG prefix. A query that references only catalog tables or that does not reference any tables, runs exclusively on the leader node.

What does this mean for the DBA?

In some cases, the leader node can become a bottleneck for the cluster. The pattern is an increase in your COMMIT queue stats. For example, larger nodes have more metadata, which requires more processing by the leader node. You can Query STL_COMMIT_STATS to determine what portion of a transaction was spent on commit and how much queuing is occurring.

Compute Nodes

The compute nodes handle all query processing, in parallel execution (“massively parallel processing”, short “MPP”).

Amazon Redshift provides two categories of nodes:

As your workloads grow, you can increase the compute and storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.

Compute nodes are also the basis for Amazon Redshift pricing. You can start with hourly on-demand consumption. Prices for on-demand range from $0.25 (dense compute) to $6.80 per hour (dense storage), with discounts of up to 69% for 3-year commitments.

What does this mean for the DBA?

The execution speed of a query depends a lot on how fast Redshift can access and scan data that’s distributed across nodes. A best practice is to choose the right distribution style for your data by defining distribution keys.

Adding nodes is an easy way to add more processing power. It’s also an easy way to address performance issues – by resizing your cluster and adding more nodes. Many Redshift customers run with over-provisioned clusters. Because nodes are the basis for pricing, that can add up over time. And removing nodes is a much harder process. But it’s also the only way to reduce your Redshift cost.

Redshift Spectrum

In some cases, it may make sense to shift data into S3. The cost of S3 storage is roughly a tenth of Redshift compute nodes. With Amazon Redshift Spectrum you can query data in Amazon S3 without first loading it into Amazon Redshift.

Image 2 shows what an extended Architecture with Spectrum and query caching looks like. The leader nodes decides:

The leader node includes the corresponding steps for Spectrum into the query plan. The compute nodes in the cluster issue multiple requests to the Amazon Redshift Spectrum layer.

Image 2: Extended Amazon Redshift Architecture with Query Caching and Redshift Spectrum

Spectrum scans S3 data, runs projections, filters and aggregates the results. Spectrum sends the final results back to the compute nodes. The compute nodes run any joins with data sitting in the cluster. That way, you can join data sets from S3 with data sets in Amazon Redshift.

Redshift pricing is based on the data volume scanned, at a rate or $5 per terabyte.

What does this mean for the DBA?

Using Redshift Spectrum is a key component for a data lake architecture. Amazon Redshift is the access layer for your data applications. Spectrum is the query processing layer for data accessed from S3. We’ve written more about the detailed architecture in “Amazon Redshift Spectrum: Diving into the Data Lake

How to get the most out of your Amazon Redshift cluster

In this post, we described the  Amazon Redshift’s architecture. We explained how the architecture affects working with data and queries.

If you want to dive deeper into Amazon Redshift and Amazon Redshift Spectrum, register for one of our public training sessions. Each month, we host a free training with live Q&A to answer your most burning questions about Amazon Redshift and building data lakes on Amazon AWS. If you have a burning question about the architecture that you want to answer right now – open this chat window, we’re around to answer your questions!

Ready to start seeing into your data infrastructure?
Get started with a 14-day free trial, with access to the full platform

No Credit Card Required