Table of Contents
This is a guest blog post by our visiting expert Dieter Matzion. Dieter is a Business Systems Analyst at Intuit. In the post, Dieter explains how his team has built a KPI dashboard to track cloud spend on top of Amazon Redshift. With the dashboard, over 5,000 Intuit employees gain self-service visibility into their utilization of cloud resources. You can follow & connect with Dieter on Linkedin.
The cloud computing market is forecasted by Forbes and Gartner to grow about 17% in 2020, to an aggregated annual revenue of over a quarter trillion dollars. Almost 70% of enterprises are moving business-critical applications to the cloud says Tech Republic. Companies using the cloud can focus on their core business vs. the undifferentiated heavy lifting as Jeff Bezos calls it.
Cloud has become a competitive advantage that drives innovation with unprecedented speed. Economies of scale allow Intuit to run a 100,000 core data analysis job at any time we choose by simply writing a check. Elasticity allows builders to scale vertically and horizontally at the click of a mouse button. The global presence of the cloud allows enterprises to provide services close to their customers. And shifting from a data center to the cloud allows CFOs to move capital expenditures to operational ones. You can read our cloud migration story in “Intuit’s Journey to Cloud Analytics“.
However, the cloud is very different from a data center as is evident by the many authors talking about its disadvantages. The procurement model for the cloud is comparable to a cell phone bill that you are expected to pay in full at the end of the month.
Cloud governance is often de-prioritized due to the complexity and favoring innovation, opening the door for runaway spending. Renting managed services has a higher cost compared to bare metal machines in the data center. Cost items like data transfer and storage covered by someone else in the data center now show up on your business unit’s monthly bill. Prepaid services need to be amortized, and your organization may have different financial reporting needs than provided out-of-the-box.
Lifting and shifting from the data center to the cloud inherits all inefficiencies and right-sizing is new skill engineers need to learn. Enterprises relinquish central control of security, privacy, and compliance by assuming a shared responsibility model with the cloud provider.
But all is not doom and gloom, and ignoring the cloud gives an advantage to your competitors who are willing to master the new environment. In this article, I am going to share best practices and learnings of how large enterprises overcome these obstacles.
Let’s start with visibility. Managing cost in a data center is slow-paced. Procurement of hardware, racking, and installing software takes months. Budgets are stable and forecasted years in advance. Using the same method in the cloud is like driving a car at top speed – blindfolded – and expecting not to crash.
Unpredictable cost is the top pain point with lack of visibility a close second of 300 C-level and IT decision-makers surveyed by SoftwareONE in 2018.
Fortunately, major cloud providers provide detailed records of how you spent your money. Unfortunately, this insight may come too late, requires training before it becomes usable, and may not align with the financial reporting requirements of your organization.
At Intuit we process 4 billion rows of AWS billing data (the “AWS Cost and Usage Report”) every day into Amazon Redshift, which feeds financial reports tailored to different audiences. We layer in amortization of prepaid services and chargebacks to obtain financial numbers that are directly compared to cloud budgets. You can read about the detailed data architecture and processes in the post “Intuit’s to Cloud Analytics“.
Latencies of billing data are filled with data from inventory changes to get near-real-time cost estimates. This allows us to alert business owners of budget risks before runaway spend occurs.
The billing data is further enhanced with performance and efficiency telemetry like compute, storage, database, container, and serverless utilization data. This allows us to surface actionable insights for right-sizing and cost avoidance to executives and engineers alike. Our CTO and CFO review the top opportunities from the largest vertical down to a single engineer on a quarterly basis. This executive commitment cascades down the organization and is key to efficient operation in the cloud.
Surface actionable insights for right-sizing and cost avoidance to executives and engineers.Share this Amazon Redshift Transformational Use Case
The lowest hanging fruit for saving or avoiding the cost in the cloud are discounts. This is because a small team can implement these without having to engage a large number of engineers. Enterprise discount agreements typically give you a flat percentage rate depending on a specific commitment. These will vary for each customer as they are based on the volume of cloud services consumed.
Additionally, most cloud providers offer so-called sustained usage discounts, a percentage discount for a specific service in exchange for a longer-term commitment. Intuit actively manages a $100M portfolio of AWS Reserved Instances. These are prepaid leases for EC2, RDS, Redshift, Elasticache, ElasticSearch, and DynamoDB. Discounts can range up to 70% depending on the type of commitment. The portfolio is managed centrally using 3rd party tools like CloudHealth. Unit pricing is calculated hourly and exposed via an API. This allows automated cost decisions where software chooses most economic deployment patterns.
Cloud governance has a heavier lift compared to the previous areas as policies will affect how engineers can use the cloud. You want to establish guardrails instead of gatekeepers to promote the speed of innovation. Each policy is customized based on its purpose and maintained as code to allow real-time enforcement. Management and engineers need to be aware of the policies to innovate efficiently.
Intuit uses a customized fork of Capital One’s Cloud Custodian for automated policy enforcement. We maintain several hundred policies in an internal Github, 40 of them are focused on cloud cost. Here are a few examples of policies that will result in the most savings for the least effort.
Take advantage of the cloud’s elasticity by stopping to incur cost when turning off resources that are not in use. So-called cloud parking is automation that is executed to decommission and reprovision cloud resources on a schedule. Development or testing environments that no one is using after business hours are good candidates. Assuming a 12 hour on time on weekdays, this will save over 60% where the policy can be enabled.
Setup red versus green policies to restrict certain usages to maximize discounts. For example, if your company is only purchasing Reserved Instances in 2-3 regions, you can opt to terminate new workload in all other regions to avoid paying an undiscounted price. Other examples are to disallow expensive services or services for which you don’t have discounts. You want to automate exceptions, where business owners can override the policy at any time, and get a report of how this affected their budget.
Cloud providers generally have some data lifecycle management capabilities, and you may need to supplement these. A good example is unattached Elastic Block Store (EBS) on AWS. EBS is a like virtual disk drive, you can think of it as a thumb drive. When it is not connected to a computer, it still incurs a cost, as the storage cannot be reused. Intuit uses policy automation to backup unattached EBS and deletes the backups after a month if no one claims them. Engineers with a valid business reason can tag unattached EBS volumes, which the policy then ignores.
We’d love to learn from you! Is there anything you can share about your own experience building in the cloud? We are always excited to share transformational use cases of Amazon Redshift. And if you want faster queries for your cloud analytics, and spend less time on Ops and more time on Dev like Intuit, then go ahead and schedule a demo or start a trial with intermix.io.
Data is valuable resource powering up analytics, predictive models and decision making. For a company to make data-driven decisions, it first must go through building its data infrastructure. And a data warehouse plays a central role in such an infrastructure.
Data warehouses are data storage and processing systems that aggregate data from different sources into a single place. With data in one place, you can combine and query it across the different sources. That includes data coming from users interacting through web and mobile, background system logs, or third party data.
Amazon Redshift is a cloud warehouse solution by Amazon Web Services (AWS). Since AWS first introduced Redshift in 2012, it got everyone’s attention for its amazing performance and low price point. In the following years, there were massive improvements from operational and data pipeline perspective. Today, it’s the market leader for cloud warehouses.
So how does Redshift work, and what’s been driving its adoption? There are two basic components you need to understand about Amazon Redshift:
The technology is nothing new. Other warehouses use it, and there are even open source data warehouses that are free to use. It’s the combination of the two, and the simplicity that Redshift offers to start with a data warehouse.
In this post, I’ll explain these two components. Before that, it’s helpful to understand basic nomenclature and key concepts. We’ll start with the key concepts.
If you’ve ever googled “Redshift” you must have read the following. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.
Let’s break down what this means, and explain a few other key concepts that are helpful for context on how Redshift operates.
Fully managed. AWS takes care of things like warehouse setup, operation and redundancy, as well as scaling and security. All this is automated in the background, so the client has a smooth experience.
Warehouse service. Redshift is a cloud service, the customer does not own the physical hardware of the warehouse, but can use it through a subscription as a service. This has dramatic impact on the procurement model for customers.
Getting started with Redshift is easy. Rather than buying and installing hardware, they can spin up a Redshift warehouse, upload data and run queries in less than 15 minutes. And since AWS manages a fleet of tens of thousands of Redshift clusters, customers benefit from automating capabilities that would not make economic sense for any individual on-premise DBA.
Petabyte. There are fifteen zeros in a petabyte, 1,000x bigger than a terabyte. To illustrate, this is equivalent to 13.3 years of HD video. Compare that with traditional on-premise data warehouses that operate in the terabyte range. Going to petabytes has dramatic impact on the analytical capabilities.
Many companies today already generate a terabyte of data per day. In the old on-premise world, storing and analyzing that much data would have been cost-prohibitive, leading to the famous “analysis gap” or “dark data”. They may collect data but don’t analyze it. Redshift solves that problem.
Cluster. A cluster is a collection of nodes which perform the actual storing and processing of data. Each cluster runs an Amazon Redshift engine distributed and replicated across all nodes.
Nodes. AWS offers four different node types for Redshift. Each node type comes with a combination of computing resources (CPU, memory, storage and I/O)
Databases. A cluster can have one or more databases. A database contains one or more named schemas, which in turn contain a set of tables and other objects.
Workloads. You’ll often hear the term “workloads”, which implies “running queries” on your Redshift cluster. There are three generic types of workloads.
Understanding and distinguishing the three generic workloads from each other is an important concept for any data warehouse. Different Workloads use different amounts of cluster resources, and it makes total sense to be able to keep disparate workloads separate, to avoid resource contention. Redshift manages them with the “Workload Manager”. We explain workloads in detail in our post “4 Simple Steps to Workload Management”.
With the key concept out of the way, we can dive into the two components, technology and pricing. We’ll see each one of them factors in, but it’s the combination that makes the magic happen and allows Redshift to reach state-of-the-art performance at a low price.
One of the key concepts behind Redshift’s exceptional performance is its columnar data storage.
Columnar storage is a type of Relational Database Management System optimized for heavy-reading tasks. Most of the queries you run on your warehouse are scans, powering up analytics (?), dashboards, or machine learning models.
A common term for these types of queries is “OLAP” – “OnLine Analytical Processing”. As the name suggests, OLAP databases are good for analytics. OLAP queries tend to be complex and process a lot of data, touching many different tables. In fact, Redshift is an “OLAP” database engine.
Compare this with row-oriented systems which are a better option for write-heavy tasks. It’s a different data modeling approach used for business applications which process a company’s transactions. You’ll hear the term “production database”, or “online transaction processing” (“OLTP”). Most OLTP databases tend to have large amounts of small queries and a high percent of write activity. In general, OLTP systems load small amounts of data with high frequency.
A Simple Example
For a simple example, let’s take a website like Glassdoor, with information on companies, their salary levels and reviews by their employees.
You create an account on Glassdoor. In your profile you submit the company you work for and your salary. That transaction creates a record in Glassdoor’s production database. That same production database also holds records for all other user profiles.
Let’s say Glassdoor creates a new marketing report with the average salary across different companies. They may even break that down by level or geography and show a historical trend. For that information, they need to create a query that selects, filters and aggregates the records for a corresponding company. It’s a much larger query compared to the initial write query for each individual user in Glassdoor’s system.
Comparing row-oriented vs. column-oriented formats
Now let’s see how that query plays out in row-based vs. a columnar-oriented format. The following table illustrates the difference between the two approaches.
You can see that in the column-oriented the data is flipped around. Each row now represents a column. So why would a columnar storage work better for a warehouse?
Analytical queries execute fast on tables in columnar format. Let’s take for example a query “give me the average salary per company”. This query only touches the “salary” and “company” columns, groups by company and then calculates the averages.
Compare that to running the same query on a row-oriented database. The query needs to sift through the entire production database. To return a result, it has to read every single row. It runs much longer and is inefficient because it’s scanning the entire database.
There’s an additional catch here. If a query scans less rows, it consumes less data. As a result, the data fits into the RAM and can process in-memory. Memory / RAM is much faster than disk. Even more, since the data is stored column-wise, Redshift applies different data encodings per text and numerical columns, thus more efficiently compressing the data.
Columnar format is nothing new. On-premise warehouses like Oracle, Teradata and Netezza use it as well. The difference here is that they require hardware, installation and maintenance. It can take weeks and months to get running. Modern cloud warehouses like Redshift deliver the database as a service, with a cluster up and running in less than 15 minutes.
Redshift also pairs columnar storage with scaling out the number of nodes, which get us to “MPP” – massively parallel processing.
MPP is the process of coordinated, simultaneous computation of data across multiple nodes in the cluster. Each node is using its own operating system and memory, also known as a “loosely coupled” system.
An alternative is “tightly coupled”, or Symmetric Parallel Systems (SMP). Separate processors use a single operating system and memory. For warehouse solutions, MPP has shown to deliver better results.
MPP Distributes Compute Loads
The Redshift architecture uses MPP and consists of a leader node and compute nodes. The leader node distributes rows of a table across the different nodes which independently store and process data and queries. How you distribute data across the nodes depends on your distribution key.
The leader node coordinates the execution of a query across the different nodes. This applies for all types of workloads. Once done, the leader node combines the results from each node, to return the final result of the query.
This process has many advantages. By adding nodes, you add more storage, memory and CPU. With each node, a cluster’s processing capacity goes up and scales in a linear fashion. That means that a 10-node cluster processes the same query about twice as fast as a 5-node cluster.
This concept applies to all three types of generic workloads (loads, transforms and ad-hoc queries). For example, loading flat files into Redshift is a very efficient process and also takes advantage of parallel processing. The leader node spreads the workload across the nodes while reading from multiple files. Loading also scales linearly as you add more nodes to your cluster.
Elasticity – How Redshift Scales
Because of MPP’s simultaneous processing, as you add more nodes to your cluster, you load data faster, execute transform faster, and return the results of ad-hoc queries faster. And Redshift takes advantage of the elasticity of the cloud. Adding nodes is a simple process and takes a few clicks in the AWS console.
We describe Redshift’s linear scalability in our use case story “Intuit’s Journey to Cloud Analytics”. The key to linear scaling the correct configuration of your cluster. Just like with any other database, if you don’t’ configure it right, you won’t enjoy the performance benefits. You can read more in our post “How we configure Amazon Redshift for Performance”.
With traditional MPP databases, storage and compute are coupled together. That means if you add more nodes, you add storage and compute at the same rate. It also means you’re allocating resources for peak-consumption. That’s inefficient, as you end up in a situation where you either have too much compute or too much storage.
There’s a solution to that, by using a data lake architecture.
Let’s look at how pricing fits into the equation. We’re starting off with data lakes and Redshift Spectrum. There’s a technology component here as well, but we can more about the economics and the cost of running queries.
Companies keep their most recent and “hot” data in a warehouse, like Redshift. It’s the data that’s closet to the current business.
What happens with the “rest” of it, like historical or “unused” data, e.g. columns and tables that nobody queries? You don’t want to delete that data, but also not pay the premium for keeping it in your cluster.
The answer is to store in your data lake. Data Lakes are a cheap long-term storage for structured and unstructured data. In AWS, that product is Amazon S3. The cost for storing data in S3 is about one-tenth of storing it in a Redshift cluster.
Storing data in S3 implies frequent communication between S3 and Redshift if you still want to query it. One approach is to build up an Extract-Transform-Load (ETL) pipeline and load the data into the cluster, and then unload it again once analysis is complete. That’s complex and not very efficient.
Decoupling Storage from Compute
The other approach is to query data in S3 and join it with data in the cluster, via a feature called “Redshift Spectrum”. Spectrum is a “serverless query engine”. “Serverless” means there isn’t any infrastructure to set up or manage.
Point to your data in S3, define the schema, and you can start querying using standard SQL queries. Redshift Spectrum runs queries directly on S3, as if they were normal Redshift tables. By using S3 and Redshift Spectrum you’re separating storage from compute for your cluster. To store more data and process, there’s no need to add more nodes. Instead, you store in S3 and use Redshift Spectrum to join and query it.
This is a huge benefit, as you store your data at a lower cost, but are still able to query in your data warehouse. Separating compute and storage enables customers to scale resources on an as-needed basis, rather than pre-allocating resources for peak consumption.
Redshift Spectrum queries do run slower, but the cost trade-off is worth it, increasing the ROI on your data investment.
And that takes us to pricing.
Amazon Redshift is priced by node type and the number of nodes you’re running. There are two categories of nodes, in parentheses the price per hour for on-demand.
AWS offers two billing models for Redshift. On-demand, or reserved instances. Users can choose their desired model depending on their annual usage plans. The more upfront you can plan your usage, the more you save.
The lowest pricing in the market
Your lowest starting price with Redshift is $1,380 per year for one node of dc2 with a 1-year commitment. Looking at it from a storage perspective, the lowest price comes down to $934.20/TB/Year, with a 3-year commitment for any Dense Storage cluster.
If you want a quick rule of thumb, for a 3-year commitment:
4-5 TB is sufficient for most medium-sized companies, enterprises may go up to about 80-100TB. Very large enterprises that collect massive amounts of data every day go indeed to a Petabyte. A good example is the NASDAQ, which stores trading data in Redshift. But you can get started with as little as a few 100s of GBs, and not pay through the nose.
Overall – these price point are much lower by a factor of about 20x than for on-premise warehouses, which makes it so attractive for adoption.
This post introduced Amazon Redshift to help you get started. We covered the basic concepts behind the cluster and described its most important features.
You could summarize it that Redshift is simple, fast and cheap. It’s no wonder that it has found broad adoption in the SMB market and the enterprise alike.
We, at intermix.io, assist organizations to overcome your analytics problems by giving you a unified view to track your data sources, queries and user activity for your cluster, along with a historical timeline.
Unlike simple monitoring tools, we give you query optimization recommendations and a place to collaborate to improve cluster and query performance.
If you’re suffering from slow queries, dashboards or ETL pipelines, schedule a demo with us or start a free trial. Coupled with our Expert Services for Redshift, you’ll spend less time fighting fires and more time building.
The Amazon Redshift architecture allows to scale up by adding more nodes to a cluster. That can cause over-provisioning of nodes to address peak query volume. Unlike adding nodes, Concurrency Scaling adds more query processing power on an as-needed basis.
The typical Concurrency Scaling for Amazon Redshift gives Redshift clusters additional capacity to handle bursts in query load. It works by off-loading queries to new, “parallel” clusters in the background. Queries are routed based on WLM configuration and rules.
Pricing for Concurrency Scaling is based on a credit model, with a free usage tier and beyond that a charge based on the time that a Concurrency Scaling cluster runs queries.
We tested Concurrency Scaling for one of our internal clusters. In this post, we present the results of our usage of the feature and information on how to get started.
To be eligible for concurrency scaling, an Amazon Redshift cluster must meet the following three requirements:
Concurrency scaling does not work on all query types. For the first release, it handles read-only queries that meet three conditions:
For routing to a Concurrency Scaling cluster, a query needs to encounter queueing. Also, queries eligible for SQA (Short Query Acceleration) queue will not run on the Concurrency Scaling clusters.
Queuing and SQA are a function of a proper set-up of Redshift’s workload management (WLM). We recommend first optimizing your WLM because it will reduce the need for Concurrency Scaling. And that matters because Concurrency Scaling comes free up to a certain amount of hours. In fact, AWS claims that Concurrency Scaling will be free for 97% of customers, which takes us to pricing.
We’ve also tested enabling Redshift’s automatic WLM and captured our experience with it in this blog post “Should I Enable Amazon Redshift’s Automatic WLM?“
For Concurrency Scaling, AWS has introduced a credit model. Each active Amazon Redshift cluster earns credits on an hourly basis, up to one hour of free Concurrency Scaling credits per day.
You only pay when your use of the concurrency scaling clusters exceeds the amount of credits that you’ve incurred.
The fee equals the per-second on-demand rate for a transient cluster used in excess of the free credits – only when it’s serving your queries – with a one-minute minimum charge each time a Concurrency Scaling cluster is activated. Calculating the per-second on-demand rate is based on general Amazon Redshift pricing, i.e. by node type and number of nodes in your cluster.
Concurrency scaling is enabled on a per-WLM queue basis. Go to the AWS Redshift Console and click on “Workload Management” from the left-side navigation menu. Select your cluster’s WLM parameter group from the subsequent pull-down menu.
You should see a new column called “Concurrency Scaling Mode” next to each queue. The default is ‘off’. Click ‘Edit’ and you’ll be able to modify the settings for each queue.
Concurrency scaling works by routing eligible queries to new, dedicated clusters. The new clusters have the same size (node type and number) as the main cluster.
The number of clusters used for concurrency scaling defaults to one (1), with the option to configure up to ten (10) total clusters.
The total number of clusters that should be used for concurrency scaling can be set by the parameter max_concurrency_scaling_clusters. Increasing the value of this parameter provisions additional standby clusters.
There are a few additional charts in the AWS Redshift console. There is a chart called “Max Configured Concurrency Scaling Clusters” which plots the value of max_concurrency_scaling_clusters over time.
The number of Active Scaling clusters is also shown in the UI under Concurrency Scaling Activity:
The Queries tab in the UI also has a column to show if the query ran on the Main cluster or on the Concurrency Scaling cluster:
Whether a particular query ran on the main cluster or via a Concurrency Scaling cluster is stored in stl_query.concurrency_scaling_status.
A value of 1 means the query ran on a Concurrency Scaling cluster, and other values mean it ran on the main cluster.
redshiftcluster_2=# select distinct
concurrency_scaling_status,count(*) from stl_query where endtime <
'2019-03-29 15:00:00' group by concurrency_scaling_status;
concurrency_scaling_status | count
2 | 21
0 | 310790
4 | 19818
6 | 69082
11 | 7
3 | 853546
8 | 228977
Concurrency Scaling info is also stored in some other tables/views, e.g. SVCS_CONCURRENCY_SCALING_USAGE. There’s a list of catalog tables that store concurrency scaling information.
We enabled Concurrency Scaling for a single queue on an internal cluster at approximately 2019-03-29 18:30:00 GMT. We changed the max_concurrency_scaling_clusters parameter to 3 at approximately 2019-03-29 20:30:00.
To simulate query queuing, we lowered the # of slots for the queue from 15 slots to 5 slots.
Below is a chart from the intermix.io dashboard, showing the running versus queuing queries for this queue, after cranking down the number of slots.
We observe that the queueing time for queries went up, maxing out at about > 5 minutes.
Here’s the corresponding summary in the AWS console of what happened during that time:
Redshift spun up three (3) concurrency scaling clusters as requested. It appears that these clusters were not fully utilized, even though our cluster had many queries that were queuing.
The usage chart correlates closely with the scaling activity chart:
After a few hours, we checked and it looked like 6 queries ran with concurrency scaling. We also spot-checked two queries against the UI. We haven’t checked how this value may be used if multiple concurrency clusters are active.
redshiftcluster_2=# select distinct
concurrency_scaling_status,count(*) from stl_query where endtime >
'2019-03-29 18:30:00' group by concurrency_scaling_status;
concurrency_scaling_status | count
4 | 108
6 | 333
1 | 6
0 | 913
3 | 4495
8 | 304
Concurrency Scaling may mitigate queue times during bursts in queries.
From this basic test, it appears that a portion of our query load improved as a result. However, simply enabling Concurrency Scaling didn’t fix all of our concurrency problems. The limited impact is likely due to the limitations on the types of queries that can use Concurrency Scaling. For example, we have a lot of tables with interleaved sort keys, and much of our workload is writes.
While concurrency scaling doesn’t appear to be a silver bullet solution for WLM tuning in all cases, using the feature is easy and transparent.
We do recommend enabling the feature on your WLM queues. Start with a single concurrency cluster, and monitor the peak load via the console to determine whether the new clusters are being fully utilized.
As AWS adds support for additional query/table types, Concurrency Scaling should become more and more effective.
In this post, we’ll lay out the 5 major components of Amazon Redshift’s architecture.
Understanding the components and how they work is fundamental for building a data platform with Redshift. In the post, we’ll provide tips and references to best practices for each component.
See 14 real-life examples of data pipelines built with Amazon Redshift
Since launch, Amazon Redshift has found rapid adoption among SMBs and the enterprise. In the early days, business intelligence was the major use case for Redshift.
That has changed.
Today, we still, of course, see companies using BI dashboards like Tableau, Looker and Periscope Data with Redshift. But with rapid adoption, the uses cases for Redshift have evolved beyond reporting. And that has come with a major shift in end-user expectations:
The shift in expectations has implications for the work of the database administrator (“DBA”) or data engineer in charge of running an Amazon Redshift cluster. The static world is gone. Today, data sets have become so large and diverse that data teams have to innovate around how to collect, store, process, analyze and share data.
In the case of Amazon Redshift, much of that depends on understanding the underlying architecture and deployment model. It’s what drives the cost, throughput volume and the efficiency of using Amazon Redshift.
And so in this blog post, we’re taking a closer look at the Amazon Redshift architecture, its components, and how queries flow through those components. We’ll include a few pointers on best practices.
This architecture diagram shows how Amazon Redshift processes queries across this architecture. We’re excluding Redshift Spectrum in this image as that layer is independent of your Amazon Redshift cluster. (We’ll explain that part in a bit.)
However, we do recommend using Spectrum from the start as an extension into your S3 data lake. We’ll go deeper into the Spectrum architecture further down in this post. Let’s first take a closer look at role of each one of the five components
In other reference architectures for Redshift, you will often hear the term “SQL client application”. And SQL is certainly the lingua franca of data warehousing.
But with the shift away from reporting to new types of use cases, we prefer to use the term “data apps”. Unlike writing plain SQL in an editor, they imply the use of data engineering techniques, i.e. the use of code/software to work with data.
Data apps run workloads or “jobs” on an Amazon Redshift cluster. There are three generic categories of data apps:
What does this mean for the DBA?
The Amazon Redshift architecture is designed to be “greedy”. A query will consume all the resources it can get. To protect workloads from each other, a best practice for Amazon Redshift is to set up workload management (“WLM”). WLM is a key architectural requirement. Setting up your WLM should be a top-level architecture component. We’ve also discussed the pros and cons of turning on automatic WLM.
A “cluster” is the core infrastructure component for Redshift, which executes workloads coming from external data apps. There are two key components in a cluster:
In our experience, most companies run multi-cluster environments, also called a “fleet” of clusters. For example, at intermix.io we run a fleet of ten clusters.
What does this mean for the DBA?
It’s easy to spin up a cluster, pump in data and begin performing advanced analytics in under an hour. That makes it easy to skip some best practices when setting up a new Amazon Redshift cluster. Read more at 3 Things to Avoid When Setting Up an Amazon Redshift Cluster
The leader node has four major roles:
What does this mean for the DBA?
In some cases, the leader node can become a bottleneck for the cluster. The pattern is an increase in your COMMIT queue stats. For example, larger nodes have more metadata, which requires more processing by the leader node. You can Query STL_COMMIT_STATS to determine what portion of a transaction was spent on commit and how much queuing is occurring.
The compute nodes handle all query processing, in parallel execution (“massively parallel processing”, short “MPP”).
Amazon Redshift provides two categories of nodes:
As your workloads grow, you can increase the compute and storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.
Compute nodes are also the basis for Amazon Redshift pricing. You can start with hourly on-demand consumption. Prices for on-demand range from $0.25 (dense compute) to $6.80 per hour (dense storage), with discounts of up to 69% for 3-year commitments.
What does this mean for the DBA?
The execution speed of a query depends a lot on how fast Redshift can access and scan data that’s distributed across nodes. A best practice is to choose the right distribution style for your data by defining distribution keys.
Adding nodes is an easy way to add more processing power. It’s also an easy way to address performance issues – by resizing your cluster and adding more nodes. Many Redshift customers run with over-provisioned clusters. Because nodes are the basis for pricing, that can add up over time. And removing nodes is a much harder process. But it’s also the only way to reduce your Redshift cost.
In some cases, it may make sense to shift data into S3. The cost of S3 storage is roughly a tenth of Redshift compute nodes. With Amazon Redshift Spectrum you can query data in Amazon S3 without first loading it into Amazon Redshift.
Image 2 shows what an extended Architecture with Spectrum and query caching looks like. The leader nodes decides:
The leader node includes the corresponding steps for Spectrum into the query plan. The compute nodes in the cluster issue multiple requests to the Amazon Redshift Spectrum layer.
Spectrum scans S3 data, runs projections, filters and aggregates the results. Spectrum sends the final results back to the compute nodes. The compute nodes run any joins with data sitting in the cluster. That way, you can join data sets from S3 with data sets in Amazon Redshift.
Redshift pricing is based on the data volume scanned, at a rate or $5 per terabyte.
What does this mean for the DBA?
Using Redshift Spectrum is a key component for a data lake architecture. Amazon Redshift is the access layer for your data applications. Spectrum is the query processing layer for data accessed from S3. We’ve written more about the detailed architecture in “Amazon Redshift Spectrum: Diving into the Data Lake”
In this post, we described the Amazon Redshift’s architecture. We explained how the architecture affects working with data and queries.
If you want to dive deeper into Amazon Redshift and Amazon Redshift Spectrum, register for one of our public training sessions. Each month, we host a free training with live Q&A to answer your most burning questions about Amazon Redshift and building data lakes on Amazon AWS. If you have a burning question about the architecture that you want to answer right now – open this chat window, we’re around to answer your questions!
At intermix.io, we work with companies that build data pipelines and data lakes in the cloud. Some start “cloud-native”, others migrate from on-premise solutions like Oracle or Teradata. What they all have in common though is the one question they ask us at the very beginning:
“How do other companies build their data pipelines?”
And so that’s why we decided to compile and publish a list of publicly available blog posts about how companies build their data pipelines. In those posts, the companies talk in detail about how they’re using data in their business, and how they’ve become “data-centric”.
Here’s the short list of the 14 companies:
11. Dollar Shave Club
And we’re sorry if we missed your post – we’re happy to include it, just fill out this form (take less than a minute). And with that – please meet the 14 examples of data pipelines from the world’s most data-centric companies
Getting data-driven is the main goal for Simple. It’s important for the entire company to have access to data internally. Instead of the analytics and engineering teams to jump from one problem to another, a unified data architecture spreading across all departments in the company allows building a unified way of doing analytics. The main problem then is how to ingest data from multiple sources, process it, store it in a central data warehouse, and present it to staff across the company. Similar to many solutions nowadays, data is ingested from multiple sources into Kafka, before passing it to compute and storage systems. The warehouse of choice is Redshift, selected because of its SQL interfaces, and the ease to process petabytes of data. Finally reports, analytics and visualizations are powered using Periscope Data. In such a way, the data is easily spread across different teams allowing them to make decisions based on data.
Clearbit was a rapidly growing, early-stage startup when it started thinking of expanding its data infrastructure and analytics. After trying out a few out-of-the-box analytics tools (and each of them failed to satisfy the company’s demands) they took building the infrastructure in their own hands. Their efforts converged into a trio of providers: Segment, Redshift, and Mode. Segment is responsible for ingesting all kind of data, combining it, and syncing it daily into a Redshift instance. The main data storage is obviously left to Redshift, with backups into AWS S3. Finally, since Redshift supports SQL, Mode is a perfectly fitted tool which is used to running queries (while using Redshift’s powerful data processing abilities) and creating data insights.
Mode make it easy to explore, visualize and share that data across your organization.
But as data volume grows, that’s when data warehouse performance goes down. With ever increasing calls to your data from analysts, your cloud warehouse becomes the bottleneck. Uncertainty why queries take longer and longer to complete frustrates analysts and engineers alike.
That’s why we’ve built intermix.io, so that Mode users get all the tools they need to optimize their queries running on Amazon Redshift. Here one of our dashboards that shows you how you can track queries from Mode down to the single user:
The whole data architecture at 500px is mainly based on two tools: Redshift – for data storage, and Periscope – for analytics, reporting, and visualization. From a customer-facing side, the company’s web and mobile apps run on top of a few API servers, backed by several databases – mostly MySQL. Data in these DBs is then processed through a Luigi ETL, before storing it to S3 and Redshift. Splunk here does a great job in querying and summarizing text-based logs. Periscope Data is responsible for building data insights and sharing them across different teams in the company. All in all, this infrastructure supports around 60 people distributed across a couple of teams within the company, as of 2015.
The data infrastructure at Netflix is certainly one of the most complex ones, having in mind that they serve over 550 billion events per day, equaling roughly to 1.3 petabytes of data. In general, Netflix’s architecture is broken down into smaller systems, such as systems for data ingestion, analytics, predictive modeling etc. The data stack employed in the core of Netflix is mainly based on Apache Kafka for real-time (sub-minute) processing of events and data. Data needed in the long-term is sent from Kafka to AWS’s S3 and EMR for persistent storage, but also to Redshift, Hive, Snowflake, RDS and other services for storage regarding different sub-systems. Metacat is built to make sure the data platform can interoperate across these data sets as a one “single” data warehouse. Its task is to actually connect different data sources (RDS, Redshift, Hive, Snowflake, Druid) with different compute engines (Spark, Hive, Presto, Pig). Other Kafka outputs lead to a secondary Kafka sub-system, predictive modeling with Apache Spark, and Elasticsearch. Operational metrics don’t flow through the data pipeline but through a separate telemetry system named Atlas.
The tech world has changed dramatically since Yelp was launched back in 2004. Eight years later and Yelp started its change. It transformed from running a huge monolithic application on-premises to one built on microservices running in the AWS cloud. By the end of 2014, there were more than 150 production services running, with over 100 of them owning data. Its main part of the cloud stack is better known as PaaSTA, based on Mesos and Docker, offloading data to warehouses such as Redshift, Salesforce and Marketo. Data enters the pipeline through Kafka, which in turn receives it from multiple different “producer” sources.
Gusto, founded in 2011, is a company that provides a cloud-based payroll, benefits and workers’ compensation solution for businesses. Their business has grown steadily over the years, currently topping to around 60 thousand customers. By early 2015, there was a growing demand within the company for access to data. Up until then, the engineering team and product managers were running their own ad-hoc SQL scripts on production databases. There was obviously a need to build a data-informed culture, both internally and for their customers. When coming to the crossroad to either build a data science or data engineering team, Gusto seems to have done the right choice: first build a data infrastructure which then would support analysts in generating insights and drawing prediction models.
The first step for Gusto was to replicate and pipe all of their major data sources into a single warehouse. The warehouse choice landed on an AWS Redshift cluster, with S3 as underlying data lake. Moving data from production app databases into Redshift was then facilitated with Amazon’s Database Migration Service. On the other side of the pipeline, Looker is used as a BI front-end that teams throughout the company can use to explore data and build core dashboards. Aleph is a shared web-based tool for writing ad-hoc SQL queries. Finally, monitoring (in the form of event tracking) is done by Snowplow, which can easily integrate with Redshift, and as usual, Airflow is used to orchestrate the work through the pipeline.
Building such pipeline massively simplified data access and manipulation across departments. For instance, analysts can simply build their own datasets as part of an Airflow task, and expose it to Looker to use in dashboards and further analyses.
Cloudflare is a web performance and security company that provides online services to protect and accelerate websites online. Online content distribution, web optimization, web security, and analytics are a few examples of the company’s business range.
While different services may require different data stacks to work on, they are all built on top of Cloudflare’s core infrastructure. In the core of their data stack there are Kafka clusters as a streaming platform, and CitusDB as a data warehouse – a scaled up version of PostgreSQL. Data is ingested through Cloudflare’s edge services using HTTP requests, then passed on to Kafka clusters, before getting stored in CitusDB warehouse. A nice example of a service working on top of this infrastructure is the DNS Analysis – a service which processes around 1 million DNS queries per second! The DNS edge service pre-processes and aggregates data, before sending it encrypted to one of Cloudflare’s data centers. Within the data center, data is de-multiplexed and pushed into several Apache Kafka clusters, which in turn pushes data to consumers grouped by Kafka topic. Consumers can store processed information into corresponding DBs which are later queried by the company’s API services and information delivered to customers.
Cloudflare gives their services to millions of websites around the world, processing and storing hundreds of terabytes of data daily. Interestingly, Cloudflare is not a fan of commercial cloud technologies, but they implement their own data centers across the world, in total 152 as of this moment.
Teads is a video advertising marketplace, often ranked as the number 1 video platform in the world. Working with data-heavy videos must be supported by a powerful data infrastructure, but that’s not the end of the story. Teads’ business needs to log user interactions with their videos through the browser (like play, pause, resume, complete…), which count up to 10 million events per day. Another source of data is video auctions (real-time bidding processes) which generate another 60 million events per day. To build their complex data infrastructure, Teads has turned to both Google and Amazon for help.
Originally the data stack at Teads was based on a lambda architecture, using Storm, Spark and Cassandra. This architecture couldn’t scale well, so the company turned toward Google’s BigQuery in 2016. They already had their Kafka clusters on AWS, which was also running some of their ad delivery components, so the company chose a multi-cloud infrastructure. Transferring data between different cloud providers can get expensive and slow. To address the second part of this issue, Teads placed their AWS and GCP clouds as close as possible and connected them with managed VPNs.
So how does their complex multi-cloud data stack look like? Well, first of all, data coming from users’ browsers and data coming from ad auctions is enqueued in Kafka topics in AWS. Then using an inter-cloud link, data is passed over to GCP’s Dataflow which is then well paired with BigQuery in the next step. Having all data in a single warehouse means half of the work is done. The next step would be to deliver data to consumers, and Analytics is one of them. The Analytics service at Teads is a Scala-based app that queries data from the warehouse and stores it to tailored data marts. Interestingly, the data marts are actually AWS Redshift servers. In the final step, data is presented into intra-company dashboards, and the user’s web apps.
You'll get additional resources like:
Remind’s data engineering team strives to provide the whole company with access to the data they need, as big as 10 million daily events, and empower them to directly make decisions. They initially started with Redshift as its source of truth resource for data, and AWS S3 to optimize for cost. While S3 is used for long-term storage of historical data in JSON format, Redshift only stores the most valuable data, not older than 3 months. The company uses Interana to run custom queries on their JSON files on S3, but they’ve also recently started using AWS Athena, as a fully managed Presto system – to query both S3 and Redshift databases. The move for Athena also triggered a change in the data format – from JSON to Parquet, which they say was the hardest step in building up their data platform. An EMR/Hive system is responsible for doing the needed data transformations between S3 and Athena. In the data ingestion part of the story, Remind gathers data through their APIs from both mobile devices and personal computers, as the company business targets schools, parents and students. This data is then passed to a streaming Kinesis Firehose system, before streaming it out to S3 and Redshift.
Remind’s future plans are probably focused on facilitating data format conversions using AWS Glue. This step would allow them to replace EMR/Hive from their architecture and use Spark SQL instead of Athena for diverse ETL tasks.
Robinhood is a stock brokerage application that democratizes access to the financial markets, which enables its customers to buy and sell U.S. listed stocks and ETFs with zero commission. The company debuted with a waiting list of nearly 1 million people, which means they had to pay attention to scale from the very beginning.
Robinhood’s data stack is hosted on AWS, and the core technology they use is ELK (Elasticsearch, Logstash, and Kibana) – a tool for powering search and analytics. Logstash is responsible for collecting, parsing and transforming logs, before passing them on to Elasticsearch, while data is visualized through Kibana. They grew up from a single ELK cluster with a few GBs of data to three clusters with over 15 TBs. Before data goes to ELK clusters, it is buffered in Kafka, as the rates of which documents enter vary significantly between different data sources. Kafka also shields the system from failures and communicates its state with data producers and consumers. As with many other companies, Robinhood uses Airflow to schedule various jobs across the stack, beating competition such as Pinball, Azkaban and Luigi. Robinhood data science team uses Amazon Redshift to help identify possible instances of fraud and money laundering.
Dollar Shave Club (DSC) is a lifestyle brand and e-commerce company that’s revolutionizing the bathroom by inventing smart, affordable products. Don’t be fooled by their name, they have a pretty cool data architecture, for a company in the shaving business. Their business model works with online sales through a subscription service. Currently, they serve around 3 million subscribed customers.
DSC’s web applications, internal services, and data infrastructure are 100% hosted on AWS. A Redshift cluster serves as the central data warehouse, receiving data from various systems. Data movement is facilitated with Apache Kafka and can move in different directions – from production DBs into the warehouse, in between different apps and in between internal pipeline components. There’s also Snowplow which collects data from the web and mobile clients. Once data reaches Redshift, it is accessed through various analytics platforms for monitoring, visualization, and insights. The main tool for the job is, of course, Apache Spark, which is mainly used to build predictive models, such as recommender systems for future sales.
Coursera is an education company that partners with the top universities and organizations in the world to offer online courses. They started building their data architecture somewhere around 2013, as both numbers of users and available courses increased. As of late 2017, Coursera provides courses to 27 million worldwide users.
Coursera collects data from their users through API calls coming from mobile and web apps, their production DBs, and logs gathered from monitoring. A backend service called “eventing” periodically uploads all received events to S3 and continuously publishes events to Kafka. The engineering team has selected Redshift a central warehouse, offering much lower operational cost when compared with Spark or Hadoop at the time.
On the analytics end, the engineering team created an internal web-based query page where people across the company can write SQL queries to the warehouse and get the needed information. Of course, there are analytics dashboard across the company which are refreshed on a daily basis. Finally, many decisions made in Coursera are based on machine learning algorithms, such as A/B testing, course recommendations, understanding student dropouts and others.
Wish is a mobile commerce platform. It provides online services that include media sharing and communication tools, personalized and other content, as well as e-commerce. During the last few years, it grew up to 500 million users, making their data architecture out of date.
The data architecture at Wish, before scaling up, had two different production databases: a MongoDB one storing user data, and a Hive/Presto cluster for logging data. Data engineers had to manually query both to respond to ad-hoc data requests, and this took weeks at some points. Another small pipeline orchestrated by Python crons, also queried both DBs and generated Email reports.
After rethinking their data architecture, Wish decided to build a single warehouse using Redshift. Data from both production DBs flowed through the data pipeline into Redshift. BigQuery is also used for some types of data. It feeds data into secondary tables needed for analytics. Finally, analytics and dashboards are created with Looker.
You'll get additional resources like:
Blinkist transforms the big ideas from the world’s best nonfiction books into powerful little packs users can read or listen to in 15 minutes. At first, they started selling their services through a pretty basic website, and monitored statistics through Google Analytics. Unfortunately, visitor statistics gathered from Google Analytics didn’t match the ones engineers computed. This is one of the reasons why Blinkist decided to move to the AWS cloud.
They choose a central Redshift warehouse where data flow in from user apps, backend and web frontend (for visitors tracking). To get data to Redshift, data is streamed with Kinesis Firehose, also using Amazon Cloudfront, Lambda and Pinpoint. The engineering team at Blinkist is working on a newer pipeline where ingested data comes to Alchemist, before passing it to a central Kinesis system, and onwards to the warehouse.
We hope this post along with its 14 examples gives you the inspiration to build your own data pipelines in the cloud.
If you don’t have any data pipelines yet, it’s time to start building them. Begin with baby steps and focus on spinning up an Amazon Redshift cluster, ingest your first data set and run your first SQL queries.
After that, you can look at expanding by adding a dashboard for data visualization, and schedule a workflow, to build your first true data pipeline. And once data is flowing, it’s time to understand what’s happening in your data pipelines.
That’s why we built intermix.io. We give you a single dashboard to understand when & why data is slow, stuck, or unavailable.
With intermix.io you can:
Our customers have the confidence to handle all the raw data their companies need to be successful. What you get is a real-time analytics platform that collects metrics from your data infrastructure and transforms them into actionable insights about your data pipelines, apps, and users who touch your data.
Setting up intermix.io takes less than 10 minutes, and because you can leverage our intermix.io experts, you can say goodbye to paying for a team of experts with expensive and time-consuming consulting projects. We can help you plan your architecture, build your data lake and cloud warehouse, and verify that you’re doing the right things.
It’s easy – start now by scheduling a call with one our of experts.