Start Now Login

At intermix.io, we work with companies that build data pipelines. Some start cloud-native on platforms like Amazon Redshift, while others migrate from on-premise or hybrid solutions. What they all have in common is the one question they ask us at the very beginning:

“How do other companies build their data pipelines?”

And so that’s why we decided to compile and publish a list of publicly available blog posts about how companies build their data pipelines. In those posts, the companies talk in detail about how they’re using data in their business and how they’ve become data-centric.

The 15 Companies we’ve looked at are:

Table of Contents

If we missed your post, we’re happy to include it. Just fill out this form, which will take you less than a minute. And with that – please meet the 15 examples of data pipelines from the world’s most data-centric companies.

1. Simple

Getting data-driven is the main goal for Simple. It’s important for the entire company to have access to data internally. Instead of the analytics and engineering teams to jump from one problem to another, a unified data architecture spreading across all departments in the company allows building a unified way of doing analytics. 

The main problem then is how to ingest data from multiple sources, process it, store it in a central data warehouse, and present it to staff across the company. Similar to many solutions nowadays, data is ingested from multiple sources into Kafka before passing it to compute and storage systems. 

The warehouse of choice is Redshift, selected because of its SQL interfaces and the ease with which it processes petabytes of data. Reports, analytics, and visualizations are powered using Periscope Data. In such a way, the data is easily spread across different teams, allowing them to make decisions based on data.

Sources: https://www.simple.com/engineering/building-analytics-at-simple

2. Clearbit

Clearbit was a rapidly growing, early-stage startup when it started thinking of expanding its data infrastructure and analytics. They tried out a few out-of-the-box analytics tools, each of which failed to satisfy the company’s demands. 

After that, Clearbit took building the infrastructure in their own hands. Their efforts converged into a trio of providers: Segment, Redshift, and Mode. Segment is responsible for ingesting all kinds of data, combining it, and syncing it daily into a Redshift instance. The main data storage is obviously left to Redshift, with backups into AWS S3. 

Finally, since Redshift supports SQL, Mode is perfectly suited for running queries (while using Redshift’s powerful data processing abilities) and creating data insights.

Source: https://blog.clearbit.com/enterprise-grade-analytics-for-startups-2/

Mode makes it easy to explore, visualize, and share that data across your organization.

But as data volume grows, that’s when data warehouse performance goes down. With ever-increasing calls to your data from analysts, your cloud warehouse becomes the bottleneck. 

That’s why we’ve built intermix.io to provide Mode users with all the tools they need to optimize their queries running on Amazon Redshift. Here one of our dashboards that shows you how you can track queries from Mode down to the single user:

See your Mode queries in intermix.io

3. 500px

The whole data architecture at 500px is mainly based on two tools: Redshift for data storage; and Periscope for analytics, reporting, and visualization. From a customer-facing side, the company’s web and mobile apps run on top of a few API servers, backed by several databases – mostly MySQL. Data from these DBs passes through a Luigi ETL, before moving to storage on S3 and Redshift. 

Splunk here does a great job of querying and summarizing text-based logs. Periscope Data is responsible for building data insights and sharing them across different teams in the company. All in all, this infrastructure supports around 60 people distributed across a couple of teams within the company, prior to their acquisition by Visual China Group.

Fig: Some of the data-related technologies used in 500px.
Source: https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83

4. Netflix

The data infrastructure at Netflix is one of the most sophisticated in the world. The video streaming company serves over 550 billion events per day, equaling roughly to 1.3 petabytes of data. In general, Netflix’s architecture is broken down into smaller systems, such as systems for data ingestion, analytics, and predictive modeling. The data stack employed in the core of Netflix is mainly based on Apache Kafka for real-time (sub-minute) processing of events and data. Data needed in the long-term is sent from Kafka to AWS’s S3 and EMR for persistent storage, but also to Redshift, Hive, Snowflake, RDS, and other services for storage regarding different sub-systems. Metacat is built to make sure the data platform can interoperate across these data sets as a one “single” data warehouse. Its task is to actually connect different data sources (RDS, Redshift, Hive, Snowflake, Druid) with different compute engines (Spark, Hive, Presto, Pig). Other Kafka outputs lead to a secondary Kafka sub-system, predictive modeling with Apache Spark, and Elasticsearch. Operational metrics don’t flow through the data pipeline but through a separate telemetry system named Atlas.

Sources:

5. Yelp

The tech world has seen dramatic changes since Yelp was launched back in 2004. By 2012, Yelp found themselves playing catch-up. It transformed from running a huge monolithic application on-premises to one built on microservices running in the AWS cloud. By the end of 2014, there were more than 150 production services running, with over 100 of them owning data. Its main part of the cloud stack is better known as PaSTA, based on Mesos and Docker, offloading data to a Redshift data warehouse, Salesforce CRM, and Marketo marketing automation. Data enters the pipeline through Kafka, which in turn receives it from multiple different “producer” sources.

Sources:

6. Gusto

Gusto, founded in 2011, is a company that provides a cloud-based payroll, benefits, and workers’ compensation solution for businesses. Their business has grown steadily over the years, currently topping to around 60 thousand customers. By early 2015, there was a growing demand within the company for access to data. Up until then, the engineering team and product managers were running their own ad-hoc SQL scripts on production databases. There was obviously a need to build a data-informed culture, both internally and for their customers. When coming to the crossroad to either build a data science or data engineering team, Gusto seems to have made the right choice: first, build a data infrastructure that can support analysts in generating insights and drawing prediction models.

The first step for Gusto was to replicate and pipe all of their major data sources into a single warehouse. The warehouse choice landed on an AWS Redshift cluster, with S3 as the underlying data lake. Moving data from production app databases into Redshift was then facilitated with Amazon’s Database Migration Service. On the other side of the pipeline, Looker is used as a BI front-end that teams throughout the company can use to explore data and build core dashboards. Aleph is a shared web-based tool for writing ad-hoc SQL queries. Finally, monitoring (in the form of event tracking) is done by Snowplow, which can easily integrate with Redshift. And, as usual, Airflow orchestrates the work through the pipeline.

Building this pipeline helped to simplify data access and manipulation across departments. For instance, analysts can simply build their own datasets as part of an Airflow task and expose it to Looker to use in dashboards and further analyses.

Source: https://engineering.gusto.com/building-a-data-informed-culture/

7. Teads

Teads is a video advertising marketplace, often ranked as the number one video platform in the world. Working with data-heavy videos must be supported by a powerful data infrastructure, but that’s not the end of the story. Teads’ business needs to log user interactions with their videos through the browser – functions like play, pause, resume, complete – which count up to 10 million events per day. Another source of data is video auctions with a real-time bidding process. These generate another 60 million events per day. To build their complex data infrastructure, Teads has turned to both Google and Amazon for help.

Originally the data stack at Teads was based on a lambda architecture, using Storm, Spark and Cassandra. This architecture couldn’t scale well, so the company turned toward Google’s BigQuery in 2016. They already had their Kafka clusters on AWS, which was also running some of their ad delivery components, so the company chose a multi-cloud infrastructure. Transferring data between different cloud providers can get expensive and slow. To address the second part of this issue, Teads placed their AWS and GCP clouds as close as possible and connected them with managed VPNs.

So how does their complex multi-cloud data stack look? Well, first of all, data coming from users’ browsers and data coming from ad auctions is enqueued in Kafka topics in AWS. Then using an inter-cloud link, data is passed over to GCP’s Dataflow, which is then well paired with BigQuery in the next step. Having all data in a single warehouse means half of the work is done. The next step would be to deliver data to consumers, and Analytics is one of them. The Analytics service at Teads is a Scala-based app that queries data from the warehouse and stores it to tailored data marts. Interestingly, the data marts are actually AWS Redshift servers. In the final step, data is presented into intra-company dashboards and on the user’s web apps.

Fig: An abstract view of Teads’ data flow, from ingestion to .  
Fig: Teads’ analytics part of the data stack.

Source: 
https://medium.com/teads-engineering/give-meaning-to-100-billion-analytics-events-a-day-d6ba09aa8f44
https://medium.com/teads-engineering/give-meaning-to-100-billion-events-a-day-part-ii-how-we-use-and-abuse-redshift-to-serve-our-data-bc23d2ed3e0

8. Remind

Remind’s data engineering team provides the whole company with access to the data they need, as big as 10 million daily events, and empower them to make decisions directly. They initially started with Redshift as its source of truth resource for data, and AWS S3 to optimize for cost. 

While S3 is used for long-term storage of historical data in JSON format, Redshift only stores the most valuable data, not older than three months. The company uses Interana to run custom queries on their JSON files on S3, but they’ve also recently started using AWS Athena as a fully managed Presto system to query both S3 and Redshift databases. 

The move for Athena also triggered a change in the data format from JSON to Parquet, which they say was the hardest step in building up their data platform. An EMR/Hive system is responsible for doing the needed data transformations between S3 and Athena. In the data ingestion part of the story, Remind gathers data through their APIs from both mobile devices and personal computers, as the company business targets schools, parents, and students. This data is then passed to a streaming Kinesis Firehose system before streaming it out to S3 and Redshift.

Remind’s future plans are probably focused on facilitating data format conversions using AWS Glue. This step would allow them to replace EMR/Hive from their architecture and use Spark SQL instead of Athena for diverse ETL tasks.

Sources:

See your data in intermix.io

9. Robinhood

Robinhood is a stock brokerage application that democratizes access to the financial markets, enabling customers to buy and sell stocks and ETFs with zero commission. The company debuted with a waiting list of nearly 1 million people, which means they had to pay attention to scale from the very beginning.

Robinhood’s data stack is hosted on AWS, and the core technology they use is ELK (Elasticsearch, Logstash, and Kibana), a tool for powering search and analytics. Logstash is responsible for collecting, parsing, and transforming logs before passing them on to Elasticsearch, while data is visualized through Kibana. 

They grew from a single ELK cluster with a few GBs of data to three clusters with over 15 TBs. Before data goes to ELK clusters, it is buffered in Kafka, as the various data sources generate documents at differing rates. 

Kafka also shields the system from failures and communicates its state with data producers and consumers. As with many other companies, Robinhood uses Airflow to schedule various jobs across the stack, beating competition such as Pinball, Azkaban and Luigi. Robinhood data science team uses Amazon Redshift to help identify possible instances of fraud and money laundering.

Sources:

10. Dollar Shave Club

Dollar Shave Club (DSC) is a lifestyle brand and e-commerce company that’s revolutionizing the bathroom by inventing smart, affordable products. Don’t be fooled by their name. They have a pretty cool data architecture for a company in the shaving business. Their business model works with online sales through a subscription service. Currently, they serve around 3 million subscribed customers.

DSC’s web applications, internal services, and data infrastructure are 100% hosted on AWS. A Redshift cluster serves as the central data warehouse, receiving data from various systems. Data movement is facilitated with Apache Kafka and can move in different directions – from production DBs into the warehouse, between different apps, and between internal pipeline components. 

There’s also Snowplow, which collects data from the web and mobile clients. Once data reaches Redshift, it is accessed through various analytics platforms for monitoring, visualization, and insights. The main tool for the job is, of course, Apache Spark, which is mainly used to build predictive models, such as recommender systems for future sales.

Sources:

11. Coursera

Coursera is an education company that partners with the top universities and organizations in the world to offer online courses. They started building their data architecture somewhere around 2013, as both numbers of users and available courses increased. As of late 2017, Coursera provides courses to 27 million worldwide users.

Coursera collects data from its users through API calls coming from mobile and web apps, their production DBs, and logs gathered from monitoring. A backend service called “eventing” periodically uploads all received events to S3 and continuously publishes events to Kafka. The engineering team has selected Redshift as its central warehouse, offering much lower operational cost when compared with Spark or Hadoop at the time.

On the analytics end, the engineering team created an internal web-based query page where people across the company can write SQL queries to the warehouse and get the information they need. Of course, there are company-wide analytics dashboards that are refreshed on a daily basis. Finally, many decisions made in Coursera are based on machine learning algorithms, such as A/B testing, course recommendations, and understanding student dropouts.

Sources:

See your data in intermix.io

12. Wish

Wish is a mobile commerce platform. It provides online services that include media sharing and communication tools, personalized and other content, as well as e-commerce. During the last few years, it grew up to 500 million users, making their data architecture out of date.

Before they scaled up, Wish’s data architecture had two different production databases: a MongoDB NoSQL database storing user data; and a Hive/Presto cluster for logging data. Data engineers had to manually query both to respond to ad-hoc data requests, and this took weeks at some points. Another small pipeline, orchestrated by Python Cron jobs, also queried both DBs and generated email reports.

After rethinking their data architecture, Wish decided to build a single warehouse using Redshift. Data from both production DBs flowed through the data pipeline into Redshift. BigQuery is also used for some types of data. It feeds data into secondary tables needed for analytics. Finally, analytics and dashboards are created with Looker.

Sources:

13. Blinkist

Blinkist transforms the big ideas from the world’s best nonfiction books into powerful little packs users can read or listen to in 15 minutes. At first, they started selling their services through a pretty basic website, and they monitored statistics through Google Analytics. Unfortunately, visitor statistics gathered from Google Analytics didn’t match the figures the engineers were computing. This is one of the reasons why Blinkist decided to move to the AWS cloud.

They choose a central Redshift warehouse where data flows in from user apps, backend, and web front-end (for visitors tracking). To get data to Redshift, they stream data with Kinesis Firehose, also using Amazon Cloudfront, Lambda, and Pinpoint. The engineering team at Blinkist is working on a newer pipeline where ingested data comes to Alchemist, before passing it to a central Kinesis system and onwards to the warehouse.

Source: https://www.slideshare.net/SebastianSchleicher/tracking-and-business-intelligence

14. Halodoc

Healthcare platform Halodoc found themselves with a common startup problem: scalability. Their existing data pipeline worked on a batch processing model, with regularly scheduled extractions for each source. They performed extractions with various standard tools, including Pentaho, AWS Database Migration Service, and AWS Glue. 

They would load each export to S3 as a CSV or JSON, and then replicate it on Redshift. At this point, they used a regular Pentaho job to transform and integrate data, which they would then load back into Redshift. 

As Halodoc’s business grew, they found that they were handling massive volumes of sensitive patient data that had to get securely and quickly to healthcare providers. The Pentaho transformation job, installed on a single EC2 instance, was a worrying single point of failure. 

Halodoc looked at a number of solutions and eventually settled on Apache Airflow as a single tool for every stage of their data migration process. They chose Airflow because it’s highly responsive and customizable, with excellent error control. It also supports machine learning use cases, which Halodoc requires for future phases. 

The new data pipeline is much more streamlined. Halodoc uses Airflow to deliver both ELT and ETL. In their ETL model, Airflow extracts data from sources. It then passes through a transformation layer that converts everything into pandas data frames. The data frames are loaded to S3 and then copied to Redshift. Airflow can then move data back to S3 as required. 

For ELT, the Airflow job loads data directly to S3. Halodoc then uses Redshift’s processing power to perform transformations as required.

Source:

15. iHeartRadio

iHeartRadio is a global streaming platform for music and podcasts. It runs on a sophisticated data structure, with over 130 data flows, all managed by Apache Airflow. These data pipelines were all running on a traditional ETL model: extracted from the source, transformed by Hive or Spark, and then loaded to multiple destinations, including Redshift and RDBMSs.

On reviewing this approach, the engineering team decided that ETL wasn’t the right approach for all data pipelines. Where possible, they moved some data flows to an ETL model. Data flows directly from source to destination – in this instance, Redshift – and the team applies any necessary transformations afterward. Redshift Spectrum is an invaluable tool here, as it allows you to use Redshift to query data directly on S3 via an external meta store, such as Hive.

However, this model still didn’t suit all use cases. The iHeartRadio team began experimenting with the ETLT model (Extract, Transform, Load, Transform) model, which combines aspects of ETL and ELT. In this approach, the team extracts data as normal, then uses Hive for munging and processing. They then load the data to the destination, where Redshift can aggregate the new data.

Now, the team uses a dynamic structure for each data pipeline, so data flows might pass through ETL, ELT, or ETLT, depending on requirements. This new approach has improved performance by up to 300% in some cases, while also simplifying and streamlining the entire data structure.

Source: https://tech.iheart.com/how-we-leveraged-redshift-spectrum-for-elt-in-our-land-of-etl-cf01edb485c0

How Will You Build Your Data Pipeline?

We hope the 15 examples in this post offer you the inspiration to build your own data pipelines in the cloud.

If you don’t have any data pipelines yet, it’s time to start building them. Begin with baby steps and focus on spinning up an Amazon Redshift cluster, ingest your first data set and run your first SQL queries.

After that, you can look at expanding by acquiring an ETL tool, adding a dashboard for data visualization, and scheduling a workflow, resulting in your first true data pipeline. And once data is flowing, it’s time to understand what’s happening in your data pipelines.

That’s why we built intermix.io. We give you a single dashboard to understand when & why data is slow, stuck, or unavailable.

With intermix.io you can:

Our customers have the confidence to handle all the raw data their companies need to be successful. What you get is a real-time analytics platform that collects metrics from your data infrastructure and transforms them into actionable insights about your data pipelines, apps, and users who touch your data.

Setting up intermix.io takes less than 10 minutes, and because you can leverage our intermix.io experts, you can say goodbye to paying for a team of experts with expensive and time-consuming consulting projects. We can help you plan your architecture, build your data lake and cloud warehouse, and verify that you’re doing the right things.

It’s easy – start now by scheduling a call with one our of experts or join our Redshift community on Slack.

At intermix.io, we run a fleet of over ten Amazon Redshift clusters. In this post, I’ll describe how we use AWS IAM authentication for our database users. 

AWS access is managed by configuring policies and connecting them to IAM identities (users, groups of users, or roles) or AWS resources. A policy is essentially an object in AWS that defines their permissions when associated with an identity or resource.  Evaluation of the policies occurs when an IAM principal (user or role) makes a request, and permissions within the policies determine whether the request is allowed or denied. Policies reside in AWS as JSON documents. There is support for six types of policies: identity-based policies, resource-based policies, permissions boundaries, organization SCPs, ACLs, and session policies.

By using IAM credentials, we can enforce security policies on users and passwords. It’s a scalable and secure way to manage access to your cluster(s).

The approach we describe is useful in environments where security is a top concern. Examples are industries with regulatory requirements such as Finance or Healthcare. Other use cases include enterprises with strict IT security compliance requirements.

Secure Ways to Manage Access to Amazon Redshift

Amazon Redshift started as a data warehouse in the cloud. A frequent initial use case was business intelligence. Over the years, though, the use cases for Amazon Redshift have evolved. Redshift is still used for reporting. But it’s now also part of mission-critical data services and applications. intermix.io is an example – we built our application on top of Redshift. And so it’s critical to track and manage user access to a cluster.

The standard way for users to log onto Amazon Redshift is by providing a database username and password; this is because Amazon Redshift is based on PostgreSQL. But using a username/password combination has a few drawbacks.  The biggest one is that there is no way to enforce security best practices for password changes; this may not be an issue when you are in a small company where some 2-3 people can access your cluster. But that’s different for enterprises with a large pool of users. It’s also different when you’re a small company where headcount is growing fast. It’s easy to lose track of everybody who has access to your most valuable data sets.

Luckily, AWS has recently developed alternatives to using a username and password. There are three options to maintaining login credentials for your Amazon Redshift database:

  1. Permit users to create user credentials and login with their IAM credentials.
  2. Permit users to login with a federated sign-on sign-on (SSO) through a SAML 2.0-compliant identity provider.
  3. Generate temporary database credentials. Permissions are granted through an AWS Identity and Access Management (IAM) permissions policy. By default, these credentials expire after 15 minutes, but you can configure them to expire up to an hour after creation.

This post will discuss #3: using IAM credentials to generate expiring database credentials.Amazon Redshift provides the GetClusterCredentials API operation and get-cluster-credentials command for the AWS Command Line Interface (AWS CLI). Both offer the ability to generate temporary database user credentials programmatically. It is also possible to configure your SQL client with Amazon Redshift JDBC or ODBC drivers who manage the process of calling the GetClusterCredentials operation; this retrieves the database user credentials establishing a connection between the SQL client and Amazon Redshift database. Checkout JDBC and ODBC Options for Creating Database User Credentials for more information.

How intermix.io Uses IAM to Generate Temporary Passwords

The Old Way: Manual

Handling credentials used to be a manual process, and that’s a pain in the neck. We’re already using the AWS Secrets Manager. It could have been a pretty trivial exercise to add auto-rotation to our secrets, and trigger ALTER <user> queries to update them with new credentials. That would have been my initial approach. But one of my new colleagues pointed out the option of using IAM. Redshift allows you to get time-scoped IAM credentials associated with a role within Redshift itself.

The New Way: Using IAM to Generate Expiring Passwords

That turned into a somewhat larger undertaking. First, I had to understand how we were using Redshift across our platform. With a reasonable idea of how the change would work, I changed our Redshift connection logic to pull credentials before connecting. That turned out to be a pretty easy change. We’re a Python shop and Boto3 – the AWS SDK for Python – is exhaustive. Boto enables Python developers to create, configure, and manage AWS services.

We deployed the change into one of our testing environments. Everything went well until we hit the rate limit for the Redshift API.  We were making too many requests to GetClusterCredentials. We have a lot of in-flight transformations that all connect to Redshift. Making those calls along-side the connect exhausted the rate-limit. But that wasn’t insurmountable. It was pretty easy to add a caching mechanism to our connection logic so that we didn’t need to generate a new credential every time.

Once we got the caching mechanism deployed, we were able to disable logins. Access to the cluster was now only available through IAM credentials.

That left us with an awkward situation, though. Our developers have to connect to our clusters to run queries or test new features. They needed a way to generate IAM credentials and connect to a remote cluster.

We already understood how to generate IAM credentials and had code that handled that. We then solved our need by creating a task in our execution and deployment framework. The task connects to a Redshift cluster in any of our environments using IAM credentials. You can see it in use below!

Generate IAM Credentials

The code we used to do this isn’t that important because it’s made possible by the Redshift API and some IAM policies. So you can do this yourself, even without seeing what I’ve done. You need an AWS client for your preferred programming language and an IAM policy profile granting your users access to get IAM credentials on your clusters.

See your data in intermix.io

Getting Started

The first thing you’re going to want to do is create an IAM policy with the Redshift Actions needed to create an IAM credential on a Redshift cluster. 

The following is a policy that allows the IAM role to call the GetClusterCredentials operation, which automatically creates a new user and specifies groups the user joins at login. The “Resource”: “*” wildcard grants the role access to any resource, including clusters, database users, or user groups.

Of course, this example is not secure and strictly for demonstration. Please see Resource policies for GetClusterCredentials for more information and examples to achieve more granular access control.

Once you have that policy created, go ahead and create a group to contain the users to whom you want to grant access; if you’ve already got that group created, great! Attach the policy you defined in the previous step. Add the users you’d like into the group and attach the policy under the Permissions tab. At this point, all users will now be able to generate IAM credentials for existing users on your clusters. 

The following shows how to use Amazon Redshift CLI to generate temporary database credentials for an existing user named adam.

As a side note, if the user doesn’t exist in the database and AutoCreate is true, the creation of a new occurs with PASSWORD disabled. In the case where the user doesn’t exist, and AutoCreate is false, the request fails.

aws redshift get-cluster-credentials –cluster-identifier clusterA –db-user adam –db-name dbA –duration-seconds 7200.

The result is the following: 

{
  “DbUser”: “IAM:adam”,
  “Expiration”: “2020-10-08T21:10:53Z”,
  “DbPassword”: “EXAMPLEjArE3hcnQj8zt4XQj9Xtma8oxYEM8OyxpDHwXVPyJYBDm/gqX2Eeaq6P3DgTzgPg==”
}

Check out the official documentation for a more in-depth exploration of the CLI method.

However, using the AWS CLI is manual and can be error-prone. My recommendation is to create a utility to generate the credentials and connect to the cluster on behalf of the user. This utility could also generate ODBC or JDBC connection strings if that’s how your users connect to a cluster.

Here’s a quick Python-based example that outputs a JDBC URL. The examples assume that:

You’re now in a situation where you have an IAM policy and a group containing your users. You’ve also configured your Redshift users with passwords DISABLED.

In short, you’ve secured your Redshift cluster. Your security team can enforce credential rotation on users using standard IAM behavior vs. direct database implementation.

For more on best practices when working with Amazon Redshift, read our post on 3 Things to Avoid When Setting Up an Amazon Redshift Cluster. Or download our best-practices guide for more tips on enterprise-grade deployments for Amazon Redshift.

Here at intermix, we’re always working on new features that make your life as a data engineer easier when working with Amazon Redshift. For example, “disk full” errors in Amazon Redshift are an all-too-common performance issue—even when you thought you had plenty of space to spare in your Redshift deployment. That’s why we’ve made it easier than ever to use intermix to find the cause of Amazon Redshift disk usage spikes.

What Causes Disk Usage Spikes in Amazon Redshift?

The problem of Redshift disk usage spikes is so common that Amazon has released its own guide, “How can I troubleshoot high or full disk usage with Amazon Redshift?” This page features 9 different possible causes of Redshift disk usage spikes and “disk full” errors, including query processing issues, tables with VARCHAR(MAX) columns, copying large files, and more. But what about the particular situation when you encounter a disk usage spike when running a Redshift query?

If you encounter a Redshift “disk full” error when running a query, one thing’s for sure: you’ve run out of space in one or more of the nodes in your Redshift cluster. Disk usage spikes when running queries stem from two issues: either the query is using so much memory that it’s overflowing to disk, or your data is too large for the hard disks on the cluster.

But what if you don’t have the time or knowledge to troubleshoot the possible causes yourself? That’s exactly where intermix comes in.

How to Find the Cause

In your intermix dashboard, go to the Storage Analysis tab. (Don’t have an account yet? Sign up here for a free 2-week trial.) Next, click anywhere on the Schema chart to get the list of tables for that exact time. The tables are sorted by growth, so the ones which are causing the disk usage spike will show up at the top of the list.

And that’s it! Once you’ve isolated the problematic table(s), you can take steps to curb the issue, such as removing extraneous data or adding another node to the cluster.

Disk Usage Spikes in intermix.io

What’s Next for Solving Disk Usage Performance Issues?

Since we’ve introduced this feature to intermix, we’ve made a few changes in response to best practices and user feedback:

As always, we’d love to hear your feedback, so please get in touch over chat or email. If you aren’t yet an intermix customer, sign up for your free trial and start optimizing your Redshift deployment today. And for the latest in data engineering news, check out our weekly newsletter “SF Data Weekly” with over 6,000 subscribers.

Amazon Redshift offers an attractive feature that can help organizations manage their hosting bill. It’s called concurrency scaling, and according to Amazon, it “automatically and elastically scales query processing power to provide consistently fast performance for hundreds of concurrent queries.”

Introduction to Amazon Redshift Concurrency Scaling

Before concurrency scaling, Redshift users faced a familiar dilemma – dealing with peak demand. There were two options:

Concurrency scaling adds resources to your Redshift cluster on an on-demand basis, adding processing power during peak time and withdrawing it in quieter moments. 

In terms of pricing, concurrency scaling works on a credit system that should make it free for most users. Amazon allows you to earn one free hour of scaling for every 24 hours of main Redshift cluster usage, and these credits accrue over time. Any usage outside of your credits gets billed on a per-second basis according to your Redshift agreement.

Concurrency scaling makes financial sense, but can it offer consistent service? Let’s find out.

Cluster Requirements

There are three eligibility requirements for concurrency scaling. Your Redshift cluster must be:

This means that single-node clusters are not eligible. Also, note that the cluster must have had fewer than 32 nodes at creation. If your cluster originally had 50 nodes and you scale down to 32, you’re still not eligible for concurrency scaling.

Eligible Query Types

Concurrency scaling does not work on all query types. For the first release, it handles read-only queries that meet three conditions:

For routing to a concurrency scaling cluster, a query needs to encounter queueing. Also, queries eligible for SQA (Short Query Acceleration) queue will not run on the concurrency scaling clusters.

Queuing and SQA are a function of a proper set-up of Redshift’s workload management (WLM). We recommend first optimizing your WLM because it will reduce the need for concurrency scaling. And that matters because, while AWS claims that concurrency scaling will be free for 97% of customers, you could face an additional usage charge if you exceed your credits.

We’ve also tested enabling Redshift’s automatic WLM and captured our experience with it in this blog post, “Should I Enable Amazon Redshift’s Automatic WLM?

See your data in intermix.io

Enabling Concurrency Scaling

Concurrency scaling is enabled on a per-WLM queue basis. Go to the AWS Redshift Console and click on “Workload Management” from the left-side navigation menu. Select your cluster’s WLM parameter group from the subsequent pull-down menu.

You should see a new column called “Concurrency Scaling Mode” next to each queue. The default is ‘off’. Click ‘Edit’ and you’ll be able to modify the settings for each queue.

How to enable concurrency scaling in the AWS Redshift Console “Workload Management” section

How We Configured Redshift Concurrency Scaling

Concurrency scaling works by routing eligible queries to new, dedicated clusters. The new clusters have the same size (node type and number) as the main cluster.  

The number of clusters used for concurrency scaling defaults to one (1), with the option to configure up to ten (10) total clusters.

The total number of clusters that should be used for concurrency scaling can be set by the parameter max_concurrency_scaling_clusters.  Increasing the value of this parameter provisions additional standby clusters.

The total number of clusters that should be used for Redshift concurrency scaling can be set by the parameter max_concurrency_scaling_clusters.

Monitoring our Concurrency Scaling Test

There are a few additional charts in the AWS Redshift console. There is a chart called “Max Configured Concurrency Scaling Clusters” which plots the value of max_concurrency_scaling_clusters over time.

Max Configured Concurrency Scaling Clusters” chart in the AWS Redshift Console

The number of Active Scaling clusters is also shown in the UI under Concurrency Scaling Activity:

The AWS Redshift Console displays the Active Scaling clusters in the Concurrency Scaling Activity report

The Queries tab in the UI also has a column to show if the query ran on the Main cluster or on the Concurrency Scaling cluster:

The AWS Redshift Console has a Queries tab to show which cluster the query ran on, either the Main Cluster or the Concurrency Scaling cluster

Whether a particular query ran on the main cluster or via a concurrency scaling cluster is stored in stl_query.concurrency_scaling_status.  

Wondering whether your query ran on the main cluster or Concurrency Scaling cluster? Here’s where to look in your AWS Redshift Console.

A value of 1 means the query ran on a Concurrency Scaling cluster, and other values mean it ran on the main cluster.

Example:

See your data in intermix.io

Concurrency Scaling info is also stored in some other tables/views, e.g. SVCS_CONCURRENCY_SCALING_USAGE.  TherConcurrency scaling info is also stored in some other tables/views, such asSVCS_CONCURRENCY_SCALING_USAGE. 

The following views have similar information as the corresponding STL views or SVL views:

These views work in the same way as their STL or SVL equivalents.

Results of our Concurrency Scaling Tests

We enabled concurrency scaling for a single queue on an internal cluster at approximately 2019-03-29 18:30:00 GMT. We changed the max_concurrency_scaling_clusters parameter to 3 at approximately 2019-03-29 20:30:00.

To simulate query queuing, we lowered the # of slots for the queue from 15 slots to 5 slots.

Below is a chart from the intermix.io dashboard, showing the running versus queuing queries for this queue, after cranking down the number of slots.

The intermix.io “concurrency and queuing” dashboard shows the Redshift performance of time spent in running versus queuing queries

We observe that the queueing time for queries went up, maxing out at about > 5 minutes.

Our intermix test showed that when Redshift Concurrency Scaling was turned on, the queuing time for queries increased

Here’s the corresponding summary in the AWS console of what happened during that time:

The intermix.io team tested the Concurrency Scaling function; this is what the AWS Redshift console showed during our test


Redshift spun up three (3) concurrency scaling clusters as requested. It appears that these clusters were not fully utilized, even though our cluster had many queries that were queuing.

The usage chart correlates closely with the scaling activity chart:

During our intermix.io test of Redshift Concurrency Scaling, we saw correlations between the Redshift Console scaling activity chart & usage chart

After a few hours, we checked and it looked like 6 queries ran with concurrency scaling.  We also spot-checked two queries against the UI. We haven’t checked how this value may be used if multiple concurrency clusters are active.

Conclusion: Is Redshift Concurrency Scaling Worth it?

Concurrency scaling may mitigate queue times during bursts in queries.

From this basic test, it appears that a portion of our query load improved as a result. However, simply enabling concurrency scaling didn’t fix all of our concurrency problems.  The limited impact is likely due to the limitations on the types of queries that can use concurrency scaling. For example, we have a lot of tables with interleaved sort keys, and much of our workload is writes.

While concurrency scaling doesn’t appear to be a silver bullet solution for WLM tuning in all cases, using the feature is transparent and easy to use. You can start with a single concurrency cluster, then monitor the peak load via the console to determine whether the new clusters are being fully utilized.

Though it may not have lived up to be the automatic solution advertized, concurrency scaling will become more and more effective over time as AWS adds more features and support. We strongly recommend enabling the feature on your WLM queues.

Here at Intermix, we’re constantly building products and innovating for Amazon Redshift users, and we’d like to think we have our finger on the pulse of what Redshift customers are saying. The number one driver of change that we’ve seen for our clients is that they’re experiencing huge growth in data and query volumes.

A 2016 report by IDG found that the average company now manages 163 terabytes (163,000 gigabytes) of information—and the figures have undoubtedly only increased since the survey was released. Of course, having more data at hand also means that you need higher query volumes to extract all of the insights they contain. 

We’ve observed three main drivers behind the growth in data and query volumes among our customers:

  1. Growth in the number of data sources connected to a Redshift cluster, and the volume of data coming from those sources. Among many of our clients, we’ve observed exponential data growth rates—even doubling every year.
  2. Growth in the number of workflows that are running once data resides inside a cluster. The reason is simple: the more data and sources you have, the more ways you have to join and combine that data.
  3. Growth in the number of data consumers who want more data, in more combinations, at a higher frequency. Much of this growth is driven by how easy dashboard tools make it to explore data.  

When building data pipelines with Amazon Redshift, all of these growth factors make it quite challenging to monitor and detect anomalies across all of the queries running on the system. For example, a lot of our clients use Looker as their tool for data exploration, and one of the most common complaints with the platform is slow Looker dashboards.

One way of approaching this problem is to follow best practices for setting up your workload management in Amazon Redshift the right way. You get a more granular view of your users and the resources they use, and the WLM gives you the power to isolate them from each other.

But even after optimizing your WLM configuration, you’re still lacking critical insights. For example:

Until now, it’s been quite hard to surface all of this information in a single place. That’s why we built a way to intuitively and powerfully visualize these insights, helping to make your life as a data engineer easier when working with Amazon Redshift.

The new feature is called “Query Insights,” and it’s available immediately in your Intermix dashboard. Below, we’ll check out some examples of using the Query Insights feature in Intermix.

Detect a Huge Increase in Query Volumes in Amazon Redshift

In this example, we’ll show how you can use Query Insights to quickly identify that a huge spike in query volume happened, as well as the specific queries that caused it.

The Intermix dashboard clearly shows that there has been a query count spike at 8:59 p.m. on January 24. In the left panel in the Intermix user interface, click on “query group” to segment the chart by query group. Click on the chart to reveal a slide-in list of query groups, sorted in descending order of Count (i.e. the number of queries). Finally, click on the “View Queries” icon to jump to the Query Details list, where you can see exactly which queries are causing the problem.

intermix.io dashboard on queries


Identifying Which Query Group is Causing a Batch Pipeline Slow Down in Amazon Redshift?

In this example, we’ll show you how to use Query Insights to find the cause of an overall increase in latency.

Click on the graph icon in the main Query Insights dashboard to group the data by app. This reveals that the “intermix_collector” app is experiencing a large increase in latency. Next, click on the “intermix_collector” app so that you can group by a custom attribute.

This app has two custom attributes, “dag” and “task”, that appear on the left navigation panel. This is because these queries are tagged using the Intermix annotation format. First group by “dag”, and then by “task”, to isolate the specific task that is seeing a spike in latency. Finally, group by “query group” to see the specific queries that are causing a problem.

query insights in intermix.io


Find the Query in Amazon Redshift causing a Looker PDT Latency Spike

In this example, let’s monitor our Looker PDT queries to find the queries which are the slowest.

First, click on the “Looker PDT” tag in the dashboard in order to only view data for Looker PDT queries. As you can see in the left navigation panel, the Looker PDT app has a few custom attributes. Click on “model” to group the queries by model name. We immediately see that the “supply chain” model is the slowest. We can click on it and then group by “query group” to find the individual queries causing a problem.

query optimization

What’s Next

Query Insights is a tremendously valuable tool in your Redshift toolkit, but we’re only getting started. Keep your eyes open for a new feature “Transfer Insights” soon, which will allow you to monitor the users and apps that are loading data and rows into your Amazon Redshift cluster.

Want to try Query Insights out for yourself? Take the opportunity to get a personalized tour of Query Insights, as well as see what’s next on our product roadmap and provide your feature requests and feedback. Please find a time on my calendar at this link, or sign up today for a free trial of Intermix.