At intermix.io, we work with companies that build data pipelines. Some start cloud-native on platforms like Amazon Redshift, while others migrate from on-premise or hybrid solutions. What they all have in common is the one question they ask us at the very beginning:
“How do other companies build their data pipelines?”
And so that’s why we decided to compile and publish a list of publicly available blog posts about how companies build their data pipelines. In those posts, the companies talk in detail about how they’re using data in their business and how they’ve become data-centric.
The 15 Companies we’ve looked at are:
Table of Contents
If we missed your post, we’re happy to include it. Just fill out this form, which will take you less than a minute. And with that – please meet the 15 examples of data pipelines from the world’s most data-centric companies.
Getting data-driven is the main goal for Simple. It’s important for the entire company to have access to data internally. Instead of the analytics and engineering teams to jump from one problem to another, a unified data architecture spreading across all departments in the company allows building a unified way of doing analytics.
The main problem then is how to ingest data from multiple sources, process it, store it in a central data warehouse, and present it to staff across the company. Similar to many solutions nowadays, data is ingested from multiple sources into Kafka before passing it to compute and storage systems.
The warehouse of choice is Redshift, selected because of its SQL interfaces and the ease with which it processes petabytes of data. Reports, analytics, and visualizations are powered using Periscope Data. In such a way, the data is easily spread across different teams, allowing them to make decisions based on data.
Clearbit was a rapidly growing, early-stage startup when it started thinking of expanding its data infrastructure and analytics. They tried out a few out-of-the-box analytics tools, each of which failed to satisfy the company’s demands.
After that, Clearbit took building the infrastructure in their own hands. Their efforts converged into a trio of providers: Segment, Redshift, and Mode. Segment is responsible for ingesting all kinds of data, combining it, and syncing it daily into a Redshift instance. The main data storage is obviously left to Redshift, with backups into AWS S3.
Finally, since Redshift supports SQL, Mode is perfectly suited for running queries (while using Redshift’s powerful data processing abilities) and creating data insights.
Mode makes it easy to explore, visualize, and share that data across your organization.
But as data volume grows, that’s when data warehouse performance goes down. With ever-increasing calls to your data from analysts, your cloud warehouse becomes the bottleneck.
That’s why we’ve built intermix.io to provide Mode users with all the tools they need to optimize their queries running on Amazon Redshift. Here one of our dashboards that shows you how you can track queries from Mode down to the single user:
The whole data architecture at 500px is mainly based on two tools: Redshift for data storage; and Periscope for analytics, reporting, and visualization. From a customer-facing side, the company’s web and mobile apps run on top of a few API servers, backed by several databases – mostly MySQL. Data from these DBs passes through a Luigi ETL, before moving to storage on S3 and Redshift.
Splunk here does a great job of querying and summarizing text-based logs. Periscope Data is responsible for building data insights and sharing them across different teams in the company. All in all, this infrastructure supports around 60 people distributed across a couple of teams within the company, prior to their acquisition by Visual China Group.
The data infrastructure at Netflix is one of the most sophisticated in the world. The video streaming company serves over 550 billion events per day, equaling roughly to 1.3 petabytes of data. In general, Netflix’s architecture is broken down into smaller systems, such as systems for data ingestion, analytics, and predictive modeling. The data stack employed in the core of Netflix is mainly based on Apache Kafka for real-time (sub-minute) processing of events and data. Data needed in the long-term is sent from Kafka to AWS’s S3 and EMR for persistent storage, but also to Redshift, Hive, Snowflake, RDS, and other services for storage regarding different sub-systems. Metacat is built to make sure the data platform can interoperate across these data sets as a one “single” data warehouse. Its task is to actually connect different data sources (RDS, Redshift, Hive, Snowflake, Druid) with different compute engines (Spark, Hive, Presto, Pig). Other Kafka outputs lead to a secondary Kafka sub-system, predictive modeling with Apache Spark, and Elasticsearch. Operational metrics don’t flow through the data pipeline but through a separate telemetry system named Atlas.
The tech world has seen dramatic changes since Yelp was launched back in 2004. By 2012, Yelp found themselves playing catch-up. It transformed from running a huge monolithic application on-premises to one built on microservices running in the AWS cloud. By the end of 2014, there were more than 150 production services running, with over 100 of them owning data. Its main part of the cloud stack is better known as PaSTA, based on Mesos and Docker, offloading data to a Redshift data warehouse, Salesforce CRM, and Marketo marketing automation. Data enters the pipeline through Kafka, which in turn receives it from multiple different “producer” sources.
Gusto, founded in 2011, is a company that provides a cloud-based payroll, benefits, and workers’ compensation solution for businesses. Their business has grown steadily over the years, currently topping to around 60 thousand customers. By early 2015, there was a growing demand within the company for access to data. Up until then, the engineering team and product managers were running their own ad-hoc SQL scripts on production databases. There was obviously a need to build a data-informed culture, both internally and for their customers. When coming to the crossroad to either build a data science or data engineering team, Gusto seems to have made the right choice: first, build a data infrastructure that can support analysts in generating insights and drawing prediction models.
The first step for Gusto was to replicate and pipe all of their major data sources into a single warehouse. The warehouse choice landed on an AWS Redshift cluster, with S3 as the underlying data lake. Moving data from production app databases into Redshift was then facilitated with Amazon’s Database Migration Service. On the other side of the pipeline, Looker is used as a BI front-end that teams throughout the company can use to explore data and build core dashboards. Aleph is a shared web-based tool for writing ad-hoc SQL queries. Finally, monitoring (in the form of event tracking) is done by Snowplow, which can easily integrate with Redshift. And, as usual, Airflow orchestrates the work through the pipeline.
Building this pipeline helped to simplify data access and manipulation across departments. For instance, analysts can simply build their own datasets as part of an Airflow task and expose it to Looker to use in dashboards and further analyses.
Teads is a video advertising marketplace, often ranked as the number one video platform in the world. Working with data-heavy videos must be supported by a powerful data infrastructure, but that’s not the end of the story. Teads’ business needs to log user interactions with their videos through the browser – functions like play, pause, resume, complete – which count up to 10 million events per day. Another source of data is video auctions with a real-time bidding process. These generate another 60 million events per day. To build their complex data infrastructure, Teads has turned to both Google and Amazon for help.
Originally the data stack at Teads was based on a lambda architecture, using Storm, Spark and Cassandra. This architecture couldn’t scale well, so the company turned toward Google’s BigQuery in 2016. They already had their Kafka clusters on AWS, which was also running some of their ad delivery components, so the company chose a multi-cloud infrastructure. Transferring data between different cloud providers can get expensive and slow. To address the second part of this issue, Teads placed their AWS and GCP clouds as close as possible and connected them with managed VPNs.
So how does their complex multi-cloud data stack look? Well, first of all, data coming from users’ browsers and data coming from ad auctions is enqueued in Kafka topics in AWS. Then using an inter-cloud link, data is passed over to GCP’s Dataflow, which is then well paired with BigQuery in the next step. Having all data in a single warehouse means half of the work is done. The next step would be to deliver data to consumers, and Analytics is one of them. The Analytics service at Teads is a Scala-based app that queries data from the warehouse and stores it to tailored data marts. Interestingly, the data marts are actually AWS Redshift servers. In the final step, data is presented into intra-company dashboards and on the user’s web apps.
Remind’s data engineering team provides the whole company with access to the data they need, as big as 10 million daily events, and empower them to make decisions directly. They initially started with Redshift as its source of truth resource for data, and AWS S3 to optimize for cost.
While S3 is used for long-term storage of historical data in JSON format, Redshift only stores the most valuable data, not older than three months. The company uses Interana to run custom queries on their JSON files on S3, but they’ve also recently started using AWS Athena as a fully managed Presto system to query both S3 and Redshift databases.
The move for Athena also triggered a change in the data format from JSON to Parquet, which they say was the hardest step in building up their data platform. An EMR/Hive system is responsible for doing the needed data transformations between S3 and Athena. In the data ingestion part of the story, Remind gathers data through their APIs from both mobile devices and personal computers, as the company business targets schools, parents, and students. This data is then passed to a streaming Kinesis Firehose system before streaming it out to S3 and Redshift.
Remind’s future plans are probably focused on facilitating data format conversions using AWS Glue. This step would allow them to replace EMR/Hive from their architecture and use Spark SQL instead of Athena for diverse ETL tasks.
Robinhood is a stock brokerage application that democratizes access to the financial markets, enabling customers to buy and sell stocks and ETFs with zero commission. The company debuted with a waiting list of nearly 1 million people, which means they had to pay attention to scale from the very beginning.
Robinhood’s data stack is hosted on AWS, and the core technology they use is ELK (Elasticsearch, Logstash, and Kibana), a tool for powering search and analytics. Logstash is responsible for collecting, parsing, and transforming logs before passing them on to Elasticsearch, while data is visualized through Kibana.
They grew from a single ELK cluster with a few GBs of data to three clusters with over 15 TBs. Before data goes to ELK clusters, it is buffered in Kafka, as the various data sources generate documents at differing rates.
Kafka also shields the system from failures and communicates its state with data producers and consumers. As with many other companies, Robinhood uses Airflow to schedule various jobs across the stack, beating competition such as Pinball, Azkaban and Luigi. Robinhood data science team uses Amazon Redshift to help identify possible instances of fraud and money laundering.
Dollar Shave Club (DSC) is a lifestyle brand and e-commerce company that’s revolutionizing the bathroom by inventing smart, affordable products. Don’t be fooled by their name. They have a pretty cool data architecture for a company in the shaving business. Their business model works with online sales through a subscription service. Currently, they serve around 3 million subscribed customers.
DSC’s web applications, internal services, and data infrastructure are 100% hosted on AWS. A Redshift cluster serves as the central data warehouse, receiving data from various systems. Data movement is facilitated with Apache Kafka and can move in different directions – from production DBs into the warehouse, between different apps, and between internal pipeline components.
There’s also Snowplow, which collects data from the web and mobile clients. Once data reaches Redshift, it is accessed through various analytics platforms for monitoring, visualization, and insights. The main tool for the job is, of course, Apache Spark, which is mainly used to build predictive models, such as recommender systems for future sales.
Coursera is an education company that partners with the top universities and organizations in the world to offer online courses. They started building their data architecture somewhere around 2013, as both numbers of users and available courses increased. As of late 2017, Coursera provides courses to 27 million worldwide users.
Coursera collects data from its users through API calls coming from mobile and web apps, their production DBs, and logs gathered from monitoring. A backend service called “eventing” periodically uploads all received events to S3 and continuously publishes events to Kafka. The engineering team has selected Redshift as its central warehouse, offering much lower operational cost when compared with Spark or Hadoop at the time.
On the analytics end, the engineering team created an internal web-based query page where people across the company can write SQL queries to the warehouse and get the information they need. Of course, there are company-wide analytics dashboards that are refreshed on a daily basis. Finally, many decisions made in Coursera are based on machine learning algorithms, such as A/B testing, course recommendations, and understanding student dropouts.
Wish is a mobile commerce platform. It provides online services that include media sharing and communication tools, personalized and other content, as well as e-commerce. During the last few years, it grew up to 500 million users, making their data architecture out of date.
Before they scaled up, Wish’s data architecture had two different production databases: a MongoDB NoSQL database storing user data; and a Hive/Presto cluster for logging data. Data engineers had to manually query both to respond to ad-hoc data requests, and this took weeks at some points. Another small pipeline, orchestrated by Python Cron jobs, also queried both DBs and generated email reports.
After rethinking their data architecture, Wish decided to build a single warehouse using Redshift. Data from both production DBs flowed through the data pipeline into Redshift. BigQuery is also used for some types of data. It feeds data into secondary tables needed for analytics. Finally, analytics and dashboards are created with Looker.
Blinkist transforms the big ideas from the world’s best nonfiction books into powerful little packs users can read or listen to in 15 minutes. At first, they started selling their services through a pretty basic website, and they monitored statistics through Google Analytics. Unfortunately, visitor statistics gathered from Google Analytics didn’t match the figures the engineers were computing. This is one of the reasons why Blinkist decided to move to the AWS cloud.
They choose a central Redshift warehouse where data flows in from user apps, backend, and web front-end (for visitors tracking). To get data to Redshift, they stream data with Kinesis Firehose, also using Amazon Cloudfront, Lambda, and Pinpoint. The engineering team at Blinkist is working on a newer pipeline where ingested data comes to Alchemist, before passing it to a central Kinesis system and onwards to the warehouse.
Healthcare platform Halodoc found themselves with a common startup problem: scalability. Their existing data pipeline worked on a batch processing model, with regularly scheduled extractions for each source. They performed extractions with various standard tools, including Pentaho, AWS Database Migration Service, and AWS Glue.
They would load each export to S3 as a CSV or JSON, and then replicate it on Redshift. At this point, they used a regular Pentaho job to transform and integrate data, which they would then load back into Redshift.
As Halodoc’s business grew, they found that they were handling massive volumes of sensitive patient data that had to get securely and quickly to healthcare providers. The Pentaho transformation job, installed on a single EC2 instance, was a worrying single point of failure.
Halodoc looked at a number of solutions and eventually settled on Apache Airflow as a single tool for every stage of their data migration process. They chose Airflow because it’s highly responsive and customizable, with excellent error control. It also supports machine learning use cases, which Halodoc requires for future phases.
The new data pipeline is much more streamlined. Halodoc uses Airflow to deliver both ELT and ETL. In their ETL model, Airflow extracts data from sources. It then passes through a transformation layer that converts everything into pandas data frames. The data frames are loaded to S3 and then copied to Redshift. Airflow can then move data back to S3 as required.
For ELT, the Airflow job loads data directly to S3. Halodoc then uses Redshift’s processing power to perform transformations as required.
iHeartRadio is a global streaming platform for music and podcasts. It runs on a sophisticated data structure, with over 130 data flows, all managed by Apache Airflow. These data pipelines were all running on a traditional ETL model: extracted from the source, transformed by Hive or Spark, and then loaded to multiple destinations, including Redshift and RDBMSs.
On reviewing this approach, the engineering team decided that ETL wasn’t the right approach for all data pipelines. Where possible, they moved some data flows to an ETL model. Data flows directly from source to destination – in this instance, Redshift – and the team applies any necessary transformations afterward. Redshift Spectrum is an invaluable tool here, as it allows you to use Redshift to query data directly on S3 via an external meta store, such as Hive.
However, this model still didn’t suit all use cases. The iHeartRadio team began experimenting with the ETLT model (Extract, Transform, Load, Transform) model, which combines aspects of ETL and ELT. In this approach, the team extracts data as normal, then uses Hive for munging and processing. They then load the data to the destination, where Redshift can aggregate the new data.
Now, the team uses a dynamic structure for each data pipeline, so data flows might pass through ETL, ELT, or ETLT, depending on requirements. This new approach has improved performance by up to 300% in some cases, while also simplifying and streamlining the entire data structure.
We hope the 15 examples in this post offer you the inspiration to build your own data pipelines in the cloud.
If you don’t have any data pipelines yet, it’s time to start building them. Begin with baby steps and focus on spinning up an Amazon Redshift cluster, ingest your first data set and run your first SQL queries.
After that, you can look at expanding by acquiring an ETL tool, adding a dashboard for data visualization, and scheduling a workflow, resulting in your first true data pipeline. And once data is flowing, it’s time to understand what’s happening in your data pipelines.
That’s why we built intermix.io. We give you a single dashboard to understand when & why data is slow, stuck, or unavailable.
With intermix.io you can:
Our customers have the confidence to handle all the raw data their companies need to be successful. What you get is a real-time analytics platform that collects metrics from your data infrastructure and transforms them into actionable insights about your data pipelines, apps, and users who touch your data.
Setting up intermix.io takes less than 10 minutes, and because you can leverage our intermix.io experts, you can say goodbye to paying for a team of experts with expensive and time-consuming consulting projects. We can help you plan your architecture, build your data lake and cloud warehouse, and verify that you’re doing the right things.
Amazon Redshift offers an attractive feature that can help organizations manage their hosting bill. It’s called concurrency scaling, and according to Amazon, it “automatically and elastically scales query processing power to provide consistently fast performance for hundreds of concurrent queries.”
Before concurrency scaling, Redshift users faced a familiar dilemma – dealing with peak demand. There were two options:
Concurrency scaling adds resources to your Redshift cluster on an on-demand basis, adding processing power during peak time and withdrawing it in quieter moments.
In terms of pricing, concurrency scaling works on a credit system that should make it free for most users. Amazon allows you to earn one free hour of scaling for every 24 hours of main Redshift cluster usage, and these credits accrue over time. Any usage outside of your credits gets billed on a per-second basis according to your Redshift agreement.
Concurrency scaling makes financial sense, but can it offer consistent service? Let’s find out.
There are three eligibility requirements for concurrency scaling. Your Redshift cluster must be:
This means that single-node clusters are not eligible. Also, note that the cluster must have had fewer than 32 nodes at creation. If your cluster originally had 50 nodes and you scale down to 32, you’re still not eligible for concurrency scaling.
Concurrency scaling does not work on all query types. For the first release, it handles read-only queries that meet three conditions:
For routing to a concurrency scaling cluster, a query needs to encounter queueing. Also, queries eligible for SQA (Short Query Acceleration) queue will not run on the concurrency scaling clusters.
Queuing and SQA are a function of a proper set-up of Redshift’s workload management (WLM). We recommend first optimizing your WLM because it will reduce the need for concurrency scaling. And that matters because, while AWS claims that concurrency scaling will be free for 97% of customers, you could face an additional usage charge if you exceed your credits.
We’ve also tested enabling Redshift’s automatic WLM and captured our experience with it in this blog post, “Should I Enable Amazon Redshift’s Automatic WLM?“
Concurrency scaling is enabled on a per-WLM queue basis. Go to the AWS Redshift Console and click on “Workload Management” from the left-side navigation menu. Select your cluster’s WLM parameter group from the subsequent pull-down menu.
You should see a new column called “Concurrency Scaling Mode” next to each queue. The default is ‘off’. Click ‘Edit’ and you’ll be able to modify the settings for each queue.
Concurrency scaling works by routing eligible queries to new, dedicated clusters. The new clusters have the same size (node type and number) as the main cluster.
The number of clusters used for concurrency scaling defaults to one (1), with the option to configure up to ten (10) total clusters.
The total number of clusters that should be used for concurrency scaling can be set by the parameter max_concurrency_scaling_clusters. Increasing the value of this parameter provisions additional standby clusters.
There are a few additional charts in the AWS Redshift console. There is a chart called “Max Configured Concurrency Scaling Clusters” which plots the value of max_concurrency_scaling_clusters over time.
The number of Active Scaling clusters is also shown in the UI under Concurrency Scaling Activity:
The Queries tab in the UI also has a column to show if the query ran on the Main cluster or on the Concurrency Scaling cluster:
Whether a particular query ran on the main cluster or via a concurrency scaling cluster is stored in stl_query.concurrency_scaling_status.
A value of 1 means the query ran on a Concurrency Scaling cluster, and other values mean it ran on the main cluster.
redshiftcluster_2=# select distinct
concurrency_scaling_status,count(*) from stl_query where endtime <
'2019-03-29 15:00:00' group by concurrency_scaling_status;
concurrency_scaling_status | count
2 | 21
0 | 310790
4 | 19818
6 | 69082
11 | 7
3 | 853546
8 | 228977
Concurrency Scaling info is also stored in some other tables/views, e.g. SVCS_CONCURRENCY_SCALING_USAGE. TherConcurrency scaling info is also stored in some other tables/views, such asSVCS_CONCURRENCY_SCALING_USAGE.
The following views have similar information as the corresponding STL views or SVL views:
These views work in the same way as their STL or SVL equivalents.
We enabled concurrency scaling for a single queue on an internal cluster at approximately 2019-03-29 18:30:00 GMT. We changed the max_concurrency_scaling_clusters parameter to 3 at approximately 2019-03-29 20:30:00.
To simulate query queuing, we lowered the # of slots for the queue from 15 slots to 5 slots.
Below is a chart from the intermix.io dashboard, showing the running versus queuing queries for this queue, after cranking down the number of slots.
We observe that the queueing time for queries went up, maxing out at about > 5 minutes.
Here’s the corresponding summary in the AWS console of what happened during that time:
Redshift spun up three (3) concurrency scaling clusters as requested. It appears that these clusters were not fully utilized, even though our cluster had many queries that were queuing.
The usage chart correlates closely with the scaling activity chart:
After a few hours, we checked and it looked like 6 queries ran with concurrency scaling. We also spot-checked two queries against the UI. We haven’t checked how this value may be used if multiple concurrency clusters are active.
redshiftcluster_2=# select distinct
concurrency_scaling_status,count(*) from stl_query where endtime >
'2019-03-29 18:30:00' group by concurrency_scaling_status;
concurrency_scaling_status | count
4 | 108
6 | 333
1 | 6
0 | 913
3 | 4495
8 | 304
Concurrency scaling may mitigate queue times during bursts in queries.
From this basic test, it appears that a portion of our query load improved as a result. However, simply enabling concurrency scaling didn’t fix all of our concurrency problems. The limited impact is likely due to the limitations on the types of queries that can use concurrency scaling. For example, we have a lot of tables with interleaved sort keys, and much of our workload is writes.
While concurrency scaling doesn’t appear to be a silver bullet solution for WLM tuning in all cases, using the feature is transparent and easy to use. You can start with a single concurrency cluster, then monitor the peak load via the console to determine whether the new clusters are being fully utilized.
Though it may not have lived up to be the automatic solution advertized, concurrency scaling will become more and more effective over time as AWS adds more features and support. We strongly recommend enabling the feature on your WLM queues.
To win in today’s market, companies must invest in both their infrastructure and their people. Data-first companies like Netflix, Uber and Tinder are dominating their industries. Mentions of “AI” are often heard in advertisements, product launches, and earnings calls. Businesses are scrambling to re-imagine their tech infrastructure and provide their people with things like Amazon Redshift training.
Why is this happening now?
Data is the new differentiator:
Moving towards a data-driven organization can be daunting for decision-makers. Learning investments such as Amazon Redshift training can sometimes produce limited results, thanks to the complex nature of modern data platforms, which can intimidate even the most experienced IT professionals.
We’re all generating a lot more data than ever before. Storing it is a challenge; analyzing it is an even bigger challenge. Over the years, we’ve seen highly structured data warehouses give way to more anarchic data lakes, which in turn are giving way to multi-tiered structures like data lakehouses.
As a result of the shift, traditional concepts of “ETL” are being re-imagined for a new world where:
The response has been that teams are building complex data assembly lines which have common characteristics:
Data lake architectures pose new challenges. Since lakes store their data without any oversight of the contents, the data needs to have defined mechanisms for cataloging in order to make it usable. Without this, data isn’t reliable. Companies need data pipelines that offer governance consistency and access controls to the data lake.
New technologies like Amazon Redshift allow all of the storage and processing power required to run a data-first business, but you still need the tools to turn that data into actionable insights.
Companies use the data for the following purposes:
Building and managing complex data assembly line requires a new skill set.
The last time we saw a comparable shift was back when cloud computing was first developed. Running cloud apps required an operational (ie uptime, cost, and performance) mindset coupled with an ability to write code. The DevOps role was born out of a need to manage infrastructure as code. As a result, engineering teams had to establish new teams and hire new types of employees. These employees needed different tools to do their jobs.
Similarly, a new role has emerged to manage data pipelines: the data engineer. Data engineers manage complex data flows by writing code that manipulates data. But they are also accountable for the uptime, performance, and cost accounting for the data flows. This skillset is a combination of DevOps and data analyst, with a sprinkling of a database administrator. They also need platform knowledge, such as Amazon Redshift training.
Building a data assembly line involves:
Raw data often exists in application silos, stranded.93% of enterprises have a multi-cloud strategy, while 87% have a hybrid cloud strategy. So, for many organizations, data exists across multiple clouds, plus their own data centers. They will need to inspect and catalog this data before they realize any value.
Security is a fundamental part of any data pipeline. The data owners must audit all access, and also ensure that the right people and processes have the right permissions.
Before anyone can trust data, they have to transform and cleanse it. Consider how you identify a customer across the data – is it by email, name, or some unique ID? If you want to combine two data sources, which one of those do you use? This stage involves data validation, removal of duplication, and handling of null values.
Mobile and IoT generate huge volumes of data. Running fast queries on huge data volumes requires careful planning, tuning, and configuration of data analytics infrastructure.
Most companies start by hiring data engineers to implement reliable data pipelines, and they may also hire data scientists with analytics skills. For a data pipeline, you may also require staff with specialist knowledge, such as Hadoop, ETL or Amazon Redshift training.
Unliked end-user applications, data apps runs jobs on the data assembly line. There are three categories of data apps.
The fundamental problem solved by the data engineer is to ensure that the data assembly line is working.
Are data flows operating normally?
Do my data tables contain the correct results?
Are data apps able to access the data quickly?
This requires answering questions in real-time across multiple systems:
In order to accomplish this, you need new types of metrics. Traditional networking monitoring metrics like CPU and network utilization are irrelevant when monitoring data assembly lines, because data flows operate at a different layer.
A monitoring tool for data flows must consider:
We started intermix.io to solve these problems. Our mission is to provide data engineers with a single dashboard to help them monitor their mission critical data flows. And if there are problems, that they are the first to know and the reason why.
Amazon Redshift is a petabyte-scale data warehouse that has been widely adopted since its release in October 2012. With Redshift, it’s easy to spin up a cluster, pump in data, and begin performing advanced analytics in under an hour.
Because it’s so easy to start using Redshift, however, data engineers often skip Redshift best practices when setting up a cluster. Cutting corners when setting up Redshift may create performance issues down the line, and you’ll pay the price later as your data volume and pipeline complexity grows. Some of the common Redshift pain points are slow queries and lack of workload scalability.
You may have already seen our article on the top performance tuning techniques for Amazon Redshift. In this post, we’ll focus on exactly the opposite topic: the top 3 things not to do when setting up an Amazon Redshift cluster. By scrupulously avoiding these issues, you’ll be paving the way for success as the complexity of your data pipeline grows.
For many people, the process of setting up Amazon Redshift looks like this: when launching a Redshift cluster, you create a masteruser, which by default has access to the initial database. Next, the masteruser’s login gets shared, such that ETL pipelines, scheduled jobs, and dashboard tools all log in with the same user.
The problem with this approach is that you lose granularity: it gets much more difficult to understand which people are doing what and running which queries. As you add more users, troubleshooting bad queries starts to become harder and harder. A single masteruser may work if you only have 3 to 5 users accessing Redshift, but it becomes simply intractable once you have 10 or more.
Instead, use Redshift’s CREATE USER command, which creates a new database user account, and create individual logins to isolate your workloads—one user, one login, no exceptions. This way, you’ll have more control and better visibility into your workloads. We also recommend grouping your users by type of workload (e.g. loads, transforms, ad hoc). This will come in handy later when you set up workload management (WLM).
Meanwhile, you should keep the masteruser idle—don’t run any queries with it. The masteruser can serve as your lifeline in case your cluster becomes unresponsive. When nothing else works, you’ll be able use it to run queries (see below).
Yet with a single schema and read/write access for all users, you’re creating dependencies that become hard to untangle as data, user and query volume grow. More users create more joins across more tables, which means that you’ll be gradually boxing yourself in. Don’t do it.
Instead, plan your data architecture well in advance. Start with two different schemas: a “raw schema” where you load all your data, and a “data schema” that’s available for running ad hoc queries. You can move data between the schemas with your transform jobs.
As a rule of thumb, users in each role should only have access to the schemas and tables that they need, and no more. That’s also another reason why you should put in place individual users and user groups.
See your data in intermix.io
Amazon Redshift operates in a queueing model, with query queues and slots. You can use the Amazon Redshift Management Console to define query queues. Redshift then routes queries to the appropriate queues at runtime.
You also have the option to use Amazon Redshift’s automatic workflow management (WLM)—but we wouldn’t recommend it. User sessions and queries can overlap. Some queries can consume cluster resources for long periods of time, which in turn impacts the performance of other queries. That’s why you should isolate your workloads from each other.
By default, Amazon Redshift configures two query queues:
There’s a 99 percent chance that the default queue will not work for you (although every company’s workloads are different). But setting up WLM is something people like to ignore because it’s so dang easy to dump data into Redshift—and when things get slow, you can simply add another node. However, this becomes a very expensive proposition over time. Consider that the performance increase of adding more nodes without WLM is degressive.
In our blog post “4 Simple Steps to Set Up your WLM in Amazon Redshift,” we describe in detail how to set up your cluster for better workload scalability. We recommend checking out the whole article, but here are the main four steps:
What’s more, any queries that are not routed to other queues should run in the default queue, which acts as your insurance if something goes wrong. By not assigning your default user to any queue, it will run in the default queue, which will always have that one slot available.
To sum up, the three biggest mistakes to avoid when setting up an Amazon Redshift cluster are:
By avoiding these three critical issues when setting up your Amazon Redshift cluster, you’ll be poised for more scalable workloads and long-term success. However, this is just the start of Redshift performance tuning at scale. As you load more data into your cluster and add more users, things can become slow.
When it’s critical for you to have excellent Amazon Redshift performance, it makes sense to use a product like intermix.io alongside an ETL tool. intermix.io helps you understand how your queries and data loads are performing, giving you complete visibility into your Redshift clusters.
This is a guest blog post by our visiting expert Dieter Matzion. Dieter is a Business Systems Analyst at Intuit. In the post, Dieter explains how his team has built a KPI dashboard to track cloud spend on top of Amazon Redshift. With the dashboard, over 5,000 Intuit employees gain self-service visibility into their utilization of cloud resources. You can follow & connect with Dieter on Linkedin.
The cloud computing market is forecasted by Forbes and Gartner to grow about 17% in 2020, to an aggregated annual revenue of over a quarter trillion dollars. Almost 70% of enterprises are moving business-critical applications to the cloud says Tech Republic. Companies using the cloud can focus on their core business vs. the undifferentiated heavy lifting as Jeff Bezos calls it.
Cloud has become a competitive advantage that drives innovation with unprecedented speed. Economies of scale allow Intuit to run a 100,000 core data analysis job at any time we choose by simply writing a check. Elasticity allows builders to scale vertically and horizontally at the click of a mouse button. The global presence of the cloud allows enterprises to provide services close to their customers. And shifting from a data center to the cloud allows CFOs to move capital expenditures to operational ones. You can read our cloud migration story in “Intuit’s Journey to Cloud Analytics“.
However, the cloud is very different from a data center as is evident by the many authors talking about its disadvantages. The procurement model for the cloud is comparable to a cell phone bill that you are expected to pay in full at the end of the month.
Cloud governance is often de-prioritized due to the complexity and favoring innovation, opening the door for runaway spending. Renting managed services has a higher cost compared to bare metal machines in the data center. Cost items like data transfer and storage covered by someone else in the data center now show up on your business unit’s monthly bill. Prepaid services need to be amortized, and your organization may have different financial reporting needs than provided out-of-the-box.
Lifting and shifting from the data center to the cloud inherits all inefficiencies and right-sizing is new skill engineers need to learn. Enterprises relinquish central control of security, privacy, and compliance by assuming a shared responsibility model with the cloud provider.
But all is not doom and gloom, and ignoring the cloud gives an advantage to your competitors who are willing to master the new environment. In this article, I am going to share best practices and learnings of how large enterprises overcome these obstacles.
Let’s start with visibility. Managing cost in a data center is slow-paced. Procurement of hardware, racking, and installing software takes months. Budgets are stable and forecasted years in advance. Using the same method in the cloud is like driving a car at top speed – blindfolded – and expecting not to crash.
Unpredictable cost is the top pain point with lack of visibility a close second of 300 C-level and IT decision-makers surveyed by SoftwareONE in 2018.
Fortunately, major cloud providers provide detailed records of how you spent your money. Unfortunately, this insight may come too late, requires training before it becomes usable, and may not align with the financial reporting requirements of your organization.
At Intuit we process 4 billion rows of AWS billing data (the “AWS Cost and Usage Report”) every day into Amazon Redshift, which feeds financial reports tailored to different audiences. We layer in amortization of prepaid services and chargebacks to obtain financial numbers that are directly compared to cloud budgets. You can read about the detailed data architecture and processes in the post “Intuit’s to Cloud Analytics“.
Latencies of billing data are filled with data from inventory changes to get near-real-time cost estimates. This allows us to alert business owners of budget risks before runaway spend occurs.
The billing data is further enhanced with performance and efficiency telemetry like compute, storage, database, container, and serverless utilization data. This allows us to surface actionable insights for right-sizing and cost avoidance to executives and engineers alike. Our CTO and CFO review the top opportunities from the largest vertical down to a single engineer on a quarterly basis. This executive commitment cascades down the organization and is key to efficient operation in the cloud.
Surface actionable insights for right-sizing and cost avoidance to executives and engineers.Share this Amazon Redshift Transformational Use Case
The lowest hanging fruit for saving or avoiding the cost in the cloud are discounts. This is because a small team can implement these without having to engage a large number of engineers. Enterprise discount agreements typically give you a flat percentage rate depending on a specific commitment. These will vary for each customer as they are based on the volume of cloud services consumed.
Additionally, most cloud providers offer so-called sustained usage discounts, a percentage discount for a specific service in exchange for a longer-term commitment. Intuit actively manages a $100M portfolio of AWS Reserved Instances. These are prepaid leases for EC2, RDS, Redshift, Elasticache, ElasticSearch, and DynamoDB. Discounts can range up to 70% depending on the type of commitment. The portfolio is managed centrally using 3rd party tools like CloudHealth. Unit pricing is calculated hourly and exposed via an API. This allows automated cost decisions where software chooses most economic deployment patterns.
Cloud governance has a heavier lift compared to the previous areas as policies will affect how engineers can use the cloud. You want to establish guardrails instead of gatekeepers to promote the speed of innovation. Each policy is customized based on its purpose and maintained as code to allow real-time enforcement. Management and engineers need to be aware of the policies to innovate efficiently.
Intuit uses a customized fork of Capital One’s Cloud Custodian for automated policy enforcement. We maintain several hundred policies in an internal Github, 40 of them are focused on cloud cost. Here are a few examples of policies that will result in the most savings for the least effort.
Take advantage of the cloud’s elasticity by stopping to incur cost when turning off resources that are not in use. So-called cloud parking is automation that is executed to decommission and reprovision cloud resources on a schedule. Development or testing environments that no one is using after business hours are good candidates. Assuming a 12 hour on time on weekdays, this will save over 60% where the policy can be enabled.
Setup red versus green policies to restrict certain usages to maximize discounts. For example, if your company is only purchasing Reserved Instances in 2-3 regions, you can opt to terminate new workload in all other regions to avoid paying an undiscounted price. Other examples are to disallow expensive services or services for which you don’t have discounts. You want to automate exceptions, where business owners can override the policy at any time, and get a report of how this affected their budget.
Cloud providers generally have some data lifecycle management capabilities, and you may need to supplement these. A good example is unattached Elastic Block Store (EBS) on AWS. EBS is a like virtual disk drive, you can think of it as a thumb drive. When it is not connected to a computer, it still incurs a cost, as the storage cannot be reused. Intuit uses policy automation to backup unattached EBS and deletes the backups after a month if no one claims them. Engineers with a valid business reason can tag unattached EBS volumes, which the policy then ignores.
We’d love to learn from you! Is there anything you can share about your own experience building in the cloud? We are always excited to share transformational use cases of Amazon Redshift. And if you want faster queries for your cloud analytics, and spend less time on Ops and more time on Dev like Intuit, then go ahead and schedule a demo or start a trial with intermix.io.