Table of Contents
In 2014 Intuit’s then-CTO Tayloe Sainsbury went all in on the cloud and started migrating Intuit’s legacy on-premise IT infrastructure to Amazon AWS. By February 2018, Intuit had sold its largest data center and processed 100% of 2018 tax filings in the cloud.
But now Intuit had a different challenge – optimizing cloud spend and allocating that spend to products and users. AWS bills customers via a “Cost & Usage Report” (“CUR”). Because of the size of its cloud spend, the Intuit CUR comprises billions of rows, and it keeps growing by the day. Intuit switched from an on-premise data warehouse and now uses Amazon Redshift to process 4-5 billion rows of raw CUR data – each day.
In this post, I’m walking you through the approach that Jason Rhoades took to build Intuit’s data pipeline with Redshift. Jason is an Architect at Intuit, and with a small data team, they provide business-critical data to more than 8,000 Intuit employees.
Heads up – this is a long post with lots of detail!
Three major blocks:
Let’s start with an overview of the business.
Intuit builds financial management and compliance products and services for consumers and small businesses. Intuit also provides tax products to accounting professionals.
Products include QuickBooks, TurboTax, Mint and Turbo. These products help customers run their businesses, pay employees, send invoices, manage expenses, track their money, and file income taxes. Across these products, Intuit serves more than 50 million customers.
Intuit started in the 1980s and built the original version of its first product Quicken for the desktop, first MS-DOS and then Windows. With the Internet, that usage shifted to web and mobile. The impact of that change became clear as early as 2010.
Fast forward to today, and over 90% of Intuits customers file their taxes and manage their accounting online and via mobile apps.
Consumption of Intuit products follows seasonal patterns.
Tax seasonality has the biggest impact on Intuit’s business. Each fiscal year, Intuit generates half of its annual revenue in the quarter ending on April 30th, with the US tax filing deadline on April 15th.
Seasonality also has a huge impact on the cost side of the equation. The shift to digital and online usage of Intuit’s products causes a dramatic usage spike for the IT infrastructure. Most users file their taxes online during the last two days of the tax season.
In the old world of on-premise infrastructure, and to handle the concurrent usage, Intuit had to size their data center for peak capacity. After tax season, demand drops back down to average usage. The gap between peak demand and average usage is so large, that 95% of Intuit’s infrastructure would sit idle for 95% of the year.
That’s why Intuit decided in 2014 to go all in with the cloud. With the cloud’s elasticity, Intuit is in a better position to accommodate spikes in customer usage during the tax season.
By shifting to the cloud, Intuit reduced cost by a factor of six because it no longer maintained idle servers for an application only active during tax season. After the first success, Intuit moved more applications, services and enabling tools to the cloud. Today, over 80% of Intuit’s workloads are running in the cloud.
With now growing usage of AWS, the priorities of the program shifted from migration speed to efficient operations and growth.
Intuit now spends $100s of Millions on prepaid AWS services (aka “reserved instances” or short “RIs”) alone, plus fees for on-demand usage during the peaks. Interest grew in understanding the use of different AWS services and spend by different business units and teams within Intuit.
The source for that information sits in the “Cost & Usage Report” (“CUR”), a bill that Amazon AWS delivers to every customer. The CUR includes line items for each unique combination of AWS product, usage type, and operation and of course pricing. The CUR also contains information about credits, refunds and support fees.
Analyzing CUR data supports Intuit’s cloud program with two major use cases:
To build these two use cases, Jason’s team needs to transform the raw CUR data into a format consumable by the business. The raw CUR data comes in a different format from what Intuit uses to charge internal parties, distribute shared costs, amortize RIs and record spend on the general ledger.
The traditional way of Jason’s team to run the analytics on the CUR data was with an on-premise data warehouse.
Unlike other companies, the size of Intuit’s CUR is very large. In 2017, it was around 500M rows at the end of a month.
Amazon delivers the report 3-4x per day to Intuit, and restates the rows with each report over the course of a month, meaning it gets longer with each delivery. Coupled with a growing business, the amount of data the cluster has to process each time grows by the hour – literally.
You can see that trend play out in chart above, with data from 2017. The grey area indicates the batch size for the CUR data. Each day, the batch size gets bigger as the number of rows in the CUR grow. At the end of the month, the CUR reaches about 500 million rows and resets on day one of the new month.
The amount of rows the warehouse processes per minute stays constant at around 1 million rows per minute. Therefore, the time it takes the warehouse to process each batch (“batch duration”) goes up in linear fashion. With 500M rows at the end of the month, it takes the warehouse 500 minutes to process the full report, or 8 hours and 20 minutes.
Now extrapolate forward and calculate what that looks like in the future. With rising cloud spend, the data team realized that the CUR would start to blow up in size. In fact, today the CUR is larger by a factor of 10x with ~5 billion rows. Now we’re talking over 80 hours, almost four days.
Intuit’s situation is a common scenario we see our customers run into: “More data, more workflows, more people”.
For Intuit, it was clear that “keep on doing what we’re doing” was not an option. In a world where data is an asset, data and DevOps teams should focus on the value-creation part of pipelines.
With cloud usage and data volume going up, the old on-prem warehouse was already running into bottlenecks, and so the analytics team followed the business team into the cloud.
The Intuit team followed their product team into the AWS cloud. The major goals included handling the explosion in data volume and adding value to the business.
With on-demand access to computing resources, access to usage data in near real-time is fundamental for Intuit’s business teams. Unlike in the old world, waiting for a report at the end of the month doesn’t work anymore. With the scale of Intuit’s cloud operations, a few hours of freshness have a substantial impact on the company.
Jason migrated the entire stack from Oracle to Redshift, and deployed the same SQL and ETL processes.
Redshift handled the growth in data volume. Three major data points:
You can also see that the size of the grey area has a step change in April – tax season! The change is due to new capabilities Intuit introduced, which tripled the number of rows of the bill (“more data”).
Despite tripling the number of rows, the batch duration stays within a narrow band and doesn’t spike. That’s because batch size and number of rows processed per minute grow at the same rate. In other words, the cluster processes more data faster, i.e. performance goes up as workloads grow.
Let’s dive into how Jason’s team achieved that result.
The cluster architecture and the data pipeline follow the best practices we recommend for setting up your Amazon Redshift cluster. In particular, pay attention to setting up your WLM to separate your different workloads from each other.
You can see the three major workloads in the architecture chart – stage, process and consume.
Among our customers, “ELT” is a standard pattern, i.e. the transformation of data happens in the cluster with SQL. Cloud warehouses like Redshift are both performant and scalable, to the point that data transformation uses cases can be handled much better in-database vs an external processing layer. SQL is concise, declarative, and you can optimize it.
Intuit follows the “ELT” vs. “ETL” approach. With a lot of SQL knowledge on the team, they can build transformations in SQL and run them within the cluster. AWS drops the CUR into an S3 bucket where Intuit extracts the raw data from (the “E”) into the staging area. Intuit leaves the raw data untouched and loads it into the cluster (the “L”), to then transform it (the “T”).
Underneath the processes is an orchestration layer that coordinates workflows and manages dependencies. Some workflows need to execute on an hourly or daily basis, others on arrival of fresh data. Understanding the workflows and their execution is a crucial component for data integrity and meeting your SLAs.
When workflows and data pipelines fail – and they will – you have to a) know about it as it happens and b) understand the root cause of the failure. Otherwise you will run into data integrity issues and miss your SLAs. In Intuit’s case, the key SLA is the near real-time character of the data.
In intermix.io, you can see these workflows via our “Query Insights”.
You can double-click into each user to see the underlying query groups and dependencies. As the engineer in charge, that means you can track your worfklows and understand which user, query and table are the cause of any issues.
Let’s go through the single steps of the data flow and the technologies involved to orchestrate the workflows.
S3 is the demarcation point. AWS delivers the CUR into S3. With various data sources next to the CUR, it’s easy for people to put data into an S3 bucket. Loading data into Redshift from S3 is easy and efficient with the COPY command.
Amazon Redshift is the data platform. The workers for ingestion and post-ingestion processing include Lambda and EC2. Intuit uses Lambda wherever possible, as they prefer to not have any persistent compute they need to monitor or care for (patching, restacking, etc.).
Lambda functions can now run for 15 minutes, and for any job that runs under five minutes, the stack uses a lambda function. For larger jobs, they can deploy the same code stack on EC2, e.g. for staging the big CUR.
AWS Step Functions coordinate the Lambda jobs. SNS triggers new worfklows as new data arrives, vs. CloudWatch for scheduling batch jobs. For example, when a new CUR arrives in an S3 bucket processing needs to start right away vs. waiting for a specific time slot. RDS helps to maintain state.
Data consumption happens across three major categories.
Intuit supports new data use cases with Redshift, such as data APIs. Some of the uses cases have a transactional character that may require many small writes.
Instead of trying to turn Redshift into an OLTP, Intuit combines Redshift with PostgreSQL via Amazon RDS. By using dblink you can have your PostgreSQL cake and eat it too. By linking AmazonRedshift with RDS PostgreSQL, the combined feature set can power a broader array of use cases and provide the best solution for each task.
Unlike with “all in one” data warehouses like Oracle or SQL Server, Redshift doesn’t offer system-native workflows. This may be at first intimidating
Instead, AWS takes the approach of providing a broad collection of primitives for low-overhead compute, storage, and development services. Coupled with a rich tool ecosystem for Redshift, you can build a data platform that allows for higher performing, more scalable and lower cost solutions than previously possible.
Overall, the migration rushed Intuit into a new era of data productivity. The platform:
Meanwhile, the new data platform saves Intuit millions of spend on cloud infrastructure, and transforms the decision making process for 8,000+ employees.
With the new platform in place, Intuit is architecting a number of new use cases.
Data Lake Architecture
Long term trends for the CUR data are interesting, but for cost optimization analysts are interested in the most recent data. It makes sense to unload data from the largest tables in Redshift into S3 in Parquet format. That saves cost and increases flexibility by separating storage and compute.
Data Lifecycle Management
Once data is in S3, other (serverless) query engines like Athena or Redshift Spectrum can access it. The main fact tables in the Intuit cluster are based on date – the CUR is a bill. The date serves as the criteria when to unload data. For example, you may only want to keep one-quarter of data within the cluster. By keeping historic data in S3 and using Spectrum to query it, you scale data outside of Redshift but keep retrieval seamless and performant.
In intermix.io, you can filter for Spectrum queries by row count and scan size. You can also track their execution time and queue wait time. In the screenshot below you see those metrics, incl. the uptick Spectrum queries beginning of June .
The cost optimization program has delivered massive benefits. Teams know and predict computing costs in near real time. Deploying ML/AI capabilities against the CUR will allow making even smarter decisions – even 1% of improvement pays huge dividends.
Intuit expects the number of data scientists to go up several-fold, along with it the query volume. These queries patterns are more complex and less predictable. Concurrency Scaling offers an option to add more slots to a cluster to accommodate that incremental query volume, without adding nodes.
It’s a new way of working with data compared with the old, on-premise warehouse. Intuit is now in a position to embed a data services layer into all of Intuit’s products and services.
That was a long post, and I hope it gave you a good peek behind the curtain on how Intuit is building their platform. I hope the post gives you enough information to get started with your data platform.
Now, I’d love to learn from you! Is there anything you can share about your own experience building a data platform? And if you want faster queries for your cloud analytics, and spend less time on Ops and more time on Dev like Intuit, then go ahead and schedule a demo or start a trial for intermix.io.
In broad terms, we see five major categories of data sources:
Dow Jones & Company is a publishing and financial information firm with products and services that help businesses participate in the market better. Examples are Barron’s, Factiva and also the Wall Street Journal. They serve enterprises and consumers alike.
Colleen Camuccio is a VP of Program Management at Dow Jones. In her presentation at AWS re:Invent she talks about Dow Jones’ use of the AWS and Amazon Redshift. Amazon Redshift is at the centre of their stack to turn their data system from a cost center to a revenue generating center.
In this post, we’re providing a summary of how Dow Jones implemented their new data platform with Amazon Redshift at the center of it.
Large companies don’t often start brand new projects from scratch, so why did Dow Jones decide to create a brand new data platform from the ground up?
At Dow Jones, data users faced five problems when working with data.
Users couldn’t get their hands on the data they wanted to get to. With these problems in mind, Colleen and her team saw an opportunity. That opportunity was to use the cloud, and turn data from a cost center into a revenue generating center by creating a brand new, world class data platform.
To plan the architecture, and choose all of the tools involved in creating their data platform, the team created a council of cloud technologists. The council includes experts from inside Dow Jones, industry specialists, and members from AWS to help design the architecture of the new platform.
There are five core AWS technologies that serve as the foundation for the architecture:
These five technologies form the backbone of the Dow Jones data pipeline.
S3 as the data lake
S3 is the staging area to source, standardize and catalog data. The goal is to collect, clean and key every relevant customer event for downstream usage. Data in S3 is transformed into Parquet and normalized for consumption by self-service tools and analytics use cases.
EC2 to pull data into S3
Not all systems Dow Jones works with are able to drop data directly into the platform via e.g. off-the-shelf ETL tools. To solve this issue of data sourcing, EC2 instances pull data from servers, APIs, and 3rd party sources.
EMR and Spark to process data
Amazon EMR is an AWS framework for processing big data workloads. EMR allows you to store data in S3 and run computation in a separate process. EMR provides native support for Apache Spark. The decision when to use Spark vs. Redshift to process data depends on the use case.
Dow Jones uses EMR to process, massage, and transform data, with different S3 buckets for the individual steps and stages.
AWS Glue to organize and partition data
End users access “data marts”, i.e. aggregated data with business rules applied. An example is a “demographic data mart”, where Dow Jones summarizes and exposes single-user profiles (e.g. cleaned for different job titles for the same customer).
To label, organize and partition data for intuitive downstream access from S3, Dow Jones uses AWS Glue.
Amazon Redshift as the analytics platform
At the start of planning the architecture, the decision came down to choosing between using Amazon Athena and Amazon Redshift for the analytics layer. They chose Amazon Redshift, for three reasons.
Amazon Redshift is amazing at aggregating large datasets. But with free-range access to a Redshift cluster, e.g. for dashboards or custom reports, you still need to consider that users end up writing poor SQL.
Consider that Redshift is an analytical database (“OLAP”), and unilke transactional databases (“OLTP”) it doesn’t use indexes. And so SQL statements that include a “SELECT *” can impact query and overall cluster performance. Rather, users should select a specific column.
The data team has approached this problem by recommending best practices to their users when querying smaller data sets.
But users don’t always pay attention to these best practices, which is where our automated, individual query recommendations come to the rescue. With individual query optimization recommendations, you can empower your users to tune their SQL queries.
With the new platform up and running, Dow Jones is enabling the business with new uses cases for data. Three examples of new use cases.
Consumer Publication Dashboard
A custom dashboard that joins clickstream, subscription, membership and demographic data for the Dow Jones consumer publications. With this dashboard, users are able to segment, filter, sort and view who is reading what.
Advertising Performance Dashboard
This dashboard provides analytics and insights into how ads are performing and how users are interacting with those ads. The dashboard joins data sets across eleven different sources doing eleven different things, in one standard format.
Data Visualization with B2B Data
A 360 view of Dow Jones clients in the B2B space, combining clickstream behavioral data with individual customer data.
To power those dashboards, the Redshift cluster hosts over 118TB of data. 100+ users access and query data in Redshift in a self-service model.
With different competing workloads, and 100s of users who write queries, it’s crucial to set up workload management in Redshift.
All the work Dow Jones had put in to create their new data platform was done with the future in mind.
Beyond reporting, Artificial intelligence and predictive analytics are the future of business. Dow Jones, as an industry leader, has to be at the forefront of this change. That is a major reason they’ve prepared this data platform.
When designing the architecture, a key goal was to make their data “AI ready”. Data cleansing and preparation is one of the most challenging and time consuming aspects of data science.
By creating a system that has data cleansing and preparation as a part of the process, they’ve allowed their data scientists to focus on the work that generates results. The work of model building, model training, and model evaluation is where data scientists earn their living, and that is where Dow Jones wants their data scientists to spend their efforts. A key factor here are fast and efficient queries, as that reduces cycle times and increases the volume of iterations for the training models.
AI, machine learning, and predictive analytics is what Dow Jones wants their data platform to do. With Redshift as the aggregatin layer, they’re using Amazon SageMaker to build and train models for predictive analytics.
With the new data platform system in place, Dow Jones is now prepared for the future of data. By using AWS and Redshift, Dow Jones has successfully turned the overflow of data from many different sources from a cost center to a revenue generating one.
Their mass of data from many different sources provides value for their business and customers in the present. For the future, they’re prepared by having a system for organizing and preparing their data for predictive analytics and machine learning.
As Dow Jones began the first steps to creating this data platform, they chose Amazon Redshift as their technological foundation. Some of the key benefits of using a cloud warehouse like Redshift include:
Building your new data platform in the cloud is an obvious choice. The benefits of Amazon Redshift make it an easy pick for teams building new data platforms in the cloud. And geared with our query recommendations, you make sure that your SQL is always tuned to perfection for your data architecture.
If you’re embarking on a similar journey like Dow Jones and have questions about your Redshift deployment or query optimization, chat with us live and we’re happy to help!
In this post, we’ll lay out the 5 major components of Amazon Redshift’s architecture.
Understanding the components and how they work is fundamental for building a data platform with Redshift. In the post, we’ll provide tips and references to best practices for each component.
See 14 real-life examples of data pipelines built with Amazon Redshift
Since launch, Amazon Redshift has found rapid adoption among SMBs and the enterprise. In the early days, business intelligence was the major use case for Redshift.
That has changed.
Today, we still, of course, see companies using BI dashboards like Tableau, Looker and Periscope Data with Redshift. But with rapid adoption, the uses cases for Redshift have evolved beyond reporting. And that has come with a major shift in end-user expectations:
The shift in expectations has implications for the work of the database administrator (“DBA”) or data engineer in charge of running an Amazon Redshift cluster. The static world is gone. Today, data sets have become so large and diverse that data teams have to innovate around how to collect, store, process, analyze and share data.
In the case of Amazon Redshift, much of that depends on understanding the underlying architecture and deployment model. It’s what drives the cost, throughput volume and the efficiency of using Amazon Redshift.
And so in this blog post, we’re taking a closer look at the Amazon Redshift architecture, its components, and how queries flow through those components. We’ll include a few pointers on best practices.
Image 1 shows how Amazon Redshift processes queries across this architecture. We’re excluding Redshift Spectrum in this image as that layer is independent of your Amazon Redshift cluster.
However, we do recommend using Spectrum from the start as an extension into your S3 data lake. We’ll go deeper into the Spectrum architecture further down in this post. Let’s first take a closer look at role of each one of the five components
In other reference architectures for Redshift, you will often hear the term “SQL client application”. And SQL is certainly the lingua franca of data warehousing.
But with the shift away from reporting to new types of use cases, we prefer to use the term “data apps”. Unlike writing plain SQL in an editor, they imply the use of data engineering techniques, i.e. the use of code/software to work with data.
Data apps run workloads or “jobs” on an Amazon Redshift cluster. There are three generic categories of data apps:
What does this mean for the DBA?
The Amazon Redshift architecture is designed to be “greedy”. A query will consume all the resources it can get. To protect workloads from each other, a best practice for Amazon Redshift is to set up workload management (“WLM”). WLM is a key architectural requirement. Setting up your WLM should be a top-level architecture component. We’ve also discussed the pros and cons of turning on automatic WLM in one of our posts.
A “cluster” is the core infrastructure component for Redshift, which executes workloads coming from external data apps. There are two key components in a cluster:
In our experience, most companies run multi-cluster environments, also called a “fleet” of clusters. For example, at intermix.io we run a fleet of ten clusters.
What does this mean for the DBA?
It’s easy to spin up a cluster, pump in data and begin performing advanced analytics in under an hour. That makes it easy to skip some best practices when setting up a new Amazon Redshift cluster. Read more at 3 Things to Avoid When Setting Up an Amazon Redshift Cluster
The leader node has four major roles:
What does this mean for the DBA?
In some cases, the leader node can become a bottleneck for the cluster. The pattern is an increase in your COMMIT queue stats. For example, larger nodes have more metadata, which requires more processing by the leader node. You can Query STL_COMMIT_STATS to determine what portion of a transaction was spent on commit and how much queuing is occurring.
The compute nodes handle all query processing, in parallel execution (“massively parallel processing”, short “MPP”).
Amazon Redshift provides two categories of nodes:
As your workloads grow, you can increase the compute and storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.
Compute nodes are also the basis for Amazon Redshift pricing. You can start with hourly on-demand consumption. Prices for on-demand range from $0.25 (dense compute) to $6.80 per hour (dense storage), with discounts of up to 69% for 3-year commitments.
What does this mean for the DBA?
The execution speed of a query depends a lot on how fast Redshift can access and scan data that’s distributed across nodes. A best practice is to choose the right distribution style for your data by defining distribution keys.
Adding nodes is an easy way to add more processing power. It’s also an easy way to address performance issues – by resizing your cluster and adding more nodes. Many Redshift customers run with over-provisioned clusters. Because nodes are the basis for pricing, that can add up over time. And removing nodes is a much harder process. But it’s also the only way to reduce your Redshift cost.
In some cases, it may make sense to shift data into S3. The cost of S3 storage is roughly a tenth of Redshift compute nodes. With Amazon Redshift Spectrum you can query data in Amazon S3 without first loading it into Amazon Redshift.
Image 2 shows what an extended Architecture with Spectrum and query caching looks like. The leader nodes decides:
The leader node includes the corresponding steps for Spectrum into the query plan. The compute nodes in the cluster issue multiple requests to the Amazon Redshift Spectrum layer.
Spectrum scans S3 data, runs projections, filters and aggregates the results. Spectrum sends the final results back to the compute nodes. The compute nodes run any joins with data sitting in the cluster. That way, you can join data sets from S3 with data sets in Amazon Redshift.
Redshift pricing is based on the data volume scanned, at a rate or $5 per terabyte.
What does this mean for the DBA?
Using Redshift Spectrum is a key component for a data lake architecture. Amazon Redshift is the access layer for your data applications. Spectrum is the query processing layer for data accessed from S3. We’ve written more about the detailed architecture in “Amazon Redshift Spectrum: Diving into the Data Lake”
How to get the most out of your Amazon Redshift cluster
In this post, we described the Amazon Redshift’s architecture. We explained how the architecture affects working with data and queries.
If you want to dive deeper into Amazon Redshift and Amazon Redshift Spectrum, register for one of our public training sessions. Each month, we host a free training with live Q&A to answer your most burning questions about Amazon Redshift and building data lakes on Amazon AWS. If you have a burning question about the architecture that you want to answer right now – open this chat window, we’re around to answer your questions!
At intermix.io, we work with companies that build data pipelines and data lakes in the cloud. Some start “cloud-native”, others migrate from on-premise solutions like Oracle or Teradata. What they all have in common though is the one question they ask us at the very beginning:
“How do other companies build their data pipelines?”
And so that’s why we decided to compile and publish a list of publicly available blog posts about how companies build their data pipelines. In those posts, the companies talk in detail about how they’re using data in their business, and how they’ve become “data-centric”.
Here’s the short list of the 14 companies:
11. Dollar Shave Club
And we’re sorry if we missed your post – we’re happy to include it, just fill out this form (take less than a minute). And with that – please meet the 14 examples of data pipelines from the world’s most data-centric companies
Getting data-driven is the main goal for Simple. It’s important for the entire company to have access to data internally. Instead of the analytics and engineering teams to jump from one problem to another, a unified data architecture spreading across all departments in the company allows building a unified way of doing analytics. The main problem then is how to ingest data from multiple sources, process it, store it in a central data warehouse, and present it to staff across the company. Similar to many solutions nowadays, data is ingested from multiple sources into Kafka, before passing it to compute and storage systems. The warehouse of choice is Redshift, selected because of its SQL interfaces, and the ease to process petabytes of data. Finally reports, analytics and visualizations are powered using Periscope Data. In such a way, the data is easily spread across different teams allowing them to make decisions based on data.
Clearbit was a rapidly growing, early-stage startup when it started thinking of expanding its data infrastructure and analytics. After trying out a few out-of-the-box analytics tools (and each of them failed to satisfy the company’s demands) they took building the infrastructure in their own hands. Their efforts converged into a trio of providers: Segment, Redshift, and Mode. Segment is responsible for ingesting all kind of data, combining it, and syncing it daily into a Redshift instance. The main data storage is obviously left to Redshift, with backups into AWS S3. Finally, since Redshift supports SQL, Mode is a perfectly fitted tool which is used to running queries (while using Redshift’s powerful data processing abilities) and creating data insights.
Mode make it easy to explore, visualize and share that data across your organization.
But as data volume grows, that’s when data warehouse performance goes down. With ever increasing calls to your data from analysts, your cloud warehouse becomes the bottleneck. Uncertainty why queries take longer and longer to complete frustrates analysts and engineers alike.
That’s why we’ve built intermix.io, so that Mode users get all the tools they need to optimize their queries running on Amazon Redshift. Here one of our dashboards that shows you how you can track queries from Mode down to the single user:
The whole data architecture at 500px is mainly based on two tools: Redshift – for data storage, and Periscope – for analytics, reporting, and visualization. From a customer-facing side, the company’s web and mobile apps run on top of a few API servers, backed by several databases – mostly MySQL. Data in these DBs is then processed through a Luigi ETL, before storing it to S3 and Redshift. Splunk here does a great job in querying and summarizing text-based logs. Periscope Data is responsible for building data insights and sharing them across different teams in the company. All in all, this infrastructure supports around 60 people distributed across a couple of teams within the company, as of 2015.
The data infrastructure at Netflix is certainly one of the most complex ones, having in mind that they serve over 550 billion events per day, equaling roughly to 1.3 petabytes of data. In general, Netflix’s architecture is broken down into smaller systems, such as systems for data ingestion, analytics, predictive modeling etc. The data stack employed in the core of Netflix is mainly based on Apache Kafka for real-time (sub-minute) processing of events and data. Data needed in the long-term is sent from Kafka to AWS’s S3 and EMR for persistent storage, but also to Redshift, Hive, Snowflake, RDS and other services for storage regarding different sub-systems. Metacat is built to make sure the data platform can interoperate across these data sets as a one “single” data warehouse. Its task is to actually connect different data sources (RDS, Redshift, Hive, Snowflake, Druid) with different compute engines (Spark, Hive, Presto, Pig). Other Kafka outputs lead to a secondary Kafka sub-system, predictive modeling with Apache Spark, and Elasticsearch. Operational metrics don’t flow through the data pipeline but through a separate telemetry system named Atlas.
The tech world has changed dramatically since Yelp was launched back in 2004. Eight years later and Yelp started its change. It transformed from running a huge monolithic application on-premises to one built on microservices running in the AWS cloud. By the end of 2014, there were more than 150 production services running, with over 100 of them owning data. Its main part of the cloud stack is better known as PaaSTA, based on Mesos and Docker, offloading data to warehouses such as Redshift, Salesforce and Marketo. Data enters the pipeline through Kafka, which in turn receives it from multiple different “producer” sources.
Gusto, founded in 2011, is a company that provides a cloud-based payroll, benefits and workers’ compensation solution for businesses. Their business has grown steadily over the years, currently topping to around 60 thousand customers. By early 2015, there was a growing demand within the company for access to data. Up until then, the engineering team and product managers were running their own ad-hoc SQL scripts on production databases. There was obviously a need to build a data-informed culture, both internally and for their customers. When coming to the crossroad to either build a data science or data engineering team, Gusto seems to have done the right choice: first build a data infrastructure which then would support analysts in generating insights and drawing prediction models.
The first step for Gusto was to replicate and pipe all of their major data sources into a single warehouse. The warehouse choice landed on an AWS Redshift cluster, with S3 as underlying data lake. Moving data from production app databases into Redshift was then facilitated with Amazon’s Database Migration Service. On the other side of the pipeline, Looker is used as a BI front-end that teams throughout the company can use to explore data and build core dashboards. Aleph is a shared web-based tool for writing ad-hoc SQL queries. Finally, monitoring (in the form of event tracking) is done by Snowplow, which can easily integrate with Redshift, and as usual, Airflow is used to orchestrate the work through the pipeline.
Building such pipeline massively simplified data access and manipulation across departments. For instance, analysts can simply build their own datasets as part of an Airflow task, and expose it to Looker to use in dashboards and further analyses.
Cloudflare is a web performance and security company that provides online services to protect and accelerate websites online. Online content distribution, web optimization, web security, and analytics are a few examples of the company’s business range.
While different services may require different data stacks to work on, they are all built on top of Cloudflare’s core infrastructure. In the core of their data stack there are Kafka clusters as a streaming platform, and CitusDB as a data warehouse – a scaled up version of PostgreSQL. Data is ingested through Cloudflare’s edge services using HTTP requests, then passed on to Kafka clusters, before getting stored in CitusDB warehouse. A nice example of a service working on top of this infrastructure is the DNS Analysis – a service which processes around 1 million DNS queries per second! The DNS edge service pre-processes and aggregates data, before sending it encrypted to one of Cloudflare’s data centers. Within the data center, data is de-multiplexed and pushed into several Apache Kafka clusters, which in turn pushes data to consumers grouped by Kafka topic. Consumers can store processed information into corresponding DBs which are later queried by the company’s API services and information delivered to customers.
Cloudflare gives their services to millions of websites around the world, processing and storing hundreds of terabytes of data daily. Interestingly, Cloudflare is not a fan of commercial cloud technologies, but they implement their own data centers across the world, in total 152 as of this moment.
Teads is a video advertising marketplace, often ranked as the number 1 video platform in the world. Working with data-heavy videos must be supported by a powerful data infrastructure, but that’s not the end of the story. Teads’ business needs to log user interactions with their videos through the browser (like play, pause, resume, complete…), which count up to 10 million events per day. Another source of data is video auctions (real-time bidding processes) which generate another 60 million events per day. To build their complex data infrastructure, Teads has turned to both Google and Amazon for help.
Originally the data stack at Teads was based on a lambda architecture, using Storm, Spark and Cassandra. This architecture couldn’t scale well, so the company turned toward Google’s BigQuery in 2016. They already had their Kafka clusters on AWS, which was also running some of their ad delivery components, so the company chose a multi-cloud infrastructure. Transferring data between different cloud providers can get expensive and slow. To address the second part of this issue, Teads placed their AWS and GCP clouds as close as possible and connected them with managed VPNs.
So how does their complex multi-cloud data stack look like? Well, first of all, data coming from users’ browsers and data coming from ad auctions is enqueued in Kafka topics in AWS. Then using an inter-cloud link, data is passed over to GCP’s Dataflow which is then well paired with BigQuery in the next step. Having all data in a single warehouse means half of the work is done. The next step would be to deliver data to consumers, and Analytics is one of them. The Analytics service at Teads is a Scala-based app that queries data from the warehouse and stores it to tailored data marts. Interestingly, the data marts are actually AWS Redshift servers. In the final step, data is presented into intra-company dashboards, and the user’s web apps.
You'll get additional resources like:
Remind’s data engineering team strives to provide the whole company with access to the data they need, as big as 10 million daily events, and empower them to directly make decisions. They initially started with Redshift as its source of truth resource for data, and AWS S3 to optimize for cost. While S3 is used for long-term storage of historical data in JSON format, Redshift only stores the most valuable data, not older than 3 months. The company uses Interana to run custom queries on their JSON files on S3, but they’ve also recently started using AWS Athena, as a fully managed Presto system – to query both S3 and Redshift databases. The move for Athena also triggered a change in the data format – from JSON to Parquet, which they say was the hardest step in building up their data platform. An EMR/Hive system is responsible for doing the needed data transformations between S3 and Athena. In the data ingestion part of the story, Remind gathers data through their APIs from both mobile devices and personal computers, as the company business targets schools, parents and students. This data is then passed to a streaming Kinesis Firehose system, before streaming it out to S3 and Redshift.
Remind’s future plans are probably focused on facilitating data format conversions using AWS Glue. This step would allow them to replace EMR/Hive from their architecture and use Spark SQL instead of Athena for diverse ETL tasks.
Robinhood is a stock brokerage application that democratizes access to the financial markets, which enables its customers to buy and sell U.S. listed stocks and ETFs with zero commission. The company debuted with a waiting list of nearly 1 million people, which means they had to pay attention to scale from the very beginning.
Robinhood’s data stack is hosted on AWS, and the core technology they use is ELK (Elasticsearch, Logstash, and Kibana) – a tool for powering search and analytics. Logstash is responsible for collecting, parsing and transforming logs, before passing them on to Elasticsearch, while data is visualized through Kibana. They grew up from a single ELK cluster with a few GBs of data to three clusters with over 15 TBs. Before data goes to ELK clusters, it is buffered in Kafka, as the rates of which documents enter vary significantly between different data sources. Kafka also shields the system from failures and communicates its state with data producers and consumers. As with many other companies, Robinhood uses Airflow to schedule various jobs across the stack, beating competition such as Pinball, Azkaban and Luigi. Robinhood data science team uses Amazon Redshift to help identify possible instances of fraud and money laundering.
Dollar Shave Club (DSC) is a lifestyle brand and e-commerce company that’s revolutionizing the bathroom by inventing smart, affordable products. Don’t be fooled by their name, they have a pretty cool data architecture, for a company in the shaving business. Their business model works with online sales through a subscription service. Currently, they serve around 3 million subscribed customers.
DSC’s web applications, internal services, and data infrastructure are 100% hosted on AWS. A Redshift cluster serves as the central data warehouse, receiving data from various systems. Data movement is facilitated with Apache Kafka and can move in different directions – from production DBs into the warehouse, in between different apps and in between internal pipeline components. There’s also Snowplow which collects data from the web and mobile clients. Once data reaches Redshift, it is accessed through various analytics platforms for monitoring, visualization, and insights. The main tool for the job is, of course, Apache Spark, which is mainly used to build predictive models, such as recommender systems for future sales.
Coursera is an education company that partners with the top universities and organizations in the world to offer online courses. They started building their data architecture somewhere around 2013, as both numbers of users and available courses increased. As of late 2017, Coursera provides courses to 27 million worldwide users.
Coursera collects data from their users through API calls coming from mobile and web apps, their production DBs, and logs gathered from monitoring. A backend service called “eventing” periodically uploads all received events to S3 and continuously publishes events to Kafka. The engineering team has selected Redshift a central warehouse, offering much lower operational cost when compared with Spark or Hadoop at the time.
On the analytics end, the engineering team created an internal web-based query page where people across the company can write SQL queries to the warehouse and get the needed information. Of course, there are analytics dashboard across the company which are refreshed on a daily basis. Finally, many decisions made in Coursera are based on machine learning algorithms, such as A/B testing, course recommendations, understanding student dropouts and others.
Wish is a mobile commerce platform. It provides online services that include media sharing and communication tools, personalized and other content, as well as e-commerce. During the last few years, it grew up to 500 million users, making their data architecture out of date.
The data architecture at Wish, before scaling up, had two different production databases: a MongoDB one storing user data, and a Hive/Presto cluster for logging data. Data engineers had to manually query both to respond to ad-hoc data requests, and this took weeks at some points. Another small pipeline orchestrated by Python crons, also queried both DBs and generated Email reports.
After rethinking their data architecture, Wish decided to build a single warehouse using Redshift. Data from both production DBs flowed through the data pipeline into Redshift. BigQuery is also used for some types of data. It feeds data into secondary tables needed for analytics. Finally, analytics and dashboards are created with Looker.
You'll get additional resources like:
Blinkist transforms the big ideas from the world’s best nonfiction books into powerful little packs users can read or listen to in 15 minutes. At first, they started selling their services through a pretty basic website, and monitored statistics through Google Analytics. Unfortunately, visitor statistics gathered from Google Analytics didn’t match the ones engineers computed. This is one of the reasons why Blinkist decided to move to the AWS cloud.
They choose a central Redshift warehouse where data flow in from user apps, backend and web frontend (for visitors tracking). To get data to Redshift, data is streamed with Kinesis Firehose, also using Amazon Cloudfront, Lambda and Pinpoint. The engineering team at Blinkist is working on a newer pipeline where ingested data comes to Alchemist, before passing it to a central Kinesis system, and onwards to the warehouse.
We hope this post along with its 14 examples gives you the inspiration to build your own data pipelines in the cloud.
If you don’t have any data pipelines yet, it’s time to start building them. Begin with baby steps and focus on spinning up an Amazon Redshift cluster, ingest your first data set and run your first SQL queries.
After that, you can look at expanding by adding a dashboard for data visualization, and schedule a workflow, to build your first true data pipeline. And once data is flowing, it’s time to understand what’s happening in your data pipelines.
That’s why we built intermix.io. We give you a single dashboard to understand when & why data is slow, stuck, or unavailable.
With intermix.io you can:
Our customers have the confidence to handle all the raw data their companies need to be successful. What you get is a real-time analytics platform that collects metrics from your data infrastructure and transforms them into actionable insights about your data pipelines, apps, and users who touch your data.
Setting up intermix.io takes less than 10 minutes, and because you can leverage our intermix.io experts, you can say goodbye to paying for a team of experts with expensive and time-consuming consulting projects. We can help you plan your architecture, build your data lake and cloud warehouse, and verify that you’re doing the right things.
It’s easy – start now by scheduling a call with one our of experts.