Table of Contents
In 2014 Intuit’s then-CTO Tayloe Sainsbury went all in on the cloud and started migrating Intuit’s legacy on-premise IT infrastructure to Amazon AWS. By February 2018, Intuit had sold its largest data center and processed 100% of 2018 tax filings in the cloud.
But now Intuit had a different challenge – optimizing cloud spend and allocating that spend to products and users. AWS bills customers via a “Cost & Usage Report” (“CUR”). Because of the size of its cloud spend, the Intuit CUR comprises billions of rows, and it keeps growing by the day. Intuit switched from an on-premise data warehouse and now uses Amazon Redshift to process 4-5 billion rows of raw CUR data – each day.
In this post, I’m walking you through the approach that Jason Rhoades took to build Intuit’s data pipeline with Redshift. Jason is an Architect at Intuit, and with a small data team, they provide business-critical data to more than 8,000 Intuit employees.
Heads up – this is a long post with lots of detail!
Three major blocks:
Let’s start with an overview of the business.
Intuit builds financial management and compliance products and services for consumers and small businesses. Intuit also provides tax products to accounting professionals.
Products include QuickBooks, TurboTax, Mint and Turbo. These products help customers run their businesses, pay employees, send invoices, manage expenses, track their money, and file income taxes. Across these products, Intuit serves more than 50 million customers.
Intuit started in the 1980s and built the original version of its first product Quicken for the desktop, first MS-DOS and then Windows. With the Internet, that usage shifted to web and mobile. The impact of that change became clear as early as 2010.
Fast forward to today, and over 90% of Intuits customers file their taxes and manage their accounting online and via mobile apps.
Consumption of Intuit products follows seasonal patterns.
Tax seasonality has the biggest impact on Intuit’s business. Each fiscal year, Intuit generates half of its annual revenue in the quarter ending on April 30th, with the US tax filing deadline on April 15th.
Seasonality also has a huge impact on the cost side of the equation. The shift to digital and online usage of Intuit’s products causes a dramatic usage spike for the IT infrastructure. Most users file their taxes online during the last two days of the tax season.
In the old world of on-premise infrastructure, and to handle the concurrent usage, Intuit had to size their data center for peak capacity. After tax season, demand drops back down to average usage. The gap between peak demand and average usage is so large, that 95% of Intuit’s infrastructure would sit idle for 95% of the year.
That’s why Intuit decided in 2014 to go all in with the cloud. With the cloud’s elasticity, Intuit is in a better position to accommodate spikes in customer usage during the tax season.
By shifting to the cloud, Intuit reduced cost by a factor of six because it no longer maintained idle servers for an application only active during tax season. After the first success, Intuit moved more applications, services and enabling tools to the cloud. Today, over 80% of Intuit’s workloads are running in the cloud.
With now growing usage of AWS, the priorities of the program shifted from migration speed to efficient operations and growth.
Intuit now spends $100s of Millions on prepaid AWS services (aka “reserved instances” or short “RIs”) alone, plus fees for on-demand usage during the peaks. Interest grew in understanding the use of different AWS services and spend by different business units and teams within Intuit.
The source for that information sits in the “Cost & Usage Report” (“CUR”), a bill that Amazon AWS delivers to every customer. The CUR includes line items for each unique combination of AWS product, usage type, and operation and of course pricing. The CUR also contains information about credits, refunds and support fees.
Analyzing CUR data supports Intuit’s cloud program with two major use cases:
To build these two use cases, Jason’s team needs to transform the raw CUR data into a format consumable by the business. The raw CUR data comes in a different format from what Intuit uses to charge internal parties, distribute shared costs, amortize RIs and record spend on the general ledger.
The traditional way of Jason’s team to run the analytics on the CUR data was with an on-premise data warehouse.
Unlike other companies, the size of Intuit’s CUR is very large. In 2017, it was around 500M rows at the end of a month.
Amazon delivers the report 3-4x per day to Intuit, and restates the rows with each report over the course of a month, meaning it gets longer with each delivery. Coupled with a growing business, the amount of data the cluster has to process each time grows by the hour – literally.
You can see that trend play out in chart above, with data from 2017. The grey area indicates the batch size for the CUR data. Each day, the batch size gets bigger as the number of rows in the CUR grow. At the end of the month, the CUR reaches about 500 million rows and resets on day one of the new month.
The amount of rows the warehouse processes per minute stays constant at around 1 million rows per minute. Therefore, the time it takes the warehouse to process each batch (“batch duration”) goes up in linear fashion. With 500M rows at the end of the month, it takes the warehouse 500 minutes to process the full report, or 8 hours and 20 minutes.
Now extrapolate forward and calculate what that looks like in the future. With rising cloud spend, the data team realized that the CUR would start to blow up in size. In fact, today the CUR is larger by a factor of 10x with ~5 billion rows. Now we’re talking over 80 hours, almost four days.
Intuit’s situation is a common scenario we see our customers run into: “More data, more workflows, more people”.
For Intuit, it was clear that “keep on doing what we’re doing” was not an option. In a world where data is an asset, data and DevOps teams should focus on the value-creation part of pipelines.
With cloud usage and data volume going up, the old on-prem warehouse was already running into bottlenecks, and so the analytics team followed the business team into the cloud.
The Intuit team followed their product team into the AWS cloud. The major goals included handling the explosion in data volume and adding value to the business.
With on-demand access to computing resources, access to usage data in near real-time is fundamental for Intuit’s business teams. Unlike in the old world, waiting for a report at the end of the month doesn’t work anymore. With the scale of Intuit’s cloud operations, a few hours of freshness have a substantial impact on the company.
Jason migrated the entire stack from Oracle to Redshift, and deployed the same SQL and ETL processes.
Redshift handled the growth in data volume. Three major data points:
You can also see that the size of the grey area has a step change in April – tax season! The change is due to new capabilities Intuit introduced, which tripled the number of rows of the bill (“more data”).
Despite tripling the number of rows, the batch duration stays within a narrow band and doesn’t spike. That’s because batch size and number of rows processed per minute grow at the same rate. In other words, the cluster processes more data faster, i.e. performance goes up as workloads grow.
Let’s dive into how Jason’s team achieved that result.
The cluster architecture and the data pipeline follow the best practices we recommend for setting up your Amazon Redshift cluster. In particular, pay attention to setting up your WLM to separate your different workloads from each other.
You can see the three major workloads in the architecture chart – stage, process and consume.
Among our customers, “ELT” is a standard pattern, i.e. the transformation of data happens in the cluster with SQL. Cloud warehouses like Redshift are both performant and scalable, to the point that data transformation uses cases can be handled much better in-database vs an external processing layer. SQL is concise, declarative, and you can optimize it.
Intuit follows the “ELT” vs. “ETL” approach. With a lot of SQL knowledge on the team, they can build transformations in SQL and run them within the cluster. AWS drops the CUR into an S3 bucket where Intuit extracts the raw data from (the “E”) into the staging area. Intuit leaves the raw data untouched and loads it into the cluster (the “L”), to then transform it (the “T”).
Underneath the processes is an orchestration layer that coordinates workflows and manages dependencies. Some workflows need to execute on an hourly or daily basis, others on arrival of fresh data. Understanding the workflows and their execution is a crucial component for data integrity and meeting your SLAs.
When workflows and data pipelines fail – and they will – you have to a) know about it as it happens and b) understand the root cause of the failure. Otherwise you will run into data integrity issues and miss your SLAs. In Intuit’s case, the key SLA is the near real-time character of the data.
In intermix.io, you can see these workflows via our “Query Insights”.
You can double-click into each user to see the underlying query groups and dependencies. As the engineer in charge, that means you can track your worfklows and understand which user, query and table are the cause of any issues.
Let’s go through the single steps of the data flow and the technologies involved to orchestrate the workflows.
S3 is the demarcation point. AWS delivers the CUR into S3. With various data sources next to the CUR, it’s easy for people to put data into an S3 bucket. Loading data into Redshift from S3 is easy and efficient with the COPY command.
Amazon Redshift is the data platform. The workers for ingestion and post-ingestion processing include Lambda and EC2. Intuit uses Lambda wherever possible, as they prefer to not have any persistent compute they need to monitor or care for (patching, restacking, etc.).
Lambda functions can now run for 15 minutes, and for any job that runs under five minutes, the stack uses a lambda function. For larger jobs, they can deploy the same code stack on EC2, e.g. for staging the big CUR.
AWS Step Functions coordinate the Lambda jobs. SNS triggers new worfklows as new data arrives, vs. CloudWatch for scheduling batch jobs. For example, when a new CUR arrives in an S3 bucket processing needs to start right away vs. waiting for a specific time slot. RDS helps to maintain state.
Data consumption happens across three major categories.
Intuit supports new data use cases with Redshift, such as data APIs. Some of the uses cases have a transactional character that may require many small writes.
Instead of trying to turn Redshift into an OLTP, Intuit combines Redshift with PostgreSQL via Amazon RDS. By using dblink you can have your PostgreSQL cake and eat it too. By linking AmazonRedshift with RDS PostgreSQL, the combined feature set can power a broader array of use cases and provide the best solution for each task.
Unlike with “all in one” data warehouses like Oracle or SQL Server, Redshift doesn’t offer system-native workflows. This may be at first intimidating
Instead, AWS takes the approach of providing a broad collection of primitives for low-overhead compute, storage, and development services. Coupled with a rich tool ecosystem for Redshift, you can build a data platform that allows for higher performing, more scalable and lower cost solutions than previously possible.
Overall, the migration rushed Intuit into a new era of data productivity. The platform:
Meanwhile, the new data platform saves Intuit millions of spend on cloud infrastructure, and transforms the decision making process for 8,000+ employees.
With the new platform in place, Intuit is architecting a number of new use cases.
Data Lake Architecture
Long term trends for the CUR data are interesting, but for cost optimization analysts are interested in the most recent data. It makes sense to unload data from the largest tables in Redshift into S3 in Parquet format. That saves cost and increases flexibility by separating storage and compute.
Data Lifecycle Management
Once data is in S3, other (serverless) query engines like Athena or Redshift Spectrum can access it. The main fact tables in the Intuit cluster are based on date – the CUR is a bill. The date serves as the criteria when to unload data. For example, you may only want to keep one-quarter of data within the cluster. By keeping historic data in S3 and using Spectrum to query it, you scale data outside of Redshift but keep retrieval seamless and performant.
In intermix.io, you can filter for Spectrum queries by row count and scan size. You can also track their execution time and queue wait time. In the screenshot below you see those metrics, incl. the uptick Spectrum queries beginning of June .
The cost optimization program has delivered massive benefits. Teams know and predict computing costs in near real time. Deploying ML/AI capabilities against the CUR will allow making even smarter decisions – even 1% of improvement pays huge dividends.
Intuit expects the number of data scientists to go up several-fold, along with it the query volume. These queries patterns are more complex and less predictable. Concurrency Scaling offers an option to add more slots to a cluster to accommodate that incremental query volume, without adding nodes.
It’s a new way of working with data compared with the old, on-premise warehouse. Intuit is now in a position to embed a data services layer into all of Intuit’s products and services.
That was a long post, and I hope it gave you a good peek behind the curtain on how Intuit is building their platform. I hope the post gives you enough information to get started with your data platform.
Now, I’d love to learn from you! Is there anything you can share about your own experience building a data platform? And if you want faster queries for your cloud analytics, and spend less time on Ops and more time on Dev like Intuit, then go ahead and schedule a demo or start a trial for intermix.io.
In broad terms, we see five major categories of data sources:
Dow Jones & Company is a publishing and financial information firm with products and services that help businesses participate in the market better. Examples are Barron’s, Factiva and also the Wall Street Journal. They serve enterprises and consumers alike.
Colleen Camuccio is a VP of Program Management at Dow Jones. In her presentation at AWS re:Invent she talks about Dow Jones’ use of the AWS and Amazon Redshift. Amazon Redshift is at the centre of their stack to turn their data system from a cost center to a revenue generating center.
In this post, we’re providing a summary of how Dow Jones implemented their new data platform with Amazon Redshift at the center of it.
Large companies don’t often start brand new projects from scratch, so why did Dow Jones decide to create a brand new data platform from the ground up?
At Dow Jones, data users faced five problems when working with data.
Users couldn’t get their hands on the data they wanted to get to. With these problems in mind, Colleen and her team saw an opportunity. That opportunity was to use the cloud, and turn data from a cost center into a revenue generating center by creating a brand new, world class data platform.
To plan the architecture, and choose all of the tools involved in creating their data platform, the team created a council of cloud technologists. The council includes experts from inside Dow Jones, industry specialists, and members from AWS to help design the architecture of the new platform.
There are five core AWS technologies that serve as the foundation for the architecture:
These five technologies form the backbone of the Dow Jones data pipeline.
S3 as the data lake
S3 is the staging area to source, standardize and catalog data. The goal is to collect, clean and key every relevant customer event for downstream usage. Data in S3 is transformed into Parquet and normalized for consumption by self-service tools and analytics use cases.
EC2 to pull data into S3
Not all systems Dow Jones works with are able to drop data directly into the platform via e.g. off-the-shelf ETL tools. To solve this issue of data sourcing, EC2 instances pull data from servers, APIs, and 3rd party sources.
EMR and Spark to process data
Amazon EMR is an AWS framework for processing big data workloads. EMR allows you to store data in S3 and run computation in a separate process. EMR provides native support for Apache Spark. The decision when to use Spark vs. Redshift to process data depends on the use case.
Dow Jones uses EMR to process, massage, and transform data, with different S3 buckets for the individual steps and stages.
AWS Glue to organize and partition data
End users access “data marts”, i.e. aggregated data with business rules applied. An example is a “demographic data mart”, where Dow Jones summarizes and exposes single-user profiles (e.g. cleaned for different job titles for the same customer).
To label, organize and partition data for intuitive downstream access from S3, Dow Jones uses AWS Glue.
Amazon Redshift as the analytics platform
At the start of planning the architecture, the decision came down to choosing between using Amazon Athena and Amazon Redshift for the analytics layer. They chose Amazon Redshift, for three reasons.
Amazon Redshift is amazing at aggregating large datasets. But with free-range access to a Redshift cluster, e.g. for dashboards or custom reports, you still need to consider that users end up writing poor SQL.
Consider that Redshift is an analytical database (“OLAP”), and unilke transactional databases (“OLTP”) it doesn’t use indexes. And so SQL statements that include a “SELECT *” can impact query and overall cluster performance. Rather, users should select a specific column.
The data team has approached this problem by recommending best practices to their users when querying smaller data sets.
But users don’t always pay attention to these best practices, which is where our automated, individual query recommendations come to the rescue. With individual query optimization recommendations, you can empower your users to tune their SQL queries.
With the new platform up and running, Dow Jones is enabling the business with new uses cases for data. Three examples of new use cases.
Consumer Publication Dashboard
A custom dashboard that joins clickstream, subscription, membership and demographic data for the Dow Jones consumer publications. With this dashboard, users are able to segment, filter, sort and view who is reading what.
Advertising Performance Dashboard
This dashboard provides analytics and insights into how ads are performing and how users are interacting with those ads. The dashboard joins data sets across eleven different sources doing eleven different things, in one standard format.
Data Visualization with B2B Data
A 360 view of Dow Jones clients in the B2B space, combining clickstream behavioral data with individual customer data.
To power those dashboards, the Redshift cluster hosts over 118TB of data. 100+ users access and query data in Redshift in a self-service model.
With different competing workloads, and 100s of users who write queries, it’s crucial to set up workload management in Redshift.
All the work Dow Jones had put in to create their new data platform was done with the future in mind.
Beyond reporting, Artificial intelligence and predictive analytics are the future of business. Dow Jones, as an industry leader, has to be at the forefront of this change. That is a major reason they’ve prepared this data platform.
When designing the architecture, a key goal was to make their data “AI ready”. Data cleansing and preparation is one of the most challenging and time consuming aspects of data science.
By creating a system that has data cleansing and preparation as a part of the process, they’ve allowed their data scientists to focus on the work that generates results. The work of model building, model training, and model evaluation is where data scientists earn their living, and that is where Dow Jones wants their data scientists to spend their efforts. A key factor here are fast and efficient queries, as that reduces cycle times and increases the volume of iterations for the training models.
AI, machine learning, and predictive analytics is what Dow Jones wants their data platform to do. With Redshift as the aggregatin layer, they’re using Amazon SageMaker to build and train models for predictive analytics.
With the new data platform system in place, Dow Jones is now prepared for the future of data. By using AWS and Redshift, Dow Jones has successfully turned the overflow of data from many different sources from a cost center to a revenue generating one.
Their mass of data from many different sources provides value for their business and customers in the present. For the future, they’re prepared by having a system for organizing and preparing their data for predictive analytics and machine learning.
As Dow Jones began the first steps to creating this data platform, they chose Amazon Redshift as their technological foundation. Some of the key benefits of using a cloud warehouse like Redshift include:
Building your new data platform in the cloud is an obvious choice. The benefits of Amazon Redshift make it an easy pick for teams building new data platforms in the cloud. And geared with our query recommendations, you make sure that your SQL is always tuned to perfection for your data architecture.
If you’re embarking on a similar journey like Dow Jones and have questions about your Redshift deployment or query optimization, chat with us live and we’re happy to help!