Table of Contents
This is a guest blog post by our visiting expert Dieter Matzion. Dieter is a Business Systems Analyst at Intuit. In the post, Dieter explains how his team has built a KPI dashboard to track cloud spend on top of Amazon Redshift. With the dashboard, over 5,000 Intuit employees gain self-service visibility into their utilization of cloud resources. You can follow & connect with Dieter on Linkedin.
The cloud computing market is forecasted by Forbes and Gartner to grow about 17% in 2020, to an aggregated annual revenue of over a quarter trillion dollars. Almost 70% of enterprises are moving business-critical applications to the cloud says Tech Republic. Companies using the cloud can focus on their core business vs. the undifferentiated heavy lifting as Jeff Bezos calls it.
Cloud has become a competitive advantage that drives innovation with unprecedented speed. Economies of scale allow Intuit to run a 100,000 core data analysis job at any time we choose by simply writing a check. Elasticity allows builders to scale vertically and horizontally at the click of a mouse button. The global presence of the cloud allows enterprises to provide services close to their customers. And shifting from a data center to the cloud allows CFOs to move capital expenditures to operational ones. You can read our cloud migration story in “Intuit’s Journey to Cloud Analytics“.
However, the cloud is very different from a data center as is evident by the many authors talking about its disadvantages. The procurement model for the cloud is comparable to a cell phone bill that you are expected to pay in full at the end of the month.
Cloud governance is often de-prioritized due to the complexity and favoring innovation, opening the door for runaway spending. Renting managed services has a higher cost compared to bare metal machines in the data center. Cost items like data transfer and storage covered by someone else in the data center now show up on your business unit’s monthly bill. Prepaid services need to be amortized, and your organization may have different financial reporting needs than provided out-of-the-box.
Lifting and shifting from the data center to the cloud inherits all inefficiencies and right-sizing is new skill engineers need to learn. Enterprises relinquish central control of security, privacy, and compliance by assuming a shared responsibility model with the cloud provider.
But all is not doom and gloom, and ignoring the cloud gives an advantage to your competitors who are willing to master the new environment. In this article, I am going to share best practices and learnings of how large enterprises overcome these obstacles.
Let’s start with visibility. Managing cost in a data center is slow-paced. Procurement of hardware, racking, and installing software takes months. Budgets are stable and forecasted years in advance. Using the same method in the cloud is like driving a car at top speed – blindfolded – and expecting not to crash.
Unpredictable cost is the top pain point with lack of visibility a close second of 300 C-level and IT decision-makers surveyed by SoftwareONE in 2018.
Fortunately, major cloud providers provide detailed records of how you spent your money. Unfortunately, this insight may come too late, requires training before it becomes usable, and may not align with the financial reporting requirements of your organization.
At Intuit we process 4 billion rows of AWS billing data (the “AWS Cost and Usage Report”) every day into Amazon Redshift, which feeds financial reports tailored to different audiences. We layer in amortization of prepaid services and chargebacks to obtain financial numbers that are directly compared to cloud budgets. You can read about the detailed data architecture and processes in the post “Intuit’s to Cloud Analytics“.
Latencies of billing data are filled with data from inventory changes to get near-real-time cost estimates. This allows us to alert business owners of budget risks before runaway spend occurs.
The billing data is further enhanced with performance and efficiency telemetry like compute, storage, database, container, and serverless utilization data. This allows us to surface actionable insights for right-sizing and cost avoidance to executives and engineers alike. Our CTO and CFO review the top opportunities from the largest vertical down to a single engineer on a quarterly basis. This executive commitment cascades down the organization and is key to efficient operation in the cloud.
Surface actionable insights for right-sizing and cost avoidance to executives and engineers.Share this Amazon Redshift Transformational Use Case
The lowest hanging fruit for saving or avoiding the cost in the cloud are discounts. This is because a small team can implement these without having to engage a large number of engineers. Enterprise discount agreements typically give you a flat percentage rate depending on a specific commitment. These will vary for each customer as they are based on the volume of cloud services consumed.
Additionally, most cloud providers offer so-called sustained usage discounts, a percentage discount for a specific service in exchange for a longer-term commitment. Intuit actively manages a $100M portfolio of AWS Reserved Instances. These are prepaid leases for EC2, RDS, Redshift, Elasticache, ElasticSearch, and DynamoDB. Discounts can range up to 70% depending on the type of commitment. The portfolio is managed centrally using 3rd party tools like CloudHealth. Unit pricing is calculated hourly and exposed via an API. This allows automated cost decisions where software chooses most economic deployment patterns.
Cloud governance has a heavier lift compared to the previous areas as policies will affect how engineers can use the cloud. You want to establish guardrails instead of gatekeepers to promote the speed of innovation. Each policy is customized based on its purpose and maintained as code to allow real-time enforcement. Management and engineers need to be aware of the policies to innovate efficiently.
Intuit uses a customized fork of Capital One’s Cloud Custodian for automated policy enforcement. We maintain several hundred policies in an internal Github, 40 of them are focused on cloud cost. Here are a few examples of policies that will result in the most savings for the least effort.
Take advantage of the cloud’s elasticity by stopping to incur cost when turning off resources that are not in use. So-called cloud parking is automation that is executed to decommission and reprovision cloud resources on a schedule. Development or testing environments that no one is using after business hours are good candidates. Assuming a 12 hour on time on weekdays, this will save over 60% where the policy can be enabled.
Setup red versus green policies to restrict certain usages to maximize discounts. For example, if your company is only purchasing Reserved Instances in 2-3 regions, you can opt to terminate new workload in all other regions to avoid paying an undiscounted price. Other examples are to disallow expensive services or services for which you don’t have discounts. You want to automate exceptions, where business owners can override the policy at any time, and get a report of how this affected their budget.
Cloud providers generally have some data lifecycle management capabilities, and you may need to supplement these. A good example is unattached Elastic Block Store (EBS) on AWS. EBS is a like virtual disk drive, you can think of it as a thumb drive. When it is not connected to a computer, it still incurs a cost, as the storage cannot be reused. Intuit uses policy automation to backup unattached EBS and deletes the backups after a month if no one claims them. Engineers with a valid business reason can tag unattached EBS volumes, which the policy then ignores.
We’d love to learn from you! Is there anything you can share about your own experience building in the cloud? We are always excited to share transformational use cases of Amazon Redshift. And if you want faster queries for your cloud analytics, and spend less time on Ops and more time on Dev like Intuit, then go ahead and schedule a demo or start a trial with intermix.io.
Query optimization that dramatically reduces runtime for queries which use window functions.
A common problem we see our customers solving in SQL is how to compare two rows per group in a dataset. For example, say you have the following data in the product_prices table:
How do you efficiently SELECT the price change at each updated_on time for each Product? I.e. we’d like to see a result set of:
|Product||Most recent updated time||Price change|
This type of problem arises in a lot of domains. Another common example of this problem occurs when you want to compare the time difference between subsequent events. For example, you may have a table that logs timestamps for each page visit of a user to your website and you want to do an analysis on the time difference between visits.
A common but sub-optimal way we see customers solve this problem is by using the ROW_NUMBER() window function together with a self join. The algorithm is straightforward: first select all your product prices and order them within each product by updated_on using the ROW_NUMBER() window function. Then self join on the table, mapping each row to the row preceding it. In SQL:
with ordered_prices as (
row_number() over (partition by product order by updated_on desc) as row_number
from ordered_prices op1 join ordered_prices op2
on op1.product=op2.product and op1.row_number = op2.row_number+1
This query produces the desired result set, but at the cost of a join. For real-world (i.e. less contrived) examples, solutions like this also often need to incorporate additional sub-selects or other expensive operations which compound the effect of the join and can make the query run excruciatingly slow.
This class of problem can manifest itself in other ways as well: sometimes instead of a self join, the query may just select from ordered_prices where row_number=1 to get the most recent updated_on and price. But the general pattern is the same: using ROW_NUMBER() to label rows for further processing/joining.
If you’re using ROW_NUMBER() in a query just to apply filters or joins later on in your transformation, there may be a better way. Rather than co-opting ROW_NUMBER() to compare neighboring rows, a better solution is to apply a different window function that was designed to solve this problem: LAG(). The LAG window function returns the value from a row offset from the current row. The resulting SQL becomes:
lag(price) over (partition by product order by updated_on desc) - price as price_change
This SQL says that we should order the rows within each product by updated_on, and take the difference of the price from the current row with the price of the following row within that group. This eliminates the need for the join and can greatly simplify the query plan. One note: this result set will include NULL rows at the start of each product (i.e. where there is no previous row), but these can be trivially eliminated with a WHERE clause of price_change is not null when the results are used (or this query can be wrapped in a subselect to remove those rows).
The seed for this blog post was planted when a potential customer came to us with the same question we hear all the time: “Why are our queries slow?” After instrumenting their cluster with intermix.io they were able to quickly identify their worst queries and find out why they are slow and what to do about it. In their case, one of the worst offending queries used this ROW_NUMBER() pattern and triggered a large number of Query Recommendations in their intermix.io Dashboard, including:
Since the query exhibited the ROW_NUMBER() pattern above, it caused multiple unnecessary table scans, an expensive join, significant IO skew, and a large amount of intermediate data to be written to disk during execution. The query operated over a single table which had 876 million rows–not a huge table, but large enough to bring Amazon Redshift to its knees if not optimized properly–and was taking over 5.5 hours to execute!
After reviewing our Recommendations, they reimplemented their query using the LAG function to eliminate the join (which greatly simplified their query by removing two additional sub-selects), and the query execution time dropped to 30 seconds. From 5.5 hours to 30 seconds simply by reviewing targeted Query Recommendations and implementing a straightforward solution–talk about a good ROI on time spent optimizing SQL. Their exact feedback after seeing the new query runtimes? “OMFG. takes 30s. Wow”
Amazon Redshift is a fantastic general-purpose data warehouse. But since it is so easy to add data and new workloads to your cluster, it can be very hard to know where and how to start optimizing SQL queries.
Analysts and data engineers often wonder which queries are the worst offenders and where limited SQL tuning effort should go to give the biggest bang-for-the-buck. Knowing where to put query tuning effort is crucial since seemingly simple-looking queries can sometimes have inefficiencies lurking which can compound to slow a query by several orders of magnitude.
The good news: this is the problem that intermix.io Query Recommendations was designed to solve. We help you pinpoint the root cause of query slowness so that you can spend your limited time optimizing the right queries in the right way.
Data is valuable resource powering up analytics, predictive models and decision making. For a company to make data-driven decisions, it first must go through building its data infrastructure. And a data warehouse plays a central role in such an infrastructure.
Data warehouses are data storage and processing systems that aggregate data from different sources into a single place. With data in one place, you can combine and query it across the different sources. That includes data coming from users interacting through web and mobile, background system logs, or third party data.
Amazon Redshift is a cloud warehouse solution by Amazon Web Services (AWS). Since AWS first introduced Redshift in 2012, it got everyone’s attention for its amazing performance and low price point. In the following years, there were massive improvements from operational and data pipeline perspective. Today, it’s the market leader for cloud warehouses.
So how does Redshift work, and what’s been driving its adoption? There are two basic components you need to understand about Amazon Redshift:
The technology is nothing new. Other warehouses use it, and there are even open source data warehouses that are free to use. It’s the combination of the two, and the simplicity that Redshift offers to start with a data warehouse.
In this post, I’ll explain these two components. Before that, it’s helpful to understand basic nomenclature and key concepts. We’ll start with the key concepts.
If you’ve ever googled “Redshift” you must have read the following. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.
Let’s break down what this means, and explain a few other key concepts that are helpful for context on how Redshift operates.
Fully managed. AWS takes care of things like warehouse setup, operation and redundancy, as well as scaling and security. All this is automated in the background, so the client has a smooth experience.
Warehouse service. Redshift is a cloud service, the customer does not own the physical hardware of the warehouse, but can use it through a subscription as a service. This has dramatic impact on the procurement model for customers.
Getting started with Redshift is easy. Rather than buying and installing hardware, they can spin up a Redshift warehouse, upload data and run queries in less than 15 minutes. And since AWS manages a fleet of tens of thousands of Redshift clusters, customers benefit from automating capabilities that would not make economic sense for any individual on-premise DBA.
Petabyte. There are fifteen zeros in a petabyte, 1,000x bigger than a terabyte. To illustrate, this is equivalent to 13.3 years of HD video. Compare that with traditional on-premise data warehouses that operate in the terabyte range. Going to petabytes has dramatic impact on the analytical capabilities.
Many companies today already generate a terabyte of data per day. In the old on-premise world, storing and analyzing that much data would have been cost-prohibitive, leading to the famous “analysis gap” or “dark data”. They may collect data but don’t analyze it. Redshift solves that problem.
Cluster. A cluster is a collection of nodes which perform the actual storing and processing of data. Each cluster runs an Amazon Redshift engine distributed and replicated across all nodes.
Nodes. AWS offers four different node types for Redshift. Each node type comes with a combination of computing resources (CPU, memory, storage and I/O)
Databases. A cluster can have one or more databases. A database contains one or more named schemas, which in turn contain a set of tables and other objects.
Workloads. You’ll often hear the term “workloads”, which implies “running queries” on your Redshift cluster. There are three generic types of workloads.
Understanding and distinguishing the three generic workloads from each other is an important concept for any data warehouse. Different Workloads use different amounts of cluster resources, and it makes total sense to be able to keep disparate workloads separate, to avoid resource contention. Redshift manages them with the “Workload Manager”. We explain workloads in detail in our post “4 Simple Steps to Workload Management”.
With the key concept out of the way, we can dive into the two components, technology and pricing. We’ll see each one of them factors in, but it’s the combination that makes the magic happen and allows Redshift to reach state-of-the-art performance at a low price.
One of the key concepts behind Redshift’s exceptional performance is its columnar data storage.
Columnar storage is a type of Relational Database Management System optimized for heavy-reading tasks. Most of the queries you run on your warehouse are scans, powering up analytics (?), dashboards, or machine learning models.
A common term for these types of queries is “OLAP” – “OnLine Analytical Processing”. As the name suggests, OLAP databases are good for analytics. OLAP queries tend to be complex and process a lot of data, touching many different tables. In fact, Redshift is an “OLAP” database engine.
Compare this with row-oriented systems which are a better option for write-heavy tasks. It’s a different data modeling approach used for business applications which process a company’s transactions. You’ll hear the term “production database”, or “online transaction processing” (“OLTP”). Most OLTP databases tend to have large amounts of small queries and a high percent of write activity. In general, OLTP systems load small amounts of data with high frequency.
A Simple Example
For a simple example, let’s take a website like Glassdoor, with information on companies, their salary levels and reviews by their employees.
You create an account on Glassdoor. In your profile you submit the company you work for and your salary. That transaction creates a record in Glassdoor’s production database. That same production database also holds records for all other user profiles.
Let’s say Glassdoor creates a new marketing report with the average salary across different companies. They may even break that down by level or geography and show a historical trend. For that information, they need to create a query that selects, filters and aggregates the records for a corresponding company. It’s a much larger query compared to the initial write query for each individual user in Glassdoor’s system.
Comparing row-oriented vs. column-oriented formats
Now let’s see how that query plays out in row-based vs. a columnar-oriented format. The following table illustrates the difference between the two approaches.
You can see that in the column-oriented the data is flipped around. Each row now represents a column. So why would a columnar storage work better for a warehouse?
Analytical queries execute fast on tables in columnar format. Let’s take for example a query “give me the average salary per company”. This query only touches the “salary” and “company” columns, groups by company and then calculates the averages.
Compare that to running the same query on a row-oriented database. The query needs to sift through the entire production database. To return a result, it has to read every single row. It runs much longer and is inefficient because it’s scanning the entire database.
There’s an additional catch here. If a query scans less rows, it consumes less data. As a result, the data fits into the RAM and can process in-memory. Memory / RAM is much faster than disk. Even more, since the data is stored column-wise, Redshift applies different data encodings per text and numerical columns, thus more efficiently compressing the data.
Columnar format is nothing new. On-premise warehouses like Oracle, Teradata and Netezza use it as well. The difference here is that they require hardware, installation and maintenance. It can take weeks and months to get running. Modern cloud warehouses like Redshift deliver the database as a service, with a cluster up and running in less than 15 minutes.
Redshift also pairs columnar storage with scaling out the number of nodes, which get us to “MPP” – massively parallel processing.
MPP is the process of coordinated, simultaneous computation of data across multiple nodes in the cluster. Each node is using its own operating system and memory, also known as a “loosely coupled” system.
An alternative is “tightly coupled”, or Symmetric Parallel Systems (SMP). Separate processors use a single operating system and memory. For warehouse solutions, MPP has shown to deliver better results.
MPP Distributes Compute Loads
The Redshift architecture uses MPP and consists of a leader node and compute nodes. The leader node distributes rows of a table across the different nodes which independently store and process data and queries. How you distribute data across the nodes depends on your distribution key.
The leader node coordinates the execution of a query across the different nodes. This applies for all types of workloads. Once done, the leader node combines the results from each node, to return the final result of the query.
This process has many advantages. By adding nodes, you add more storage, memory and CPU. With each node, a cluster’s processing capacity goes up and scales in a linear fashion. That means that a 10-node cluster processes the same query about twice as fast as a 5-node cluster.
This concept applies to all three types of generic workloads (loads, transforms and ad-hoc queries). For example, loading flat files into Redshift is a very efficient process and also takes advantage of parallel processing. The leader node spreads the workload across the nodes while reading from multiple files. Loading also scales linearly as you add more nodes to your cluster.
Elasticity – How Redshift Scales
Because of MPP’s simultaneous processing, as you add more nodes to your cluster, you load data faster, execute transform faster, and return the results of ad-hoc queries faster. And Redshift takes advantage of the elasticity of the cloud. Adding nodes is a simple process and takes a few clicks in the AWS console.
We describe Redshift’s linear scalability in our use case story “Intuit’s Journey to Cloud Analytics”. The key to linear scaling the correct configuration of your cluster. Just like with any other database, if you don’t’ configure it right, you won’t enjoy the performance benefits. You can read more in our post “How we configure Amazon Redshift for Performance”.
With traditional MPP databases, storage and compute are coupled together. That means if you add more nodes, you add storage and compute at the same rate. It also means you’re allocating resources for peak-consumption. That’s inefficient, as you end up in a situation where you either have too much compute or too much storage.
There’s a solution to that, by using a data lake architecture.
Let’s look at how pricing fits into the equation. We’re starting off with data lakes and Redshift Spectrum. There’s a technology component here as well, but we can more about the economics and the cost of running queries.
Companies keep their most recent and “hot” data in a warehouse, like Redshift. It’s the data that’s closet to the current business.
What happens with the “rest” of it, like historical or “unused” data, e.g. columns and tables that nobody queries? You don’t want to delete that data, but also not pay the premium for keeping it in your cluster.
The answer is to store in your data lake. Data Lakes are a cheap long-term storage for structured and unstructured data. In AWS, that product is Amazon S3. The cost for storing data in S3 is about one-tenth of storing it in a Redshift cluster.
Storing data in S3 implies frequent communication between S3 and Redshift if you still want to query it. One approach is to build up an Extract-Transform-Load (ETL) pipeline and load the data into the cluster, and then unload it again once analysis is complete. That’s complex and not very efficient.
Decoupling Storage from Compute
The other approach is to query data in S3 and join it with data in the cluster, via a feature called “Redshift Spectrum”. Spectrum is a “serverless query engine”. “Serverless” means there isn’t any infrastructure to set up or manage.
Point to your data in S3, define the schema, and you can start querying using standard SQL queries. Redshift Spectrum runs queries directly on S3, as if they were normal Redshift tables. By using S3 and Redshift Spectrum you’re separating storage from compute for your cluster. To store more data and process, there’s no need to add more nodes. Instead, you store in S3 and use Redshift Spectrum to join and query it.
This is a huge benefit, as you store your data at a lower cost, but are still able to query in your data warehouse. Separating compute and storage enables customers to scale resources on an as-needed basis, rather than pre-allocating resources for peak consumption.
Redshift Spectrum queries do run slower, but the cost trade-off is worth it, increasing the ROI on your data investment.
And that takes us to pricing.
Amazon Redshift is priced by node type and the number of nodes you’re running. There are two categories of nodes, in parentheses the price per hour for on-demand.
AWS offers two billing models for Redshift. On-demand, or reserved instances. Users can choose their desired model depending on their annual usage plans. The more upfront you can plan your usage, the more you save.
The lowest pricing in the market
Your lowest starting price with Redshift is $1,380 per year for one node of dc2 with a 1-year commitment. Looking at it from a storage perspective, the lowest price comes down to $934.20/TB/Year, with a 3-year commitment for any Dense Storage cluster.
If you want a quick rule of thumb, for a 3-year commitment:
4-5 TB is sufficient for most medium-sized companies, enterprises may go up to about 80-100TB. Very large enterprises that collect massive amounts of data every day go indeed to a Petabyte. A good example is the NASDAQ, which stores trading data in Redshift. But you can get started with as little as a few 100s of GBs, and not pay through the nose.
Overall – these price point are much lower by a factor of about 20x than for on-premise warehouses, which makes it so attractive for adoption.
This post introduced Amazon Redshift to help you get started. We covered the basic concepts behind the cluster and described its most important features.
You could summarize it that Redshift is simple, fast and cheap. It’s no wonder that it has found broad adoption in the SMB market and the enterprise alike.
We, at intermix.io, assist organizations to overcome your analytics problems by giving you a unified view to track your data sources, queries and user activity for your cluster, along with a historical timeline.
Unlike simple monitoring tools, we give you query optimization recommendations and a place to collaborate to improve cluster and query performance.
If you’re suffering from slow queries, dashboards or ETL pipelines, schedule a demo with us or start a free trial. Coupled with our Expert Services for Redshift, you’ll spend less time fighting fires and more time building.
In 2014 Intuit’s then-CTO Tayloe Sainsbury went all in on the cloud and started migrating Intuit’s legacy on-premise IT infrastructure to Amazon AWS. By February 2018, Intuit had sold its largest data center and processed 100% of 2018 tax filings in the cloud.
But now Intuit had a different challenge – optimizing cloud spend and allocating that spend to products and users. AWS bills customers via a “Cost & Usage Report” (“CUR”). Because of the size of its cloud spend, the Intuit CUR comprises billions of rows, and it keeps growing by the day. Intuit switched from an on-premise data warehouse and now uses Amazon Redshift to process 4-5 billion rows of raw CUR data – each day.
In this post, I’m walking you through the approach that Jason Rhoades took to build Intuit’s data pipeline with Redshift. Jason is an Architect at Intuit, and with a small data team, they provide business-critical data to more than 8,000 Intuit employees.
Heads up – this is a long post with lots of detail!
Three major blocks:
Let’s start with an overview of the business.
Intuit builds financial management and compliance products and services for consumers and small businesses. Intuit also provides tax products to accounting professionals.
Products include QuickBooks, TurboTax, Mint and Turbo. These products help customers run their businesses, pay employees, send invoices, manage expenses, track their money, and file income taxes. Across these products, Intuit serves more than 50 million customers.
Intuit started in the 1980s and built the original version of its first product Quicken for the desktop, first MS-DOS and then Windows. With the Internet, that usage shifted to web and mobile. The impact of that change became clear as early as 2010.
Fast forward to today, and over 90% of Intuits customers file their taxes and manage their accounting online and via mobile apps.
Consumption of Intuit products follows seasonal patterns.
Tax seasonality has the biggest impact on Intuit’s business. Each fiscal year, Intuit generates half of its annual revenue in the quarter ending on April 30th, with the US tax filing deadline on April 15th.
Seasonality also has a huge impact on the cost side of the equation. The shift to digital and online usage of Intuit’s products causes a dramatic usage spike for the IT infrastructure. Most users file their taxes online during the last two days of the tax season.
In the old world of on-premise infrastructure, and to handle the concurrent usage, Intuit had to size their data center for peak capacity. After tax season, demand drops back down to average usage. The gap between peak demand and average usage is so large, that 95% of Intuit’s infrastructure would sit idle for 95% of the year.
That’s why Intuit decided in 2014 to go all in with the cloud. With the cloud’s elasticity, Intuit is in a better position to accommodate spikes in customer usage during the tax season.
By shifting to the cloud, Intuit reduced cost by a factor of six because it no longer maintained idle servers for an application only active during tax season. After the first success, Intuit moved more applications, services and enabling tools to the cloud. Today, over 80% of Intuit’s workloads are running in the cloud.
With now growing usage of AWS, the priorities of the program shifted from migration speed to efficient operations and growth.
Intuit now spends $100s of Millions on prepaid AWS services (aka “reserved instances” or short “RIs”) alone, plus fees for on-demand usage during the peaks. Interest grew in understanding the use of different AWS services and spend by different business units and teams within Intuit.
The source for that information sits in the “Cost & Usage Report” (“CUR”), a bill that Amazon AWS delivers to every customer. The CUR includes line items for each unique combination of AWS product, usage type, and operation and of course pricing. The CUR also contains information about credits, refunds and support fees.
Analyzing CUR data supports Intuit’s cloud program with two major use cases:
To build these two use cases, Jason’s team needs to transform the raw CUR data into a format consumable by the business. The raw CUR data comes in a different format from what Intuit uses to charge internal parties, distribute shared costs, amortize RIs and record spend on the general ledger.
The traditional way of Jason’s team to run the analytics on the CUR data was with an on-premise data warehouse.
Unlike other companies, the size of Intuit’s CUR is very large. In 2017, it was around 500M rows at the end of a month.
Amazon delivers the report 3-4x per day to Intuit, and restates the rows with each report over the course of a month, meaning it gets longer with each delivery. Coupled with a growing business, the amount of data the cluster has to process each time grows by the hour – literally.
You can see that trend play out in chart above, with data from 2017. The grey area indicates the batch size for the CUR data. Each day, the batch size gets bigger as the number of rows in the CUR grow. At the end of the month, the CUR reaches about 500 million rows and resets on day one of the new month.
The amount of rows the warehouse processes per minute stays constant at around 1 million rows per minute. Therefore, the time it takes the warehouse to process each batch (“batch duration”) goes up in linear fashion. With 500M rows at the end of the month, it takes the warehouse 500 minutes to process the full report, or 8 hours and 20 minutes.
Now extrapolate forward and calculate what that looks like in the future. With rising cloud spend, the data team realized that the CUR would start to blow up in size. In fact, today the CUR is larger by a factor of 10x with ~5 billion rows. Now we’re talking over 80 hours, almost four days.
Intuit’s situation is a common scenario we see our customers run into: “More data, more workflows, more people”.
For Intuit, it was clear that “keep on doing what we’re doing” was not an option. In a world where data is an asset, data and DevOps teams should focus on the value-creation part of pipelines.
With cloud usage and data volume going up, the old on-prem warehouse was already running into bottlenecks, and so the analytics team followed the business team into the cloud.
The Intuit team followed their product team into the AWS cloud. The major goals included handling the explosion in data volume and adding value to the business.
With on-demand access to computing resources, access to usage data in near real-time is fundamental for Intuit’s business teams. Unlike in the old world, waiting for a report at the end of the month doesn’t work anymore. With the scale of Intuit’s cloud operations, a few hours of freshness have a substantial impact on the company.
Jason migrated the entire stack from Oracle to Redshift, and deployed the same SQL and ETL processes.
Redshift handled the growth in data volume. Three major data points:
You can also see that the size of the grey area has a step change in April – tax season! The change is due to new capabilities Intuit introduced, which tripled the number of rows of the bill (“more data”).
Despite tripling the number of rows, the batch duration stays within a narrow band and doesn’t spike. That’s because batch size and number of rows processed per minute grow at the same rate. In other words, the cluster processes more data faster, i.e. performance goes up as workloads grow.
Let’s dive into how Jason’s team achieved that result.
The cluster architecture and the data pipeline follow the best practices we recommend for setting up your Amazon Redshift cluster. In particular, pay attention to setting up your WLM to separate your different workloads from each other.
You can see the three major workloads in the architecture chart – stage, process and consume.
Among our customers, “ELT” is a standard pattern, i.e. the transformation of data happens in the cluster with SQL. Cloud warehouses like Redshift are both performant and scalable, to the point that data transformation uses cases can be handled much better in-database vs an external processing layer. SQL is concise, declarative, and you can optimize it.
Intuit follows the “ELT” vs. “ETL” approach. With a lot of SQL knowledge on the team, they can build transformations in SQL and run them within the cluster. AWS drops the CUR into an S3 bucket where Intuit extracts the raw data from (the “E”) into the staging area. Intuit leaves the raw data untouched and loads it into the cluster (the “L”), to then transform it (the “T”).
Underneath the processes is an orchestration layer that coordinates workflows and manages dependencies. Some workflows need to execute on an hourly or daily basis, others on arrival of fresh data. Understanding the workflows and their execution is a crucial component for data integrity and meeting your SLAs.
When workflows and data pipelines fail – and they will – you have to a) know about it as it happens and b) understand the root cause of the failure. Otherwise you will run into data integrity issues and miss your SLAs. In Intuit’s case, the key SLA is the near real-time character of the data.
In intermix.io, you can see these workflows via our “Query Insights”.
You can double-click into each user to see the underlying query groups and dependencies. As the engineer in charge, that means you can track your worfklows and understand which user, query and table are the cause of any issues.
Let’s go through the single steps of the data flow and the technologies involved to orchestrate the workflows.
S3 is the demarcation point. AWS delivers the CUR into S3. With various data sources next to the CUR, it’s easy for people to put data into an S3 bucket. Loading data into Redshift from S3 is easy and efficient with the COPY command.
Amazon Redshift is the data platform. The workers for ingestion and post-ingestion processing include Lambda and EC2. Intuit uses Lambda wherever possible, as they prefer to not have any persistent compute they need to monitor or care for (patching, restacking, etc.).
Lambda functions can now run for 15 minutes, and for any job that runs under five minutes, the stack uses a lambda function. For larger jobs, they can deploy the same code stack on EC2, e.g. for staging the big CUR.
AWS Step Functions coordinate the Lambda jobs. SNS triggers new worfklows as new data arrives, vs. CloudWatch for scheduling batch jobs. For example, when a new CUR arrives in an S3 bucket processing needs to start right away vs. waiting for a specific time slot. RDS helps to maintain state.
Data consumption happens across three major categories.
Intuit supports new data use cases with Redshift, such as data APIs. Some of the uses cases have a transactional character that may require many small writes.
Instead of trying to turn Redshift into an OLTP, Intuit combines Redshift with PostgreSQL via Amazon RDS. By using dblink you can have your PostgreSQL cake and eat it too. By linking AmazonRedshift with RDS PostgreSQL, the combined feature set can power a broader array of use cases and provide the best solution for each task.
Unlike with “all in one” data warehouses like Oracle or SQL Server, Redshift doesn’t offer system-native workflows. This may be at first intimidating
Instead, AWS takes the approach of providing a broad collection of primitives for low-overhead compute, storage, and development services. Coupled with a rich tool ecosystem for Redshift, you can build a data platform that allows for higher performing, more scalable and lower cost solutions than previously possible.
Overall, the migration rushed Intuit into a new era of data productivity. The platform:
Meanwhile, the new data platform saves Intuit millions of spend on cloud infrastructure, and transforms the decision making process for 8,000+ employees.
With the new platform in place, Intuit is architecting a number of new use cases.
Data Lake Architecture
Long term trends for the CUR data are interesting, but for cost optimization analysts are interested in the most recent data. It makes sense to unload data from the largest tables in Redshift into S3 in Parquet format. That saves cost and increases flexibility by separating storage and compute.
Data Lifecycle Management
Once data is in S3, other (serverless) query engines like Athena or Redshift Spectrum can access it. The main fact tables in the Intuit cluster are based on date – the CUR is a bill. The date serves as the criteria when to unload data. For example, you may only want to keep one-quarter of data within the cluster. By keeping historic data in S3 and using Spectrum to query it, you scale data outside of Redshift but keep retrieval seamless and performant.
In intermix.io, you can filter for Spectrum queries by row count and scan size. You can also track their execution time and queue wait time. In the screenshot below you see those metrics, incl. the uptick Spectrum queries beginning of June .
The cost optimization program has delivered massive benefits. Teams know and predict computing costs in near real time. Deploying ML/AI capabilities against the CUR will allow making even smarter decisions – even 1% of improvement pays huge dividends.
Intuit expects the number of data scientists to go up several-fold, along with it the query volume. These queries patterns are more complex and less predictable. Concurrency Scaling offers an option to add more slots to a cluster to accommodate that incremental query volume, without adding nodes.
It’s a new way of working with data compared with the old, on-premise warehouse. Intuit is now in a position to embed a data services layer into all of Intuit’s products and services.
That was a long post, and I hope it gave you a good peek behind the curtain on how Intuit is building their platform. I hope the post gives you enough information to get started with your data platform.
Now, I’d love to learn from you! Is there anything you can share about your own experience building a data platform? And if you want faster queries for your cloud analytics, and spend less time on Ops and more time on Dev like Intuit, then go ahead and schedule a demo or start a trial for intermix.io.
In broad terms, we see five major categories of data sources: