This is part of a series of interviews on how companies are building data products. In these interviews, we’re sharing how data teams use data, with a deep dive into a data product at the company. We also cover tech stacks, best practices and other lessons learned.
Table of Contents
Aaron Biller is a lead data engineer at Postmates. Postmates is an on-demand delivery platform with operations in 3,500 cities in the US and Mexico. With over 5 Million deliveries each month, Postmates is transforming the way food and merchandise is moved around cities.
A successful, on-time delivery is the single most important event for Postmates’ business.
“In the very, very early days, we would query our production database to understand how many deliveries we had for the past day. Reporting on our metrics would happen with spreadsheets. Clearly that wouldn’t scale, and so we shifted our analytics to a data warehouse and used Amazon Redshift” says Aaron Biller, Engineering Lead for Data at Postmates.
“Data is ubiquitous at Postmates. It’s much more than reporting – we’re delivering data as a product and support data microservices”, says Biller.
Consider fraud prevention. On-demand platforms like Postmates have a unique exposure to payments and fraud because they have to assess risk in real-time.
While the warehouse does not operate in real-time, it ingests and transforms event data from all transactions for downstream consumption by predictive models and real-time services.
The Postmates risk team has engineered an internal risk detection microservice called “Pegasus”. Event data passes through a series of transformations in Redshift and feeds into “business rules”, which take the transformed data as input and produce decisions as output, with live decisions for every individual transaction on the Postmates platform.
In addition to Fraud, the data team has built an infrastructure that drives four major use cases for data:
As things have progressed at Postmates, they added more developers, more microservices and more data sources. “The amount of data we have, period, and the amount of new data we generate every day has expanded exponentially,” describes Biller the growth at Postmates.
Consider the amount of data collected during “peak delivery time” on Sunday nights, when people order their dinner to eat at home.
“Three years ago, we captured data from a certain number of ongoing deliveries on a Sunday night at peak. We’re now at about 30x the number of deliveries in flight. And we’re also monitoring and tracking so many more events per delivery. In short, we’re doing 30x the deliveries, and a single delivery includes 10x the data, and it just keeps growing,” explains Biller.
Amazon Redshift and Google BigQuery are the primary data warehouses.
The vast majority of raw data comes from the Postmates app itself. In addition to the app, the data team has built integrations with 3rd party services. Examples include:
You can write a query that combines 13 data sources in one single query and just run it and get data. That’s extraordinarily useful and powerful from an analytics and reporting perspective.”
This is part of a series of interviews on how companies are building data products. In these interviews, we’re sharing how data teams use data, with a deep dive into a data product at the company. We also cover tech stacks, best practices and other lessons learned.
Stephen Bronstein leads the Data Team at Fuze. It’s a “skinny team” of 3 people that supports all of Fuze’s data needs.
Over the course of the past three years, Stephen has led the team through warehouse transitions and performance tuning, adoption of new data sources, regular surges of new data and use cases, and the on-boarding of hundreds of new data-users. “People care a lot about having the right data at the right time. It’s crucial to drive their work forward,” says Bronstein.
Fuze is a cloud-based communications and collaboration platform provider. The Fuze platform unifies voice, video, messaging, and conferencing services on a single, award-winning cloud platform,and delivers intelligent, mobile-ready apps to customers.
As Fuze has grown its customer base and employee count, data has become a mission-critical component. More than 300 people query the data warehouse on a constant basis across Fuze’s 19 global office locations.
Departments include Finance, Sales, Product, and Customer Support.
Each day critical business functions query the data warehouse:
A central Amazon Redshift data warehouse combines data from a growing number of sources and events. The toolchain around Amazon Redshift includes
Data modeling is bespoke, and Fuze runs large-scale data transformations within Redshift in SQL.
Arvind Ramesh is the manager of the data team at Envoy. In this post, we’re sharing how Arvind’s team has built a data platform at Envoy, with a deep dive into a data product for the Customer Success Team. Arvind believes that the most important skill in data is storytelling and that data teams should operate more like software engineering teams. You can find Arvind on LinkedIn
For more detail, you can also watch the full video of our conversation.
Envoy helps companies manage their office visitors and deliveries. Chances are you’ve already used Envoy somewhere in a lobby or a reception desk, where an iPad kiosk runs the Envoy app. The app checks you in, prints your name tag and sends your host a message that you’ve arrived. Guests can also use the “Envoy Passport” app for automatic check-in, collecting different “stamps” for each office visit, similar to visa stamps at customs when entering a country. Hosts can manage visitors and other things like mail and packages via a mobile app.
The apps generate a new, growing amount of data on a per-minute basis. Envoy has welcomed over 60M visitors, adding 130,000 new visitors every day, in over 13,000 locations around the globe. Envoy’s business model consists of a subscription where pricing scales with the number of locations and deliveries. Providing analysis and reporting for decision making across that location and visitor footprint is of course valuable.
But basic reporting is not a data product. “The way I look at it, a data product, it’s not really delivering an insight per se, but it’s delivering ‘a thing’ that can be used by other people within or outside of the company to improve their workflow’, says Arvind.
At Envoy, data products are about enabling people to do more with the data Envoy generates. “The advantages of data products is that they enable other people to more easily access data and use it in their workflow without us having to go back and forth for every piece of insight”, says Arvind.
Part of Envoy’s secret to success is using a data product to drive account penetration and adoption of the Envoy app.
The Customer Success team uses Gainsight, a suite of SaaS products for monitoring and managing product adoption. “Gainsight is basically a UI that allows you to better manage your customer base, see how people are using your product, where there are problems, where you might need to focus your attention. But, like most tools, it is only valuable if it has a stream of comprehensive, reliable, and accurate customer data underpinning it.”
Gainsight offers a full customer data platform, but in Envoy’s case the Gainsight app is a “skinny” UI, which sits on top of the Envoy data platform. In this specific case, it’s a table called “Gainsight company facts”. A daily batch process combines data from many different sources into the final table.
“The way I think of it Gainsight, it’s the UI that sits on top of our data platform because it would not be the best use of our time or anyone’s time to build a UI for that kind of thing. The way we think of data products – it’s a backend data service or data infrastructure. The actual last mile is a web dashboard or something similar. Usually we’ll use a tool to accomplish that.”
Data delivered in Gainsight helps the Customer Success team prioritize the accounts where product adoption is below par, and locations are at risk of churning.
Compared to the raw source data that’s scattered across various sources and places, the new “Gainsight company facts” table has reliable, useful information in a single place, such as:
“We have a few customer success managers and each of them will have several hundred accounts” says Arvind. “They can have a conversation with the customer, ‘ok, you have 50 offices, but look, these five are actually having some issues and maybe you should focus on these’.” Arvind’s team helps make those conversations more effective with product usage data.
The data for the “Gainsight company facts” table is the result of a daily batch process. The process cleans, transforms and combines raw data from different sources within the Envoy data warehouse.
The image below shows a summary of the graph, or “DAG” involved in building the model. For every company, for every day, the output table contains a variety of key data points.
To run the batch process, Arvind’s team has built a platform that consists of five key components.
The model for Gaingisht takes about 40 minutes to run, including completing all upstream dependencies for the freshest possible data.
For other data products, the SLA can be a bit more real-time, and models run every three hours or even shorter. But “typically it’s pretty rare for people to truly need data real-time or even close to real time, unless it’s something operational or you’re looking at server metrics.”
Next to customer success, the platform supports other data products, and building the underlying models requires domain knowledge.
“That’s typically where we do a lot of work on our end. You really have to understand how our business works, how our activation metrics are defined, or how our products interact with each other”, says Arvind. Usually one person is an expert for a specific domain, but with peer review, documentation and a QA rotation, domain knowledge starts to make its way across the team.
“Our team is eight people and that’s split across data engineering, product analytics and go-to-market analytics. We’re about 170 people right now total at Envoy, which translates to about 5% of the company working in data-focused roles. “If we build effective data products and continue to iterate on operational efficiencies, our team should not have to scale linearly with the company.”
Building the platform has been an iterative process. “It’s probably a bit cliché, but I think data teams should work more like software engineering teams”.
First was testing. “After building out the first set of several dozen models, we quickly realized that ensuring everything stayed accurate was a challenge. People were finding issues in the data before we were, which was embarrassing for us since we felt accountable for the accuracy of anything we put out. The upstream data was at times incorrect, and could change in ways we hadn’t anticipated. So we implemented tests, which are basically data sanity checks running on a schedule that checked the intermediate and final transformations in our pipeline. But then, when some of these tests failed, it’s not evident who on the team was responsible. They went unfixed for a long time, as each member of the team thought someone else would fix the issue”.
So next up was QA rotation. “Now, every week, there’s someone else who’s responsible for fixing any data test issues that pop up, and the ‘oncall’ schedule is managed through PagerDuty”. As the company and the data team keeps growing, the person who is responsible for fixing a failed test may not understand the full logic that’s going on.
That meant better documentation on how things are built and how to debug tests. “And again, I keep drawing this parallel, but if you look at any software engineering team, and open up our knowledge base, you will likely see a postmortem for every outage and issue that happens, and runbooks to fix them.”
Arvind’s team has started to use the data platform for advanced data products, like churn prediction and machine learning. Envoy is trying Amazon SageMaker for that.
Another project is integrating the data platform with the customer-facing analytics. “Obviously, when people log into Envoy, there are already analytics. How many visitors are coming to their office, how many packages they’re scanning. That’s done through a white-labeled solution with a BI vendor”, says Arvind.
“But eventually the data team is going to build an internal microservice where the Envoy app will just request what it wants and then we will return a bunch of things that we think are valuable”.
In 2014 Intuit’s then-CTO Tayloe Sainsbury went all in on the cloud and started migrating Intuit’s legacy on-premise IT infrastructure to Amazon AWS. By February 2018, Intuit had sold its largest data center and processed 100% of 2018 tax filings in the cloud.
But now Intuit had a different challenge – optimizing cloud spend and allocating that spend to products and users. AWS bills customers via a “Cost & Usage Report” (“CUR”). Because of the size of its cloud spend, the Intuit CUR comprises billions of rows, and it keeps growing by the day. Intuit switched from an on-premise data warehouse and now uses Amazon Redshift to process 4-5 billion rows of raw CUR data – each day.
In this post, I’m walking you through the approach that Jason Rhoades took to build Intuit’s data pipeline with Redshift. Jason is an Architect at Intuit, and with a small data team, they provide business-critical data to more than 8,000 Intuit employees.
Heads up – this is a long post with lots of detail!
Three major blocks:
Let’s start with an overview of the business.
Intuit builds financial management and compliance products and services for consumers and small businesses. Intuit also provides tax products to accounting professionals.
Products include QuickBooks, TurboTax, Mint and Turbo. These products help customers run their businesses, pay employees, send invoices, manage expenses, track their money, and file income taxes. Across these products, Intuit serves more than 50 million customers.
Intuit started in the 1980s and built the original version of its first product Quicken for the desktop, first MS-DOS and then Windows. With the Internet, that usage shifted to web and mobile. The impact of that change became clear as early as 2010.
Fast forward to today, and over 90% of Intuits customers file their taxes and manage their accounting online and via mobile apps.
Consumption of Intuit products follows seasonal patterns.
Tax seasonality has the biggest impact on Intuit’s business. Each fiscal year, Intuit generates half of its annual revenue in the quarter ending on April 30th, with the US tax filing deadline on April 15th.
Seasonality also has a huge impact on the cost side of the equation. The shift to digital and online usage of Intuit’s products causes a dramatic usage spike for the IT infrastructure. Most users file their taxes online during the last two days of the tax season.
In the old world of on-premise infrastructure, and to handle the concurrent usage, Intuit had to size their data center for peak capacity. After tax season, demand drops back down to average usage. The gap between peak demand and average usage is so large, that 95% of Intuit’s infrastructure would sit idle for 95% of the year.
That’s why Intuit decided in 2014 to go all in with the cloud. With the cloud’s elasticity, Intuit is in a better position to accommodate spikes in customer usage during the tax season.
By shifting to the cloud, Intuit reduced cost by a factor of six because it no longer maintained idle servers for an application only active during tax season. After the first success, Intuit moved more applications, services and enabling tools to the cloud. Today, over 80% of Intuit’s workloads are running in the cloud.
With now growing usage of AWS, the priorities of the program shifted from migration speed to efficient operations and growth.
Intuit now spends $100s of Millions on prepaid AWS services (aka “reserved instances” or short “RIs”) alone, plus fees for on-demand usage during the peaks. Interest grew in understanding the use of different AWS services and spend by different business units and teams within Intuit.
The source for that information sits in the “Cost & Usage Report” (“CUR”), a bill that Amazon AWS delivers to every customer. The CUR includes line items for each unique combination of AWS product, usage type, and operation and of course pricing. The CUR also contains information about credits, refunds and support fees.
Analyzing CUR data supports Intuit’s cloud program with two major use cases:
To build these two use cases, Jason’s team needs to transform the raw CUR data into a format consumable by the business. The raw CUR data comes in a different format from what Intuit uses to charge internal parties, distribute shared costs, amortize RIs and record spend on the general ledger.
The traditional way of Jason’s team to run the analytics on the CUR data was with an on-premise data warehouse.
Unlike other companies, the size of Intuit’s CUR is very large. In 2017, it was around 500M rows at the end of a month.
Amazon delivers the report 3-4x per day to Intuit, and restates the rows with each report over the course of a month, meaning it gets longer with each delivery. Coupled with a growing business, the amount of data the cluster has to process each time grows by the hour – literally.
You can see that trend play out in chart above, with data from 2017. The grey area indicates the batch size for the CUR data. Each day, the batch size gets bigger as the number of rows in the CUR grow. At the end of the month, the CUR reaches about 500 million rows and resets on day one of the new month.
The amount of rows the warehouse processes per minute stays constant at around 1 million rows per minute. Therefore, the time it takes the warehouse to process each batch (“batch duration”) goes up in linear fashion. With 500M rows at the end of the month, it takes the warehouse 500 minutes to process the full report, or 8 hours and 20 minutes.
Now extrapolate forward and calculate what that looks like in the future. With rising cloud spend, the data team realized that the CUR would start to blow up in size. In fact, today the CUR is larger by a factor of 10x with ~5 billion rows. Now we’re talking over 80 hours, almost four days.
Intuit’s situation is a common scenario we see our customers run into: “More data, more workflows, more people”.
For Intuit, it was clear that “keep on doing what we’re doing” was not an option. In a world where data is an asset, data and DevOps teams should focus on the value-creation part of pipelines.
With cloud usage and data volume going up, the old on-prem warehouse was already running into bottlenecks, and so the analytics team followed the business team into the cloud.
The Intuit team followed their product team into the AWS cloud. The major goals included handling the explosion in data volume and adding value to the business.
With on-demand access to computing resources, access to usage data in near real-time is fundamental for Intuit’s business teams. Unlike in the old world, waiting for a report at the end of the month doesn’t work anymore. With the scale of Intuit’s cloud operations, a few hours of freshness have a substantial impact on the company.
Jason migrated the entire stack from Oracle to Redshift, and deployed the same SQL and ETL processes.
Redshift handled the growth in data volume. Three major data points:
You can also see that the size of the grey area has a step change in April – tax season! The change is due to new capabilities Intuit introduced, which tripled the number of rows of the bill (“more data”).
Despite tripling the number of rows, the batch duration stays within a narrow band and doesn’t spike. That’s because batch size and number of rows processed per minute grow at the same rate. In other words, the cluster processes more data faster, i.e. performance goes up as workloads grow.
Let’s dive into how Jason’s team achieved that result.
The cluster architecture and the data pipeline follow the best practices we recommend for setting up your Amazon Redshift cluster. In particular, pay attention to setting up your WLM to separate your different workloads from each other.
You can see the three major workloads in the architecture chart – stage, process and consume.
Among our customers, “ELT” is a standard pattern, i.e. the transformation of data happens in the cluster with SQL. Cloud warehouses like Redshift are both performant and scalable, to the point that data transformation uses cases can be handled much better in-database vs an external processing layer. SQL is concise, declarative, and you can optimize it.
Intuit follows the “ELT” vs. “ETL” approach. With a lot of SQL knowledge on the team, they can build transformations in SQL and run them within the cluster. AWS drops the CUR into an S3 bucket where Intuit extracts the raw data from (the “E”) into the staging area. Intuit leaves the raw data untouched and loads it into the cluster (the “L”), to then transform it (the “T”).
Underneath the processes is an orchestration layer that coordinates workflows and manages dependencies. Some workflows need to execute on an hourly or daily basis, others on arrival of fresh data. Understanding the workflows and their execution is a crucial component for data integrity and meeting your SLAs.
When workflows and data pipelines fail – and they will – you have to a) know about it as it happens and b) understand the root cause of the failure. Otherwise you will run into data integrity issues and miss your SLAs. In Intuit’s case, the key SLA is the near real-time character of the data.
In intermix.io, you can see these workflows via our “Query Insights”.
You can double-click into each user to see the underlying query groups and dependencies. As the engineer in charge, that means you can track your worfklows and understand which user, query and table are the cause of any issues.
Let’s go through the single steps of the data flow and the technologies involved to orchestrate the workflows.
S3 is the demarcation point. AWS delivers the CUR into S3. With various data sources next to the CUR, it’s easy for people to put data into an S3 bucket. Loading data into Redshift from S3 is easy and efficient with the COPY command.
Amazon Redshift is the data platform. The workers for ingestion and post-ingestion processing include Lambda and EC2. Intuit uses Lambda wherever possible, as they prefer to not have any persistent compute they need to monitor or care for (patching, restacking, etc.).
Lambda functions can now run for 15 minutes, and for any job that runs under five minutes, the stack uses a lambda function. For larger jobs, they can deploy the same code stack on EC2, e.g. for staging the big CUR.
AWS Step Functions coordinate the Lambda jobs. SNS triggers new worfklows as new data arrives, vs. CloudWatch for scheduling batch jobs. For example, when a new CUR arrives in an S3 bucket processing needs to start right away vs. waiting for a specific time slot. RDS helps to maintain state.
Data consumption happens across three major categories.
Intuit supports new data use cases with Redshift, such as data APIs. Some of the uses cases have a transactional character that may require many small writes.
Instead of trying to turn Redshift into an OLTP, Intuit combines Redshift with PostgreSQL via Amazon RDS. By using dblink you can have your PostgreSQL cake and eat it too. By linking AmazonRedshift with RDS PostgreSQL, the combined feature set can power a broader array of use cases and provide the best solution for each task.
Unlike with “all in one” data warehouses like Oracle or SQL Server, Redshift doesn’t offer system-native workflows. This may be at first intimidating
Instead, AWS takes the approach of providing a broad collection of primitives for low-overhead compute, storage, and development services. Coupled with a rich tool ecosystem for Redshift, you can build a data platform that allows for higher performing, more scalable and lower cost solutions than previously possible.
Overall, the migration rushed Intuit into a new era of data productivity. The platform:
Meanwhile, the new data platform saves Intuit millions of spend on cloud infrastructure, and transforms the decision making process for 8,000+ employees.
With the new platform in place, Intuit is architecting a number of new use cases.
Data Lake Architecture
Long term trends for the CUR data are interesting, but for cost optimization analysts are interested in the most recent data. It makes sense to unload data from the largest tables in Redshift into S3 in Parquet format. That saves cost and increases flexibility by separating storage and compute.
Data Lifecycle Management
Once data is in S3, other (serverless) query engines like Athena or Redshift Spectrum can access it. The main fact tables in the Intuit cluster are based on date – the CUR is a bill. The date serves as the criteria when to unload data. For example, you may only want to keep one-quarter of data within the cluster. By keeping historic data in S3 and using Spectrum to query it, you scale data outside of Redshift but keep retrieval seamless and performant.
In intermix.io, you can filter for Spectrum queries by row count and scan size. You can also track their execution time and queue wait time. In the screenshot below you see those metrics, incl. the uptick Spectrum queries beginning of June .
The cost optimization program has delivered massive benefits. Teams know and predict computing costs in near real time. Deploying ML/AI capabilities against the CUR will allow making even smarter decisions – even 1% of improvement pays huge dividends.
Intuit expects the number of data scientists to go up several-fold, along with it the query volume. These queries patterns are more complex and less predictable. Concurrency Scaling offers an option to add more slots to a cluster to accommodate that incremental query volume, without adding nodes.
It’s a new way of working with data compared with the old, on-premise warehouse. Intuit is now in a position to embed a data services layer into all of Intuit’s products and services.
That was a long post, and I hope it gave you a good peek behind the curtain on how Intuit is building their platform. I hope the post gives you enough information to get started with your data platform.
Now, I’d love to learn from you! Is there anything you can share about your own experience building a data platform? And if you want faster queries for your cloud analytics, and spend less time on Ops and more time on Dev like Intuit, then go ahead and schedule a demo or start a trial for intermix.io.
In broad terms, we see five major categories of data sources: