Start Now Login

“Real-time Fraud Detection”

This is part of a series of interviews on how companies are building data products. In these interviews, we’re sharing how data teams use data, with a deep dive into a data product at the company. We also cover tech stacks, best practices and other lessons learned.

Table of Contents

About

Aaron Biller is a lead data engineer at Postmates. Postmates is an on-demand delivery platform with operations in 3,500 cities in the US and Mexico. With over 5 Million deliveries each month, Postmates is transforming the way food and merchandise is moved around cities.

A successful, on-time delivery is the single most important event for Postmates’ business.

“In the very, very early days, we would query our production database to understand how many deliveries we had for the past day. Reporting on our metrics would happen with spreadsheets. Clearly that wouldn’t scale, and so we shifted our analytics to a data warehouse and used Amazon Redshift” says Aaron Biller, Engineering Lead for Data at Postmates.

What business problem does this data product solve? 

“Data is ubiquitous at Postmates. It’s much more than reporting – we’re delivering data as a product and support data microservices”, says Biller.

Consider fraud prevention. On-demand platforms like Postmates have a unique exposure to payments and fraud because they have to assess risk in real-time. 

While the warehouse does not operate in real-time, it ingests and transforms event data from all transactions for downstream consumption by predictive models and real-time services.

The Postmates risk team has engineered an internal risk detection microservice called “Pegasus”. Event data passes through a series of transformations in Redshift and feeds into “business rules”, which take the transformed data as input and produce decisions as output, with live decisions for every individual transaction on the Postmates platform.

In addition to Fraud, the data team has built an infrastructure that drives four major use cases for data:

What is the tech-stack used?

As things have progressed at Postmates, they added more developers, more microservices and more data sources. “The amount of data we have, period, and the amount of new data we generate every day has expanded exponentially,” describes Biller the growth at Postmates.

Consider the amount of data collected during “peak delivery time” on Sunday nights, when people order their dinner to eat at home.

“Three years ago, we captured data from a certain number of ongoing deliveries on a Sunday night at peak. We’re now at about 30x the number of deliveries in flight. And we’re also monitoring and tracking so many more events per delivery. In short, we’re doing 30x the deliveries, and a single delivery includes 10x the data, and it just keeps growing,” explains Biller.

Amazon Redshift and Google BigQuery are the primary data warehouses.  

What are the sources of data?

The vast majority of raw data comes from the Postmates app itself. In addition to the app, the data team has built integrations with 3rd party services. Examples include:

You can write a query that combines 13 data sources in one single query and just run it and get data. That’s extraordinarily useful and powerful from an analytics and reporting perspective.”

“Used by over 300 people”

This is part of a series of interviews on how companies are building data products. In these interviews, we’re sharing how data teams use data, with a deep dive into a data product at the company. We also cover tech stacks, best practices and other lessons learned.

About

Stephen Bronstein leads the Data Team at Fuze. It’s a “skinny team” of 3 people that supports all of Fuze’s data needs.

Over the course of the past three years, Stephen has led the team through warehouse transitions and performance tuning, adoption of new data sources, regular surges of new data and use cases, and the on-boarding of hundreds of new data-users. “People care a lot about having the right data at the right time. It’s crucial to drive their work forward,” says Bronstein. 

Who is the end-user of this data product?

Fuze is a cloud-based communications and collaboration platform provider. The Fuze platform unifies voice, video, messaging, and conferencing services on a single, award-winning cloud platform,and delivers intelligent, mobile-ready apps to customers.

As Fuze has grown its customer base and employee count, data has become a mission-critical component. More than 300 people query the data warehouse on a constant basis across Fuze’s 19 global office locations.  

Departments include Finance, Sales, Product, and Customer Support.

What business problem does this data product solve? 

Each day critical business functions query the data warehouse: 

What is the tech-stack used?

A central Amazon Redshift data warehouse combines data from a growing number of sources and events. The toolchain around Amazon Redshift includes 

Data modeling is bespoke, and Fuze runs large-scale data transformations within Redshift in SQL. 

Watch the full video and download a complete transcript of our conversation.

Summary

Arvind Ramesh is the manager of the data team at Envoy. In this post, we’re sharing how Arvind’s team has built a data platform at Envoy, with a deep dive into a data product for the Customer Success Team. Arvind believes that the most important skill in data is storytelling and that data teams should operate more like software engineering teams. You can find Arvind on LinkedIn

For more detail, you can also watch the full video of our conversation. 

Data at Envoy: Background & Business Context

Envoy helps companies manage their office visitors and deliveries. Chances are you’ve already used Envoy somewhere in a lobby or a reception desk, where an iPad kiosk runs the Envoy app. The app checks you in, prints your name tag and sends your host a message that you’ve arrived. Guests can also use the “Envoy Passport” app for automatic check-in, collecting different “stamps” for each office visit, similar to visa stamps at customs when entering a country. Hosts can manage visitors and other things like mail and packages via a mobile app. 

envoy app

The apps generate a new, growing amount of data on a per-minute basis. Envoy has welcomed over 60M visitors, adding 130,000 new visitors every day, in over 13,000 locations around the globe. Envoy’s business model consists of a subscription where pricing scales with the number of locations and deliveries. Providing analysis and reporting for decision making across that location and visitor footprint is of course valuable. 

But basic reporting is not a data product. “The way I look at it, a data product, it’s not really delivering an insight per se, but it’s delivering ‘a thing’ that can be used by other people within or outside of the company to improve their workflow’, says Arvind.

envoy iphone app

At Envoy, data products are about enabling people to do more with the data Envoy generates. “The advantages of data products is that they enable other people to more easily access data and use it in their workflow without us having to go back and forth for every piece of insight”, says Arvind.

Part of Envoy’s secret to success is using a data product to drive account penetration and adoption of the Envoy app. 

Who is the end-user of this data product?

The Customer Success team uses Gainsight, a suite of SaaS products for monitoring and managing product adoption. “Gainsight is basically a UI that allows you to better manage your customer base, see how people are using your product, where there are problems, where you might need to focus your attention. But, like most tools, it is only valuable if it has a stream of comprehensive, reliable, and accurate customer data underpinning it.”

gainsight

Gainsight offers a full customer data platform, but in Envoy’s case the Gainsight app is a “skinny” UI, which sits on top of the Envoy data platform. In this specific case, it’s a table called “Gainsight company facts”. A daily batch process combines data from many different sources into the final table.

“The way I think of it Gainsight, it’s the UI that sits on top of our data platform because it would not be the best use of our time or anyone’s time to build a UI for that kind of thing. The way we think of data products – it’s a backend data service or data infrastructure. The actual last mile is a web dashboard or something similar. Usually we’ll use a tool to accomplish that.”

What Business Problem Does this Data Product Solve?

Data delivered in Gainsight helps the Customer Success team prioritize the accounts where product adoption is below par, and locations are at risk of churning. 

Compared to the raw source data that’s scattered across various sources and places, the new “Gainsight company facts” table has reliable, useful information in a single place, such as:

“We have a few customer success managers and each of them will have several hundred accounts” says Arvind. “They can have a conversation with the customer, ‘ok, you have 50 offices, but look, these five are actually having some issues and maybe you should focus on these’.” Arvind’s team helps make those conversations more effective with product usage data.

What Are the Data Sources & Tech Stack?

The data for the “Gainsight company facts” table is the result of a daily batch process. The process cleans, transforms and combines raw data from different sources within the Envoy data warehouse.

The image below shows a summary of the graph, or “DAG” involved in building the model. For every company, for every day, the output table contains a variety of key data points. 

To run the batch process, Arvind’s team has built a platform that consists of five key components.

The model for Gaingisht takes about 40 minutes to run, including completing all upstream dependencies for the freshest possible data. 

For other data products, the SLA can be a bit more real-time, and models run every three hours or even shorter. But “typically it’s pretty rare for people to truly need data real-time or even close to real time, unless it’s something operational or you’re looking at server metrics.”

Best practices and lessons learned

Next to customer success, the platform supports other data products, and building the underlying models requires domain knowledge. 

“That’s typically where we do a lot of work on our end. You really have to understand how our business works, how our activation metrics are defined, or how our products interact with each other”, says Arvind. Usually one person is an expert for a specific domain, but with peer review, documentation and a QA rotation, domain knowledge starts to make its way across the team. 

“Our team is eight people and that’s split across data engineering, product analytics and go-to-market analytics. We’re about 170 people right now total at Envoy, which translates to about  5% of the company working in data-focused roles. “If we build effective data products and continue to iterate on operational efficiencies, our team should not have to scale linearly with the company.”  

Building the platform has been an iterative process. “It’s probably a bit cliché, but I think data teams should work more like software engineering teams”. 

First was testing. “After building out the first set of several dozen models, we quickly realized that ensuring everything stayed accurate was a challenge. People were finding issues in the data before we were, which was embarrassing for us since we felt accountable for the accuracy of anything we put out. The upstream data was at times incorrect, and could change in ways we hadn’t anticipated. So we implemented tests, which are basically data sanity checks running on a schedule that checked the intermediate and final transformations in our pipeline. But then, when some of these tests failed, it’s not evident who on the team was responsible. They went unfixed for a long time, as each member of the team thought someone else would fix the issue”. 

So next up was QA rotation. “Now, every week, there’s someone else who’s responsible for fixing any data test issues that pop up, and the ‘oncall’ schedule is managed through PagerDuty”. As the company and the data team keeps growing, the person who is responsible for fixing a failed test may not understand the full logic that’s going on. 

That meant better documentation on how things are built and how to debug tests. “And again, I keep drawing this parallel, but if you look at any software engineering team, and open up our knowledge base, you will likely see a  postmortem for every outage and issue that happens, and runbooks to fix them.”

Arvind’s team has started to use the data platform for advanced data products, like churn prediction and machine learning. Envoy is trying Amazon SageMaker for that. 

Another project is integrating the data platform with the customer-facing analytics. “Obviously, when people log into Envoy, there are already analytics. How many visitors are coming to their office, how many packages they’re scanning. That’s done through a white-labeled solution with a BI vendor”, says Arvind.

“But eventually the data team is going to build an internal microservice where the Envoy app will just request what it wants and then we will return a bunch of things that we think are valuable”. 

Arvind Ramesh is the manager of the data team at Envoy. He believes that the most important skill in data is storytelling and that data teams should operate more like software engineering teams. When he’s not walking his dog Casper or playing board games with friends, you can find Arvind on LinkedIn

Envoy helps companies manage their office visitors and deliveries. Chances are you’ve already used Envoy somewhere in a lobby or a reception desk, where an iPad kiosk runs the Envoy app. The app checks you in, prints your name tag and sends your host a message that you’ve arrived.

Who is the end-user of this data product?

The Customer Success team.

Part of Envoy’s secret to success is using data to drive account penetration and adoption of the Envoy app. Envoy’s business model consists of a subscription where pricing scales with the number of locations and deliveries. Data helps the customer success team prioritize the accounts where product adoption is below par, and locations are at risk of churning. 

What business problem does this data product solve? 

“We have a few customer success managers and each of them will have several hundred accounts” says Arvind. “They can have a conversation with the customer, ‘ok, you have 50 offices, but look, these five are actually having some issues and maybe you should focus on these’.” Arvind’s team facilitates that conversation with product usage data.

The dashboard is “a UI that allows you to better manage your customer base, see how people are using your product, where there are problems, where you might need to focus your attention. But the whole premise of this is that you need to feed it a lot of product data, a lot of customer data and it all needs to be reliable, accurate, and pretty comprehensive.” 

The app ..”is a UI that sits on top of our data platform because it would not be the best use of our time or anyone’s time to build a UI for that kind of thing. The way we think of data products – it’s a backend data service or data infrastructure. The actual last mile is a web dashboard or something similar. Usually, we’ll use a tool to accomplish that.”

What is the tech-stack used?

The primary interface for the data product is a tool called Gainsight. Gainsight offers a full customer data platform, but in Envoy’s case the Gainsight app is a “skinny” UI.

The data is the result of a daily batch process. The batch process cleans and transforms raw data within the Envoy data warehouse, and combines data from many different sources into a prepared table called “Gainsight company facts”. The image below shows a summary of the graph, or “DAG” involved in building the model. For every company, for every day, the output table contains a variety of key data points. 

a summary of the graph, or “DAG” involved in building the model.

To run the batch process, Arvind’s team has built a platform that consists of five key components.

ETL tools

To get raw data from the source into the warehouse, Envoy uses off-the-shelf tools wherever an integration with a data source is available, including Amazon Glue, Fivetran and Stitch Data. “For loading data, our general philosophy is that it’s a solved problem and we use a tool. For custom ingestion, we builds a script using Singer”, an open-source ETL framework sponsored by Stitch Data, with orchestration via Airflow. Data loads happen up to every 30 minutes.

Data Warehouse

Envoy loads raw data into Amazon Redshift, with dense compute nodes (dc2.8xlarge). 

Data Modeling

Once data has been loaded into Redshift, transformation happens in SQL with dbt, an open-source analytics engineering framework, to create the final output tables, e.g. “Gainsight company facts”. 

The final table is unloaded from the warehouse into a designated S3 bucket from where Gainsight picks it up. This architecture supports other front-end visualization tools in the future that may require different formats. The key part is that all of the logic is encapsulated in a model, and the final table(s) can have different destinations, e.g. an S3 bucket, a query within Redshift, a data API, or a data microservice. 

What are the sources of data?

Envoy has between 20-25 data sources, across internal and 3rd party systems. For Gainsight, the relevant sources include for example Envoy’s production databases, Segment event data, customer support tool (e.g. ticket volume), the billing system, email systems and Salesforce.

What best practices did you use?

Building the platform has been an iterative process. “It’s probably a bit cliché, but I think data teams should work more like software engineering teams”. 

First was testing. “After building out the first set of several dozen models, we quickly realized that ensuring everything stayed accurate was a challenge. People were finding issues in the data before we were, which was embarrassing for us since we felt accountable for the accuracy of anything we put out. The upstream data was at times incorrect, and could change in ways we hadn’t anticipated. So we implemented tests, which are basically data sanity checks running on a schedule that checked the intermediate and final transformations in our pipeline. But then, when some of these tests failed, it’s not evident who on the team was responsible. They went unfixed for a long time, as each member of the team thought someone else would fix the issue”. 

So next up was QA rotation. “Now, every week, there’s someone else who’s responsible for fixing any data test issues that pop up, and the ‘oncall’ schedule is managed through PagerDuty”. As the company and the data team keeps growing, the person who is responsible for fixing a failed test may not understand the full logic that’s going on. 

That meant better documentation on how things are built and how to debug tests. “And again, I keep drawing this parallel, but if you look at any software engineering team, and open up our knowledge base, you will likely see a  postmortem for every outage and issue that happens, and runbooks to fix them.”

What does the future hold for Envoy’s data product?

There are plans to integrate the Data Platform with the Envoy Product Experience. Arvind’s team has started to use the data platform for advanced data products, like churn prediction and machine learning. Envoy is trying Amazon SageMaker for that. 

Another project is integrating the data platform with the customer-facing analytics. “Obviously, when people log into Envoy, there are already analytics. How many visitors are coming to their office, how many packages they’re scanning. That’s done through a white-labeled solution with a BI vendor”, says Arvind.

“But eventually the data team is going to build an internal microservice where the Envoy app will just request what it wants and then we will return a bunch of things that we think are valuable”. 

Watch the full video and download a complete transcript of our conversation.

How to Use Amazon Redshift For a New Generation of Data Services

In 2014 Intuit’s then-CTO Tayloe Sainsbury went all in on the cloud and started migrating Intuit’s legacy on-premise IT infrastructure to Amazon AWS. By February 2018, Intuit had sold its largest data center and processed 100% of 2018 tax filings in the cloud.

But now Intuit had a different challenge – optimizing cloud spend and allocating that spend to products and users. AWS bills customers via a “Cost & Usage Report” (“CUR”). Because of the size of its cloud spend, the Intuit CUR comprises billions of rows, and it keeps growing by the day. Intuit switched from an on-premise data warehouse and now uses Amazon Redshift to process 4-5 billion rows of raw CUR data – each day.

In this post, I’m walking you through the approach that Jason Rhoades took to build Intuit’s data pipeline with Redshift. Jason is an Architect at Intuit, and with a small data team, they provide business-critical data to more than 8,000 Intuit employees.

Heads up – this is a long post with lots of detail!

Three major blocks:

Let’s start with an overview of the business.

Digital Has Changed the Way Intuit Operates Its Business

Intuit builds financial management and compliance products and services for consumers and small businesses. Intuit also provides tax products to accounting professionals.

Products include QuickBooks, TurboTax, Mint and Turbo. These products help customers run their businesses, pay employees, send invoices, manage expenses, track their money, and file income taxes. Across these products, Intuit serves more than 50 million customers.

Intuit started in the 1980s and built the original version of its first product Quicken for the desktop, first MS-DOS and then Windows. With the Internet, that usage shifted to web and mobile. The impact of that change became clear as early as 2010.

Intuit Product Usage Forecast in 2014

Fast forward to today, and over 90% of Intuits customers file their taxes and manage their accounting online and via mobile apps.

The Cloud as a Catalyst to Handle Seasonality and Peak Demand

Consumption of Intuit products follows seasonal patterns.

Tax seasonality has the biggest impact on Intuit’s business. Each fiscal year, Intuit generates half of its annual revenue in the quarter ending on April 30th, with the US tax filing deadline on April 15th.

Seasonality also has a huge impact on the cost side of the equation. The shift to digital and online usage of Intuit’s products causes a dramatic usage spike for the IT infrastructure. Most users file their taxes online during the last two days of the tax season.

In the old world of on-premise infrastructure, and to handle the concurrent usage, Intuit had to size their data center for peak capacity. After tax season, demand drops back down to average usage. The gap between peak demand and average usage is so large, that 95% of Intuit’s infrastructure would sit idle for 95% of the year.

That’s why Intuit decided in 2014 to go all in with the cloud. With the cloud’s elasticity, Intuit is in a better position to accommodate spikes in customer usage during the tax season.

Shifting Priorities: From Migration Speed to Efficient Operations & Growth

By shifting to the cloud, Intuit reduced cost by a factor of six because it no longer maintained idle servers for an application only active during tax season. After the first success, Intuit moved more applications, services and enabling tools to the cloud. Today, over 80% of Intuit’s workloads are running in the cloud.

With now growing usage of AWS, the priorities of the program shifted from migration speed to efficient operations and growth.

Intuit now spends $100s of Millions on prepaid AWS services (aka “reserved instances” or short “RIs”) alone, plus fees for on-demand usage during the peaks. Interest grew in understanding the use of different AWS services and spend by different business units and teams within Intuit.

The source for that information sits in the “Cost & Usage Report” (“CUR”), a bill that Amazon AWS delivers to every customer. The CUR includes line items for each unique combination of AWS product, usage type, and operation and of course pricing. The CUR also contains information about credits, refunds and support fees.

Analyzing CUR data supports Intuit’s cloud program with two major use cases:

  1. Cost optimization. The goal is to understand opportunities to lower Intuit’s cloud spend. With $100s of Millions of spend on cloud infrastructure, the difference between on-demand usage vs. purchasing RIs can imply savings of 6-figure amounts per day. While humans look at cost data to make purchase and modification decisions, Intuit also has automated routines that take action based on the data.
  2. Cost allocation. The goal is to forecast and distribute the cost of using cloud resources. Unlike in the old on-premise world, things are dynamic, and engineers run load tests and spin up new services all the time. They are trying to understand “how much is this costing me?”

To build these two use cases, Jason’s team needs to transform the raw CUR data into a format consumable by the business. The raw CUR data comes in a different format from what Intuit uses to charge internal parties, distribute shared costs, amortize RIs and record spend on the general ledger.

The traditional way of Jason’s team to run the analytics on the CUR data was with an on-premise data warehouse.

The Next Bottleneck – the On-premise Data Warehouse

Unlike other companies, the size of Intuit’s CUR is very large. In 2017, it was around 500M rows at the end of a month.

Amazon delivers the report 3-4x per day to Intuit, and restates the rows with each report over the course  of a month, meaning it gets longer with each delivery. Coupled with a growing business, the amount of data the cluster has to process each time grows by the hour – literally.

You can see that trend play out in chart above, with data from 2017. The grey area indicates the batch size for the CUR data. Each day, the batch size gets bigger as the number of rows in the CUR grow. At the end of the month, the CUR reaches about 500 million rows and resets on day one of the new month.

The amount of rows the warehouse processes per minute stays constant at around 1 million rows per minute. Therefore, the time it takes the warehouse to process each batch (“batch duration”) goes up in linear fashion. With 500M rows at the end of the month, it takes the warehouse 500 minutes to process the full report, or 8 hours and 20 minutes.

Now extrapolate forward and calculate what that looks like in the future. With rising cloud spend, the data team realized that the CUR would start to blow up in size. In fact, today the CUR is larger by a factor of 10x with ~5 billion rows. Now we’re talking over 80 hours, almost four days.

3 Challenges for Data Teams: More data, more Workflows, more People

Intuit’s situation is a common scenario we see our customers run into: “More data, more workflows, more people”.

For Intuit, it was clear that “keep on doing what we’re doing” was not an option. In a world where data is an asset, data and DevOps teams should focus on the value-creation part of pipelines.

With cloud usage and data volume going up, the old on-prem warehouse was already running into bottlenecks, and so the analytics team followed the business team into the cloud.

Building a Data Services Layer in the Cloud

The Intuit team followed their product team into the AWS cloud. The major goals included handling the explosion in data volume and adding value to the business.

With on-demand access to computing resources, access to usage data in near real-time is fundamental for Intuit’s business teams. Unlike in the old world, waiting for a report at the end of the month doesn’t work anymore. With the scale of Intuit’s cloud operations, a few hours of freshness have a substantial impact on the company.

Cloud analytics with Amazon Redshift

Jason migrated the entire stack from Oracle to Redshift, and deployed the same SQL and ETL processes.

Redshift handled the growth in data volume. Three major data points:

  1. The volume of total rows processed (grey area) goes up for each day of a given month as the size of the CUR grows, to about 4 billion rows per batch.
  2. Number of rows that Redshift processes every minute (yellow line) goes up as the size of the CUR grows, to about 100 million rows per minute. 22
  3. The batch duration (red line) to process a full CUR stays within 30-40 minutes.   

You can also see that the size of the grey area has a step change in April – tax season!  The change is due to new capabilities Intuit introduced, which tripled the number of rows of the bill (“more data”).

Despite tripling the number of rows, the batch duration stays within a narrow band and doesn’t spike. That’s because batch size and number of rows processed per minute grow at the same rate.  In other words, the cluster processes more data faster, i.e. performance goes up as workloads grow.

Let’s dive into how Jason’s team achieved that result.

Building A Data Architecture That Supports the Business

The cluster architecture and the data pipeline follow the best practices we recommend for setting up your Amazon Redshift cluster. In particular, pay attention to setting up your WLM to separate your different workloads from each other.

You can see the three major workloads in the architecture chart – stage, process and consume.

Among our customers, “ELT” is a standard pattern, i.e. the transformation of data happens in the cluster with SQL. Cloud warehouses like Redshift are both performant and scalable, to the point that data transformation uses cases can be handled much better in-database vs an external processing layer. SQL is concise, declarative, and you can optimize it.

Intuit follows the “ELT” vs. “ETL” approach. With a lot of SQL knowledge on the team, they can build transformations in SQL and run them within the cluster. AWS drops the CUR into an S3 bucket where Intuit extracts the raw data from (the “E”) into the staging area. Intuit leaves the raw data untouched and loads it into the cluster (the “L”), to then transform it (the “T”).

Underneath the processes is an orchestration layer that coordinates workflows and manages dependencies. Some workflows need to execute on an hourly or daily basis, others on arrival of fresh data. Understanding the workflows and their execution is a crucial component for data integrity and meeting your SLAs.

When workflows and data pipelines fail –  and they will – you have to a) know about it as it happens and b) understand the root cause of the failure. Otherwise you will run into data integrity issues and miss your SLAs. In Intuit’s case, the key SLA is the near real-time character of the data.

In intermix.io, you can see these workflows via our “Query Insights”.

You can double-click into each user to see the underlying query groups and dependencies. As the engineer in charge, that means you can track your worfklows and understand which user, query and table are the cause of any issues.

End-to-end Data Flow, Toolchain and Business Services

Let’s go through the single steps of the data flow and the technologies involved to orchestrate the workflows.

Stage

S3 is the demarcation point. AWS delivers the CUR into S3. With various data sources next to the CUR, it’s easy for people to put data into an S3 bucket. Loading data into Redshift from S3 is easy and efficient with the COPY command.

Process

Amazon Redshift is the data platform. The workers for ingestion and post-ingestion processing include Lambda and EC2. Intuit uses Lambda wherever possible, as they prefer to not have any persistent compute they need to monitor or care for (patching, restacking, etc.).

Lambda functions can now run for 15 minutes, and for any job that runs under five minutes, the stack uses a lambda function. For larger jobs, they can deploy the same code stack on EC2, e.g. for staging the big CUR.

Orchestrate

AWS Step Functions coordinate the Lambda jobs. SNS triggers new worfklows as new data arrives, vs. CloudWatch for scheduling batch jobs. For example, when a new CUR arrives in an S3 bucket processing needs to start right away vs. waiting for a specific time slot. RDS helps to maintain state.

Consume

Data consumption happens across three major categories.

  1. Generic downstream consumers, where the landing zone for the transformed data is Intuit’s data lake in S3. Moving data from Redshift into S3 is fast and efficient with the UNLOAD command.
  2. A growing contingent of data scientist that run machine learning and artificial intelligence algorithms, with Sagemaker as their platform of choice. They can query data in Redshift, or call a growing set of APIs that run on Lambda with programmatic access to data. 23
  3. Business intelligence tools and dashboards to run the cost allocation programs, such as Tableau, Qlik, and QuickSight. This layer sees most of the consumption. Product managers have near real-time insights into the true allocated cost to make business choices.

Intuit supports new data use cases with Redshift, such as data APIs. Some of the uses cases have a transactional character that may require many small writes.

Instead of trying to turn Redshift into an OLTP, Intuit combines Redshift with PostgreSQL via Amazon RDS. By using dblink you can have your PostgreSQL cake and eat it too. By linking AmazonRedshift with RDS PostgreSQL, the combined feature set can power a broader array of use cases and provide the best solution for each task.

Comparing “Old” vs “New” – Benefits & Lessons Learned

Unlike with “all in one” data warehouses like Oracle or SQL Server, Redshift doesn’t offer system-native workflows. This may be at first intimidating

Instead, AWS takes the approach of providing a broad collection of primitives for low-overhead compute, storage, and development services. Coupled with a rich tool ecosystem for Redshift, you can build a data platform that allows for higher performing, more scalable and lower cost solutions than previously possible.

Overall, the migration rushed Intuit into a new era of data productivity. The platform:

Meanwhile, the new data platform saves Intuit millions of spend on cloud infrastructure, and transforms the decision making process for 8,000+ employees.

A New Way of Working with Data

With the new platform in place, Intuit is architecting a number of new use cases.

Data Lake Architecture

Long term trends for the CUR data are interesting, but for cost optimization analysts are interested in the most recent data. It makes sense to unload data from the largest tables in Redshift into S3 in Parquet format. That saves cost and increases flexibility by separating storage and compute.

Data Lifecycle Management

Once data is in S3, other (serverless) query engines like Athena or Redshift Spectrum can access it. The main fact tables in the Intuit cluster are based on date – the CUR is a bill. The date serves as the criteria when to unload data. For example, you may only want to keep one-quarter of data within the cluster. By keeping historic data in S3 and using Spectrum to query it, you scale data outside of Redshift but keep retrieval seamless and performant.

In intermix.io, you can filter for Spectrum queries by row count and scan size. You can also track their execution time and queue wait time.  In the screenshot below you see those metrics, incl. the uptick Spectrum queries beginning of June .

Data Science

The cost optimization program has delivered massive benefits. Teams know and predict computing costs in near real time. Deploying ML/AI capabilities against the CUR will allow making even smarter decisions – even 1% of improvement pays huge dividends.

Intuit expects the number of data scientists to go up several-fold, along with it the query volume. These queries patterns are more complex and less predictable. Concurrency Scaling offers an option to add more slots to a cluster to accommodate that incremental query volume, without adding nodes.

It’s a new way of working with data compared with the old, on-premise warehouse. Intuit is now in a position to embed a data services layer into all of Intuit’s products and services.

That’s all, folks!

That was a long post, and I hope it gave you a good peek behind the curtain on how Intuit is building their platform. I hope the post gives you enough information to get started with your data platform.

Now, I’d love to learn from you! Is there anything you can share about your own experience building a data platform? And if you want faster queries for your cloud analytics, and spend less time on Ops and more time on Dev like Intuit, then go ahead and schedule a demo or start a trial for intermix.io.

The 5 Type of Data Sources that Every Enterprise Should Think About

In broad terms, we see five major categories of data sources:

  1. Production data: Data coming from core web and mobile apps and / or the line of business apps, and their underlying production databases that contain user data and profiles. Examples are relational database like Amazon Aurora or a NoSQL database like DynamoDB.
  2. Sensor data: Data from connected devices / IoT devices like cell phones, vehicles, appliances, buildings, meters and machinery. The sensors collect a constant stream of environmental and usage data.
  3. Event data: Event data (also “behavioral data”) describes actions by users or entities, and contains three pieces of information: An action, with a timestamp and a state. Events is very rich and can have hundreds of properties. Examples are clickstream data for a web application, or log data from connected devices.
  4. SaaS data: Data from SaaS systems to support the customer lifecycle and the lines of business. Examples are data from systems for marketing automation (customer acqusition), in-app engagement (analytics), payments (monetization), or account management  / support (CRM).
  5. 3rd party data: Data that’s coming from private data brokers or government agencies, to enrich and provide additional context to existing in-house data sources. Examples are weather data, census data, or credit card transactions.