Table of Contents
One of the things we hear most often from our customers is that data is becoming more and more critical to their business.
Take for example one customer who builds hardware devices for consumers. They use their Amazon Redshift cluster to monitor and provide support for customers on the health of those devices (billions of events per day) and also to manage the complex supply chain for their overseas production lines.
This company relies on the data warehouse to run their business, and if the data pipelines don’t complete in time, or queries are slow or fail – the business loses revenue and customer satisfaction plummets.
We see this trend across all of our customers. In response – businesses are beginning to demand service level agreements (SLAs) for queries and reports built on top of the data warehouse. These SLAs have a number of different service level indicators – the most prominent are:
In fact, we have one customer that demands a 10 second SLA for any query that is executed through their BI reporting tool. Analysts get fed up with waiting for the “spinning wheel of death” waiting for a dashboard to load, or a query to complete.
Measuring service levels for query latency is trickier than it seems. To understand why, consider the sheer scale of workloads that a typical data warehouse handles. Our average customers runs over one million queries per week on their Amazon Redshift cluster. These queries span three categories of applications
At the same time, each category of workload is growing:
So a single SLA across all the queries is meaningless. What you really need is to measure SLA on a very granular basis – for each data application, down to the individual user level. For a real world example, consider a company using Looker as their BI application. How do you easily filter out all the noise only to see the Looker queries? How do you monitor the performance of a single model?
Measuring SLAs is hard, but is only the beginning of the journey. Assuming you are able to measure the SLA from each of your data applications – the next step is to take steps to improve the SLA by finding and fixing issues impacting your workloads and users.
The Cluster Health Dashboard helps data teams measure and improve SLAs. It does this by surfacing:
To see latency distribution for any connected app (see here for a list of supported apps), simply modify the filter at the top-right of the screen.
Query latency follows a power-law distribution – and most queries in the warehouse run rather quickly, with a long tail of slow queries.
Measuring query latency is best done by looking at different points across that distribution – the most common being the P95 and P99 latencies. These are shown prominently at the top of the page.
Further, we show you a distribution of latencies across the entire cluster, grouped by user (by default). Click on the chart to quickly go to the queries on the Discover page.
Example of SLA metrics for “Looker PDT” queries.
Our team has significant experience running Amazon Redshift at scale. In fact, we run a very large fleet of Amazon Redshift clusters as part of our product architecture. Based on this experience, we surface observation and recommendations on how to optimize the cost and performance of the cluster. The Recommendations cover things like:
The “Top 5 Longest Running Queries” list default to showing you the slowest queries in your cluster, over the past day.
Click on any of these queries, and go to a detail showing recommendations to optimize the query.
In recent months, there’s been a wave of acquisitions and fundings in the business intelligence (“BI tools”) and analytics space:
That’s a lot of movement in a short time for a segment of enterprise software that’s been somewhat static in the past 25 years.
A major driver in the uptick of M&A and venture activity are cloud warehouses like Amazon Redshift, Google BigQuery and Snowflake. They have brought down the cost and complexity to build a data platform, in a shift away from Hadoop, with BI tools as the catalyst to make data exploration and visualization available to a much wider audience.
We believe that the visualizations and charts are a commodity. What seems to matter to buyers of analytics is the “eye-candy” of the dashboards. But much of the hard work to build a data platform happens underneath the visualization layer, e.g. data acquisition and ingestion, ETL pipelines and data discovery layers.
There’s a bit of confusion on how these tools work and their use cases. So in this post, we’re helping you find the answers, by looking at the current BI tool landscape and their differentiators.
BI tools facilitate exploration and visualization of data, and for the purpose of this post, we’re only including tools that support the major data warehouse (Amazon Redshift, Snowflake, Google BigQuery, Azure SQL Data Warehouse.)
Let’s start by dividing BI tools into three simple categories:
We believe the distinction between “legacy” and “modern” is key.
In an old world of on-premise warehouses like Teradata and Oracle, the cost of running queries and experiments on your warehouse was expensive. Extracting the data into the BI tool made sense, to drive down the cost and allow users to be flexible in transforming data and exploring new types of queries.
But cloud warehouses have become so performant and scalable that running transformations and queries in-database is much more efficient. With the separation of storage and compute, there are a few reasons why you would run queries elsewhere. And the benefit is that employees explore one common & current dataset.
Let’s go through each of these categories and explore their key representatives in more detail.
Metabase is an open-source business intelligence tool designed for non-technical users to provide data insights and visualizations. One of the main advantages of Metabase is its excellent visualizations and the fact that you can create dashboards almost without any prior experience.
You can create dashboards and visualizations using a simple question creator with no programming. If the question creator is not enough for you, you can use basic SQL to create more complex queries.
Another useful feature is “Pulses”. With Pulses, you can send updates on specific queries (“questions”) to Slack channel or email.
Metabase supports multiple relational and non-relational databases as data sources. Among them are the following: Amazon Redshift, Snowflake, BigQuery, PostgreSQL, SQLite, MySQL, Druid, MongoDB, and more. The Detailed list you can see in the documentation.
Superset is a modern BI tool with a simple interface allowing to create and share dashboards. Superset comes with a lot of different, rich visualization types included. To create queries and dashboards, you need to know basic SQL.
Superset started as a project at Airbnb, to as a fully customizable application to visualize and explore massive amounts of data in a fast and intuitive way. Today, Superset is one of the leading open source BI solutions.
Superset provides an advanced security management system, with flexibleadjustment for levels of product, feature and data set access.. Superset also has integrations with major authentication backends.. Another attractive Superset feature is the SQL/IDE editor with interactive querying.
Superset supports multiple data source types through SQLAlchemy, a Python ORM that is compatible with the most common databases (PostgreSQL, Amazon Redshift, MySQL, Oracle, SQLite, Microsoft SQL Server).
Redash is a simple and powerful BI tool, suitable for advanced users since you need to know SQL to work with it. Redash provides an interactive query editor for sharing both the dataset and the query that generated it.
Redash allows you to use query results as data sources to join different databases and make advanced reports and dashboards. With Redash you can set up reminders and automatic notifications (Alerts) to notify a group of people or an individual user when a specific query reaches a preset value. The biggest advantage of Redash is a very vast array of different visualizations that users can create.
Redash provides the most impressive list of data sources from all open source BI tools in our review. You can use almost any data source you can imagine, from CSV to Elasticsearch, and of course the major data warehouses like Amazon Redshift, Snowflake and BigQuery. You can explore the full list here.
Tableau is one of the leading BI solutions offering a wide range of visualization capabilities, extensive documentation, and a simple interface. Tableau is friendly for beginners, the needed functions are most often achieved in no more than 2 clicks, filters are easy to find, and all operations are clearly documented.
A key differentiator for Tableau is data blending, a method for combining data from numerous databases and sources. Tableau also allows multiple users to work on a report in real-time simultaneously.
Tableau supports a wide range of data sources. These can be files (CSV, JSON, MS Excel, etc.), relational and non-relational databases (PostgreSQL, MySQL, SQL Server, MongoDB, etc.) and cloud warehouse (Amazon Redshift, Oracle Cloud, Google BigQuery, Microsoft Azure).
Tableau provides several pricing plans for different needs:
A free trial is also available.
Domo is a powerful cloud-based BI tool with solid data visualization capabilities that integrates with multiple data sources, including spreadsheets, databases, social networks, and almost any other existing cloud or local software solutions.
Domo provides a wide range of connectors (over 100 connectors right out of the box) and impressive sharing features. But its user interface is not intuitive, and this tool is much less friendly to newcomers than some other competitors.
There are several pricing options:
Standard – $83/user per month (billed annually).
Professional – $160/user per month (billed annually).
Enterprise – $190/user per month (billed annually).
Qlik Sense is a powerful BI solution. Its Associative Engine automatically finds relationships inside the data. This simplifies the study of data and allows easy combination of any data sources, no matter how large or complex they are, into a single view.
Qlik Sense lets you explore data at any level of detail you need, explore the shape of the data and pinpoint outliers. AI-powered insight suggestions will not allow any meaningful data to escape from you. Qlik Sense also provides an opportunity for you to dive into advanced analytics calculations from R and Python using simple clicks.
Qlik Insight Bot adds AI-powered conversational analytics capabilities to Qlik Sense. With it, you can receive comprehensive reports, ask questions and discover insights using natural language.
Cloud Basic – Free
Cloud Business – $15 per user/month (billed annually).
Power BI is a business intelligence tool from Microsoft with all the benefits that Microsoft services ecosystem provides. PowerBI connects with the company’s main products, such as MS Excel, Azure Cloud Service, and SQL Server. Power BI is marked as a leader in Gartner’s Magic Quadrant for Analytics and Business Intelligence Platforms.
Power BI interface is simple and clear, it will be familiar to Windows users. Visualizations are created by drag-and-drop. All you need to create any graphics – click on the desired item and drag it to an empty space in the report.
Power BI supports plenty of ways to incorporate or import your data (streaming data, cloud services, excel spreadsheets and third-party connections) and provides a simple API for integration with your applications.
Looker is a BI tool of a new generation. It is fundamentally different from other solutions. It offers LookML, its own data modeling language, which simplifies the descriptions of dimensions, calculations (measures), and data relationships in a database. Using LookML, database queries are more descriptive and subject to fewer errors.
Once you have created a data model with LookML describing in which dimensions you want to group the data and how you want to evaluate it, you can open access to this model to other users who can use it to create their own queries, reports, and dashboards through a simple interface. Thus, the work is divided and optimized, data analysts get a convenient tool for modeling and business users are freed from the need to create complex SQL queries.
The need to learn LookML and data modeling before any data visualizations is made easy through the educational materials for learning this language provided by Looker. (Learn more about Looker and Amazon Redshift Dashboards.)
Looker introduces innovation in how to collaborate on BI projects. Each project or data analyses can be saved as a Git project. In this way, you can use the full power of modern version control systems when collaborating on data.
Looker supports about 45 different data sources. The full list you can see on this page.
Looker’s price ranges from $3,000 to $5,000 per month for 10 users. Additional users are $50 per month.
With intermix.io, Looker users get all the tools they need to optimize their queries running on Amazon Redshift. For more info on how this feature works, visit intermix.io/looker.
Chartio is a great BI tool for advanced users with excellent data blending, data transformation, data sharing capabilities and robust security options.
Chartio provides two modes of work: drag-and-drop interactive mode or SQL mode. While its UI is quite powerful, it is not so easy for non-experienced users as UI of some other competitors. Chartio has built a visual version of SQL, which enables you to explore, transform and visualize data on the fly through a drag-and-drop interface.
Chartio’s visual SQL allows you to quickly combine and compare data from different sources. For example, you can get data from Google Analytics and then compare it with Campaign history from Mailchimp. All this can be done in a few clicks and you will be helped by extensive documentation.
Chartio supports about a hundred of different data sources including relational and non-relational databases and a bunch of SaaS partners. The full list you can see on this page.
With intermix.io, Chartio users get all the tools they need to optimize their queries running on Amazon Redshift. For more info on how this feature works, visit intermix.io/chartio.
Mode is a light-weight framework for business intelligence. You can run it directly in the browser. The tool allows running complex queries using SQL, Python, or R. The query editor is very intuitive and simple even for novices. For Python and R, you can perform data analysis in the Mode Notebooks. More than 60 Python and R libraries for data wrangling and analysis are available out of the box. If needed, you can install other public packages. Also, Mode Analytics supports the creation of the reports, which you can customize and share with different stakeholders.
The target group of use cases where Mode Analytics is extremely good is the fast and light data analysis. In-browser running, drag-and-drop capabilities for reporting, simple interface, a large number of built-in features, integration with SQL and such popular programming languages for data analytics as Python and R makes Mode Analytics a perfect choice for fast prototyping and ideas sharing.
Mode Analytics has two pricing options:
With intermix.io, Mode users get all the tools they need to optimize their queries running on Amazon Redshift. For more info on how this feature works, visit intermix.io/mode.
Periscope Data is one of the new tools for BI. It provides support for different data source connection and data preparation using SQL, Python, and R. It also allows to perform predictive analytics and unstructured data analysis (for example, during the natural language processing).
Periscope Data has a lot of built-in visualizations. Among the provided templates there are even such exotic charts as radar chart, cohort grid, gauge chart, etc. Charting with maps is also available. In addition, you can use Python and R libraries to prepare your own graphs. There is a lot of space for customization anyway.
As many users say, there is no learning curve for Periscope Data at all. It is extremely simple. Periscope data supports easy-to-understand drag-and-drop interface. If you need to find something, usually it is no farther than a few clicks from your current location in the app. Sharing capabilities and collaboration are also great.
Enterprise pricing plans for periscope data are not available at the official website. However, there is a free trial package available for the customers to get started with the software. Later, they can consult the sales team for their respective pricing packages.
With intermix.io, Periscope Data users get all the tools they need to optimize their queries running on Amazon Redshift. For more info on how this feature works, visit intermix.io/periscope.
[You can embed the code in the article. This will allow to continuously update the graphs. If only including screenshots, we will state the date when they were taken.]
On the chart below, you can see the searching trends for 3 open source BI tools described in this article: Metabase, Apache Superset, and Redash. At the time of writing, the most popular tool is Metabase which is approximately 2 times more popular than its closest pursuer, Redash. Also, you can notice that while the popularity of Apache Superset and Redash remains on a relatively stable level, the popularity of Metabase started to grow in the last couple of months (at the time of writing).
As was mentioned earlier, “legacy” commercial BI tools include Tableau, Domo, Qlik, and PowerBI. Their search popularity on Google Trends is depicted in the graph below. Tableau is the undoubted leader here. Its popularity is substantially higher than all other instruments. PowerBI ranks second. Domo is a little bit more popular than Qlik, but the level is relatively the same. There is no vivid trend for any platform, with unexpected peaks due to one-time events (i.e. the Tableau acquisition by Salesforce). The stable (side) nature of trend for all of these tools emphasizes their matureness in BI world one more time.
The leader in the set of “Modern” BI tools is even more clear than in the set of “Legacy” tools. Looker is like a giant among them. To better understand the gap between these instruments remember that Google Trends displays the relative popularity. Now, Looker at the time of writing has 37 points, while Chartio, Mode, and Periscope Data have 2, 3, and 1 point respectively. Looker had a huge peak in popularity due to its acquisition by Google. The trend is not clear here for any of the instruments. They all remain at the same level.
In this article, we highlighted some of the most popular business intelligence tools. We split them into open source and commercial proposals. In turn, commercial tools were conditionally split into “legacy” and “modern”.
“Legacy” BI tools are probably the most well-known. They are admitted as classic platforms and every BI analyst should know at least some of them.
“Modern” BI tools have emerged relatively recently. These tools provide some new value for data users. In many aspects, they are very good and can be considered as worthy competitors to the “legacy giants”.
We briefly described features of each of the instruments as well as tried to measure their popularity and existing trends. We can say that regardless of the fact that there are a lot of BI tools nowadays, there is no “swiss knife”. In each case, a person or organization should consider their use case, estimate what and how they need, which resources are available and then decide which instrument will be the optimal choice.
Query optimization that dramatically reduces runtime for queries which use window functions.
A common problem we see our customers solving in SQL is how to compare two rows per group in a dataset. For example, say you have the following data in the product_prices table:
How do you efficiently SELECT the price change at each updated_on time for each Product? I.e. we’d like to see a result set of:
|Product||Most recent updated time||Price change|
This type of problem arises in a lot of domains. Another common example of this problem occurs when you want to compare the time difference between subsequent events. For example, you may have a table that logs timestamps for each page visit of a user to your website and you want to do an analysis on the time difference between visits.
A common but sub-optimal way we see customers solve this problem is by using the ROW_NUMBER() window function together with a self join. The algorithm is straightforward: first select all your product prices and order them within each product by updated_on using the ROW_NUMBER() window function. Then self join on the table, mapping each row to the row preceding it. In SQL:
with ordered_prices as (
row_number() over (partition by product order by updated_on desc) as row_number
from ordered_prices op1 join ordered_prices op2
on op1.product=op2.product and op1.row_number = op2.row_number+1
This query produces the desired result set, but at the cost of a join. For real-world (i.e. less contrived) examples, solutions like this also often need to incorporate additional sub-selects or other expensive operations which compound the effect of the join and can make the query run excruciatingly slow.
This class of problem can manifest itself in other ways as well: sometimes instead of a self join, the query may just select from ordered_prices where row_number=1 to get the most recent updated_on and price. But the general pattern is the same: using ROW_NUMBER() to label rows for further processing/joining.
If you’re using ROW_NUMBER() in a query just to apply filters or joins later on in your transformation, there may be a better way. Rather than co-opting ROW_NUMBER() to compare neighboring rows, a better solution is to apply a different window function that was designed to solve this problem: LAG(). The LAG window function returns the value from a row offset from the current row. The resulting SQL becomes:
lag(price) over (partition by product order by updated_on desc) - price as price_change
This SQL says that we should order the rows within each product by updated_on, and take the difference of the price from the current row with the price of the following row within that group. This eliminates the need for the join and can greatly simplify the query plan. One note: this result set will include NULL rows at the start of each product (i.e. where there is no previous row), but these can be trivially eliminated with a WHERE clause of price_change is not null when the results are used (or this query can be wrapped in a subselect to remove those rows).
The seed for this blog post was planted when a potential customer came to us with the same question we hear all the time: “Why are our queries slow?” After instrumenting their cluster with intermix.io they were able to quickly identify their worst queries and find out why they are slow and what to do about it. In their case, one of the worst offending queries used this ROW_NUMBER() pattern and triggered a large number of Query Recommendations in their intermix.io Dashboard, including:
Since the query exhibited the ROW_NUMBER() pattern above, it caused multiple unnecessary table scans, an expensive join, significant IO skew, and a large amount of intermediate data to be written to disk during execution. The query operated over a single table which had 876 million rows–not a huge table, but large enough to bring Amazon Redshift to its knees if not optimized properly–and was taking over 5.5 hours to execute!
After reviewing our Recommendations, they reimplemented their query using the LAG function to eliminate the join (which greatly simplified their query by removing two additional sub-selects), and the query execution time dropped to 30 seconds. From 5.5 hours to 30 seconds simply by reviewing targeted Query Recommendations and implementing a straightforward solution–talk about a good ROI on time spent optimizing SQL. Their exact feedback after seeing the new query runtimes? “OMFG. takes 30s. Wow”
Amazon Redshift is a fantastic general-purpose data warehouse. But since it is so easy to add data and new workloads to your cluster, it can be very hard to know where and how to start optimizing SQL queries.
Analysts and data engineers often wonder which queries are the worst offenders and where limited SQL tuning effort should go to give the biggest bang-for-the-buck. Knowing where to put query tuning effort is crucial since seemingly simple-looking queries can sometimes have inefficiencies lurking which can compound to slow a query by several orders of magnitude.
The good news: this is the problem that intermix.io Query Recommendations was designed to solve. We help you pinpoint the root cause of query slowness so that you can spend your limited time optimizing the right queries in the right way.
In 2014 Intuit’s then-CTO Tayloe Sainsbury went all in on the cloud and started migrating Intuit’s legacy on-premise IT infrastructure to Amazon AWS. By February 2018, Intuit had sold its largest data center and processed 100% of 2018 tax filings in the cloud.
But now Intuit had a different challenge – optimizing cloud spend and allocating that spend to products and users. AWS bills customers via a “Cost & Usage Report” (“CUR”). Because of the size of its cloud spend, the Intuit CUR comprises billions of rows, and it keeps growing by the day. Intuit switched from an on-premise data warehouse and now uses Amazon Redshift to process 4-5 billion rows of raw CUR data – each day.
In this post, I’m walking you through the approach that Jason Rhoades took to build Intuit’s data pipeline with Redshift. Jason is an Architect at Intuit, and with a small data team, they provide business-critical data to more than 8,000 Intuit employees.
Heads up – this is a long post with lots of detail!
Three major blocks:
Let’s start with an overview of the business.
Intuit builds financial management and compliance products and services for consumers and small businesses. Intuit also provides tax products to accounting professionals.
Products include QuickBooks, TurboTax, Mint and Turbo. These products help customers run their businesses, pay employees, send invoices, manage expenses, track their money, and file income taxes. Across these products, Intuit serves more than 50 million customers.
Intuit started in the 1980s and built the original version of its first product Quicken for the desktop, first MS-DOS and then Windows. With the Internet, that usage shifted to web and mobile. The impact of that change became clear as early as 2010.
Fast forward to today, and over 90% of Intuits customers file their taxes and manage their accounting online and via mobile apps.
Consumption of Intuit products follows seasonal patterns.
Tax seasonality has the biggest impact on Intuit’s business. Each fiscal year, Intuit generates half of its annual revenue in the quarter ending on April 30th, with the US tax filing deadline on April 15th.
Seasonality also has a huge impact on the cost side of the equation. The shift to digital and online usage of Intuit’s products causes a dramatic usage spike for the IT infrastructure. Most users file their taxes online during the last two days of the tax season.
In the old world of on-premise infrastructure, and to handle the concurrent usage, Intuit had to size their data center for peak capacity. After tax season, demand drops back down to average usage. The gap between peak demand and average usage is so large, that 95% of Intuit’s infrastructure would sit idle for 95% of the year.
That’s why Intuit decided in 2014 to go all in with the cloud. With the cloud’s elasticity, Intuit is in a better position to accommodate spikes in customer usage during the tax season.
By shifting to the cloud, Intuit reduced cost by a factor of six because it no longer maintained idle servers for an application only active during tax season. After the first success, Intuit moved more applications, services and enabling tools to the cloud. Today, over 80% of Intuit’s workloads are running in the cloud.
With now growing usage of AWS, the priorities of the program shifted from migration speed to efficient operations and growth.
Intuit now spends $100s of Millions on prepaid AWS services (aka “reserved instances” or short “RIs”) alone, plus fees for on-demand usage during the peaks. Interest grew in understanding the use of different AWS services and spend by different business units and teams within Intuit.
The source for that information sits in the “Cost & Usage Report” (“CUR”), a bill that Amazon AWS delivers to every customer. The CUR includes line items for each unique combination of AWS product, usage type, and operation and of course pricing. The CUR also contains information about credits, refunds and support fees.
Analyzing CUR data supports Intuit’s cloud program with two major use cases:
To build these two use cases, Jason’s team needs to transform the raw CUR data into a format consumable by the business. The raw CUR data comes in a different format from what Intuit uses to charge internal parties, distribute shared costs, amortize RIs and record spend on the general ledger.
The traditional way of Jason’s team to run the analytics on the CUR data was with an on-premise data warehouse.
Unlike other companies, the size of Intuit’s CUR is very large. In 2017, it was around 500M rows at the end of a month.
Amazon delivers the report 3-4x per day to Intuit, and restates the rows with each report over the course of a month, meaning it gets longer with each delivery. Coupled with a growing business, the amount of data the cluster has to process each time grows by the hour – literally.
You can see that trend play out in chart above, with data from 2017. The grey area indicates the batch size for the CUR data. Each day, the batch size gets bigger as the number of rows in the CUR grow. At the end of the month, the CUR reaches about 500 million rows and resets on day one of the new month.
The amount of rows the warehouse processes per minute stays constant at around 1 million rows per minute. Therefore, the time it takes the warehouse to process each batch (“batch duration”) goes up in linear fashion. With 500M rows at the end of the month, it takes the warehouse 500 minutes to process the full report, or 8 hours and 20 minutes.
Now extrapolate forward and calculate what that looks like in the future. With rising cloud spend, the data team realized that the CUR would start to blow up in size. In fact, today the CUR is larger by a factor of 10x with ~5 billion rows. Now we’re talking over 80 hours, almost four days.
Intuit’s situation is a common scenario we see our customers run into: “More data, more workflows, more people”.
For Intuit, it was clear that “keep on doing what we’re doing” was not an option. In a world where data is an asset, data and DevOps teams should focus on the value-creation part of pipelines.
With cloud usage and data volume going up, the old on-prem warehouse was already running into bottlenecks, and so the analytics team followed the business team into the cloud.
The Intuit team followed their product team into the AWS cloud. The major goals included handling the explosion in data volume and adding value to the business.
With on-demand access to computing resources, access to usage data in near real-time is fundamental for Intuit’s business teams. Unlike in the old world, waiting for a report at the end of the month doesn’t work anymore. With the scale of Intuit’s cloud operations, a few hours of freshness have a substantial impact on the company.
Jason migrated the entire stack from Oracle to Redshift, and deployed the same SQL and ETL processes.
Redshift handled the growth in data volume. Three major data points:
You can also see that the size of the grey area has a step change in April – tax season! The change is due to new capabilities Intuit introduced, which tripled the number of rows of the bill (“more data”).
Despite tripling the number of rows, the batch duration stays within a narrow band and doesn’t spike. That’s because batch size and number of rows processed per minute grow at the same rate. In other words, the cluster processes more data faster, i.e. performance goes up as workloads grow.
Let’s dive into how Jason’s team achieved that result.
The cluster architecture and the data pipeline follow the best practices we recommend for setting up your Amazon Redshift cluster. In particular, pay attention to setting up your WLM to separate your different workloads from each other.
You can see the three major workloads in the architecture chart – stage, process and consume.
Among our customers, “ELT” is a standard pattern, i.e. the transformation of data happens in the cluster with SQL. Cloud warehouses like Redshift are both performant and scalable, to the point that data transformation uses cases can be handled much better in-database vs an external processing layer. SQL is concise, declarative, and you can optimize it.
Intuit follows the “ELT” vs. “ETL” approach. With a lot of SQL knowledge on the team, they can build transformations in SQL and run them within the cluster. AWS drops the CUR into an S3 bucket where Intuit extracts the raw data from (the “E”) into the staging area. Intuit leaves the raw data untouched and loads it into the cluster (the “L”), to then transform it (the “T”).
Underneath the processes is an orchestration layer that coordinates workflows and manages dependencies. Some workflows need to execute on an hourly or daily basis, others on arrival of fresh data. Understanding the workflows and their execution is a crucial component for data integrity and meeting your SLAs.
When workflows and data pipelines fail – and they will – you have to a) know about it as it happens and b) understand the root cause of the failure. Otherwise you will run into data integrity issues and miss your SLAs. In Intuit’s case, the key SLA is the near real-time character of the data.
In intermix.io, you can see these workflows via our “Query Insights”.
You can double-click into each user to see the underlying query groups and dependencies. As the engineer in charge, that means you can track your worfklows and understand which user, query and table are the cause of any issues.
Let’s go through the single steps of the data flow and the technologies involved to orchestrate the workflows.
S3 is the demarcation point. AWS delivers the CUR into S3. With various data sources next to the CUR, it’s easy for people to put data into an S3 bucket. Loading data into Redshift from S3 is easy and efficient with the COPY command.
Amazon Redshift is the data platform. The workers for ingestion and post-ingestion processing include Lambda and EC2. Intuit uses Lambda wherever possible, as they prefer to not have any persistent compute they need to monitor or care for (patching, restacking, etc.).
Lambda functions can now run for 15 minutes, and for any job that runs under five minutes, the stack uses a lambda function. For larger jobs, they can deploy the same code stack on EC2, e.g. for staging the big CUR.
AWS Step Functions coordinate the Lambda jobs. SNS triggers new worfklows as new data arrives, vs. CloudWatch for scheduling batch jobs. For example, when a new CUR arrives in an S3 bucket processing needs to start right away vs. waiting for a specific time slot. RDS helps to maintain state.
Data consumption happens across three major categories.
Intuit supports new data use cases with Redshift, such as data APIs. Some of the uses cases have a transactional character that may require many small writes.
Instead of trying to turn Redshift into an OLTP, Intuit combines Redshift with PostgreSQL via Amazon RDS. By using dblink you can have your PostgreSQL cake and eat it too. By linking AmazonRedshift with RDS PostgreSQL, the combined feature set can power a broader array of use cases and provide the best solution for each task.
Unlike with “all in one” data warehouses like Oracle or SQL Server, Redshift doesn’t offer system-native workflows. This may be at first intimidating
Instead, AWS takes the approach of providing a broad collection of primitives for low-overhead compute, storage, and development services. Coupled with a rich tool ecosystem for Redshift, you can build a data platform that allows for higher performing, more scalable and lower cost solutions than previously possible.
Overall, the migration rushed Intuit into a new era of data productivity. The platform:
Meanwhile, the new data platform saves Intuit millions of spend on cloud infrastructure, and transforms the decision making process for 8,000+ employees.
With the new platform in place, Intuit is architecting a number of new use cases.
Data Lake Architecture
Long term trends for the CUR data are interesting, but for cost optimization analysts are interested in the most recent data. It makes sense to unload data from the largest tables in Redshift into S3 in Parquet format. That saves cost and increases flexibility by separating storage and compute.
Data Lifecycle Management
Once data is in S3, other (serverless) query engines like Athena or Redshift Spectrum can access it. The main fact tables in the Intuit cluster are based on date – the CUR is a bill. The date serves as the criteria when to unload data. For example, you may only want to keep one-quarter of data within the cluster. By keeping historic data in S3 and using Spectrum to query it, you scale data outside of Redshift but keep retrieval seamless and performant.
In intermix.io, you can filter for Spectrum queries by row count and scan size. You can also track their execution time and queue wait time. In the screenshot below you see those metrics, incl. the uptick Spectrum queries beginning of June .
The cost optimization program has delivered massive benefits. Teams know and predict computing costs in near real time. Deploying ML/AI capabilities against the CUR will allow making even smarter decisions – even 1% of improvement pays huge dividends.
Intuit expects the number of data scientists to go up several-fold, along with it the query volume. These queries patterns are more complex and less predictable. Concurrency Scaling offers an option to add more slots to a cluster to accommodate that incremental query volume, without adding nodes.
It’s a new way of working with data compared with the old, on-premise warehouse. Intuit is now in a position to embed a data services layer into all of Intuit’s products and services.
That was a long post, and I hope it gave you a good peek behind the curtain on how Intuit is building their platform. I hope the post gives you enough information to get started with your data platform.
Now, I’d love to learn from you! Is there anything you can share about your own experience building a data platform? And if you want faster queries for your cloud analytics, and spend less time on Ops and more time on Dev like Intuit, then go ahead and schedule a demo or start a trial for intermix.io.
In broad terms, we see five major categories of data sources: