Start Now Login

“Real-time Fraud Detection”

This is part of a series of interviews on how companies are building data products. In these interviews, we’re sharing how data teams use data, with a deep dive into a data product at the company. We also cover tech stacks, best practices and other lessons learned.

Table of Contents

About

Aaron Biller is a lead data engineer at Postmates. Postmates is an on-demand delivery platform with operations in 3,500 cities in the US and Mexico. With over 5 Million deliveries each month, Postmates is transforming the way food and merchandise is moved around cities.

A successful, on-time delivery is the single most important event for Postmates’ business.

“In the very, very early days, we would query our production database to understand how many deliveries we had for the past day. Reporting on our metrics would happen with spreadsheets. Clearly that wouldn’t scale, and so we shifted our analytics to a data warehouse and used Amazon Redshift” says Aaron Biller, Engineering Lead for Data at Postmates.

What business problem does this data product solve? 

“Data is ubiquitous at Postmates. It’s much more than reporting – we’re delivering data as a product and support data microservices”, says Biller.

Consider fraud prevention. On-demand platforms like Postmates have a unique exposure to payments and fraud because they have to assess risk in real-time. 

While the warehouse does not operate in real-time, it ingests and transforms event data from all transactions for downstream consumption by predictive models and real-time services.

The Postmates risk team has engineered an internal risk detection microservice called “Pegasus”. Event data passes through a series of transformations in Redshift and feeds into “business rules”, which take the transformed data as input and produce decisions as output, with live decisions for every individual transaction on the Postmates platform.

In addition to Fraud, the data team has built an infrastructure that drives four major use cases for data:

What is the tech-stack used?

As things have progressed at Postmates, they added more developers, more microservices and more data sources. “The amount of data we have, period, and the amount of new data we generate every day has expanded exponentially,” describes Biller the growth at Postmates.

Consider the amount of data collected during “peak delivery time” on Sunday nights, when people order their dinner to eat at home.

“Three years ago, we captured data from a certain number of ongoing deliveries on a Sunday night at peak. We’re now at about 30x the number of deliveries in flight. And we’re also monitoring and tracking so many more events per delivery. In short, we’re doing 30x the deliveries, and a single delivery includes 10x the data, and it just keeps growing,” explains Biller.

Amazon Redshift and Google BigQuery are the primary data warehouses.  

What are the sources of data?

The vast majority of raw data comes from the Postmates app itself. In addition to the app, the data team has built integrations with 3rd party services. Examples include:

You can write a query that combines 13 data sources in one single query and just run it and get data. That’s extraordinarily useful and powerful from an analytics and reporting perspective.”

“Used by over 300 people”

This is part of a series of interviews on how companies are building data products. In these interviews, we’re sharing how data teams use data, with a deep dive into a data product at the company. We also cover tech stacks, best practices and other lessons learned.

About

Stephen Bronstein leads the Data Team at Fuze. It’s a “skinny team” of 3 people that supports all of Fuze’s data needs.

Over the course of the past three years, Stephen has led the team through warehouse transitions and performance tuning, adoption of new data sources, regular surges of new data and use cases, and the on-boarding of hundreds of new data-users. “People care a lot about having the right data at the right time. It’s crucial to drive their work forward,” says Bronstein. 

Who is the end-user of this data product?

Fuze is a cloud-based communications and collaboration platform provider. The Fuze platform unifies voice, video, messaging, and conferencing services on a single, award-winning cloud platform,and delivers intelligent, mobile-ready apps to customers.

As Fuze has grown its customer base and employee count, data has become a mission-critical component. More than 300 people query the data warehouse on a constant basis across Fuze’s 19 global office locations.  

Departments include Finance, Sales, Product, and Customer Support.

What business problem does this data product solve? 

Each day critical business functions query the data warehouse: 

What is the tech-stack used?

A central Amazon Redshift data warehouse combines data from a growing number of sources and events. The toolchain around Amazon Redshift includes 

Data modeling is bespoke, and Fuze runs large-scale data transformations within Redshift in SQL. 

Watch the full video and download a complete transcript of our conversation.

Summary

Arvind Ramesh is the manager of the data team at Envoy. In this post, we’re sharing how Arvind’s team has built a data platform at Envoy, with a deep dive into a data product for the Customer Success Team. Arvind believes that the most important skill in data is storytelling and that data teams should operate more like software engineering teams. You can find Arvind on LinkedIn

For more detail, you can also watch the full video of our conversation. 

Data at Envoy: Background & Business Context

Envoy helps companies manage their office visitors and deliveries. Chances are you’ve already used Envoy somewhere in a lobby or a reception desk, where an iPad kiosk runs the Envoy app. The app checks you in, prints your name tag and sends your host a message that you’ve arrived. Guests can also use the “Envoy Passport” app for automatic check-in, collecting different “stamps” for each office visit, similar to visa stamps at customs when entering a country. Hosts can manage visitors and other things like mail and packages via a mobile app. 

envoy app

The apps generate a new, growing amount of data on a per-minute basis. Envoy has welcomed over 60M visitors, adding 130,000 new visitors every day, in over 13,000 locations around the globe. Envoy’s business model consists of a subscription where pricing scales with the number of locations and deliveries. Providing analysis and reporting for decision making across that location and visitor footprint is of course valuable. 

But basic reporting is not a data product. “The way I look at it, a data product, it’s not really delivering an insight per se, but it’s delivering ‘a thing’ that can be used by other people within or outside of the company to improve their workflow’, says Arvind.

envoy iphone app

At Envoy, data products are about enabling people to do more with the data Envoy generates. “The advantages of data products is that they enable other people to more easily access data and use it in their workflow without us having to go back and forth for every piece of insight”, says Arvind.

Part of Envoy’s secret to success is using a data product to drive account penetration and adoption of the Envoy app. 

Who is the end-user of this data product?

The Customer Success team uses Gainsight, a suite of SaaS products for monitoring and managing product adoption. “Gainsight is basically a UI that allows you to better manage your customer base, see how people are using your product, where there are problems, where you might need to focus your attention. But, like most tools, it is only valuable if it has a stream of comprehensive, reliable, and accurate customer data underpinning it.”

gainsight

Gainsight offers a full customer data platform, but in Envoy’s case the Gainsight app is a “skinny” UI, which sits on top of the Envoy data platform. In this specific case, it’s a table called “Gainsight company facts”. A daily batch process combines data from many different sources into the final table.

“The way I think of it Gainsight, it’s the UI that sits on top of our data platform because it would not be the best use of our time or anyone’s time to build a UI for that kind of thing. The way we think of data products – it’s a backend data service or data infrastructure. The actual last mile is a web dashboard or something similar. Usually we’ll use a tool to accomplish that.”

What Business Problem Does this Data Product Solve?

Data delivered in Gainsight helps the Customer Success team prioritize the accounts where product adoption is below par, and locations are at risk of churning. 

Compared to the raw source data that’s scattered across various sources and places, the new “Gainsight company facts” table has reliable, useful information in a single place, such as:

“We have a few customer success managers and each of them will have several hundred accounts” says Arvind. “They can have a conversation with the customer, ‘ok, you have 50 offices, but look, these five are actually having some issues and maybe you should focus on these’.” Arvind’s team helps make those conversations more effective with product usage data.

What Are the Data Sources & Tech Stack?

The data for the “Gainsight company facts” table is the result of a daily batch process. The process cleans, transforms and combines raw data from different sources within the Envoy data warehouse.

The image below shows a summary of the graph, or “DAG” involved in building the model. For every company, for every day, the output table contains a variety of key data points. 

To run the batch process, Arvind’s team has built a platform that consists of five key components.

The model for Gaingisht takes about 40 minutes to run, including completing all upstream dependencies for the freshest possible data. 

For other data products, the SLA can be a bit more real-time, and models run every three hours or even shorter. But “typically it’s pretty rare for people to truly need data real-time or even close to real time, unless it’s something operational or you’re looking at server metrics.”

Best practices and lessons learned

Next to customer success, the platform supports other data products, and building the underlying models requires domain knowledge. 

“That’s typically where we do a lot of work on our end. You really have to understand how our business works, how our activation metrics are defined, or how our products interact with each other”, says Arvind. Usually one person is an expert for a specific domain, but with peer review, documentation and a QA rotation, domain knowledge starts to make its way across the team. 

“Our team is eight people and that’s split across data engineering, product analytics and go-to-market analytics. We’re about 170 people right now total at Envoy, which translates to about  5% of the company working in data-focused roles. “If we build effective data products and continue to iterate on operational efficiencies, our team should not have to scale linearly with the company.”  

Building the platform has been an iterative process. “It’s probably a bit cliché, but I think data teams should work more like software engineering teams”. 

First was testing. “After building out the first set of several dozen models, we quickly realized that ensuring everything stayed accurate was a challenge. People were finding issues in the data before we were, which was embarrassing for us since we felt accountable for the accuracy of anything we put out. The upstream data was at times incorrect, and could change in ways we hadn’t anticipated. So we implemented tests, which are basically data sanity checks running on a schedule that checked the intermediate and final transformations in our pipeline. But then, when some of these tests failed, it’s not evident who on the team was responsible. They went unfixed for a long time, as each member of the team thought someone else would fix the issue”. 

So next up was QA rotation. “Now, every week, there’s someone else who’s responsible for fixing any data test issues that pop up, and the ‘oncall’ schedule is managed through PagerDuty”. As the company and the data team keeps growing, the person who is responsible for fixing a failed test may not understand the full logic that’s going on. 

That meant better documentation on how things are built and how to debug tests. “And again, I keep drawing this parallel, but if you look at any software engineering team, and open up our knowledge base, you will likely see a  postmortem for every outage and issue that happens, and runbooks to fix them.”

Arvind’s team has started to use the data platform for advanced data products, like churn prediction and machine learning. Envoy is trying Amazon SageMaker for that. 

Another project is integrating the data platform with the customer-facing analytics. “Obviously, when people log into Envoy, there are already analytics. How many visitors are coming to their office, how many packages they’re scanning. That’s done through a white-labeled solution with a BI vendor”, says Arvind.

“But eventually the data team is going to build an internal microservice where the Envoy app will just request what it wants and then we will return a bunch of things that we think are valuable”. 

Arvind Ramesh is the manager of the data team at Envoy. He believes that the most important skill in data is storytelling and that data teams should operate more like software engineering teams. When he’s not walking his dog Casper or playing board games with friends, you can find Arvind on LinkedIn

Envoy helps companies manage their office visitors and deliveries. Chances are you’ve already used Envoy somewhere in a lobby or a reception desk, where an iPad kiosk runs the Envoy app. The app checks you in, prints your name tag and sends your host a message that you’ve arrived.

Who is the end-user of this data product?

The Customer Success team.

Part of Envoy’s secret to success is using data to drive account penetration and adoption of the Envoy app. Envoy’s business model consists of a subscription where pricing scales with the number of locations and deliveries. Data helps the customer success team prioritize the accounts where product adoption is below par, and locations are at risk of churning. 

What business problem does this data product solve? 

“We have a few customer success managers and each of them will have several hundred accounts” says Arvind. “They can have a conversation with the customer, ‘ok, you have 50 offices, but look, these five are actually having some issues and maybe you should focus on these’.” Arvind’s team facilitates that conversation with product usage data.

The dashboard is “a UI that allows you to better manage your customer base, see how people are using your product, where there are problems, where you might need to focus your attention. But the whole premise of this is that you need to feed it a lot of product data, a lot of customer data and it all needs to be reliable, accurate, and pretty comprehensive.” 

The app ..”is a UI that sits on top of our data platform because it would not be the best use of our time or anyone’s time to build a UI for that kind of thing. The way we think of data products – it’s a backend data service or data infrastructure. The actual last mile is a web dashboard or something similar. Usually, we’ll use a tool to accomplish that.”

What is the tech-stack used?

The primary interface for the data product is a tool called Gainsight. Gainsight offers a full customer data platform, but in Envoy’s case the Gainsight app is a “skinny” UI.

The data is the result of a daily batch process. The batch process cleans and transforms raw data within the Envoy data warehouse, and combines data from many different sources into a prepared table called “Gainsight company facts”. The image below shows a summary of the graph, or “DAG” involved in building the model. For every company, for every day, the output table contains a variety of key data points. 

a summary of the graph, or “DAG” involved in building the model.

To run the batch process, Arvind’s team has built a platform that consists of five key components.

ETL tools

To get raw data from the source into the warehouse, Envoy uses off-the-shelf tools wherever an integration with a data source is available, including Amazon Glue, Fivetran and Stitch Data. “For loading data, our general philosophy is that it’s a solved problem and we use a tool. For custom ingestion, we builds a script using Singer”, an open-source ETL framework sponsored by Stitch Data, with orchestration via Airflow. Data loads happen up to every 30 minutes.

Data Warehouse

Envoy loads raw data into Amazon Redshift, with dense compute nodes (dc2.8xlarge). 

Data Modeling

Once data has been loaded into Redshift, transformation happens in SQL with dbt, an open-source analytics engineering framework, to create the final output tables, e.g. “Gainsight company facts”. 

The final table is unloaded from the warehouse into a designated S3 bucket from where Gainsight picks it up. This architecture supports other front-end visualization tools in the future that may require different formats. The key part is that all of the logic is encapsulated in a model, and the final table(s) can have different destinations, e.g. an S3 bucket, a query within Redshift, a data API, or a data microservice. 

What are the sources of data?

Envoy has between 20-25 data sources, across internal and 3rd party systems. For Gainsight, the relevant sources include for example Envoy’s production databases, Segment event data, customer support tool (e.g. ticket volume), the billing system, email systems and Salesforce.

What best practices did you use?

Building the platform has been an iterative process. “It’s probably a bit cliché, but I think data teams should work more like software engineering teams”. 

First was testing. “After building out the first set of several dozen models, we quickly realized that ensuring everything stayed accurate was a challenge. People were finding issues in the data before we were, which was embarrassing for us since we felt accountable for the accuracy of anything we put out. The upstream data was at times incorrect, and could change in ways we hadn’t anticipated. So we implemented tests, which are basically data sanity checks running on a schedule that checked the intermediate and final transformations in our pipeline. But then, when some of these tests failed, it’s not evident who on the team was responsible. They went unfixed for a long time, as each member of the team thought someone else would fix the issue”. 

So next up was QA rotation. “Now, every week, there’s someone else who’s responsible for fixing any data test issues that pop up, and the ‘oncall’ schedule is managed through PagerDuty”. As the company and the data team keeps growing, the person who is responsible for fixing a failed test may not understand the full logic that’s going on. 

That meant better documentation on how things are built and how to debug tests. “And again, I keep drawing this parallel, but if you look at any software engineering team, and open up our knowledge base, you will likely see a  postmortem for every outage and issue that happens, and runbooks to fix them.”

What does the future hold for Envoy’s data product?

There are plans to integrate the Data Platform with the Envoy Product Experience. Arvind’s team has started to use the data platform for advanced data products, like churn prediction and machine learning. Envoy is trying Amazon SageMaker for that. 

Another project is integrating the data platform with the customer-facing analytics. “Obviously, when people log into Envoy, there are already analytics. How many visitors are coming to their office, how many packages they’re scanning. That’s done through a white-labeled solution with a BI vendor”, says Arvind.

“But eventually the data team is going to build an internal microservice where the Envoy app will just request what it wants and then we will return a bunch of things that we think are valuable”. 

Watch the full video and download a complete transcript of our conversation.

Amazon Redshift was introduced in 2012 as the first Cloud Data Warehouse. It quickly became the fastest-growing service among Amazon Web Services. Amazon Redshift and these ETL tools forever changed the methods of working with analytical data. The focus shifted from the classical ETL approach to ELT. Let’s examine in detail what’s behind these mysterious abbreviations and why it is so important.

ETL stands for: 

The standard analytics stack in the “old” on-premise world would look like this:

Once it’s in the data warehouse, data doesn’t change at all anymore. It’s absolutely static. 

ELT Tools for Amazon Redshift

“ETL” is an “old world” paradigm. With Amazon Redshift, cloud warehouses became very powerful and cheap. So there was no more reason to process the data outside of the warehouse. Just load your raw data into the warehouse and then run aggregations/transformations within your warehouse. This explains the shift from “ETL” to “ELT”. 

a shift to a new data stack

The new generation of tools follows the “ELT” paradigm. They integrate many different data sources and then push the raw data into the warehouse. Let’s review the top modern ETL tools for Amazon Redshift in more detail.

Fivetran

Fivetran architecture
Image source: https://fivetran.com/docs/getting-started/architecture 

Fivetran is an easy to use cloud service. It can help you gather all your data into your warehouse in minutes without any coding. Fivetran supports a wide range of data sources.

Fivetran automatically cleans and normalizes your data. It organizes it into standardized schemas and then routes to a warehouse. Once the data from the connected data source lands in your warehouse, you can apply custom transformations to data via SQL.

Fivetran is a scalable and secure solution that meets PCI, HIPAA, GDPR and SOC 2 criteria with advanced security features like SSH tunnels.

fivetran ranking

SimilarWeb Category Rank: 15,200

Pricing: Pricing available on request.

Alooma

alooma dashboard
Image source: https://www.alooma.com/ 

Alooma is an enterprise data pipeline platform, recently acquired by Google. Alooma is scalable and secure cloud-based ETL tool. It provides dozens of native integrations with various data sources.

One of the unique features of Alooma is Code Engine. This enables you to enrich and customize data with Python scripts.

Alooma Mapper provides automatic data mapping from any source to a specified destination. When an automatic mapping is not enough, you can make custom mappings in a simple and intuitive UI.

However, with the Google acquisition, Alooma has announced that they will end support for Amazon Redshift in 2020.

alooma ranking

SimilarWeb Category Rank: 7,306

Pricing: Pricing available on request.

Stitch

stitch dashboard
Image source: https://www.stitchdata.com/docs/getting-started/set-up-stitch-data-pipeline 

Stitch is a developer-focused tool for rapidly moving data. Stitch is built on top of Singer open source ETL framework. It provides a rich collection of community-driven integrations. If you can’t find the necessary integration, you can always create it yourself. Otherwise, order its development from the implementation partners. Stitch provides rich customization options for replication, enterprise-grade security, error handling, and advanced reports.

Stitch can be difficult to learn for non-technical users. But high-quality documentation and almost unlimited platform extensibility compensate this drawback.

SimilarWeb Category Rank: 4,906

Pricing: From free plan (including 5 million rows per month and selected free integrations) to $1250 per month.

Talend 

talend dashboard

Talend Cloud Integration is a feature rich solution synchronizing your data between any SaaS and on-premises applications quickly. It also provides cloud-based data preparation tools to access, clean, transform, and enrich data easily. 

Talend allows easy design of cloud-to-cloud and hybrid integration workflows. It automates provisioning of your warehouse. Talend Cloud Integration provides powerful features for data integration, big data integration, and application integration. 

Talend was named a Leader in the 2019 Gartner Magic Quadrant for Data Integration Tools.

talend ranking

SimilarWeb Category Rank: 1,567

Pricing: Pricing available on request.

Blendo

blendo dashboard
Image source: https://app.blendo.co/ 

Blendo is a simple ETL and ELT data integration tool with user-friendly web interface built especially for non-technical users. Just choose one from dozens ready-made data connectors and create ELT data integrations from almost any data source. Blendo will collect, detect, and optimize data according to your data warehouse. 

Blendo lacks some database connectors, but the service is actively developing and constantly adding new data sources.

blendo ranking

SimilarWeb Category Rank: 13,890

Pricing: From $125 per month for the standard package to $1,000 per month for the advanced package.

Hevo Data

hevodata dashboard
Image source: https://hevodata.com/blog/building-data-pipelines-hevo/ 

Hevo is a data integration platform that allows you to quickly set up your warehouse. It brings data from any source to any destination without writing any code. Hevo provides user-friendly point and click Interface to set up data integration logic in minutes. 

Hevo supports hundreds of data sources and destinations. They can be integrated in real time with Hevo’s Real-time Streaming Architecture.

For advanced data processing and enrichment, Hevo allows writing custom data transformation scripts in Python.

Hevo provides automatic schema detection and mapping. With a robust Schema Registry Module, Hevo intelligently detects the schema changes of incoming data. It notifies you immediately with Real-time Alerts over email or Slack. 

For advanced users, Hevo also provides granular activity logs, version control. It has Extensive APIs to manage data pipelines programmatically.

SimilarWeb Category Rank: 14,248

hevodata ranking

Pricing: From $499 per month for the Starter plan to $999 per month for the Business plan. Enterprise plan price is available on request.

ETLeap

ETL Leap Dashboard
Image source: https://etleap.com/product/ 

ETLeap is an ETL tool for engineering, analytics, and data science teams. It helps build data pipelines and data warehouses without friction.

With ETLeap you can integrate with any of 50+ supported data sources. ETLeap will monitor and maintain your data pipelines for availability and completeness.

You can control your data pipelines over the intuitive dashboard. The interactive data wrangler automatically guesses how to parse incoming data and enables adjustments. Optionally, you can apply custom data transformations with scripts in any language.

ETLeap provides a secure and flexible ETL service. ETLeap is HIPAA-compliant and EU-US privacy shield certified. It provides end-to-end encryption and single sign-on support. You can run it as a hosted solution (SaaS) or in your VPC. 

SimilarWeb Category Rank: 71,974 

etleap ranking

Pricing: Pricing available on request.

Data Virtuality

Data Virtuality dashboard
Image source: https://datavirtuality.com/connectors/ 

Data Virtuality is an advanced data integration solution with user-friendly UI. It centralizes data from multiple sources. 

Data Virtuality supports 200+ data sources. It offers a virtual data layer for modeling and transforming data with SQL on the fly.

Data Virtuality enables configuring replication settings flexibly. It has pre-built templates for retrieving data and customizable data pipelines. It can be hosted either in the cloud or on-premise. You can ensure data is optimally integrated with distributed join optimizations, dynamic cost, and rule-based optimizations.

Data Virtuality ranking

SimilarWeb Category Rank: 22,810

Pricing: From $249 per month for the Starter bundle to $599 per month for the Large bundle. Enterprise bundles price is available on request.

FlyData

flydata dashboard

FlyData allows you to replicate data changes in your relational database or csv/json files directly to Amazon Redshift.

Flydata provides a limited list of data sources, but syncs them in real time right into Amazon Redshift. Flydata ensures 100% accuracy with each load, so your data is always up-to-date. 

FlyData offers SecureTunnel VPN to securely access your database behind a firewall. FlyData provides error handling system, optimized specifically for Amazon Redshift. 

flydata ranking

SimilarWeb Category Rank: 22,312

Pricing: From $199 per month to $4667 per month.

Informatica

Informatica Cloud provides codeless, optimized integration to hundreds of applications and data sources. Both on-premises and on cloud. Informatica was named a Leader by Gartner in several markets. Including Magic Quadrant for Enterprise Integration Platform as a Service, Magic Quadrant for Data Integration Tools, Magic Quadrant for Data Quality Tools.

Informatica Cloud has everything you may need for data integration projects and initiatives. Informatica Cloud provides several hundred connectors to databases, cloud data lakes, on-premises and SaaS applications.

Informatica Cloud allows you to build advanced integrations quickly and run them at scale. You can use recommendations for automated parsing driven by the CLAIRE™ engine. Also, build complex data integrations using mapping designer with out-of-the-box advanced data integration transformations. 

Since this is a large product suite, its initial set up can be difficult. It will take a lot of time to learn how to use all its capabilities to its full potential. 

informatica ranking

SimilarWeb Category Rank: 1,165

Pricing: Informatica provides various pricing plans starting at $2,000/month.

Matillion

matillion dashboard
Image source: https://www.matillion.com/events/etl-vs-elt-whats-big-difference/ 

Matillion is a powerful and easy-to-use cloud ETL/ELT solution. It has a huge selection of pre-built connectors out-of-the-box. It provides an intuitive interface to visually orchestrate sophisticated data workflows.

A simple and intuitive UI allows you to get started quickly. But Matillion allows more than this. It offers many advanced functions. Dozens of Transformation and Orchestration Components help perform advanced data transformations. You can orchestrate your ETL processes from start to finish in a graphical user interface.

Matillion provides Generic API connector integrating with almost any JSON or XML based API. The component converts the API into a pseudo SQL dialect. This can then be queried in a similar way to any normal database. 

Ready-made components are most convenient. With Matillion you can also create your own Python or Bash script to transform and enrich data.

matillion ranking

SimilarWeb Category Rank: 7,216

Pricing: Pricing on Matillion depends on instance size, from “Medium” at $1.37 per hour, to “XLarge” at $5.48 per hour.

Conclusion

Modern ETL tools have a lot to offer. They benefit users with completely different backgrounds and needs. You can find a simple and easy-to-use solution for quickly integrating data from several sources. You can also create a complex data pipeline. The ability to enrich data using Python scripts is essential. One of the services in this review will likely offer a solution for your data pipelines.

Cloud Data Warehouse Trends

In the past few years, we’ve seen rapid adoption of a new data analytics stack, based on cloud warehouses like Amazon Redshift, Google BigQuery and Snowflake. This stack has three layers:

  1. Data Integration. Ingests raw data from the source into a staging area in the warehouse.
  2. Modeling Layer. Run algorithms on the raw data (usually with batch processing) and materialize the outputs into prepared tables and views.
  3. Application Layer. Data discovery and visualization tools.  Other tools that serve data to other systems (ie for machine learning applications).

On AWS, the stack looks like this:

intermix AWS stack

Companies using this new stack deal with three growth challenges:

Demand for Data Analysts explodes
Source: Google

Trusted and Timely Data Becomes Mission Critical

Companies aim to make every department data driven. Data use cases typically fall into one of three categories:

For each of the above uses cases, a company will need different applications, tools and people to interface with the data.

It’s critical that the raw data is consistent across this growing number of people and tools. If the data is not consistent, it would be impossible to determine whether an error in the data is caused by a modeling problem, or a raw data problem.

In response, companies invest in doing modeling in the data warehouse and delivering prepared data to analysts and data science teams. These prepared data sets are built by continuously running algorithms on the data, and ensuring that the outputs are correct. It’s all about getting all your data into a single place and with the ability to run fast queries and store large volumes of data.

Modeling in the data warehouse requires building fast and robust data pipelines.

Building Fast Data Pipelines on Amazon Redshift

Amazon Redshift is a great place to do build data pipelines because of fast performance and low cost of storage.

Performance management is critical, because slow queries impact data pipelines in the following ways:

  1. When data pipeline jobs take longer, overall data recency “freshness” goes down.
  2. Query authors spend more time waiting for queries to finish, leads to slower velocity for data science projects.
  3. Expensive, wasteful queries require more CPU and Storage, which leads to higher costs.

Building data pipelines on Amazon Redshift is a matter of doing some thoughtful up-front planning and ongoing monitoring as your data volume, users and cluster grow.

Here are some tips we’ve learned from our experience helping customers with Amazon Redshift. For a more comprehensive list, please check out our “Top 14” blog post at this link.

Amazon Redshift Tip 1: Use the Workload Manager (WLM)

The Amazon Redshift Workload Manager (WLM) is critical to managing query performance. Amazon Redshift run queries in a queueing model. The default WLM configuration has a single queue with five slots. 99% of the time, this default configuration will not work for you and you will need to tweak it.

Configuring the WLM for your workloads is critical to:

For more information on setting up Amazon Redshift WLM, see our recent blog post on the Automatic WLM feature.

Amazon Redshift Tip 2: Use Amazon Spectrum for infrequently used data

Amazon Redshift prices are based on the size of your cluster, i.e. it couples compute and storage. You’ll have to keep adding nodes for storage, even though you may not need the additional computing power of the additional vCPUs.

With Redshift Spectrum, you can leave data as-is in your S3 data lake, and query it via Amazon Redshift. This approach makes sense when you have data that doesn’t require frequent access. Leave your “hot” data in Amazon Redshift, and your “cold” data in S3.

With Redshift Spectrum, you get the best of both worlds:  a) keep all your historical data, along with the performance of Amazon Redshift and b) benefit from the cost savings of using S3. See a more detailed write-up on Amazon Spectrum at this link.

redshift spectrum with intermix

Amazon Redshift Tip 3: Security Best Practices

Security and access controls should be first-class considerations for any data project.  A common pattern is to separate the ‘prepared’ data sets from the ‘raw’ data sets This can be done easily by creating one schema for each.

Here are some rules of thumb to implement separation of concerns for the various actors in the data warehouse:

  1. Analysts and Data Scientists have read-only access to ‘prepared’ data schemas.
  2. Modeling tools have read-only access to ‘raw’ data sets, and write access to the prepared data sets.
  3. ETL tools have ‘write-only’ access to the ‘raw’ data schemas.
data reference architecture

 
Setting up your access controls in this way will prevent problems down the road.

Talend and intermix.io: Better Together

intermix.io helps data teams spot problems in data pipelines before they turn into fire drills. It’s a SaaS product that provides a single view of every tool, user, and query that touches the data in your Amazon Redshift cloud warehouse.

Intermix.io works by ingesting metadata & other logs from Amazon Redshift. Intermix.io can see all the SQL queries executed on the data warehouse, as well as metadata about storage and tables.

The intermix.io Talend integration works by embedding a comment block in the Talend SQL query before the SQL query is submitted to Amazon Redshift. The comment block contains metadata about the Talend job like job name, user name, and other attributes. intermix.io then introspects the SQL query to parse the comment block, and stores the query attributes.

The result is the ability to group query latency metrics by Talend job name and user – providing an easy way to spot issues in individual Talend jobs.

Troubleshoot Talend Pipelines

With intermix.io Query Insights, find the individual queries that cause your Talend data pipelines to suffer. See latency and query throughput for all Talend jobs, and drill-down into the individual jobs and SQL queries.

query insights dashboard in intermix.io

Once you see your problematic queries, easily drill into them to get details recommendations on how to optimize the queries to run faster.

Query Optimization Recommendations

Once you find a query that is taking a long time, drill into the query to get actionable advice on how to re-write it to make it run faster.

query recommendations dashboard in intermix.io

Measure & Improve SLAs

The intermix.io dashboard makes it easy to see the P95 SLA you are providing to users, and quickly drill into the users who are running expensive queries.

cluster health dashboard in intermix.io

Improving SLA metrics requires a two-pronged approach:

  1. Optimize the Amazon Redshift cluster resources
  2. Find and optimize the slowest & most resource-intensive queries

Getting Started

By following the advice in this article, you will be able to respond to and fix problems fast. The result will be fewer support requests, and peace of mind that when you return to your desk on Monday morning, you won’t have to deal with any surprises and complaints.

We’ve assembled a number of useful resources for Amazon Redshift practitioner:

  1. Join hundreds of Amazon Redshift practitioners on our free Slack channel. Sign up here.
  2. Download technical content for Amazon Redshift on our Blog.