13 Examples of Data Pipelines Built with Amazon Redshift
Table of Contents
At intermix.io, we work with companies that build data pipelines and data lakes in the cloud. Some start “cloud-native”, others migrate from on-premise solutions like Oracle or Teradata. What they all have in common though is the one question they ask us at the very beginning:
“How do other companies build their data pipelines?”
And so that’s why we decided to compile and publish a list of publicly available blog posts about how companies build their data pipelines. In those posts, the companies talk in detail about how they’re using data in their business, and how they’ve become “data-centric”.
Here’s the short list of the 14 companies:
10. Dollar Shave Club
And we’re sorry if we missed your post – we’re happy to include it, just fill out this form (take less than a minute). And with that – please meet the 14 examples of data pipelines from the world’s most data-centric companies
Getting data-driven is the main goal for Simple. It’s important for the entire company to have access to data internally. Instead of the analytics and engineering teams to jump from one problem to another, a unified data architecture spreading across all departments in the company allows building a unified way of doing analytics. The main problem then is how to ingest data from multiple sources, process it, store it in a central data warehouse, and present it to staff across the company. Similar to many solutions nowadays, data is ingested from multiple sources into Kafka, before passing it to compute and storage systems. The warehouse of choice is Redshift, selected because of its SQL interfaces, and the ease to process petabytes of data. Finally reports, analytics and visualizations are powered using Periscope Data. In such a way, the data is easily spread across different teams allowing them to make decisions based on data.
Clearbit was a rapidly growing, early-stage startup when it started thinking of expanding its data infrastructure and analytics. After trying out a few out-of-the-box analytics tools (and each of them failed to satisfy the company’s demands) they took building the infrastructure in their own hands. Their efforts converged into a trio of providers: Segment, Redshift, and Mode. Segment is responsible for ingesting all kind of data, combining it, and syncing it daily into a Redshift instance. The main data storage is obviously left to Redshift, with backups into AWS S3. Finally, since Redshift supports SQL, Mode is a perfectly fitted tool which is used to running queries (while using Redshift’s powerful data processing abilities) and creating data insights.
Mode make it easy to explore, visualize and share that data across your organization.
But as data volume grows, that’s when data warehouse performance goes down. With ever increasing calls to your data from analysts, your cloud warehouse becomes the bottleneck. Uncertainty why queries take longer and longer to complete frustrates analysts and engineers alike.
That’s why we’ve built intermix.io, so that Mode users get all the tools they need to optimize their queries running on Amazon Redshift. Here one of our dashboards that shows you how you can track queries from Mode down to the single user:
The whole data architecture at 500px is mainly based on two tools: Redshift – for data storage, and Periscope – for analytics, reporting, and visualization. From a customer-facing side, the company’s web and mobile apps run on top of a few API servers, backed by several databases – mostly MySQL. Data in these DBs is then processed through a Luigi ETL, before storing it to S3 and Redshift. Splunk here does a great job in querying and summarizing text-based logs. Periscope Data is responsible for building data insights and sharing them across different teams in the company. All in all, this infrastructure supports around 60 people distributed across a couple of teams within the company, as of 2015.
The data infrastructure at Netflix is certainly one of the most complex ones, having in mind that they serve over 550 billion events per day, equaling roughly to 1.3 petabytes of data. In general, Netflix’s architecture is broken down into smaller systems, such as systems for data ingestion, analytics, predictive modeling etc. The data stack employed in the core of Netflix is mainly based on Apache Kafka for real-time (sub-minute) processing of events and data. Data needed in the long-term is sent from Kafka to AWS’s S3 and EMR for persistent storage, but also to Redshift, Hive, Snowflake, RDS and other services for storage regarding different sub-systems. Metacat is built to make sure the data platform can interoperate across these data sets as a one “single” data warehouse. Its task is to actually connect different data sources (RDS, Redshift, Hive, Snowflake, Druid) with different compute engines (Spark, Hive, Presto, Pig). Other Kafka outputs lead to a secondary Kafka sub-system, predictive modeling with Apache Spark, and Elasticsearch. Operational metrics don’t flow through the data pipeline but through a separate telemetry system named Atlas.
The tech world has changed dramatically since Yelp was launched back in 2004. Eight years later and Yelp started its change. It transformed from running a huge monolithic application on-premises to one built on microservices running in the AWS cloud. By the end of 2014, there were more than 150 production services running, with over 100 of them owning data. Its main part of the cloud stack is better known as PaaSTA, based on Mesos and Docker, offloading data to warehouses such as Redshift, Salesforce and Marketo. Data enters the pipeline through Kafka, which in turn receives it from multiple different “producer” sources.
Gusto, founded in 2011, is a company that provides a cloud-based payroll, benefits and workers’ compensation solution for businesses. Their business has grown steadily over the years, currently topping to around 60 thousand customers. By early 2015, there was a growing demand within the company for access to data. Up until then, the engineering team and product managers were running their own ad-hoc SQL scripts on production databases. There was obviously a need to build a data-informed culture, both internally and for their customers. When coming to the crossroad to either build a data science or data engineering team, Gusto seems to have done the right choice: first build a data infrastructure which then would support analysts in generating insights and drawing prediction models.
The first step for Gusto was to replicate and pipe all of their major data sources into a single warehouse. The warehouse choice landed on an AWS Redshift cluster, with S3 as underlying data lake. Moving data from production app databases into Redshift was then facilitated with Amazon’s Database Migration Service. On the other side of the pipeline, Looker is used as a BI front-end that teams throughout the company can use to explore data and build core dashboards. Aleph is a shared web-based tool for writing ad-hoc SQL queries. Finally, monitoring (in the form of event tracking) is done by Snowplow, which can easily integrate with Redshift, and as usual, Airflow is used to orchestrate the work through the pipeline.
Building such pipeline massively simplified data access and manipulation across departments. For instance, analysts can simply build their own datasets as part of an Airflow task, and expose it to Looker to use in dashboards and further analyses.
Teads is a video advertising marketplace, often ranked as the number 1 video platform in the world. Working with data-heavy videos must be supported by a powerful data infrastructure, but that’s not the end of the story. Teads’ business needs to log user interactions with their videos through the browser (like play, pause, resume, complete…), which count up to 10 million events per day. Another source of data is video auctions (real-time bidding processes) which generate another 60 million events per day. To build their complex data infrastructure, Teads has turned to both Google and Amazon for help.
Originally the data stack at Teads was based on a lambda architecture, using Storm, Spark and Cassandra. This architecture couldn’t scale well, so the company turned toward Google’s BigQuery in 2016. They already had their Kafka clusters on AWS, which was also running some of their ad delivery components, so the company chose a multi-cloud infrastructure. Transferring data between different cloud providers can get expensive and slow. To address the second part of this issue, Teads placed their AWS and GCP clouds as close as possible and connected them with managed VPNs.
So how does their complex multi-cloud data stack look like? Well, first of all, data coming from users’ browsers and data coming from ad auctions is enqueued in Kafka topics in AWS. Then using an inter-cloud link, data is passed over to GCP’s Dataflow which is then well paired with BigQuery in the next step. Having all data in a single warehouse means half of the work is done. The next step would be to deliver data to consumers, and Analytics is one of them. The Analytics service at Teads is a Scala-based app that queries data from the warehouse and stores it to tailored data marts. Interestingly, the data marts are actually AWS Redshift servers. In the final step, data is presented into intra-company dashboards, and the user’s web apps.
[cta heading=”Download the Data Pipeline Resource Bundle” description=”You’ll get additional resources like:” checklist=”Full stack breakdown and tech checklist,Summary slides with links to resources,PDF version of the blog post” image=”https://intermix-media.intermix.io/wp-content/uploads/20190117201559/mauro-licul-388509-unsplash.jpg” form=”9″]
Remind’s data engineering team strives to provide the whole company with access to the data they need, as big as 10 million daily events, and empower them to directly make decisions. They initially started with Redshift as its source of truth resource for data, and AWS S3 to optimize for cost. While S3 is used for long-term storage of historical data in JSON format, Redshift only stores the most valuable data, not older than 3 months. The company uses Interana to run custom queries on their JSON files on S3, but they’ve also recently started using AWS Athena, as a fully managed Presto system – to query both S3 and Redshift databases. The move for Athena also triggered a change in the data format – from JSON to Parquet, which they say was the hardest step in building up their data platform. An EMR/Hive system is responsible for doing the needed data transformations between S3 and Athena. In the data ingestion part of the story, Remind gathers data through their APIs from both mobile devices and personal computers, as the company business targets schools, parents and students. This data is then passed to a streaming Kinesis Firehose system, before streaming it out to S3 and Redshift.
Remind’s future plans are probably focused on facilitating data format conversions using AWS Glue. This step would allow them to replace EMR/Hive from their architecture and use Spark SQL instead of Athena for diverse ETL tasks.
Robinhood is a stock brokerage application that democratizes access to the financial markets, which enables its customers to buy and sell U.S. listed stocks and ETFs with zero commission. The company debuted with a waiting list of nearly 1 million people, which means they had to pay attention to scale from the very beginning.
Robinhood’s data stack is hosted on AWS, and the core technology they use is ELK (Elasticsearch, Logstash, and Kibana) – a tool for powering search and analytics. Logstash is responsible for collecting, parsing and transforming logs, before passing them on to Elasticsearch, while data is visualized through Kibana. They grew up from a single ELK cluster with a few GBs of data to three clusters with over 15 TBs. Before data goes to ELK clusters, it is buffered in Kafka, as the rates of which documents enter vary significantly between different data sources. Kafka also shields the system from failures and communicates its state with data producers and consumers. As with many other companies, Robinhood uses Airflow to schedule various jobs across the stack, beating competition such as Pinball, Azkaban and Luigi. Robinhood data science team uses Amazon Redshift to help identify possible instances of fraud and money laundering.
10. Dollar Shave Club
Dollar Shave Club (DSC) is a lifestyle brand and e-commerce company that’s revolutionizing the bathroom by inventing smart, affordable products. Don’t be fooled by their name, they have a pretty cool data architecture, for a company in the shaving business. Their business model works with online sales through a subscription service. Currently, they serve around 3 million subscribed customers.
DSC’s web applications, internal services, and data infrastructure are 100% hosted on AWS. A Redshift cluster serves as the central data warehouse, receiving data from various systems. Data movement is facilitated with Apache Kafka and can move in different directions – from production DBs into the warehouse, in between different apps and in between internal pipeline components. There’s also Snowplow which collects data from the web and mobile clients. Once data reaches Redshift, it is accessed through various analytics platforms for monitoring, visualization, and insights. The main tool for the job is, of course, Apache Spark, which is mainly used to build predictive models, such as recommender systems for future sales.
Coursera is an education company that partners with the top universities and organizations in the world to offer online courses. They started building their data architecture somewhere around 2013, as both numbers of users and available courses increased. As of late 2017, Coursera provides courses to 27 million worldwide users.
Coursera collects data from their users through API calls coming from mobile and web apps, their production DBs, and logs gathered from monitoring. A backend service called “eventing” periodically uploads all received events to S3 and continuously publishes events to Kafka. The engineering team has selected Redshift a central warehouse, offering much lower operational cost when compared with Spark or Hadoop at the time.
On the analytics end, the engineering team created an internal web-based query page where people across the company can write SQL queries to the warehouse and get the needed information. Of course, there are analytics dashboard across the company which are refreshed on a daily basis. Finally, many decisions made in Coursera are based on machine learning algorithms, such as A/B testing, course recommendations, understanding student dropouts and others.
Wish is a mobile commerce platform. It provides online services that include media sharing and communication tools, personalized and other content, as well as e-commerce. During the last few years, it grew up to 500 million users, making their data architecture out of date.
The data architecture at Wish, before scaling up, had two different production databases: a MongoDB one storing user data, and a Hive/Presto cluster for logging data. Data engineers had to manually query both to respond to ad-hoc data requests, and this took weeks at some points. Another small pipeline orchestrated by Python crons, also queried both DBs and generated Email reports.
After rethinking their data architecture, Wish decided to build a single warehouse using Redshift. Data from both production DBs flowed through the data pipeline into Redshift. BigQuery is also used for some types of data. It feeds data into secondary tables needed for analytics. Finally, analytics and dashboards are created with Looker.
[cta heading=”Download the Data Pipeline Resource Bundle” description=”You’ll get additional resources like:” checklist=”Full stack breakdown and tech checklist,Summary slides with links to resources,PDF version of the blog post” image=”https://intermix-media.intermix.io/wp-content/uploads/20190117201559/mauro-licul-388509-unsplash.jpg” form=”10″]
Blinkist transforms the big ideas from the world’s best nonfiction books into powerful little packs users can read or listen to in 15 minutes. At first, they started selling their services through a pretty basic website, and monitored statistics through Google Analytics. Unfortunately, visitor statistics gathered from Google Analytics didn’t match the ones engineers computed. This is one of the reasons why Blinkist decided to move to the AWS cloud.
They choose a central Redshift warehouse where data flow in from user apps, backend and web frontend (for visitors tracking). To get data to Redshift, data is streamed with Kinesis Firehose, also using Amazon Cloudfront, Lambda and Pinpoint. The engineering team at Blinkist is working on a newer pipeline where ingested data comes to Alchemist, before passing it to a central Kinesis system, and onwards to the warehouse.
How will you build your data pipeline?
We hope this post along with its 14 examples gives you the inspiration to build your own data pipelines in the cloud.
If you don’t have any data pipelines yet, it’s time to start building them. Begin with baby steps and focus on spinning up an Amazon Redshift cluster, ingest your first data set and run your first SQL queries.
After that, you can look at expanding by adding a dashboard for data visualization, and schedule a workflow, to build your first true data pipeline. And once data is flowing, it’s time to understand what’s happening in your data pipelines.
That’s why we built intermix.io. We give you a single dashboard to understand when & why data is slow, stuck, or unavailable.
With intermix.io you can:
- Spot problems before users do: You can measure and trend the performance of all queries and apps, detect any rogue and resource wasting queries, and monitor the behavior of individual users.
- Have the confidence to scale your data volume: Our dashboards help you understand how to optimize concurrency and memory configurations for your Redshift cluster, with 4 simple steps to configure your workload management (WLM). You can get more out of storage by finding “cold” tables and reduce your Amazon Redshift cost, and detect bottlenecks that cause queries to be slow for your dashboards, such as for slow Looker queries.
- Eliminate the fire fighting: Rather than guessing, we give you the root cause analysis of performance issues at your fingertips. Most dashboards and ETL tools mask the single user(s) behind a query – but with our app tracing capabilities, you can look behind the proverbial curtain to understand the cost of user queries and their resource impact.
Our customers have the confidence to handle all the raw data their companies need to be successful. What you get is a real-time analytics platform that collects metrics from your data infrastructure and transforms them into actionable insights about your data pipelines, apps, and users who touch your data.
Setting up intermix.io takes less than 10 minutes, and because you can leverage our intermix.io experts, you can say goodbye to paying for a team of experts with expensive and time-consuming consulting projects. We can help you plan your architecture, build your data lake and cloud warehouse, and verify that you’re doing the right things.
Join 11,000 of your peers.
Subscribe to our newsletter SF Data.
People at Facebook, Amazon and Uber read it every week.
Every Monday morning we'll send you a roundup of the best content from intermix.io and around the web. Make sure you're ready for the week! See all issues.