At intermix.io, we work with companies that build data pipelines. Some start cloud-native on platforms like Amazon Redshift, while others migrate from on-premise or hybrid solutions. What they all have in common is the one question they ask us at the very beginning:
“How do other companies build their data pipelines?”
And so that’s why we decided to compile and publish a list of publicly available blog posts about how companies build their data pipelines. In those posts, the companies talk in detail about how they’re using data in their business and how they’ve become data-centric.
The 15 Companies we’ve looked at are:
Table of Contents
If we missed your post, we’re happy to include it. Just fill out this form, which will take you less than a minute. And with that – please meet the 15 examples of data pipelines from the world’s most data-centric companies.
Getting data-driven is the main goal for Simple. It’s important for the entire company to have access to data internally. Instead of the analytics and engineering teams to jump from one problem to another, a unified data architecture spreading across all departments in the company allows building a unified way of doing analytics.
The main problem then is how to ingest data from multiple sources, process it, store it in a central data warehouse, and present it to staff across the company. Similar to many solutions nowadays, data is ingested from multiple sources into Kafka before passing it to compute and storage systems.
The warehouse of choice is Redshift, selected because of its SQL interfaces and the ease with which it processes petabytes of data. Reports, analytics, and visualizations are powered using Periscope Data. In such a way, the data is easily spread across different teams, allowing them to make decisions based on data.
Clearbit was a rapidly growing, early-stage startup when it started thinking of expanding its data infrastructure and analytics. They tried out a few out-of-the-box analytics tools, each of which failed to satisfy the company’s demands.
After that, Clearbit took building the infrastructure in their own hands. Their efforts converged into a trio of providers: Segment, Redshift, and Mode. Segment is responsible for ingesting all kinds of data, combining it, and syncing it daily into a Redshift instance. The main data storage is obviously left to Redshift, with backups into AWS S3.
Finally, since Redshift supports SQL, Mode is perfectly suited for running queries (while using Redshift’s powerful data processing abilities) and creating data insights.
Mode makes it easy to explore, visualize, and share that data across your organization.
But as data volume grows, that’s when data warehouse performance goes down. With ever-increasing calls to your data from analysts, your cloud warehouse becomes the bottleneck.
That’s why we’ve built intermix.io to provide Mode users with all the tools they need to optimize their queries running on Amazon Redshift. Here one of our dashboards that shows you how you can track queries from Mode down to the single user:
The whole data architecture at 500px is mainly based on two tools: Redshift for data storage; and Periscope for analytics, reporting, and visualization. From a customer-facing side, the company’s web and mobile apps run on top of a few API servers, backed by several databases – mostly MySQL. Data from these DBs passes through a Luigi ETL, before moving to storage on S3 and Redshift.
Splunk here does a great job of querying and summarizing text-based logs. Periscope Data is responsible for building data insights and sharing them across different teams in the company. All in all, this infrastructure supports around 60 people distributed across a couple of teams within the company, prior to their acquisition by Visual China Group.
The data infrastructure at Netflix is one of the most sophisticated in the world. The video streaming company serves over 550 billion events per day, equaling roughly to 1.3 petabytes of data. In general, Netflix’s architecture is broken down into smaller systems, such as systems for data ingestion, analytics, and predictive modeling. The data stack employed in the core of Netflix is mainly based on Apache Kafka for real-time (sub-minute) processing of events and data. Data needed in the long-term is sent from Kafka to AWS’s S3 and EMR for persistent storage, but also to Redshift, Hive, Snowflake, RDS, and other services for storage regarding different sub-systems. Metacat is built to make sure the data platform can interoperate across these data sets as a one “single” data warehouse. Its task is to actually connect different data sources (RDS, Redshift, Hive, Snowflake, Druid) with different compute engines (Spark, Hive, Presto, Pig). Other Kafka outputs lead to a secondary Kafka sub-system, predictive modeling with Apache Spark, and Elasticsearch. Operational metrics don’t flow through the data pipeline but through a separate telemetry system named Atlas.
The tech world has seen dramatic changes since Yelp was launched back in 2004. By 2012, Yelp found themselves playing catch-up. It transformed from running a huge monolithic application on-premises to one built on microservices running in the AWS cloud. By the end of 2014, there were more than 150 production services running, with over 100 of them owning data. Its main part of the cloud stack is better known as PaSTA, based on Mesos and Docker, offloading data to a Redshift data warehouse, Salesforce CRM, and Marketo marketing automation. Data enters the pipeline through Kafka, which in turn receives it from multiple different “producer” sources.
Gusto, founded in 2011, is a company that provides a cloud-based payroll, benefits, and workers’ compensation solution for businesses. Their business has grown steadily over the years, currently topping to around 60 thousand customers. By early 2015, there was a growing demand within the company for access to data. Up until then, the engineering team and product managers were running their own ad-hoc SQL scripts on production databases. There was obviously a need to build a data-informed culture, both internally and for their customers. When coming to the crossroad to either build a data science or data engineering team, Gusto seems to have made the right choice: first, build a data infrastructure that can support analysts in generating insights and drawing prediction models.
The first step for Gusto was to replicate and pipe all of their major data sources into a single warehouse. The warehouse choice landed on an AWS Redshift cluster, with S3 as the underlying data lake. Moving data from production app databases into Redshift was then facilitated with Amazon’s Database Migration Service. On the other side of the pipeline, Looker is used as a BI front-end that teams throughout the company can use to explore data and build core dashboards. Aleph is a shared web-based tool for writing ad-hoc SQL queries. Finally, monitoring (in the form of event tracking) is done by Snowplow, which can easily integrate with Redshift. And, as usual, Airflow orchestrates the work through the pipeline.
Building this pipeline helped to simplify data access and manipulation across departments. For instance, analysts can simply build their own datasets as part of an Airflow task and expose it to Looker to use in dashboards and further analyses.
Teads is a video advertising marketplace, often ranked as the number one video platform in the world. Working with data-heavy videos must be supported by a powerful data infrastructure, but that’s not the end of the story. Teads’ business needs to log user interactions with their videos through the browser – functions like play, pause, resume, complete – which count up to 10 million events per day. Another source of data is video auctions with a real-time bidding process. These generate another 60 million events per day. To build their complex data infrastructure, Teads has turned to both Google and Amazon for help.
Originally the data stack at Teads was based on a lambda architecture, using Storm, Spark and Cassandra. This architecture couldn’t scale well, so the company turned toward Google’s BigQuery in 2016. They already had their Kafka clusters on AWS, which was also running some of their ad delivery components, so the company chose a multi-cloud infrastructure. Transferring data between different cloud providers can get expensive and slow. To address the second part of this issue, Teads placed their AWS and GCP clouds as close as possible and connected them with managed VPNs.
So how does their complex multi-cloud data stack look? Well, first of all, data coming from users’ browsers and data coming from ad auctions is enqueued in Kafka topics in AWS. Then using an inter-cloud link, data is passed over to GCP’s Dataflow, which is then well paired with BigQuery in the next step. Having all data in a single warehouse means half of the work is done. The next step would be to deliver data to consumers, and Analytics is one of them. The Analytics service at Teads is a Scala-based app that queries data from the warehouse and stores it to tailored data marts. Interestingly, the data marts are actually AWS Redshift servers. In the final step, data is presented into intra-company dashboards and on the user’s web apps.
Remind’s data engineering team provides the whole company with access to the data they need, as big as 10 million daily events, and empower them to make decisions directly. They initially started with Redshift as its source of truth resource for data, and AWS S3 to optimize for cost.
While S3 is used for long-term storage of historical data in JSON format, Redshift only stores the most valuable data, not older than three months. The company uses Interana to run custom queries on their JSON files on S3, but they’ve also recently started using AWS Athena as a fully managed Presto system to query both S3 and Redshift databases.
The move for Athena also triggered a change in the data format from JSON to Parquet, which they say was the hardest step in building up their data platform. An EMR/Hive system is responsible for doing the needed data transformations between S3 and Athena. In the data ingestion part of the story, Remind gathers data through their APIs from both mobile devices and personal computers, as the company business targets schools, parents, and students. This data is then passed to a streaming Kinesis Firehose system before streaming it out to S3 and Redshift.
Remind’s future plans are probably focused on facilitating data format conversions using AWS Glue. This step would allow them to replace EMR/Hive from their architecture and use Spark SQL instead of Athena for diverse ETL tasks.
Robinhood is a stock brokerage application that democratizes access to the financial markets, enabling customers to buy and sell stocks and ETFs with zero commission. The company debuted with a waiting list of nearly 1 million people, which means they had to pay attention to scale from the very beginning.
Robinhood’s data stack is hosted on AWS, and the core technology they use is ELK (Elasticsearch, Logstash, and Kibana), a tool for powering search and analytics. Logstash is responsible for collecting, parsing, and transforming logs before passing them on to Elasticsearch, while data is visualized through Kibana.
They grew from a single ELK cluster with a few GBs of data to three clusters with over 15 TBs. Before data goes to ELK clusters, it is buffered in Kafka, as the various data sources generate documents at differing rates.
Kafka also shields the system from failures and communicates its state with data producers and consumers. As with many other companies, Robinhood uses Airflow to schedule various jobs across the stack, beating competition such as Pinball, Azkaban and Luigi. Robinhood data science team uses Amazon Redshift to help identify possible instances of fraud and money laundering.
Dollar Shave Club (DSC) is a lifestyle brand and e-commerce company that’s revolutionizing the bathroom by inventing smart, affordable products. Don’t be fooled by their name. They have a pretty cool data architecture for a company in the shaving business. Their business model works with online sales through a subscription service. Currently, they serve around 3 million subscribed customers.
DSC’s web applications, internal services, and data infrastructure are 100% hosted on AWS. A Redshift cluster serves as the central data warehouse, receiving data from various systems. Data movement is facilitated with Apache Kafka and can move in different directions – from production DBs into the warehouse, between different apps, and between internal pipeline components.
There’s also Snowplow, which collects data from the web and mobile clients. Once data reaches Redshift, it is accessed through various analytics platforms for monitoring, visualization, and insights. The main tool for the job is, of course, Apache Spark, which is mainly used to build predictive models, such as recommender systems for future sales.
Coursera is an education company that partners with the top universities and organizations in the world to offer online courses. They started building their data architecture somewhere around 2013, as both numbers of users and available courses increased. As of late 2017, Coursera provides courses to 27 million worldwide users.
Coursera collects data from its users through API calls coming from mobile and web apps, their production DBs, and logs gathered from monitoring. A backend service called “eventing” periodically uploads all received events to S3 and continuously publishes events to Kafka. The engineering team has selected Redshift as its central warehouse, offering much lower operational cost when compared with Spark or Hadoop at the time.
On the analytics end, the engineering team created an internal web-based query page where people across the company can write SQL queries to the warehouse and get the information they need. Of course, there are company-wide analytics dashboards that are refreshed on a daily basis. Finally, many decisions made in Coursera are based on machine learning algorithms, such as A/B testing, course recommendations, and understanding student dropouts.
Wish is a mobile commerce platform. It provides online services that include media sharing and communication tools, personalized and other content, as well as e-commerce. During the last few years, it grew up to 500 million users, making their data architecture out of date.
Before they scaled up, Wish’s data architecture had two different production databases: a MongoDB NoSQL database storing user data; and a Hive/Presto cluster for logging data. Data engineers had to manually query both to respond to ad-hoc data requests, and this took weeks at some points. Another small pipeline, orchestrated by Python Cron jobs, also queried both DBs and generated email reports.
After rethinking their data architecture, Wish decided to build a single warehouse using Redshift. Data from both production DBs flowed through the data pipeline into Redshift. BigQuery is also used for some types of data. It feeds data into secondary tables needed for analytics. Finally, analytics and dashboards are created with Looker.
Blinkist transforms the big ideas from the world’s best nonfiction books into powerful little packs users can read or listen to in 15 minutes. At first, they started selling their services through a pretty basic website, and they monitored statistics through Google Analytics. Unfortunately, visitor statistics gathered from Google Analytics didn’t match the figures the engineers were computing. This is one of the reasons why Blinkist decided to move to the AWS cloud.
They choose a central Redshift warehouse where data flows in from user apps, backend, and web front-end (for visitors tracking). To get data to Redshift, they stream data with Kinesis Firehose, also using Amazon Cloudfront, Lambda, and Pinpoint. The engineering team at Blinkist is working on a newer pipeline where ingested data comes to Alchemist, before passing it to a central Kinesis system and onwards to the warehouse.
Healthcare platform Halodoc found themselves with a common startup problem: scalability. Their existing data pipeline worked on a batch processing model, with regularly scheduled extractions for each source. They performed extractions with various standard tools, including Pentaho, AWS Database Migration Service, and AWS Glue.
They would load each export to S3 as a CSV or JSON, and then replicate it on Redshift. At this point, they used a regular Pentaho job to transform and integrate data, which they would then load back into Redshift.
As Halodoc’s business grew, they found that they were handling massive volumes of sensitive patient data that had to get securely and quickly to healthcare providers. The Pentaho transformation job, installed on a single EC2 instance, was a worrying single point of failure.
Halodoc looked at a number of solutions and eventually settled on Apache Airflow as a single tool for every stage of their data migration process. They chose Airflow because it’s highly responsive and customizable, with excellent error control. It also supports machine learning use cases, which Halodoc requires for future phases.
The new data pipeline is much more streamlined. Halodoc uses Airflow to deliver both ELT and ETL. In their ETL model, Airflow extracts data from sources. It then passes through a transformation layer that converts everything into pandas data frames. The data frames are loaded to S3 and then copied to Redshift. Airflow can then move data back to S3 as required.
For ELT, the Airflow job loads data directly to S3. Halodoc then uses Redshift’s processing power to perform transformations as required.
iHeartRadio is a global streaming platform for music and podcasts. It runs on a sophisticated data structure, with over 130 data flows, all managed by Apache Airflow. These data pipelines were all running on a traditional ETL model: extracted from the source, transformed by Hive or Spark, and then loaded to multiple destinations, including Redshift and RDBMSs.
On reviewing this approach, the engineering team decided that ETL wasn’t the right approach for all data pipelines. Where possible, they moved some data flows to an ETL model. Data flows directly from source to destination – in this instance, Redshift – and the team applies any necessary transformations afterward. Redshift Spectrum is an invaluable tool here, as it allows you to use Redshift to query data directly on S3 via an external meta store, such as Hive.
However, this model still didn’t suit all use cases. The iHeartRadio team began experimenting with the ETLT model (Extract, Transform, Load, Transform) model, which combines aspects of ETL and ELT. In this approach, the team extracts data as normal, then uses Hive for munging and processing. They then load the data to the destination, where Redshift can aggregate the new data.
Now, the team uses a dynamic structure for each data pipeline, so data flows might pass through ETL, ELT, or ETLT, depending on requirements. This new approach has improved performance by up to 300% in some cases, while also simplifying and streamlining the entire data structure.
We hope the 15 examples in this post offer you the inspiration to build your own data pipelines in the cloud.
If you don’t have any data pipelines yet, it’s time to start building them. Begin with baby steps and focus on spinning up an Amazon Redshift cluster, ingest your first data set and run your first SQL queries.
After that, you can look at expanding by acquiring an ETL tool, adding a dashboard for data visualization, and scheduling a workflow, resulting in your first true data pipeline. And once data is flowing, it’s time to understand what’s happening in your data pipelines.
That’s why we built intermix.io. We give you a single dashboard to understand when & why data is slow, stuck, or unavailable.
With intermix.io you can:
Our customers have the confidence to handle all the raw data their companies need to be successful. What you get is a real-time analytics platform that collects metrics from your data infrastructure and transforms them into actionable insights about your data pipelines, apps, and users who touch your data.
Setting up intermix.io takes less than 10 minutes, and because you can leverage our intermix.io experts, you can say goodbye to paying for a team of experts with expensive and time-consuming consulting projects. We can help you plan your architecture, build your data lake and cloud warehouse, and verify that you’re doing the right things.
At intermix.io, we use Amazon Redshift as part of our stack. Amazon Redshift is an OLAP database, and a valuable tool for data teams due to its low cost and speed for analytical queries. We have a particular use case though. We’re using Amazon Redshift in an OLTP scenario, i.e. we’ve built an analytical application on top of Redshift.
The challenge of using Redshift as an OLTP is that queries can lack the low-latency that would exist on a traditional RDBMS and transactional queries. This is a result of the column-oriented data storage design of Amazon Redshift which makes the tradeoff to perform better for big data analytical workloads. This comes from eliminating the need to traverse full records (all columns) for each query. Also, there are features of a traditional RDBMS that would often be useful when querying Redshift (e.g. indices).
In this post we detail a method of using Amazon Redshift as an OLTP. We will tell the story of how we off-load OLTP workloads from Amazon Redshift to Amazon RDS. We were able to do this without writing any complex ETL or changing our data processing strategy.
intermix.io is an analytics platform that provides a single monitoring dashboard for data engineers to keep an eye on their critical data flows. We surface information for data engineers using Amazon Redshift like how to optimize warehouse resources with actionable recommendations.
For example, our “Cluster Health” dashboard gives you a very quick overview of SLA measures for any connected app, down to the individual users as well as actionable recommendations on how to improve performance and optimize costs.
If you’d like to instantly see personalized recommendations for your cluster, you can start a 14-day free trial on this link.
There are three steps for acquiring the data we need to provide our analytics to our customers:
We use Amazon Redshift for:
serving data to our product dashboard (this is the piece that was giving us trouble)
In the early stage of intermix.io, our dashboard pages would pull data from Redshift directly. But as we added more and more customers, we realized the dashboard pages were not performing well. This was due to two factors:
We did use the best practices to manage our Redshift cluster, including the proper set-up of our workload management (WLM) and all 15 performance tuning techniques. But we also knew that using Redshift for this OLTP use case — serving data to the dashboard – was not going to work long term. So we started looking for solutions.
Our ideal solution would:
Amazon Redshift is a columnar-oriented petabyte-scale data warehouse. It is an OLAP database suitable for analytical queries and applications. It uses a modified version of PostgreSQL 8.
The solution was to use Amazon RDS as a ‘cache’ by leveraging the Dblink feature of RDS. This allows us to make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over. Dblink handles moving the data at the block level.
The benefits of this solution included:
Before we go over the installation instructions, we’ll cover some of the features that using Dblink with Redshift provides.
The main features presented in this post require a Postgres instance with Postgres version >= 9.4 (some features listed in this section require more recent versions of Postgres). We use an RDS-hosted version of Postgres but other types of instances will work as long as they meet the minimum requirements.
For installing on Redshift, perform the following commands:
GRANT SELECT ON all TABLES IN SCHEMA data to <amazon_redshift_username>;
Enable dblink and postgres_fdw in the Postgres RDS instance (only needs to be done once per database).
CREATE EXTENSION postgres_fdw;
CREATE EXTENSION dblink;
Create the server link and user mapping.
CREATE SERVER redshift_server FOREIGN DATA WRAPPER redshift_fdw
OPTIONS (host '<amazon_redshift_ip>', port '<port>', dbname '<database_name>', sslmode 'require');
CREATE USER MAPPING FOR <rds_postgresql_username> SERVER redshift_server
OPTIONS (user '<amazon_redshift_username>', password '<password>');
You are now ready to query your Redshift data in RDS.
This approach allows you to query Redshift while inside of a Postgres instance. It bridges data but doesn’t confer any performance benefits. For example, to get a list of users that were updated this year.
FROM dblink('redshift_server', $REDSHIFT$
WHERE updated_on >= '2018-01-01';
$REDSHIFT$) AS t1 (id int);
To create a queryable cache of the Redshift data, we create a materialized view. You can also select a subset of rows and a subset of columns if you don’t require the full table.
CREATE MATERIALIZED VIEW mv_users AS
FROM dblink('redshift_server', $REDSHIFT$
SELECT id, updated_on
$REDSHIFT$) AS t1 (id int, updated_on timestamp);
A materialized view will not self-update. So we need an approach to refresh the data. If you only wish to execute live queries (no materialized views) then you can ignore this section.
Easier and faster but with concurrent reads blocked.
REFRESH MATERIALIZED VIEW mv_users;
This will block concurrent reads, so if you need to be able to always read from the view follow the next approach.
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_users;
This requires a unique index to work. For example, after creating the view:
CREATE UNIQUE INDEX mv_user_id_updated_on_idx ON data.mv_users (id, updated_on);
If the data you retrieve takes more than 5 minutes to transfer or you don’t mind refreshing synchronously, you can issue the REFRESH commands above inside your code or in a periodic task as needed.
If retrieving the data from Redshift takes fewer than 5 minutes, AWS’s Lambda is a good approach to use here. Be mindful of concurrency, for example it’s bad practice to issue multiple concurrent refreshes on a single table while another refresh is currently happening on the same table. We’ll have more about how we approach this in a subsequent blog post.
The above refresh methods query the full set of data each time (even if nothing has changed), if you don’t have updates on your data you may want to use the following approach instead:
Dblink ingestion incurs regular data transfer costs, hence if you don’t want to be charged for the network traffic of synchronizing the data please have your Postgres RDS instance in the same AZ as your Redshift leader node.
You need to ensure that the user permissions for the users querying data via Dblink matches that of the same users querying directly on Redshift, so that additional permissions aren’t granted by querying Dblink.
Amazon Redshift supports identity-based policies (IAM policies). We recommend using IAM identities (e.g. a user, a group, a role) to manage cluster access for users vs. creating direct logins in your cluster. That’s because nobody ever keeps track of those logins. For example, at intermix.io we use IAM to generate temporary passwords for our clusters.
In the AWS console, keep an eye on the following Postgres RDS metrics to ensure that the instance isn’t overwhelmed:
And then, of course, we use our own intermix.io dashboard to monitor the performance of our Amazon Redshift clusters and our data pipelines. A few select key metrics we pay attention to are:
For example, in image 2 you can see a distribution of all queries for a certain time interval. We can see how much memory each query consumes, if it’s disk-based or not, and what the total query return time (execution time + queue time) is.
Not only that but if you click on a single query we surface every single detail about it as well as actionable insights based on your cluster about how to improve performance instantly.
The total on-disk size limit for Postgres on RDS is 16TB, so the data you link cannot be greater than that (such a transfer would take at least 3.6 hours on a 10Gbit connection). The total table size on Redshift can be greater than 16TB but your materialized view query cannot select a quantity of rows and columns combined with index overhead that would exceed 16TB.
A current limitation of the DNS resolving on RDS is such that lookups don’t resolve to private IPs for DNS queries performed inside of Dblink. Hence if you’re running in a VPC without publicly routable Redshift then you’ll need to setup your own DNS to resolve to the private IP or hardcode Redshift Leader node private IPs when provisioning Dblink.
With the approach in this post, we showed you how it’s possible to use Amazon Redshift in an OLTP scenario. By using Amazon RDS as a ‘cache’ and leveraging its Dblink feature, we can make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over.
So if your using Amazon Redshift in an OLTP scenario for building data apps, and you want to monitor the performance of those data apps – schedule a call with us. We’d love to look at your set-up, and help you with a free trial of intermix.io to tune your Amazon Redshift cluster.
To win in today’s market, companies must invest in both their infrastructure and their people. Data-first companies like Netflix, Uber and Tinder are dominating their industries. Mentions of “AI” are often heard in advertisements, product launches, and earnings calls. Businesses are scrambling to re-imagine their tech infrastructure and provide their people with things like Amazon Redshift training.
Why is this happening now?
Data is the new differentiator:
Moving towards a data-driven organization can be daunting for decision-makers. Learning investments such as Amazon Redshift training can sometimes produce limited results, thanks to the complex nature of modern data platforms, which can intimidate even the most experienced IT professionals.
We’re all generating a lot more data than ever before. Storing it is a challenge; analyzing it is an even bigger challenge. Over the years, we’ve seen highly structured data warehouses give way to more anarchic data lakes, which in turn are giving way to multi-tiered structures like data lakehouses.
As a result of the shift, traditional concepts of “ETL” are being re-imagined for a new world where:
The response has been that teams are building complex data assembly lines which have common characteristics:
Data lake architectures pose new challenges. Since lakes store their data without any oversight of the contents, the data needs to have defined mechanisms for cataloging in order to make it usable. Without this, data isn’t reliable. Companies need data pipelines that offer governance consistency and access controls to the data lake.
New technologies like Amazon Redshift allow all of the storage and processing power required to run a data-first business, but you still need the tools to turn that data into actionable insights.
Companies use the data for the following purposes:
Building and managing complex data assembly line requires a new skill set.
The last time we saw a comparable shift was back when cloud computing was first developed. Running cloud apps required an operational (ie uptime, cost, and performance) mindset coupled with an ability to write code. The DevOps role was born out of a need to manage infrastructure as code. As a result, engineering teams had to establish new teams and hire new types of employees. These employees needed different tools to do their jobs.
Similarly, a new role has emerged to manage data pipelines: the data engineer. Data engineers manage complex data flows by writing code that manipulates data. But they are also accountable for the uptime, performance, and cost accounting for the data flows. This skillset is a combination of DevOps and data analyst, with a sprinkling of a database administrator. They also need platform knowledge, such as Amazon Redshift training.
Building a data assembly line involves:
Raw data often exists in application silos, stranded.93% of enterprises have a multi-cloud strategy, while 87% have a hybrid cloud strategy. So, for many organizations, data exists across multiple clouds, plus their own data centers. They will need to inspect and catalog this data before they realize any value.
Security is a fundamental part of any data pipeline. The data owners must audit all access, and also ensure that the right people and processes have the right permissions.
Before anyone can trust data, they have to transform and cleanse it. Consider how you identify a customer across the data – is it by email, name, or some unique ID? If you want to combine two data sources, which one of those do you use? This stage involves data validation, removal of duplication, and handling of null values.
Mobile and IoT generate huge volumes of data. Running fast queries on huge data volumes requires careful planning, tuning, and configuration of data analytics infrastructure.
Most companies start by hiring data engineers to implement reliable data pipelines, and they may also hire data scientists with analytics skills. For a data pipeline, you may also require staff with specialist knowledge, such as Hadoop, ETL or Amazon Redshift training.
Unliked end-user applications, data apps runs jobs on the data assembly line. There are three categories of data apps.
The fundamental problem solved by the data engineer is to ensure that the data assembly line is working.
Are data flows operating normally?
Do my data tables contain the correct results?
Are data apps able to access the data quickly?
This requires answering questions in real-time across multiple systems:
In order to accomplish this, you need new types of metrics. Traditional networking monitoring metrics like CPU and network utilization are irrelevant when monitoring data assembly lines, because data flows operate at a different layer.
A monitoring tool for data flows must consider:
We started intermix.io to solve these problems. Our mission is to provide data engineers with a single dashboard to help them monitor their mission critical data flows. And if there are problems, that they are the first to know and the reason why.
From Intermix’s Amazon Redshift Training series, learn how to build a stable data pipeline using these three simple steps.
When you’re building or improving your data pipeline, batch data processing seems simple. Pull data from a source, apply some business logic to it, and load it for later use. When done well, automating these jobs is a huge win. It saves time and empowers decision-makers with fresh and accurate data.
But this kind of ETL (Extract, Transform, and Load) process gets very tricky, very quickly. A failure or bug in a single step of an ETL process has cascading effects that can require hours of manual intervention and cleanup. These kinds of issues can be a huge time sink for the analysts or engineers that have to clean up the mess. They negatively impact data quality. They erode end-user confidence. And they can fuel anxiety when pushing out even the most innocuous changes to a data pipeline.
To avoid falling into this kind of ETL black hole, analysts and engineers need to make data pipelines that:
In short, they need to make data pipelines that are idempotent.
Idempotent is a mathematical term describing an operation that can be applied arbitrarily many times without changing the result. Such data pipelines are fault-tolerant, reliable, and easy to troubleshoot. And they’re much simpler to build than you might expect.
Here are three simple steps you can take to make your data pipeline idempotent.
Amazon recommends several best practices for loading data. The easiest way to be sure that a buggy or failed ETL run doesn’t dirty up your data is to isolate the data that you deal with in a given run.
If you’re working with time-series data, this is reasonably straightforward. When pulling data from your initial source, filter it by time. Operate on an hour, a day, or a month, depending on the volume of your data. By operating on data in time-windowed batches like this, you can be confident that any bugs or failures aren’t impacting your data quality beyond a single window.
If you’re working with cross-sectional data, try to deal with some explicitly-defined subset of that data in a given run. Knowing exactly what data has been impacted by a bug or failure makes clean up and remediation of data pipeline issues relatively straightforward. Just delete the affected data from your final destination and then run again.
If you isolate your data correctly, you never have to worry about duplicated or double-counted data points.
It’s important to build your pipeline so that it either loads everything or nothing at all. This will help to avoid time-consuming cleanup of failed or buggy pipeline runs. That is to say, your runs should be atomic.
If you’re loading into a SQL database, achieving atomicity for your data pipeline is as simple as wrapping all of your loading statements–inserts, copies, creates, or updates–in a transaction with a BEGIN statement at the beginning and an END or COMMIT statement at the end.
If the final destination for your data is not a SQL database–say an Elasticsearch cluster or an Amazon S3 bucket–building atomicity into your pipeline is a little more tricky. But it’s not impossible. You’ll simply need to implement your own “rollback” logic within the load step.
For instance, if you’re writing CSV files to an Amazon S3 bucket from a python script, maintain a list of the resource identifiers for the files you’ve written. Wrap your loading logic in a try-except-finally statement which deletes all of the objects you’ve written to if the script errors out.
By making your data pipeline atomic, you prevent the need for time-consuming cleanup after failed runs.
Related Reading: Troubleshooting Data Loading Errors in Amazon Redshift
Now that you’ve isolated the data your pipeline deals within a given run and implemented transactions, having your pipeline clean up after itself is a breeze. You just need to delete before you insert. This is the most efficient way to avoid loading duplicate data.
Say your data pipeline is loading all of your company’s sales data from the previous month into a SQL database. Before you load it, delete any data from last month that’s already in the database. Be sure to do this within the same transaction as your load statement. This may seem scary. But if you’ve implemented your transaction correctly, your pipeline will never delete data without also replacing it.
Many SQL dialects have native support for this type of operation, typically referred to as an upsert. For instance in PostgreSQL (which is also the SQL dialect used by Amazon Redshift) you could prevent loading duplicate data with the following:
INSERT INTO sales (employee, customer, contract_number, annual_revenue, date)
VALUES ('Janet', 5439, 50000, '2018-07-20')
ON CONFLICT (contract_number)
DO UPDATE SET employee = excluded.employee;
This SQL statement will attempt to insert a new row into the database. If a row already exists with contract number 5439, the existing row will be updated with the new employee name. In this way, duplicate rows for the same contract will never load into the database.
By having your pipeline clean up after itself and prevent the loading of duplicate data, backfilling data becomes as simple as rerunning. If you discover a bug in your business logic or you want to add a new dimension to your data, you can push out a change and backfill your data without any manual intervention whatsoever.
ETL pipelines can be fragile, difficult to troubleshoot, and labor-intensive to maintain. They’re going to have bugs. They’re going to fail once in a while.
The key to building a stable, reliable, fault-tolerant data pipeline is not to build it so that it always runs perfectly. The key is to build it in such a way that recovering from bugs and failures is quick and easy.
The best data pipelines can recover from bugs and failures without ever requiring manual intervention. By applying the three simple steps outlined above, you’re well on your way to achieving a more fault-tolerant, idempotent data pipeline.
We’ve built an ETL performance monitoring solution with end-to-end visibility that’s used by top analytics teams. Discover more at intermix.io.
At intermix.io, we spend all day helping our customers optimize their performance on Amazon Redshift. We have a front-row view of all the ways that Redshift can be used to help businesses manage their data. Redshift is a versatile product that can help businesses aggregate, store, analyze, and share their data. We’ve collected 4 Amazon Redshift use cases that help businesses get more out of their data
Redshift started out as a simpler, cheaper, and faster alternative to legacy on-premise warehouses. Fast forward to now, and the use cases are way more sophisticated than just running a data warehouse in the cloud.
In this article, you will find 4 examples of Amazon Redshift use cases:
Data warehouse technology has been around since Kimball and Inmon. What changed with Amazon Redshift was the price at which you can get it – about 20x less than what you had to carve out for legacy vendors like Oracle and Teradata.
The use case for data warehousing is to unify disparate data sources in a single place and run custom analytics for your business.
Let’s say you‘re the head of business intelligence for a web property that also has a mobile app. The typical categories of data sources are:
With a rich ecosystem of data integration vendors, it’s easy to build pipelines to those sources and feed data into Redshift. Put a powerful BI / dashboard tool on top, and you have a full-blown BI stack.
A key advantage of Redshift is simplicity. It used to take months to get a data warehouse up and running. And you’d need the help of an Accenture or IBM. None of that anymore. You can spin up a Redshift cluster in less than 15 minutes, and build a whole business intelligence stack in a weekend.
Take the case of NTT Docomo, for instance. The company built a scalable, secure data warehouse on Amazon Redshift.
The combination of price / speed / simplicity have expanded the addressable market for data warehousing from large corporations to SMBs. However, because it is so easy to get going, data engineers must make sure to follow the best practices when setting up their cluster and avoid any performance issues. As the volume of data increases, it can increase the complexity of data pipelines. Any data solution should be easily scalable.
Previous generations of data warehouses had to aggregate data it was too expensive to store raw data. That changed with Amazon Redshift. Since Redshift is cheap, it’s possible to store raw, event-level data without exorbitant storage costs.
There are three key benefits to using event-level data:
Since Redshift is fast and cheap, processing machine data data is cost-effective. You can also drive the time required for “ingest-to-insight” (i.e. the time between pushing data into Redshift and the final analysis output) below the 5 minute mark. And not just for basic aggregations, but complex 10-way joins, across billions (billions with a “b”) of rows. That’s remarkable computational power.
Storing and processing raw data is one of the best Amazon Redshift use cases for businesses that deal with machine-generated data. This is high-velocity data, such as data from weblogs and clickstreams.
The business value of processing raw data at high speeds comes with exposing that data back into the business for enabling better business insights and new data-driven services. Here are two examples of companies that leveraged Amazon Redshift for real-time streaming analytics:
The business value here goes beyond mere cost savings by migrating your warehouse to the cloud. Rather, you’re enabling new services, informed by data. These “data-driven services” are the foundation for better / faster decision making. They can also be new revenue-generating products.
That distinction is key. Previously, companies would look at a data warehouse as a “cost center”. So, the goal was to keep that cost down as much as possible by limiting the exposure of data to a limited set of people, among other tactics.
With the advent of the cloud and Amazon Redshift, it makes sense to increase spend on your data infrastructure. An increase in spending on data analysis can lead to large increments in revenues.
That brings us to the next Amazon Redshift use case:
Not all companies have the technical abilities and budget to build and run a custom streaming pipeline with near real-time analytics.
But analytical use cases can be pretty similar across a single industry or vertical. That has given rise to “analytics-as-a-service” vendors. They use Redshift under the covers, to offer analytics in a SaaS model to their customers.
These vendors either run a single cluster in a multi-tenant model, or offer a single cluster to customers in a premium model. Take Acquia Lift as an example. The pricing model is a subscription fee to the analytics service.
To give you a rough estimate of how much an analytics service can cost in a SaaS model: in a multi-tenant model, you can cram data from 10s of customers onto a single node cluster, which costs you ~$200 / month. The average price of analytics services is about $500 / month / subscriber. Thus, for about $700 per month you get the ability to quickly analyze important data and unlock business insights.
One of the common use cases for Amazon Redshift is to use it for mission-critical workloads. Here, data sitting in Redshift feeds into time-sensitive apps. It’s key that the database stays up; otherwise, the business goes down (quite literally).
A common example of time-sensitive data reporting is stock exchanges, such as the London Stock Exchange. There is daily reporting involved and the reporting can’t be late or wrong
Other use cases include building predictive models on top of Redshift, and then embed the results programmatically into another app, via a data API. An example is automated ad-bidding, where bids across certain ad networks are adjusted on a near real-time basis. The adjustments are calculated on ROI and performance of ad types over a certain time period.
Redshift has driven down the cost of running a data warehouse. It has also opened up opportunities for small and medium enterprises to deliver data-driven services and create new revenue streams
However, Redshift can be a difficult beast to tame. To get the maximum out of your data, it is important that your cluster is running efficiently at all times. If you’re part of a data team that’s building mission-critical data pipelines, sign-up for a free trial of intermix.io to improve your Redshift performance.