Table of Contents
This is a guest blog post by Pete DeJoy. Pete is a Product Specialist at Astronomer, where he helps companies adopt Airflow.
Apache Airflow has come a long way since it was first started as an internal project within Airbnb back in 2014 thanks to the core contributors’ fantastic work in creating a very engaged community while all doing some superhero lifting of their own. In this post, we’re going to go through some of the exciting things coming down the pipe as the project gears up for a very hotly anticipated 2.0 release.
Apache Airflow is an open-source workflow management system that allows you programmatically author, schedule, and monitor data pipelines in Python. It is the most popular and effective open-source tool on the market for managing workflows, with over 8,500 stars and nearly 500 contributors on Github.
Since being open-sourced in 2015, Airflow has proven to be the dominant tool in its class (beating out alternatives like Spotify’s Luigi, Pinterest’s Pinball, and even the ever-present Hadoop-centric Oozie) because of its core principles of configurability, extensibility, and scalability. As we’ll see, the new features being developed and direction of the project still adhere to these principles.
Improvements that are currently being worked on in active PRs or recently merged and will be included in the upcoming Airflow 1.10 release.
See 14 real-life examples of data pipelines built with Amazon Redshift
RBAC and the UI
Joy Gao (https://twitter.com/joygao) took on the herculean task of converting the front end framework from Flask Admin to Flask AppBuilder (https://issues.apache.org/jira/browse/AIRFLOW-1433), which was an incredible feat for one person to accomplish largely on her own. One of the primary benefits realized in this update is ground-level support for role-based authentication controls (RBAC) that will open the door for various auth backends and functionally allow admin users to dictate who has access to specific elements of their Airflow cluster.
Along with this RBAC capability will come an improved UI that will allow for better security around user access. Our team at Astronomer is hoping to fix the UI so that refreshing the dashboard is not needed to check on the status of DAGs – we’d like to be able to view the status of our DAGs in real time without driving ourselves crazy pressing that refresh button.
The Kubernetes Executor
One of the most exciting developments the Astronomer team is anticipating is the release of the Kubernetes Executor that Daniel Imberman (https://github.com/dimberman) of Bloomberg has been leading development on. This is long-awaited from the community and will allow users to auto-scale workers via Kubernetes, ensuring that resources are not wasted. This is especially important for expanding the viable use cases for Airflow, as right now many are forced to either run Airflow on a low powered EC2 instance and use it to schedule external jobs or run it on expensive hardware that is massively underutilized when tasks aren’t actively running. While it is being included in the 1.10 release, the release of a new executor is part of a long-term but active effort to make Airflow completely could-native, which we’ll discuss in the following section.
In addition to the short-term fixes outlined above, there are a few longer-term efforts being worked on that will have a huge bearing on the stability and usability of the project. Most of these items have been identified by the Airflow core maintainers as necessary for the v2.x era and subsequent graduation from “incubation” status within the Apache Foundation.
First Class API Support
A largely requested feature (at least from users of Airflow that we work with) is first class support for an API to control everything from connection creation to DAG pausing to requesting usage metrics. Right now, the API is strictly experimental with limited functionality, so, in practice, if you want to create this behavior, you end up writing a plugin that directly manipulates the underlying MySQL or PostgreSQL. Ideally, given that much of the current functionality in the UI is based on direct modification of the database and not via any API, the inclusion of a first-class API that handles all functionality would mean that everything done in the UI could also be done in the CLI, further expanding the use cases Airflow could facilitate.
Making Airflow Cloud Native
As mentioned above in relation to the Kubernetes Executor, perhaps the most significant long-term push in the project is to make Airflow cloud native. Today it is still up to the user to figure out how to operationalize Airflow for Kubernetes, although at Astronomer we have done this and provide it in a dockerized package for our customers.
We feel this is an important step for the project to keep up with the changing deployment landscape and we plan to open-source what we can as we go, but knocking this out is easier said than done. One of the most fundamental problems blocking this initiative is the need for a high-availability, massively-distributed, and auto-rebalancing datastore, something that is hard to do with a simple postgresql or mysql.
A promising lead towards addressing this is added support for CockroachDB, a database following the Google Spanner whitepaper (and founded by former Google File System engineers) and designed precisely for the features listed above.
Improved Test Suite
A common complaint among contributors to Airflow is the long time that it can take for Travis, the CI of choice for the Airflow project, to run all the tests when cutting a new release. This has been brought up in the past but given Airflow’s code base has hit a scale where it can take up to an hour for Travis to run, we see this test suite finally making it over the line (and are looking forward to helping!). One factor to help in this process is the proposed break out of plugins (which has been growing and is a large code base in and of itself). Which brings us to…
Before we talk about this, it’s important to note that this is NOT on the official Airflow roadmap (at least not yet) but is rather something that the Astronomer team has been mulling around as we see the continued proliferation of plugins.
The brilliance of airflow plugins (and why they have contributed in no small part to the success of the entire project) is how wide-ranging they can be, enabling your workflows to connect with GCP, AWS, and Hadoop ecosystems as well as any number of other APIs and databases rather trivially.
Ironically, this is also their weakness. Importing Google Cloud plugins and opening the door to additional processing and dependency conflicts makes zero sense if your stack is on AWS. The inherent brittleness of plugins that have to interact with constantly changing APIs by their very nature require a different release schedule from the core project, which is a slower procedure as any error could affect core functionality.
All plugins should be on their own release schedule with an independent testing suite to make sure that all updates take advantage of the latest changes in external projects. Getting this to be as easy as a pip install will be huge for making Airflow more available to to other systems.
As we look toward the next year of our roadmap, we’re doubling down on our community contributions to help Airflow retain its status as the most flexible, extensible, and reliable scheduler available, regardless of how it’s being run. In Astronomer speak, we’d recommend hanging onto your helmets – the ship is about to kick into hyperdrive.
Every day, we talk to companies who are in the early phases of building our their data infrastructure. A lot of times these conversations circle around which technology to pick for which job. For example, we often get the question “what’s better – Spark or Amazon Redshift?”, or “which one should we be using?”. Spark and Redshift are two very different technologies. It’s not an either / or, it’s more of a “when do I use what?”. In this post, I’ll lay out some of the differences and when to use which technology.
At the time of this post, if you look under the hood of the most advanced tech start-ups in Silicon Valley, you will likely find both Spark and Redshift. Spark is getting a little bit more attention these days because it’s a new shiny toy. But they cover different use cases (“dish washer vs. fridge”, per Ricardo Vladimiro).
Let’s give you a decision-making framework that can guide you through your thinking:
Apache Spark is a data processing engine. With Spark you can:
There is a general execution engine (Spark Core) and all other functionality is built on top of.
People are excited about Spark for three reasons:
Spark is fast because it distributes data across a cluster, and processes that data in parallel. It tries to process data in memory, vs. shuffling things in and out of disk (like e.g. MapReduce does).
Spark is easy because it has a high level of abstraction, allowing you to write applications with less lines of code. Plus, Scala and R are attractive for data manipulation.
Spark is extensible via the pre-built libraries, e.g. for machine learning, streaming apps or data ingestion. These libraries are either part of Spark or 3rd party projects.
In short, the promise of Spark is to speed up development, make applications more portable and extensible, and make the actual application run faster.
A few more noteworthy points on Spark:
You need to know how to write code to use Spark (the “write applications” part). So the people who use Spark are typically developers.
Amazon Redshift is an analytical database. With Redshift you can:
Redshift is a managed service provided by Amazon. Raw data flows into Redshift (called “ETL”), where it’s processed and transformed at a regular cadence (“transformation” or “aggregations”), or on an ad-hoc basis (“ad-hoc queries”). Another term for loading and transforming data is also “data pipelines”.
People are excited about Redshift for three reasons:
Redshift is fast because its massively parallel processing (MPP) architecture distributes and parallelizes queries. Redshift allows a high query concurrency and processes queries in memory.
Redshift is easy because it can ingest structured, semi-structured and unstructured datasets (via S3 or DynamoDB) up to a petabyte or more, to then slice ‘n dice that data any way you can imagine with SQL.
Redshift is cheap because you can store data for a $935/TB annual fee (if you use the pricing for a 3-year reserved instance). That price-point is unheard of in the world of data warehousing.
In short, the promise of Redshift is to make data warehousing cheaper, faster and easier. You can analyze much bigger and complex datasets than ever before, and there’s a rich ecosystem of tools that work with Redshift.
A few more noteworthy points about Redshift:
With intermix.io we make it very easy to figure out what knobs to turn when using Amazon Redshift. For example, below is a screenshot from our “Cluster Health” dashboard.
The Cluster Health Dashboard helps data teams measure and improve SLAs. It does this by surfacing:
You need to know how to write SQL queries to use Redshift (the “run big, complex queries” part). So the people who use Redshift are typically analysts or data scientists.
In summary, one way to think about Spark and Redshift is to distinguish them by what they are, what you do with them, how you interact with them, and who the typical user is.
I’ve hinted at how you see both Spark and Redshift deployed. That gets us to data architecture.
In very simple terms, you can build an application with Spark, and then use Redshift both as a source and a destination for data.
Why would you do that? A key reason is the difference between Spark and Redshift in the way they process data, and how much time it takes to product a result.
A highly simplified example: Fraud detection. You could build an app with Spark that detects fraud in real-time from e.g. a stream of bitcoin transactions. Given it’s near-real time character, Redshift would not be a great fit in this case.
But let’s say if you wanted to have more signals for your fraud detection, for better predictability. You could load data from Spark into Redshift. There, you join it with historic data on fraud patterns. But you can’t do that in real-time, the result would come too late for you to block the transaction. So you use Spark to e.g. block a transaction in real-time, and then wait for the result from Redshift to decide if you keep blocking it, send it to a human for verification, or approve it.
In December 2017, the Amazon Big Data Blog had another example of using both Spark and Redshift: “Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning”. The post covers how to build a predictive app that tells you how likely a flight will be delayed. The prediction happens based on the time of day or the airline carrier, by using multiple data sources and processing them across Spark and Redshift.
You can see how the separation of “apps” and “data warehousing” we created at the start of this post is, in reality, an area that’s shifting or even merging.
To help data engineers stay on top of both the apps and the data warehouse we’ve built a feature in intermix.io called “App Tracing”.
What we do is correlate information about applications (dashboards, orchestration tools) with cluster performance data.
App Tracing can answer questions like:
The border between developers and business intelligence analysts / data scientists are fading. That has given rise to a new occupation: Data engineering. I’ll use a definition for data engineering by Maxime Beauchemin:
“In relation to previously existing roles, the data engineering field [is] a superset of business intelligence and data warehousing that brings more elements from software engineering, [and it] integrates the operation of ‘big data’ distributed systems”.
Spark is such a “big data” distributed system. Redshift is the data warehousing part. Data engineering is the discipline that brings both together. That’s because you see “code” moving its way into data warehousing. Code allows you to author, schedule and monitor data pipelines that feed into Redshift, incl. the transformations on the data once it sits inside your cluster. And you’ll very likely have to ingest data from Spark. And so the trend to “code” in warehousing implies that knowing SQL is not sufficient any more. You need to know how to write code. Hence the “data engineer”.
This post is already way too long, but I hope it provides a useful summary on how to think about your data stack. For your big data architecture, you will likely end up using both Spark and Redshift, each one to fulfill a specific use case that’s is best suited for.