Table of Contents
Apache Airflow has come a long way since it was first started as an internal project within Airbnb back in 2014 thanks to the core contributors’ fantastic work in creating a very engaged community while all doing some superhero lifting of their own. In this post, we’re going to go through some of the exciting things coming down the pipe as the project gears up for a very hotly anticipated 2.0 release.
Apache Airflow is an open-source workflow management system that allows you programmatically author, schedule, and monitor data pipelines in Python. It is the most popular and effective open-source tool on the market for managing workflows, with over 18,000 stars and over 1,300 contributors on Github.
Since being open-sourced in 2015, Airflow has proven to be the dominant tool in its class, beating out alternatives like Spotify’s Luigi, Pinterest’s Pinball, and even the ever-present Hadoop-centric Oozie. This is due to its core principles of configurability, extensibility, and scalability. As we’ll see, the new features being developed and the direction of the project still adheres to these principles.
Improvements that are currently being worked on in active PRs or recently merged and will be included in the ongoing Apache Airflow 1.10 release.
Joy Gao took on the herculean task of converting the front end framework from Flask Admin to Flask AppBuilder, which was an incredible feat for one person to do largely on her own. One of the primary benefits realized in this update is ground-level support for role-based authentication controls (RBAC) that will open the door for various auth backends and allow admin users to dictate who has access to specific elements of their Airflow cluster.
Along with this RBAC capability will come an improved UI that will allow for better security around user access. Our team at Astronomer is hoping to fix the UI so that refreshing the dashboard is not needed to check on the status of DAGs – we’d like to be able to view the status of our DAGs in real-time without driving ourselves crazy pressing that Refresh button.
One of the most exciting developments anticipated by the Astronomer team is the release of the Kubernetes Executor that Daniel Imberman of Bloomberg has been leading development on. This is long-awaited from the community and will allow users to auto-scale workers via Kubernetes, ensuring that resources are not wasted. This is especially important for expanding the viable use cases for Airflow, as right now we must run Airflow on a low powered EC2 instance and use it to schedule external jobs or run it on expensive hardware that is massively underutilized when tasks aren’t running. While it is being included in the 1.10 release, the release of a new executor is part of a long-term but active effort to make Airflow completely could-native, which we’ll discuss in the following section.
In a procedural change, the Airflow community moved their issue tracking from Jira to Github. This signals their growth as a community and their ability to adapt to the changing environments of issue tracking.
In addition to the short-term fixes outlined above, there are a few longer-term efforts that will have a huge bearing on the stability and usability of the project. Most of these items have been identified by the Airflow core maintainers as necessary for the v2.x era and subsequent graduation from “incubation” status within the Apache Foundation.
A very desirable feature (at least from users of Airflow that we work with) is first class support for an API to control everything from connection creation to DAG pausing to requesting usage metrics. Right now, the API is strictly experimental with functionality being added regularly, so, in practice, if you want to create this behavior, you end up writing a plugin that manipulates the underlying MySQL or PostgreSQL. Given that much of the current functionality in the UI is based on direct modification of the database and not via any API, the inclusion of a first-class API that handles all functionality would mean that everything done in the UI could also be done in the CLI, further expanding the use cases Airflow could facilitate.
As mentioned above in relation to the Kubernetes Executor, the most significant long-term push in the project is to make Airflow cloud native. Today it is still up to the user to figure out how to operationalize Airflow for Kubernetes, AWS Batch, or other cloud-based infrastructure. At Astronomer, we have done this and provide it in a dockerized package for our customers.
We feel this is an important step for the project to keep up with the changing deployment landscape and we plan to open-source what we can as we go, but knocking this out is easier said than done. One of the most fundamental problems blocking this initiative is the need for a high-availability, massively-distributed, and auto-rebalancing datastore, something that is hard to do with a simple postgresql or mysql.
A promising lead towards addressing this is added support for CockroachDB, a database following the Google Spanner whitepaper (and founded by former Google File System engineers) and designed precisely for the features listed above.
A common complaint among contributors to Airflow is the long time that it can take for Travis, the CI of choice for the Airflow project, to run all the tests when cutting a new release. This has been brought up in the past but given Airflow’s codebase has hit a scale where it can take up to an hour for Travis to run, we see this test suite finally making it over the line (and are looking forward to helping!). One factor to help in this process is the break out of plugins (which has been growing and is a large codebase in and of itself). Which brings us to…
Before we talk about this, it’s important to note that this is NOT on the official Airflow roadmap (at least not yet) but is rather something that the Astronomer team has been mulling around as we see the continued proliferation of plugins.
The brilliance of airflow plugins (and why they have contributed in no small part to the success of the entire project) is how wide-ranging they can be. They can enable your workflows to connect with GCP, AWS, and Hadoop ecosystems as well as any number of other APIs and databases rather trivially.
Ironically, this is also their weakness. Importing Google Cloud plugins and opening the door to extra processing and dependency conflicts makes zero sense if your stack is on AWS. The inherent brittleness of plugins that have to interact with constantly changing APIs by their very nature require a different release schedule from the core project, which is a slower procedure as any error could affect core functionality.
All plugins should be on their own release schedule with an independent testing suite to make sure that all updates take advantage of the latest changes in external projects. Getting this to be as easy as a pip install will be huge for making Airflow more available to other systems.
As we look toward the next few years of our roadmap, we’re doubling down on our community contributions to help Airflow retain its status as the most flexible, extensible, and reliable scheduler available, regardless of how it’s being run. In Astronomer speak, we’d recommend hanging onto your helmets – the ship is about to kick into hyperdrive.
Which future Airflow developments are you and your network excited about? Join our Slack channel to get in on the conversation and keep up-to-date with the latest news and get support from the community.
This is a guest blog post by Pete DeJoy. Pete is a Product Specialist at Astronomer, where he helps companies adopt Airflow.
At intermix.io, we use Amazon Redshift as part of our stack. Amazon Redshift is an OLAP database, and a valuable tool for data teams due to its low cost and speed for analytical queries. We have a particular use case though. We’re using Amazon Redshift in an OLTP scenario, i.e. we’ve built an analytical application on top of Redshift.
The challenge of using Redshift as an OLTP is that queries can lack the low-latency that would exist on a traditional RDBMS and transactional queries. This is a result of the column-oriented data storage design of Amazon Redshift which makes the tradeoff to perform better for big data analytical workloads. This comes from eliminating the need to traverse full records (all columns) for each query. Also, there are features of a traditional RDBMS that would often be useful when querying Redshift (e.g. indices).
In this post we detail a method of using Amazon Redshift as an OLTP. We will tell the story of how we off-load OLTP workloads from Amazon Redshift to Amazon RDS. We were able to do this without writing any complex ETL or changing our data processing strategy.
intermix.io is an analytics platform that provides a single monitoring dashboard for data engineers to keep an eye on their critical data flows. We surface information for data engineers using Amazon Redshift like how to optimize warehouse resources with actionable recommendations.
For example, our “Cluster Health” dashboard gives you a very quick overview of SLA measures for any connected app, down to the individual users as well as actionable recommendations on how to improve performance and optimize costs.
If you’d like to instantly see personalized recommendations for your cluster, you can start a 14-day free trial on this link.
There are three steps for acquiring the data we need to provide our analytics to our customers:
We use Amazon Redshift for:
serving data to our product dashboard (this is the piece that was giving us trouble)
In the early stage of intermix.io, our dashboard pages would pull data from Redshift directly. But as we added more and more customers, we realized the dashboard pages were not performing well. This was due to two factors:
We did use the best practices to manage our Redshift cluster, including the proper set-up of our workload management (WLM) and all 15 performance tuning techniques. But we also knew that using Redshift for this OLTP use case — serving data to the dashboard – was not going to work long term. So we started looking for solutions.
Our ideal solution would:
Amazon Redshift is a columnar-oriented petabyte-scale data warehouse. It is an OLAP database suitable for analytical queries and applications. It uses a modified version of PostgreSQL 8.
The solution was to use Amazon RDS as a ‘cache’ by leveraging the Dblink feature of RDS. This allows us to make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over. Dblink handles moving the data at the block level.
The benefits of this solution included:
Before we go over the installation instructions, we’ll cover some of the features that using Dblink with Redshift provides.
The main features presented in this post require a Postgres instance with Postgres version >= 9.4 (some features listed in this section require more recent versions of Postgres). We use an RDS-hosted version of Postgres but other types of instances will work as long as they meet the minimum requirements.
For installing on Redshift, perform the following commands:
GRANT SELECT ON all TABLES IN SCHEMA data to <amazon_redshift_username>;
Enable dblink and postgres_fdw in the Postgres RDS instance (only needs to be done once per database).
CREATE EXTENSION postgres_fdw;
CREATE EXTENSION dblink;
Create the server link and user mapping.
CREATE SERVER redshift_server FOREIGN DATA WRAPPER redshift_fdw
OPTIONS (host '<amazon_redshift_ip>', port '<port>', dbname '<database_name>', sslmode 'require');
CREATE USER MAPPING FOR <rds_postgresql_username> SERVER redshift_server
OPTIONS (user '<amazon_redshift_username>', password '<password>');
You are now ready to query your Redshift data in RDS.
This approach allows you to query Redshift while inside of a Postgres instance. It bridges data but doesn’t confer any performance benefits. For example, to get a list of users that were updated this year.
FROM dblink('redshift_server', $REDSHIFT$
WHERE updated_on >= '2018-01-01';
$REDSHIFT$) AS t1 (id int);
To create a queryable cache of the Redshift data, we create a materialized view. You can also select a subset of rows and a subset of columns if you don’t require the full table.
CREATE MATERIALIZED VIEW mv_users AS
FROM dblink('redshift_server', $REDSHIFT$
SELECT id, updated_on
$REDSHIFT$) AS t1 (id int, updated_on timestamp);
A materialized view will not self-update. So we need an approach to refresh the data. If you only wish to execute live queries (no materialized views) then you can ignore this section.
Easier and faster but with concurrent reads blocked.
REFRESH MATERIALIZED VIEW mv_users;
This will block concurrent reads, so if you need to be able to always read from the view follow the next approach.
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_users;
This requires a unique index to work. For example, after creating the view:
CREATE UNIQUE INDEX mv_user_id_updated_on_idx ON data.mv_users (id, updated_on);
If the data you retrieve takes more than 5 minutes to transfer or you don’t mind refreshing synchronously, you can issue the REFRESH commands above inside your code or in a periodic task as needed.
If retrieving the data from Redshift takes fewer than 5 minutes, AWS’s Lambda is a good approach to use here. Be mindful of concurrency, for example it’s bad practice to issue multiple concurrent refreshes on a single table while another refresh is currently happening on the same table. We’ll have more about how we approach this in a subsequent blog post.
The above refresh methods query the full set of data each time (even if nothing has changed), if you don’t have updates on your data you may want to use the following approach instead:
Dblink ingestion incurs regular data transfer costs, hence if you don’t want to be charged for the network traffic of synchronizing the data please have your Postgres RDS instance in the same AZ as your Redshift leader node.
You need to ensure that the user permissions for the users querying data via Dblink matches that of the same users querying directly on Redshift, so that additional permissions aren’t granted by querying Dblink.
Amazon Redshift supports identity-based policies (IAM policies). We recommend using IAM identities (e.g. a user, a group, a role) to manage cluster access for users vs. creating direct logins in your cluster. That’s because nobody ever keeps track of those logins. For example, at intermix.io we use IAM to generate temporary passwords for our clusters.
In the AWS console, keep an eye on the following Postgres RDS metrics to ensure that the instance isn’t overwhelmed:
And then, of course, we use our own intermix.io dashboard to monitor the performance of our Amazon Redshift clusters and our data pipelines. A few select key metrics we pay attention to are:
For example, in image 2 you can see a distribution of all queries for a certain time interval. We can see how much memory each query consumes, if it’s disk-based or not, and what the total query return time (execution time + queue time) is.
Not only that but if you click on a single query we surface every single detail about it as well as actionable insights based on your cluster about how to improve performance instantly.
The total on-disk size limit for Postgres on RDS is 16TB, so the data you link cannot be greater than that (such a transfer would take at least 3.6 hours on a 10Gbit connection). The total table size on Redshift can be greater than 16TB but your materialized view query cannot select a quantity of rows and columns combined with index overhead that would exceed 16TB.
A current limitation of the DNS resolving on RDS is such that lookups don’t resolve to private IPs for DNS queries performed inside of Dblink. Hence if you’re running in a VPC without publicly routable Redshift then you’ll need to setup your own DNS to resolve to the private IP or hardcode Redshift Leader node private IPs when provisioning Dblink.
With the approach in this post, we showed you how it’s possible to use Amazon Redshift in an OLTP scenario. By using Amazon RDS as a ‘cache’ and leveraging its Dblink feature, we can make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over.
So if your using Amazon Redshift in an OLTP scenario for building data apps, and you want to monitor the performance of those data apps – schedule a call with us. We’d love to look at your set-up, and help you with a free trial of intermix.io to tune your Amazon Redshift cluster.
Monitoring your data apps with app tracing gives you better control and can help train new hires to use tools. Amazon Redshift training, for example, can become much easier when you use Intermix.io Data Tracing. If you don’t know how monitoring data apps benefit your organization and tools, the following article will give you a few ideas.
App Tracing surfaces important information about how apps & users interact with your data. It can help answer questions like:
Data apps typically fall into one or more of three categories:
Intermix can fulfill all three of these functions. It integrates with your favorite tools to show you which end-users connect to your data warehouse. It comes ready to integrate with popular tools like:
Whether you want to connect your data to more BI tools or you want insights that lead to better Amazon Redshift training, Intermix can help.
Intermix App Tracing thrives on visualization and analysis. If you want to know how much data an app uses, just click on the icon to get a graph that shows a timeline of the app’s data usage.
You can also get visualization analysis to learn:
Visualization makes information easier for everyone to understand. Whether you have years of experience in IT or you just want an overview of your data use, Intermix’s App Tracing gives you a straightforward approach to view information.
App Tracing requires the data app to annotate the executed SQL with a comment. The comment encodes metadata about the application which submitted this query.
Intermix.io will automatically index all data contained in the annotation, and make it accessible as first-class labels in our system. I.e. for Discover searches, Saved Searches, and aggregations in the Throughput Analysis page.
Out of the box, we support:
In the below example, a query spike in WLM 3 causes a bottleneck in query latency. The result is that queries which would otherwise take 13-14 seconds to execute are stuck in the queue for
> greater than 3 minutes.
App Tracing detects that the majority of these queries are from Looker. How do you know which user is causing this?
Click on the chart, and a widget will pinpoint the specific Looker user(s) who ran those queries. In this example, we see that user 248 is responsible.
Armed with this information, you can now:
See all the activity for this user by heading to Discover and use the new ‘App’ filter to search for Looker user 248.
To set up an alarm to get email notifications, save that search and stream the following metrics to CloudWatch:
The following Slack conversation took place the morning we soft-launched app tracing in June 2018:
Intermix.io makes Amazon Redshit more powerful. We’ve seen clients use Intermix.io to:
If you’re using Amazon Redshift in combination with Apache Airflow, and you’re trying to monitor your DAGs – we’d love to talk! We’re running a private beta for a new Airflow plug-in with a few select customers. Go ahead and click on the chat widget on the bottom right of this window. Answer three simple questions, schedule a call, and then mention “Airflow” at the end and we’ll get you set up! As a bonus, we’ll throw in an extended trial of 4 weeks instead of 2!