Building a Better Data Pipeline – The Importance of Being Idempotent
At a glance, batch data processing seems simple. Pull data from a source, apply some business logic to it, and load it for later use. When done well, automating these jobs is a huge win. It saves time and empowers decision-makers with fresh and accurate data. But this kind of ETL (Extract, Transform, and Load) process gets very tricky, very quickly. A failure or bug in a single step of an ETL process has cascading effects which can require hours of manual intervention and cleanup. These kinds of issues can be a gigantic time sink for the analysts or engineers that have to clean up the mess. They negatively impact data quality. They erode end-user confidence. And they can make teams terrified of pushing out even the most innocuous changes to a data pipeline. To avoid falling into this kind of ETL hell, analysts and engineers need to make data pipelines that:
- Never load duplicate data
- Load no data to the final destination when a step has failed.
- Clean up after themselves
In short, they need to make data pipelines that are idempotent. Idempotent is a mathematical term describing an operation which can be applied arbitrarily many times without changing the result. Idempotent data pipelines are fault-tolerant, reliable, and easy to troubleshoot. And they’re much simpler to build than you might expect. Here are some simple steps you can take to make your data pipeline idempotent.
Load Data in Isolated Batches
The easiest way to be sure that a buggy or failed ETL run doesn’t dirty up your data is to isolate the data that you deal with in a given run. If you’re working with time-series data this is fairly straightforward. When pulling data from your initial source, simply filter it by time. Operate on an hour, a day, or a month, depending on the volume of your data. By operating on data in time-windowed batches like this, you can be confident that any bugs or failures aren’t impacting your data quality beyond a single window. If you’re working with cross-sectional data, try to deal with some explicitly-defined subset of that data in a given run. Knowing exactly what data has been impacted by a bug or failure makes clean up and remediation of data pipeline issues relatively straightforward. Just delete the data that has been impacted from your final destination and then run again. And if you isolate your data correctly, you never have to worry about duplicated or double-counted data points.
In order to avoid time-consuming cleanup of failed or buggy data pipeline runs, it’s important to build your pipeline in such a way that it either loads everything or it loads nothing at all. That is to say, it should be atomic. If you’re loading into a SQL database, achieving atomicity for your data pipeline is as simple as wrapping all of your loading statements–inserts, copies, creates, or updates–in a transaction with a BEGIN statement at the beginning and an END or COMMIT statement at the end. If the final destination for your data is not a SQL database–say an Elasticsearch cluster or an Amazon S3 bucket–building atomicity into your pipeline is a little more tricky. But it’s not impossible. You’ll simply need to implement your own “rollback” logic within the load step. For instance, if you’re writing CSV files to an Amazon S3 bucket from a python script, maintain a list of the resource identifiers for the files you’ve written. Wrap your loading logic in a try-except-finally statement which deletes all of the objects you’ve written to if the script errors out. By making your data pipeline atomic, you completely obviate the need for time-consuming cleanup after failed runs.
Now that you’ve explicitly isolated the data your pipeline deals with in a given run and implemented transactions, having your pipeline clean up after itself is a breeze. You just need to delete before you insert. This is the most efficient way to avoid loading duplicate data. Say your data pipeline is loading all of your company’s sales data from the previous month into a SQL database. Before you load it, delete any data from last month that’s already in the database. Be sure to do this within the same transaction as your load statement. This may seem scary. But if you’ve implemented your transaction correctly, your pipeline will never delete data without also replacing it. Many SQL dialects have native support for this type of operation, which is typically referred to as an upsert. For instance in PostgreSQL (which is also the SQL dialect used by Amazon Redshift) you could prevent loading duplicate data with the following:
INSERT INTO sales (employee, customer, contract_number, annual_revenue, date)
VALUES ('Janet', 5439, 50000, '2018-07-20')
ON CONFLICT (contract_number)
DO UPDATE SET employee = excluded.employee;
This SQL statement will attempt to insert a new row into the database. If a row already exists with contract number 5439, the existing row will be updated with the new employee name. In this way, duplicate rows for the same contract will never be loaded into the database. By having your pipeline clean up after itself and prevent the loading of duplicate data, backfilling data becomes as simple as rerunning. If you discover a bug in your business logic or you want to add a new dimension to your data, you can push out a change and backfill your data without any manual intervention whatsoever.
ETL pipelines can be fragile, difficult to troubleshoot, and labor-intensive to maintain. They’re inevitably going to have bugs. They’re going to fail every once in a while. The key to building a stable, reliable, fault-tolerant data pipeline is not to build it so that it always runs perfectly. The key is to build it in such a way that recovering from bugs and failures is quick and easy. The best data pipelines can recover from bugs and failures without ever requiring manual intervention. By applying the three simple steps outlined above, you’re well on your way to achieving a more fault-tolerant, idempotent data pipeline.