Start Now Login
Have Your Postgres Cake with Amazon Redshift and eat it, too.

Have Your Postgres Cake with Amazon Redshift and eat it, too.

Introduction

At intermix.io, we use Amazon Redshift as part of our stack. Amazon Redshift is an OLAP database, and a valuable tool for data teams due to its low cost and speed for analytical queries. We have a particular use case though. We’re using Amazon Redshift in an OLTP scenario, i.e. we’ve built an analytical application on top of Redshift.

The challenge of using Redshift as an OLTP is that queries can lack the low-latency that would exist on a traditional RDBMS and transactional queries. This is a result of the column-oriented data storage design of Amazon Redshift which makes the tradeoff to perform better for big data analytical workloads. This comes from eliminating the need to traverse full records (all columns) for each query. Also, there are features of a traditional RDBMS that would often be useful when querying Redshift (e.g. indices).

In this post we detail a method of using Amazon Redshift as an OLTP. We will tell the story of how we off-load OLTP workloads from Amazon Redshift to Amazon RDS. We were able to do this without writing any complex ETL or changing our data processing strategy.

How We Use Amazon Redshift

intermix.io is an analytics platform that provides a single monitoring dashboard for data engineers to keep an eye on their critical data flows.

Image 1: Simplified architecture of Amazon Redshift in an OLTP scenario

There are three steps for acquiring the data we need to provide our analytics to our customers.

1. We extract raw event data from the customer databases via the intermix.io Collector. The Collector runs as a Docker container in the customer’s environment.
2. We move the data into the intermix.io S3 environment.
3. From S3, we copy data into Redshift.

We use Amazon Redshift for:

Issues

In the early stage of intermix.io, our dashboard pages would pull data from Redshift directly. But as we added more and more customers, we realized the dashboard pages were not performing well. This was due to two factors:

We did use the best practices to manage our Redshift cluster, including the proper set-up of our workload management (WLM) and all 14 performance tuning techniques. But we also knew that using Redshift for this OLTP use case — serving data to the dashboard – was not going to work long term. So we started looking for solutions.

Our ideal solution would:

What is Amazon Redshift?

Amazon Redshift is a columnar-oriented petabyte-scale data warehouse. It is an OLAP database suitable for analytical queries and applications.  It uses a modified version of PostgreSQL 8.

Our Approach

The solution was to use Amazon RDS as a ‘cache’ by leveraging the Dblink feature of RDS. This allows us to make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over. Dblink handles moving the data at the block level.

This solution had a bunch of benefits:

The Magic – 6 Steps

Before we go over the installation instructions, we’ll cover some of the features that using Dblink with Redshift provides.

The main features presented in this post require a Postgres instance with Postgres version >= 9.4 (some features listed in this section require more recent versions of Postgres). We use an RDS-hosted version of Postgres but other types of instances will work as long as they meet the minimum requirements.

Step 1: Setup on Redshift cluster

Now onto the installation. For installing on Redshift, perform the following commands:

Step 2: Setup on Postgres RDS instance

Enable dblink and postgres_fdw in the Postgres RDS instance (only needs to be done once per database).

Create the server link and user mapping.

That’s it, now you’re ready to query your Redshift data in RDS!

Step 3: Run some queries

This approach allows you to query Redshift while inside of a Postgres instance. It bridges data but doesn’t confer any performance benefits. For example, to get a list of users that were updated this year.

Step 4: Create a cached view into the data

To create a queryable cache of the Redshift data, we create a materialized view. You can also select a subset of rows and a subset of columns if you don’t require the full table.

Step 5: Keep it up to date

A materialized view will not self-update. So we need an approach to refresh the data. If you only wish to execute live queries (no materialized views) then you can ignore this section.

Refresh approach 1 (non-concurrent)
Easier and faster but with concurrent reads blocked.

This will block concurrent reads, so if you need to be able to always read from the view follow the next approach.

Refresh approach 2 (concurrent)
Non-blocking refreshes (this requires a unique index on the materialized view).

This requires a unique index to work. For example, after creating the view:

Periodic refresh
If the data you retrieve takes more than 5 minutes to transfer or you don’t mind refreshing synchronously, you can issue the REFRESH commands above inside your code or in a periodic task as needed.

Refresh Lambdas
If retrieving the data from Redshift takes fewer than 5 minutes, AWS’s Lambda is a good approach to use here. Be mindful of concurrency, for example it’s bad practice to issue multiple concurrent refreshes on a single table while another refresh is currently happening on the same table. We’ll have more about how we approach this in a subsequent blog post.

Incremental refreshes
The above refresh methods query the full set of data each time (even if nothing has changed), if you don’t have updates on your data you may want to use the following approach instead:

Step 6: Operations and monitoring

Dblink ingestion incurs regular data transfer costs, hence if you don’t want to be charged for the network traffic of synchronizing the data please have your Postgres RDS instance in the same AZ as your Redshift leader node.

You need to ensure that the user permissions for the users querying data via Dblink matches that of the same users querying directly on Redshift, so that additional permissions aren’t granted by querying Dblink.

In the AWS console, keep an eye on the following Postgres RDS metrics to ensure that the instance isn’t overwhelmed:

And then of course we use our own intermix.io dashboard to monitor the performance of our Amazon Redshift clusters and our data pipelines. A few select key metrics we pay attention to are:

Image 2: intermix.io dashboard – query memory footprint

For example, in image 2 you can see a distribution of all queries for a certain time interval. We can see how much memory each query consumes, if it’s disk-based or not, and what the total query return time (execution time + queue time) is.

Caveats and limits

The total on-disk size limit for Postgres on RDS is 16TB, so the data you link cannot be greater than that (such a transfer would take at least 3.6 hours on a 10Gbit connection). The total table size on Redshift can be greater than 16TB but your materialized view query cannot select a quantity of rows and columns combined with index overhead that would exceed 16TB.

A current limitation of the DNS resolving on RDS is such that lookups don’t resolve to private IPs for DNS queries performed inside of Dblink. Hence if you’re running in a VPC without publicly routable Redshift then you’ll need to setup your own DNS to resolve to the private IP or hardcode Redshift Leader node private IPs when provisioning Dblink.

Wrapping up

With the approach in this post, we showed you how it’s possible to use Amazon Redshift in an OLTP scenario. By using Amazon RDS as a ‘cache’ and leveraging its Dblink feature, we can make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over.

So if your using Amazon Redshift in an OLTP scenario for building data apps, and you want to monitor the performance of those data apps – schedule a call with us. We’d love to look at your set-up, and help you with a free trial of intermix.io to tune your Amazon Redshift cluster.

Related content
3 Things to Avoid When Setting Up an Amazon Redshift Cluster Apache Spark vs. Amazon Redshift: Which is better for big data? Amazon Redshift Spectrum: Diving into the Data Lake! What Causes "Serializable Isolation Violation Errors" in Amazon Redshift? A Quick Guide to Using Short Query Acceleration and WLM for Amazon Redshift for Faster Queries What is TensorFlow? An Intro to The Most Popular Machine Learning Framework Titans of Data with Mirko Novakovic - How Containers are Giving Rise to New Data Services Why We Built intermix.io - “APM for Data” 4 Simple Steps To Set-up Your WLM in Amazon Redshift For Better Workload Scalability World-class Data Engineering with Amazon Redshift - Training Announcing App Tracing - Monitoring Your Data Apps With intermix.io 4 Real World Use Cases for Amazon Redshift 3 Steps for Fixing Slow Looker Dashboards with Amazon Redshift Zero Downtime Elasticsearch Migrations Titans of Data with Florian Leibert – CEO Mesosphere Improve Amazon Redshift COPY performance:  Don’t ANALYZE on every COPY Building a Better Data Pipeline - The Importance of Being Idempotent The Future of Machine Learning in the Browser with TensorFlow.js Gradient Boosting Libraries — A Comparison Crowdsourcing Weather Data With Amazon Redshift The Future of Apache Airflow Announcing Query Groups – Intelligent Query Classification Top 14 Performance Tuning Techniques for Amazon Redshift Product Update: An Easy Way To Find The Cause of Disk Usage Spikes in Amazon Redshift How We Reduced Our Amazon Redshift Cost by 28%
Ready to start seeing into your data infrastructure?
Get started with a 14-day free trial, with access to the full platform

No Credit Card Required