Start Now Login
Have Your Postgres Cake with Amazon Redshift and eat it, too.

Have Your Postgres Cake with Amazon Redshift and eat it, too.

Introduction

At intermix.io, we use Amazon Redshift as part of our stack. Amazon Redshift is an OLAP database, and a valuable tool for data teams due to its low cost and speed for analytical queries. We have a particular use case though. We’re using Amazon Redshift in an OLTP scenario, i.e. we’ve built an analytical application on top of Redshift.

The challenge of using Redshift as an OLTP is that queries can lack the low-latency that would exist on a traditional RDBMS and transactional queries. This is a result of the column-oriented data storage design of Amazon Redshift which makes the tradeoff to perform better for big data analytical workloads. This comes from eliminating the need to traverse full records (all columns) for each query. Also, there are features of a traditional RDBMS that would often be useful when querying Redshift (e.g. indices).

In this post we detail a method of using Amazon Redshift as an OLTP. We will tell the story of how we off-load OLTP workloads from Amazon Redshift to Amazon RDS. We were able to do this without writing any complex ETL or changing our data processing strategy.

How We Use Amazon Redshift

intermix.io is an analytics platform that provides a single monitoring dashboard for data engineers to keep an eye on their critical data flows.

Image 1: Simplified architecture of Amazon Redshift in an OLTP scenario

There are three steps for acquiring the data we need to provide our analytics to our customers.

1. We extract raw event data from the customer databases via the intermix.io Collector. The Collector runs as a Docker container in the customer’s environment.
2. We move the data into the intermix.io S3 environment.
3. From S3, we copy data into Redshift.

We use Amazon Redshift for:

Download the Top 14 Performance Tuning Techniques for Amazon Redshift

Issues

In the early stage of intermix.io, our dashboard pages would pull data from Redshift directly. But as we added more and more customers, we realized the dashboard pages were not performing well. This was due to two factors:

We did use the best practices to manage our Redshift cluster, including the proper set-up of our workload management (WLM) and all 14 performance tuning techniques. But we also knew that using Redshift for this OLTP use case — serving data to the dashboard – was not going to work long term. So we started looking for solutions.

Our ideal solution would:

See your data in intermix.io

What is Amazon Redshift?

Amazon Redshift is a columnar-oriented petabyte-scale data warehouse. It is an OLAP database suitable for analytical queries and applications.  It uses a modified version of PostgreSQL 8.

Our Approach

The solution was to use Amazon RDS as a ‘cache’ by leveraging the Dblink feature of RDS. This allows us to make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over. Dblink handles moving the data at the block level.

This solution had a bunch of benefits:

The Magic – 6 Steps

Before we go over the installation instructions, we’ll cover some of the features that using Dblink with Redshift provides.

The main features presented in this post require a Postgres instance with Postgres version >= 9.4 (some features listed in this section require more recent versions of Postgres). We use an RDS-hosted version of Postgres but other types of instances will work as long as they meet the minimum requirements.

Download the Top 14 Performance Tuning Techniques for Amazon Redshift

Step 1: Setup on Redshift cluster

Now onto the installation. For installing on Redshift, perform the following commands:

Step 2: Setup on Postgres RDS instance

Enable dblink and postgres_fdw in the Postgres RDS instance (only needs to be done once per database).

Create the server link and user mapping.

That’s it, now you’re ready to query your Redshift data in RDS!

Step 3: Run some queries

This approach allows you to query Redshift while inside of a Postgres instance. It bridges data but doesn’t confer any performance benefits. For example, to get a list of users that were updated this year.

Step 4: Create a cached view into the data

To create a queryable cache of the Redshift data, we create a materialized view. You can also select a subset of rows and a subset of columns if you don’t require the full table.

Step 5: Keep it up to date

A materialized view will not self-update. So we need an approach to refresh the data. If you only wish to execute live queries (no materialized views) then you can ignore this section.

Refresh approach 1 (non-concurrent)
Easier and faster but with concurrent reads blocked.

This will block concurrent reads, so if you need to be able to always read from the view follow the next approach.

Refresh approach 2 (concurrent)
Non-blocking refreshes (this requires a unique index on the materialized view).

This requires a unique index to work. For example, after creating the view:

See your data in intermix.io

Periodic refresh
If the data you retrieve takes more than 5 minutes to transfer or you don’t mind refreshing synchronously, you can issue the REFRESH commands above inside your code or in a periodic task as needed.

Refresh Lambdas
If retrieving the data from Redshift takes fewer than 5 minutes, AWS’s Lambda is a good approach to use here. Be mindful of concurrency, for example it’s bad practice to issue multiple concurrent refreshes on a single table while another refresh is currently happening on the same table. We’ll have more about how we approach this in a subsequent blog post.

Incremental refreshes
The above refresh methods query the full set of data each time (even if nothing has changed), if you don’t have updates on your data you may want to use the following approach instead:

Download the Top 14 Performance Tuning Techniques for Amazon Redshift

Step 6: Operations and monitoring

Dblink ingestion incurs regular data transfer costs, hence if you don’t want to be charged for the network traffic of synchronizing the data please have your Postgres RDS instance in the same AZ as your Redshift leader node.

You need to ensure that the user permissions for the users querying data via Dblink matches that of the same users querying directly on Redshift, so that additional permissions aren’t granted by querying Dblink.

In the AWS console, keep an eye on the following Postgres RDS metrics to ensure that the instance isn’t overwhelmed:

And then of course we use our own intermix.io dashboard to monitor the performance of our Amazon Redshift clusters and our data pipelines. A few select key metrics we pay attention to are:

Image 2: intermix.io dashboard – query memory footprint

For example, in image 2 you can see a distribution of all queries for a certain time interval. We can see how much memory each query consumes, if it’s disk-based or not, and what the total query return time (execution time + queue time) is.

Caveats and limits

The total on-disk size limit for Postgres on RDS is 16TB, so the data you link cannot be greater than that (such a transfer would take at least 3.6 hours on a 10Gbit connection). The total table size on Redshift can be greater than 16TB but your materialized view query cannot select a quantity of rows and columns combined with index overhead that would exceed 16TB.

A current limitation of the DNS resolving on RDS is such that lookups don’t resolve to private IPs for DNS queries performed inside of Dblink. Hence if you’re running in a VPC without publicly routable Redshift then you’ll need to setup your own DNS to resolve to the private IP or hardcode Redshift Leader node private IPs when provisioning Dblink.

Wrapping up

With the approach in this post, we showed you how it’s possible to use Amazon Redshift in an OLTP scenario. By using Amazon RDS as a ‘cache’ and leveraging its Dblink feature, we can make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over.

So if your using Amazon Redshift in an OLTP scenario for building data apps, and you want to monitor the performance of those data apps – schedule a call with us. We’d love to look at your set-up, and help you with a free trial of intermix.io to tune your Amazon Redshift cluster.

Related content
3 Things to Avoid When Setting Up an Amazon Redshift Cluster Apache Spark vs. Amazon Redshift: Which is better for big data? Amazon Redshift Spectrum: Diving into the Data Lake! What Causes "Serializable Isolation Violation Errors" in Amazon Redshift? A Quick Guide to Using Short Query Acceleration and WLM for Amazon Redshift for Faster Queries Why We Built intermix.io - “APM for Data” 4 Simple Steps To Set-up Your WLM in Amazon Redshift For Better Workload Scalability World-class Data Engineering with Amazon Redshift - Training 4 Real World Use Cases for Amazon Redshift 3 Steps for Fixing Slow Looker Dashboards with Amazon Redshift Zero Downtime Elasticsearch Migrations Improve Amazon Redshift COPY performance:  Don’t ANALYZE on every COPY Building a Better Data Pipeline - The Importance of Being Idempotent Crowdsourcing Weather Data With Amazon Redshift The Future of Apache Airflow Top 14 Performance Tuning Techniques for Amazon Redshift How We Reduced Our Amazon Redshift Cost by 28% 14 Examples of Data Pipelines Built with Amazon Redshift A DBA’s Guide to the Amazon Redshift Architecture How We Use AWS IAM to Generate Temporary Amazon Redshift Passwords Amazon Redshift Concurrency Scaling - A Guide and Our Test Results How-To Configure Amazon Redshift for Performance - "The Second Set-up" How Dow Jones Uses Amazon Redshift to Build a Cloud-based Data Platform How we see United Airlines as Data-Led Intuit’s Journey to Cloud Analytics What is Amazon Redshift? A Deep Dive Into Pricing and Technology
Ready to start seeing into your data infrastructure?
Get started with a 14-day free trial, with access to the full platform

No Credit Card Required