Here at intermix, we’re always working on new features that make your life as a data engineer easier when working with Amazon Redshift. For example, “disk full” errors in Amazon Redshift are an all-too-common performance issue—even when you thought you had plenty of space to spare in your Redshift deployment. That’s why we’ve made it easier than ever to use intermix to find the cause of Amazon Redshift disk usage spikes.
Table of Contents
The problem of Redshift disk usage spikes is so common that Amazon has released its own guide, “How can I troubleshoot high or full disk usage with Amazon Redshift?” This page features 9 different possible causes of Redshift disk usage spikes and “disk full” errors, including query processing issues, tables with VARCHAR(MAX) columns, copying large files, and more. But what about the particular situation when you encounter a disk usage spike when running a Redshift query?
If you encounter a Redshift “disk full” error when running a query, one thing’s for sure: you’ve run out of space in one or more of the nodes in your Redshift cluster. Disk usage spikes when running queries stem from two issues: either the query is using so much memory that it’s overflowing to disk, or your data is too large for the hard disks on the cluster.
But what if you don’t have the time or knowledge to troubleshoot the possible causes yourself? That’s exactly where intermix comes in.
In your intermix dashboard, go to the Storage Analysis tab. (Don’t have an account yet? Sign up here for a free 2-week trial.) Next, click anywhere on the Schema chart to get the list of tables for that exact time. The tables are sorted by growth, so the ones which are causing the disk usage spike will show up at the top of the list.
And that’s it! Once you’ve isolated the problematic table(s), you can take steps to curb the issue, such as removing extraneous data or adding another node to the cluster.
Since we’ve introduced this feature to intermix, we’ve made a few changes in response to best practices and user feedback:
As always, we’d love to hear your feedback, so please get in touch over chat or email. If you aren’t yet an intermix customer, sign up for your free trial and start optimizing your Redshift deployment today. And for the latest in data engineering news, check out our weekly newsletter “SF Data Weekly” with over 6,000 subscribers.
Amazon Redshift offers an attractive feature that can help organizations manage their hosting bill. It’s called concurrency scaling, and according to Amazon, it “automatically and elastically scales query processing power to provide consistently fast performance for hundreds of concurrent queries.”
Before concurrency scaling, Redshift users faced a familiar dilemma – dealing with peak demand. There were two options:
Concurrency scaling adds resources to your Redshift cluster on an on-demand basis, adding processing power during peak time and withdrawing it in quieter moments.
In terms of pricing, concurrency scaling works on a credit system that should make it free for most users. Amazon allows you to earn one free hour of scaling for every 24 hours of main Redshift cluster usage, and these credits accrue over time. Any usage outside of your credits gets billed on a per-second basis according to your Redshift agreement.
Concurrency scaling makes financial sense, but can it offer consistent service? Let’s find out.
There are three eligibility requirements for concurrency scaling. Your Redshift cluster must be:
This means that single-node clusters are not eligible. Also, note that the cluster must have had fewer than 32 nodes at creation. If your cluster originally had 50 nodes and you scale down to 32, you’re still not eligible for concurrency scaling.
Concurrency scaling does not work on all query types. For the first release, it handles read-only queries that meet three conditions:
For routing to a concurrency scaling cluster, a query needs to encounter queueing. Also, queries eligible for SQA (Short Query Acceleration) queue will not run on the concurrency scaling clusters.
Queuing and SQA are a function of a proper set-up of Redshift’s workload management (WLM). We recommend first optimizing your WLM because it will reduce the need for concurrency scaling. And that matters because, while AWS claims that concurrency scaling will be free for 97% of customers, you could face an additional usage charge if you exceed your credits.
We’ve also tested enabling Redshift’s automatic WLM and captured our experience with it in this blog post, “Should I Enable Amazon Redshift’s Automatic WLM?“
Concurrency scaling is enabled on a per-WLM queue basis. Go to the AWS Redshift Console and click on “Workload Management” from the left-side navigation menu. Select your cluster’s WLM parameter group from the subsequent pull-down menu.
You should see a new column called “Concurrency Scaling Mode” next to each queue. The default is ‘off’. Click ‘Edit’ and you’ll be able to modify the settings for each queue.
Concurrency scaling works by routing eligible queries to new, dedicated clusters. The new clusters have the same size (node type and number) as the main cluster.
The number of clusters used for concurrency scaling defaults to one (1), with the option to configure up to ten (10) total clusters.
The total number of clusters that should be used for concurrency scaling can be set by the parameter max_concurrency_scaling_clusters. Increasing the value of this parameter provisions additional standby clusters.
There are a few additional charts in the AWS Redshift console. There is a chart called “Max Configured Concurrency Scaling Clusters” which plots the value of max_concurrency_scaling_clusters over time.
The number of Active Scaling clusters is also shown in the UI under Concurrency Scaling Activity:
The Queries tab in the UI also has a column to show if the query ran on the Main cluster or on the Concurrency Scaling cluster:
Whether a particular query ran on the main cluster or via a concurrency scaling cluster is stored in stl_query.concurrency_scaling_status.
A value of 1 means the query ran on a Concurrency Scaling cluster, and other values mean it ran on the main cluster.
redshiftcluster_2=# select distinct
concurrency_scaling_status,count(*) from stl_query where endtime <
'2019-03-29 15:00:00' group by concurrency_scaling_status;
concurrency_scaling_status | count
2 | 21
0 | 310790
4 | 19818
6 | 69082
11 | 7
3 | 853546
8 | 228977
Concurrency Scaling info is also stored in some other tables/views, e.g. SVCS_CONCURRENCY_SCALING_USAGE. TherConcurrency scaling info is also stored in some other tables/views, such asSVCS_CONCURRENCY_SCALING_USAGE.
The following views have similar information as the corresponding STL views or SVL views:
These views work in the same way as their STL or SVL equivalents.
We enabled concurrency scaling for a single queue on an internal cluster at approximately 2019-03-29 18:30:00 GMT. We changed the max_concurrency_scaling_clusters parameter to 3 at approximately 2019-03-29 20:30:00.
To simulate query queuing, we lowered the # of slots for the queue from 15 slots to 5 slots.
Below is a chart from the intermix.io dashboard, showing the running versus queuing queries for this queue, after cranking down the number of slots.
We observe that the queueing time for queries went up, maxing out at about > 5 minutes.
Here’s the corresponding summary in the AWS console of what happened during that time:
Redshift spun up three (3) concurrency scaling clusters as requested. It appears that these clusters were not fully utilized, even though our cluster had many queries that were queuing.
The usage chart correlates closely with the scaling activity chart:
After a few hours, we checked and it looked like 6 queries ran with concurrency scaling. We also spot-checked two queries against the UI. We haven’t checked how this value may be used if multiple concurrency clusters are active.
redshiftcluster_2=# select distinct
concurrency_scaling_status,count(*) from stl_query where endtime >
'2019-03-29 18:30:00' group by concurrency_scaling_status;
concurrency_scaling_status | count
4 | 108
6 | 333
1 | 6
0 | 913
3 | 4495
8 | 304
Concurrency scaling may mitigate queue times during bursts in queries.
From this basic test, it appears that a portion of our query load improved as a result. However, simply enabling concurrency scaling didn’t fix all of our concurrency problems. The limited impact is likely due to the limitations on the types of queries that can use concurrency scaling. For example, we have a lot of tables with interleaved sort keys, and much of our workload is writes.
While concurrency scaling doesn’t appear to be a silver bullet solution for WLM tuning in all cases, using the feature is transparent and easy to use. You can start with a single concurrency cluster, then monitor the peak load via the console to determine whether the new clusters are being fully utilized.
Though it may not have lived up to be the automatic solution advertized, concurrency scaling will become more and more effective over time as AWS adds more features and support. We strongly recommend enabling the feature on your WLM queues.
Here at Intermix, we’re constantly building products and innovating for Amazon Redshift users, and we’d like to think we have our finger on the pulse of what Redshift customers are saying. The number one driver of change that we’ve seen for our clients is that they’re experiencing huge growth in data and query volumes.
A 2016 report by IDG found that the average company now manages 163 terabytes (163,000 gigabytes) of information—and the figures have undoubtedly only increased since the survey was released. Of course, having more data at hand also means that you need higher query volumes to extract all of the insights they contain.
We’ve observed three main drivers behind the growth in data and query volumes among our customers:
When building data pipelines with Amazon Redshift, all of these growth factors make it quite challenging to monitor and detect anomalies across all of the queries running on the system. For example, a lot of our clients use Looker as their tool for data exploration, and one of the most common complaints with the platform is slow Looker dashboards.
One way of approaching this problem is to follow best practices for setting up your workload management in Amazon Redshift the right way. You get a more granular view of your users and the resources they use, and the WLM gives you the power to isolate them from each other.
But even after optimizing your WLM configuration, you’re still lacking critical insights. For example:
Until now, it’s been quite hard to surface all of this information in a single place. That’s why we built a way to intuitively and powerfully visualize these insights, helping to make your life as a data engineer easier when working with Amazon Redshift.
The new feature is called “Query Insights,” and it’s available immediately in your Intermix dashboard. Below, we’ll check out some examples of using the Query Insights feature in Intermix.
In this example, we’ll show how you can use Query Insights to quickly identify that a huge spike in query volume happened, as well as the specific queries that caused it.
The Intermix dashboard clearly shows that there has been a query count spike at 8:59 p.m. on January 24. In the left panel in the Intermix user interface, click on “query group” to segment the chart by query group. Click on the chart to reveal a slide-in list of query groups, sorted in descending order of Count (i.e. the number of queries). Finally, click on the “View Queries” icon to jump to the Query Details list, where you can see exactly which queries are causing the problem.
In this example, we’ll show you how to use Query Insights to find the cause of an overall increase in latency.
Click on the graph icon in the main Query Insights dashboard to group the data by app. This reveals that the “intermix_collector” app is experiencing a large increase in latency. Next, click on the “intermix_collector” app so that you can group by a custom attribute.
This app has two custom attributes, “dag” and “task”, that appear on the left navigation panel. This is because these queries are tagged using the Intermix annotation format. First group by “dag”, and then by “task”, to isolate the specific task that is seeing a spike in latency. Finally, group by “query group” to see the specific queries that are causing a problem.
In this example, let’s monitor our Looker PDT queries to find the queries which are the slowest.
First, click on the “Looker PDT” tag in the dashboard in order to only view data for Looker PDT queries. As you can see in the left navigation panel, the Looker PDT app has a few custom attributes. Click on “model” to group the queries by model name. We immediately see that the “supply chain” model is the slowest. We can click on it and then group by “query group” to find the individual queries causing a problem.
Query Insights is a tremendously valuable tool in your Redshift toolkit, but we’re only getting started. Keep your eyes open for a new feature “Transfer Insights” soon, which will allow you to monitor the users and apps that are loading data and rows into your Amazon Redshift cluster.
Want to try Query Insights out for yourself? Take the opportunity to get a personalized tour of Query Insights, as well as see what’s next on our product roadmap and provide your feature requests and feedback. Please find a time on my calendar at this link, or sign up today for a free trial of Intermix.
Amazon Redshift is a robust, enterprise-class cloud data warehouse—but that doesn’t mean it’s always the most user-friendly solution. Getting the most out of your Redshift deployment can be a challenge at the best of times (which is one reason why we wrote our article “15 Performance Tuning Techniques for Amazon Redshift”).
In particular, Redshift queries can cause performance issues and mysterious slowdowns if you don’t know how to optimize them. But without a dedicated way to analyze Redshift query performance, how can you hope to isolate the problem?
That’s why we’ve introduced the Query Groups feature to Intermix, our performance monitoring solution for Amazon Redshift. Query Groups is a powerful asset that intelligently classifies and ranks query workloads on your cluster. Using Query Groups, you can answer questions like:
Using Intermix, the queries in a Query Group are grouped together using a proprietary algorithm, and ranked by volume, execution time, and queue time. More metrics might be added in the future depending on user demand and necessity. All of the queries in a Query Group share a SQL structure and operate on the same tables.
At 8:17 a.m. on August 7, the below cluster experienced an eightfold spike in queries—wow! Typically, this type of event is caused by a handful of new queries that suddenly increased in volume. But how do you find which queries, and who ran them?
intermix.io’s Query Groups feature can quickly determine which of your queries are responsible.
First, click on the new Query Groups page in the left navigation panel. By default, Query Groups are sorted by Rank. In this case, we want to re-sort by “Rank Change,” which will order the list of Query Groups by the “fastest movers.” In other words, this will help us quickly see the groups which have been moving up the ranks in the past week.
Sure enough, we see a handful of Query Groups which suddenly started running. Clicking into the first one, we can isolate the exact queries that are causing the problem.
You could also use the same procedure to determine the queries that underwent a spike in latency or queue time.
Query Groups in Redshift and Intermix offer the potential to make your Redshift performance tuning process radically more efficient—and we’re just getting started. We plan to expand the “grouping” concept in the future to add:
Sound like what you’ve been looking for,? We’re here to assist. Sign up today for a free trial of the Intermix platform. Query Groups are just one of the ways that we’ve helped Redshift customers get the most from their cloud data warehouse.
At intermix.io, we use Amazon Redshift as part of our stack. Amazon Redshift is an OLAP database, and a valuable tool for data teams due to its low cost and speed for analytical queries. We have a particular use case though. We’re using Amazon Redshift in an OLTP scenario, i.e. we’ve built an analytical application on top of Redshift.
The challenge of using Redshift as an OLTP is that queries can lack the low-latency that would exist on a traditional RDBMS and transactional queries. This is a result of the column-oriented data storage design of Amazon Redshift which makes the tradeoff to perform better for big data analytical workloads. This comes from eliminating the need to traverse full records (all columns) for each query. Also, there are features of a traditional RDBMS that would often be useful when querying Redshift (e.g. indices).
In this post we detail a method of using Amazon Redshift as an OLTP. We will tell the story of how we off-load OLTP workloads from Amazon Redshift to Amazon RDS. We were able to do this without writing any complex ETL or changing our data processing strategy.
intermix.io is an analytics platform that provides a single monitoring dashboard for data engineers to keep an eye on their critical data flows. We surface information for data engineers using Amazon Redshift like how to optimize warehouse resources with actionable recommendations.
For example, our “Cluster Health” dashboard gives you a very quick overview of SLA measures for any connected app, down to the individual users as well as actionable recommendations on how to improve performance and optimize costs.
If you’d like to instantly see personalized recommendations for your cluster, you can start a 14-day free trial on this link.
There are three steps for acquiring the data we need to provide our analytics to our customers:
We use Amazon Redshift for:
serving data to our product dashboard (this is the piece that was giving us trouble)
In the early stage of intermix.io, our dashboard pages would pull data from Redshift directly. But as we added more and more customers, we realized the dashboard pages were not performing well. This was due to two factors:
We did use the best practices to manage our Redshift cluster, including the proper set-up of our workload management (WLM) and all 15 performance tuning techniques. But we also knew that using Redshift for this OLTP use case — serving data to the dashboard – was not going to work long term. So we started looking for solutions.
Our ideal solution would:
Amazon Redshift is a columnar-oriented petabyte-scale data warehouse. It is an OLAP database suitable for analytical queries and applications. It uses a modified version of PostgreSQL 8.
The solution was to use Amazon RDS as a ‘cache’ by leveraging the Dblink feature of RDS. This allows us to make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over. Dblink handles moving the data at the block level.
The benefits of this solution included:
Before we go over the installation instructions, we’ll cover some of the features that using Dblink with Redshift provides.
The main features presented in this post require a Postgres instance with Postgres version >= 9.4 (some features listed in this section require more recent versions of Postgres). We use an RDS-hosted version of Postgres but other types of instances will work as long as they meet the minimum requirements.
For installing on Redshift, perform the following commands:
GRANT SELECT ON all TABLES IN SCHEMA data to <amazon_redshift_username>;
Enable dblink and postgres_fdw in the Postgres RDS instance (only needs to be done once per database).
CREATE EXTENSION postgres_fdw;
CREATE EXTENSION dblink;
Create the server link and user mapping.
CREATE SERVER redshift_server FOREIGN DATA WRAPPER redshift_fdw
OPTIONS (host '<amazon_redshift_ip>', port '<port>', dbname '<database_name>', sslmode 'require');
CREATE USER MAPPING FOR <rds_postgresql_username> SERVER redshift_server
OPTIONS (user '<amazon_redshift_username>', password '<password>');
You are now ready to query your Redshift data in RDS.
This approach allows you to query Redshift while inside of a Postgres instance. It bridges data but doesn’t confer any performance benefits. For example, to get a list of users that were updated this year.
FROM dblink('redshift_server', $REDSHIFT$
WHERE updated_on >= '2018-01-01';
$REDSHIFT$) AS t1 (id int);
To create a queryable cache of the Redshift data, we create a materialized view. You can also select a subset of rows and a subset of columns if you don’t require the full table.
CREATE MATERIALIZED VIEW mv_users AS
FROM dblink('redshift_server', $REDSHIFT$
SELECT id, updated_on
$REDSHIFT$) AS t1 (id int, updated_on timestamp);
A materialized view will not self-update. So we need an approach to refresh the data. If you only wish to execute live queries (no materialized views) then you can ignore this section.
Easier and faster but with concurrent reads blocked.
REFRESH MATERIALIZED VIEW mv_users;
This will block concurrent reads, so if you need to be able to always read from the view follow the next approach.
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_users;
This requires a unique index to work. For example, after creating the view:
CREATE UNIQUE INDEX mv_user_id_updated_on_idx ON data.mv_users (id, updated_on);
If the data you retrieve takes more than 5 minutes to transfer or you don’t mind refreshing synchronously, you can issue the REFRESH commands above inside your code or in a periodic task as needed.
If retrieving the data from Redshift takes fewer than 5 minutes, AWS’s Lambda is a good approach to use here. Be mindful of concurrency, for example it’s bad practice to issue multiple concurrent refreshes on a single table while another refresh is currently happening on the same table. We’ll have more about how we approach this in a subsequent blog post.
The above refresh methods query the full set of data each time (even if nothing has changed), if you don’t have updates on your data you may want to use the following approach instead:
Dblink ingestion incurs regular data transfer costs, hence if you don’t want to be charged for the network traffic of synchronizing the data please have your Postgres RDS instance in the same AZ as your Redshift leader node.
You need to ensure that the user permissions for the users querying data via Dblink matches that of the same users querying directly on Redshift, so that additional permissions aren’t granted by querying Dblink.
Amazon Redshift supports identity-based policies (IAM policies). We recommend using IAM identities (e.g. a user, a group, a role) to manage cluster access for users vs. creating direct logins in your cluster. That’s because nobody ever keeps track of those logins. For example, at intermix.io we use IAM to generate temporary passwords for our clusters.
In the AWS console, keep an eye on the following Postgres RDS metrics to ensure that the instance isn’t overwhelmed:
And then, of course, we use our own intermix.io dashboard to monitor the performance of our Amazon Redshift clusters and our data pipelines. A few select key metrics we pay attention to are:
For example, in image 2 you can see a distribution of all queries for a certain time interval. We can see how much memory each query consumes, if it’s disk-based or not, and what the total query return time (execution time + queue time) is.
Not only that but if you click on a single query we surface every single detail about it as well as actionable insights based on your cluster about how to improve performance instantly.
The total on-disk size limit for Postgres on RDS is 16TB, so the data you link cannot be greater than that (such a transfer would take at least 3.6 hours on a 10Gbit connection). The total table size on Redshift can be greater than 16TB but your materialized view query cannot select a quantity of rows and columns combined with index overhead that would exceed 16TB.
A current limitation of the DNS resolving on RDS is such that lookups don’t resolve to private IPs for DNS queries performed inside of Dblink. Hence if you’re running in a VPC without publicly routable Redshift then you’ll need to setup your own DNS to resolve to the private IP or hardcode Redshift Leader node private IPs when provisioning Dblink.
With the approach in this post, we showed you how it’s possible to use Amazon Redshift in an OLTP scenario. By using Amazon RDS as a ‘cache’ and leveraging its Dblink feature, we can make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over.
So if your using Amazon Redshift in an OLTP scenario for building data apps, and you want to monitor the performance of those data apps – schedule a call with us. We’d love to look at your set-up, and help you with a free trial of intermix.io to tune your Amazon Redshift cluster.