Amazon Redshift Best Practices for Designing Queries
As a data warehouse, Amazon Redshift makes analyzing petabytes of data quick, streamlined, and cost-effective no matter how far out your data is spread. With Amazon Redshift, you can enjoy over 10x the performance of other data warehouses thanks to the combination of massively parallel processing (MPP), machine learning (ML), and SSD-based columnar storage.
However, even with the extreme power that Amazon Redshift offers, it’s still possible to experience hurdles and setbacks when scaling your workloads. That’s why following the best practices for designing queries in Amazon Redshift is essential to making the most of the platform’s potential.
Table of Contents
The Best Practices of Designing Queries
Amazon Redshift has laid out some very helpful guidelines that will improve the performance of your Redshift account. Below, we’ll review the best practices that specifically apply to designing queries.
Follow The Best Practices of Table Design
If you want to maximize query performance within Amazon Redshift, the first recommendation you should follow is to utilize the best table design practices. This ensures that your queries have a solid foundation, allowing them to perform at their best.
In brief, the best practices for table design include:
- Selecting the best sort key
- Choosing the best distribution style
- Using COPT to select compression encodings
- Defining constraints for your primary key and foreign key
- Using the smallest column size possible
- Using date/time data types for columns containing dates
Avoid Unnecessarily Using The “All Columns” Query
Using the command select * is a common practice in Amazon Redshift, but the platform advises against it. This command allows you to request a query to return all columns of the tables you’re querying. However, to ensure the best possible performance, Amazon suggests only querying the columns you actually need.
Use CASE Expressions
The CASE expression can be used anywhere expressions are valid. It’s a generic, conditional expression similar to IF/THEN/ELSE statements found in other development languages. Amazon Redshift suggests that you use a CASE expression when performing complex aggregations rather than selecting multiple times from the same table.
Avoid Cross-Joins Unless Critical
Unless they’re absolutely necessary to the query you’re trying to perform, Amazon Redshift recommends against using cross-joins. Joins without a join condition result in the Cartesian product (the set that contains all ordered pairs) of both tables.
Typically, a cross-join is executed as a nested-loop join, and it’s the slowest type of join to perform. Therefore, you should explore other join types that perform faster to design efficient queries.
Use Sub-Queries Under The Right Conditions
Sub-queries are helpful, particularly in scenarios where one table you are querying is only used for predicate conditions. Additionally, sub-queries are ideal when they return less than 200 rows overall. Instead of joining the LISTING table, take the following sub-query as an example:
select sum(sales.qtysold) from sales where salesid in (select listid from listing where listtime > '2020-01-14');
Restrict Your Datasets As Much As Possible
You should always restrict the dataset in your queries to the greatest degree. You can do this with the use of predicates. When using a predicate, make sure the least expensive operate is being utilized. For example, you’re better off using a comparison condition operator than LIKE operators. However, LIKE operators are preferable to POSIX and SIMILAR TO operators.
Use a WHERE Clause When Possible
When restricting your dataset, another best practice is to use a WHERE clause whenever possible. Using the query planner, you’ll be able to use row order to determine where matching records are, allowing you to skip over many disk blocks. Failing to use a WHERE clause will force the execution engine to scan all participation columns in their entirety to complete your query.
Avoid Functions In Your Predicates
When using predicates to restrict the dataset for your query, avoid using functions. Using functions in your predicates will only drive up the query’s cost as it will require a larger number of rows to complete all the steps of the query.
Use Predicates to Filter Tables for Joins
When tables participate in joins, you should use predicates to filter those tables, even if the predicates apply redundant filters. In the end, your query will return the same result, but Amazon Redshift will filter the join tables before scanning them, which allows it to skip unnecessary blocks efficiently. However, if you filter a column that’s part of the join condition, there’s no need to use this redundant filtering approach.
Here’s an example: Let’s say you want to join SALES and LISTINGS to locate sales of listings published after January. You’d like to group the results by seller. Your tables are sorted by date, so the following query joins your two tables (SALES AND LISTINGS) using the common key. The query filters based on “listing,listtime”, pulling out values greater than January 01, 2020.
select listing.sellerid, sum(sales.qtysold) from sales, listing where sales.salesid = listing.listid and listing.listtime > '2020-01-01' group by 1 order by 1;
In this example, the WHERE clause includes no predicates for “sales,saletime”, forcing the query execution engine to look through the entire SALES table. Adding that filter is a good idea if you know it will result in fewer rows participating in the join condition. The following example substantially reduces execution time:
select listing.sellerid, sum(sales.qtysold) from sales, listing where sales.salesid = listing.listid and listing.listtime > '2008-12-01' and sales.saletime > '2008-12-01' group by 1 order by 1;
Utilize Sort Keys in The GROUP BY Clause
When using a GROUP BY clause, make use of sort keys to enable the query planner to aggregate more efficiently. If your GROUP BY list uses only sort key columns (including the distribution key), your query may be able to use one-phase aggregation.
You can confirm that one-phase aggregation is utilized for a query by using the EXPLAIN command. One-phase aggregation will be marked by “XN GroupAggregate” in the query’s aggregation step.
When making your GROUP BY list, you have to include the first sort key and all other sort keys in the proper order. For instance, you can use the first, second, and third sort keys, but not the first and third sort keys.
Match Column Order Across Clauses
If your query uses both the ORDER BY and GROUP BY clauses, ensure that the columns are inserted in the same order for both clauses. The following examples illustrate this.
group by a, b, c order by a, b, c
group by c, b, a order by a, b, c
Querying Spatial Data
Spatial data describes the geometric data (position and shape) of a geometry within a defined space (known as a spatial reference system). With the GEOMETRY data type, you can use spatial data within Amazon Redshift, including both spatial data and, optionally, the spatial reference system identifier (SRID). As of now, Redshift can only contain 2D data.
Examples of spatial data include map directions, store locations, and shipping routes. This data can play a major role in a business’ reporting and forecasting. To correctly and efficiently query spatial data in Amazon Redshift, you’ll need to use SQL functions.
With spatial data, you can create queries that will:
- Calculate the distance between two points.
- Check whether a given area contains another.
- Check if a given linestring intersects another linestring or area.
You can use certain SQL functions in Redshift to support these representations:
- Well-known text (WKT)
- Extended well-known text (EWKT)
- Well-known binary (WKB) representation
- Extended well-known binary (EWKB)
The SQL functions you can use to query your spatial data are plentiful. However, they will all require valid geometry (except for the “ST_IsValid” function) or else the behavior of the function returns as “undefined.”
Querying Data with Federated Queries
With the use of federated queries, Amazon Redshift enables you to analyze data across your databases, data warehouses, and data lakes.
A federated query also enables you to integrate queries with live data from your external databases in PostgreSQL; Amazon RDS for MySQL; Amazon Aurora with PostgreSQL compatibility or MySQL compatibility. You might use federated queries for a number of applications, including for circumstances where you need to incorporate live data for BI initiatives.
To make data ingestion to Redshift simpler, you could use federated queries to directly query your operational databases; to quickly apply transformations to your data; or to load data into target tables without ETL pipelines.
When completing a federated query, Amazon Redshift will distribute its computation into remote operational databases. This reduces data movement across the network, thereby improving your query’s performance. Redshift may also utilize parallel processing to support your federated queries when necessary.
The steps of using a federated query vary depending upon whether you use PostgreSQL or MySQL. Here’s an example using PostgreSQL, which first requires you to:
- Create external schemas referencing your databases/data warehouses/data lakes. Set the schema to public.
- Show the row count of each data source, then create a view of the tables in each. Show the count of total views and set a predicate to limit your results.
Once you follow those steps, you could run a federated query like this:
SELECT extract(year from l_shipdate) as year, extract(month from l_shipdate) as month, count(*) as orders FROM lineitem_all WHERE extract(month from l_shipdate) = 1 AND l_quantity < 2 GROUP BY 1,2 ORDER BY 1,2; year | month | orders ------+-------+--------- 1992 | 1 | 196019 1993 | 1 | 1582034 1994 | 1 | 1583181 1995 | 1 | 1583919 1996 | 1 | 1583622 1997 | 1 | 1586541 1998 | 1 | 1583198 2016 | 1 | 15542 2017 | 1 | 15414 2018 | 1 | 15527 2019 | 1 | 151
Amazon Redshift has a Getting Started guide for both <a rel="noreferrer noopener" href="https://docs.aws.amazon.com/redshift/latest/dg/getting-started-federated.html" target="_blank">PostgreSQL</a> and <a rel="noreferrer noopener" href="blank" target="_blank">MySQL</a>.
Querying External Data Using Redshift Spectrum
With Amazon Redshift Spectrum, you can query both structured and semistructured data from your Amazon S3 files without the need to load the data into your Redshift tables.
Spectrum queries use parallelism at massive scale to ensure fast executions even with the largest datasets. The processing primarily occurs in the Spectrum layer while the majority of the data stays in Amazon S3 the whole time. When using Spectrum, you can query the same dataset concurrently in Amazon S3 without needing to copy your data for every cluster.
To improve the performance of your Spectrum queries, you should first review the query plan to see what steps have been pushed to the Spectrum layer. They may include S3 Seq Scan, S3 HashAggregate, S3 Query Scan, Seq Scan PartitionInfo, and/or Partition Loop.
Below is an example of a query plan for a query designed to join a local table and an external table together. This query uses both the S3 Seq Scan and S3 HashAggregate steps”
QUERY PLAN ----------------------------------------------------------------------------- XN Limit (cost=1001055770628.63..1001055770628.65 rows=10 width=31) -> XN Merge (cost=1001055770628.63..1001055770629.13 rows=200 width=31) Merge Key: sum(sales.derived_col2) -> XN Network (cost=1001055770628.63..1001055770629.13 rows=200 width=31) Send to leader -> XN Sort (cost=1001055770628.63..1001055770629.13 rows=200 width=31) Sort Key: sum(sales.derived_col2) -> XN HashAggregate (cost=1055770620.49..1055770620.99 rows=200 width=31) -> XN Hash Join DS_BCAST_INNER (cost=3119.97..1055769620.49 rows=200000 width=31) Hash Cond: ("outer".derived_col1 = "inner".eventid) -> XN S3 Query Scan sales (cost=3010.00..5010.50 rows=200000 width=31) -> S3 HashAggregate (cost=3010.00..3010.50 rows=200000 width=16) -> S3 Seq Scan spectrum.sales location:"s3://awssampledbuswest2/tickit/spectrum/sales" format:TEXT (cost=0.00..2150.00 rows=172000 width=16) Filter: (pricepaid > 30.00) -> XN Hash (cost=87.98..87.98 rows=8798 width=4) -> XN Seq Scan on event (cost=0.00..87.98 rows=8798 width=4)
To improve the performance of this example query, Amazon Redshift suggests:
- Reduce the number of columns included in your queries.
- Optimize parallel processing by using multiple files larger than 64 MB. Keep files around the same size.
- Keep smaller, frequently used tables in your Amazon Redshift Database.
- Keep larger fact tables in your Amazon S3 account.
- Push processing to the Spectrum layer as often as possible.
- Whenever possible, include filters and aggregations in your queries that can be pushed to Spectrum.
- Use Apache Parquet to format your data files. Parquet will store the data in a columnar format, allowing Spectrum to ignore unnecessary columns while scanning. If your data is in text format, Spectrum will have to scan everything.
- Limit the data that is scanned by using partitions.
Amazon Redshift specifies which operations can be pushed to the Spectrum layer and which ones can’t.
Build High-Performance Data Products
Following the best practices for querying data in Amazon Redshift will help you make the most out of your processing capabilities. Of course, creating exceptional data products takes more than good query practices.
With Intermix, you can build better data products by capturing metadata directly from your data warehouse and all of the tools that tap into it. Track only what you care to analyze, gaining insight into user engagement, cost, and performance. Interested in learning more? Start for free and see what Intermix can do for you.
Join 11,000 of your peers.
Subscribe to our newsletter SF Data.
People at Facebook, Amazon and Uber read it every week.
Every Monday morning we'll send you a roundup of the best content from intermix.io and around the web. Make sure you're ready for the week! See all issues.