In the past few years, we’ve seen rapid adoption of a new data analytics stack, based on cloud warehouses like Amazon Redshift, Google BigQuery and Snowflake. This stack has three layers:
On AWS, the stack looks like this:
Companies using this new stack deal with three growth challenges:
Companies aim to make every department data driven. Data use cases typically fall into one of three categories:
For each of the above uses cases, a company will need different applications, tools and people to interface with the data.
It’s critical that the raw data is consistent across this growing number of people and tools. If the data is not consistent, it would be impossible to determine whether an error in the data is caused by a modeling problem, or a raw data problem.
In response, companies invest in doing modeling in the data warehouse and delivering prepared data to analysts and data science teams. These prepared data sets are built by continuously running algorithms on the data, and ensuring that the outputs are correct. It’s all about getting all your data into a single place and with the ability to run fast queries and store large volumes of data.
Modeling in the data warehouse requires building fast and robust data pipelines.
Amazon Redshift is a great place to do build data pipelines because of fast performance and low cost of storage.
Performance management is critical, because slow queries impact data pipelines in the following ways:
Building data pipelines on Amazon Redshift is a matter of doing some thoughtful up-front planning and ongoing monitoring as your data volume, users and cluster grow.
Here are some tips we’ve learned from our experience helping customers with Amazon Redshift. For a more comprehensive list, please check out our “Top 14” blog post at this link.
The Amazon Redshift Workload Manager (WLM) is critical to managing query performance. Amazon Redshift run queries in a queueing model. The default WLM configuration has a single queue with five slots. 99% of the time, this default configuration will not work for you and you will need to tweak it.
Configuring the WLM for your workloads is critical to:
For more information on setting up Amazon Redshift WLM, see our recent blog post on the Automatic WLM feature.
Amazon Redshift prices are based on the size of your cluster, i.e. it couples compute and storage. You’ll have to keep adding nodes for storage, even though you may not need the additional computing power of the additional vCPUs.
With Redshift Spectrum, you can leave data as-is in your S3 data lake, and query it via Amazon Redshift. This approach makes sense when you have data that doesn’t require frequent access. Leave your “hot” data in Amazon Redshift, and your “cold” data in S3.
With Redshift Spectrum, you get the best of both worlds: a) keep all your historical data, along with the performance of Amazon Redshift and b) benefit from the cost savings of using S3. See a more detailed write-up on Amazon Spectrum at this link.
Security and access controls should be first-class considerations for any data project. A common pattern is to separate the ‘prepared’ data sets from the ‘raw’ data sets This can be done easily by creating one schema for each.
Here are some rules of thumb to implement separation of concerns for the various actors in the data warehouse:
Setting up your access controls in this way will prevent problems down the road.
intermix.io helps data teams spot problems in data pipelines before they turn into fire drills. It’s a SaaS product that provides a single view of every tool, user, and query that touches the data in your Amazon Redshift cloud warehouse.
Intermix.io works by ingesting metadata & other logs from Amazon Redshift. Intermix.io can see all the SQL queries executed on the data warehouse, as well as metadata about storage and tables.
The intermix.io Talend integration works by embedding a comment block in the Talend SQL query before the SQL query is submitted to Amazon Redshift. The comment block contains metadata about the Talend job like job name, user name, and other attributes. intermix.io then introspects the SQL query to parse the comment block, and stores the query attributes.
The result is the ability to group query latency metrics by Talend job name and user – providing an easy way to spot issues in individual Talend jobs.
With intermix.io Query Insights, find the individual queries that cause your Talend data pipelines to suffer. See latency and query throughput for all Talend jobs, and drill-down into the individual jobs and SQL queries.
Once you see your problematic queries, easily drill into them to get details recommendations on how to optimize the queries to run faster.
Once you find a query that is taking a long time, drill into the query to get actionable advice on how to re-write it to make it run faster.
The intermix.io dashboard makes it easy to see the P95 SLA you are providing to users, and quickly drill into the users who are running expensive queries.
Improving SLA metrics requires a two-pronged approach:
By following the advice in this article, you will be able to respond to and fix problems fast. The result will be fewer support requests, and peace of mind that when you return to your desk on Monday morning, you won’t have to deal with any surprises and complaints.
We’ve assembled a number of useful resources for Amazon Redshift practitioner:
In this blog post, we are going to cover the difference between Amazon Redshift vs Amazon Redshift Spectrum vs Amazon Athena vs Amazon Aurora (you probably guessed that one…) along with a practical example of when you would use each of these technologies.
We’ll also give a high-level overview of database concepts. Let’s get into it.
Table of Contents
When dealing with databases, you will often hear two terms:
Engineers love acronyms, and if you just keep “OLAP” and “OLTP” in mind, you can have a seat at the table and look smart.
Let’s use United Airlines as an example to describe what the two acronyms mean.
United Airlines sells tickets for their flights. When a customer purchases a ticket on United’s website or mobile app, United needs to keep a record of that purchase in their production database.
They will use an OLTP style database to store the record.
At the end of the month, the CFO at United wants to know “how many tickets did we sell this month?”. To get the answer, they could query the production database. However, those queries would put more load onto the production database, which will slow it down.
That’s not desirable, as as a slow database will have a negative impact on selling tickets.
And so the answer is to use an analytics database or data warehouse that has a replica of all transactions in the production database. The process to move data from production to analytics is called “ETL”. You
Once the data is in the data warehouse, you can go crazy with even the most complex queries (e.g. “how many tickets did we sell in France with our iOS App in the 2nd week of July, and which ones came from net new customers?”), without affecting the production database.
Back to our United Airlines example. Over time, the data volume adds up, and it gets expensive to store a copy of all historic ticket sales in the data warehouse. It’s also not very useful to have data from e.g. 3 or 5 years ago in your warehouse, when all you really care about is the last fiscal month, quarter and year.
And at that point, you move your data from your warehouse into your data lake, where storage is about 5x cheaper.
Now say there’s the occasional ask for historic data – “how many tickets did we sell 5 years ago in July, and how does that compare with July this year?”. And so you either move the data back into your warehouse to answer that question, or you use a query engine that can query the data directly in your data lake, and / or maybe join it with data in your warehouse.
A query engine doesn’t store data – it just completes the job of running the query. The “serverless” part means you don’t need to spin up a server to run the engine.
And with that background, we can now answer the question.
All four are Amazon AWS products, and I add Amazon S3 into the mix for a complete answer.
In our United Airlines example, United would use:
In short, you end up with a stack where Aurora is your production database, S3 your data lake with a long term history of your production data, and then you have a choice of three AWS products to run analytics (Redshift, Redshift Spectrum, Athena) on top of that production data.
So that, of course, begs the question – when you do use which product for your analytics?
Which product to use is a function of the cost, complexity and execution speed for your workloads and queries.
Amazon Redshift excels at running complex analytic queries, joins and aggregrations over large datasets because it leverages high-performance local disks, sophisticated query execution, and join-optimized data formats. Redshift is so powerful that you can run the “T” part of ETL within the warehouse, and not some external processing engine like Spark or Hadoop. In fact, it’s a shift from “ETL” to “ELT”, which is fodder for an entire blog post, but a good concept to keep in mind. If execution speed and for queries and transformations is of essence, then using Amazon Redshift is the way to go.
But you do have to load data into your warehouse, with a data engineer or DBA to manage your Redshift cluster against the storage and CPU requirements you need. And at its cheapest pricing (dense storage nodes with 3-year RIs), Redshift runs at ~$1,000 / TB / Year, about 4x as expensive as S3, which runs at about $250 / TB / Year.
That’s where Redshift Spectrum comes into play. With Redshift Spectrum, Amazon Redshift users can take advantage of inexpensive S3 storage and still scale out to pull, filter, aggregate, group and sort data.
Because Spectrum is serverless, there’s nothing to provision or manage. Pricing for Spectrum is $5 per terabyte of data scanned, and combined with cheaper S3 storage than Redshift, it can be a more cost-efficient proposition than storing data in a Redshift cluster only. You pay only for the queries you run against the data that you actually scan.
The trade-off is that Redshift Spectrum queries do run slower than queries in an Amazon Redshift cluster, mainly because of data movement between S3 and the cluster. But if you’re fine with that trade-off, and the priority is cost, then a combination of S3 / Spectrum is a great choice.
But you do have to watch the frequency at which your Spectrum queries run, and how much they cost you. If scanning your data with Spectrum costs more than just storing it in Redshift, it’s to move data back into the cluster.
Athena Athena follows the same logic as Spectrum, except that you’re going full-in on serverless and skip the warehouse. That scenario usually only works for simple, interactive queries that don’t require large-scale aggregations.
But none of these options are “either / or” choices. We see all three analytics engine deployed with customers, depending on the use case.
Earlier this year we launched an update showing query recommendations on how to improve SQL queries to enhance performance. In most cases, improving query performance requires re-writing the SQL and re-submitting the query.
We got a lot of great feedback from teams that used the recommendations to improve queries. Two requests, in particular, stood out:
Our team has been busy working to answer these questions.
We’re pleased to announce a preview of a new enhancement to Query Details which surfaces a graph of the query execution. The purpose of the graph is to visually pinpoint which portion of the query execution was the bottleneck and took the most amount of time.
Armed with this data you can go back to your SQL and know exactly where to start looking.
The size of the nodes indicated elapsed time: the bigger the node, the more time was spent on that portion of the query execution.
Nodes may represent any of the following things:
Clicking on any node will open detailed metrics including memory used, rows and data processed, skew, and other useful information on what happened in that operation.
We’re opening up the feature to existing customers to get feedback. (We plan to release the feature to customers on our ‘Growth’ plan later this year.)
It’s easy to request a graph and we highly encourage it. Simply find a query (the bigger the better) you’d like to see a graph for and send us a request. We’ll reply with a URL that will contain an interactive graph for your query.
There are two ways to make a request:
No problem – if you’re not an existing customer but want to see a graph for one of your queries, simply sign up for a two-week free trial (no credit card required) and request one during your trial.
One of the things we hear most often from our customers is that data is becoming more and more critical to their business necessitating Data SLAs.
Take for example one customer who builds hardware devices for consumers. They use their Amazon Redshift cluster to monitor and provide support for customers on the health of those devices (billions of events per day) and also to manage the complex supply chain for their overseas production lines.
This company relies on the data warehouse to run their business, and if the data pipelines don’t complete in time, or queries are slow or fail – the business loses revenue and customer satisfaction plummets.
We see this trend across all of our customers. In response – businesses are beginning to demand service level agreements (SLAs) for queries and reports built on top of the data warehouse. These SLAs have a number of different service level indicators – the most prominent are:
In fact, we have one customer that demands a 10 second SLA for any query that is executed through their BI reporting tool. Analysts get fed up with waiting for the “spinning wheel of death” waiting for a dashboard to load, or a query to complete.
Measuring service levels for query latency is trickier than it seems. To understand why, consider the sheer scale of workloads that a typical data warehouse handles. Our average customers runs over one million queries per week on their Amazon Redshift cluster. These queries span three categories of applications
At the same time, each category of workload is growing:
So a single SLA across all the queries is meaningless. What you really need is to measure SLA on a very granular basis – for each data application, down to the individual user level. For a real world example, consider a company using Looker as their BI application. How do you easily filter out all the noise only to see the Looker queries? How do you monitor the performance of a single model?
Measuring SLAs is hard, but is only the beginning of the journey. Assuming you are able to measure the SLA from each of your data applications – the next step is to take steps to improve the SLA by finding and fixing issues impacting your workloads and users.
The Cluster Health Dashboard helps data teams measure and improve SLAs. It does this by surfacing:
To see latency distribution for any connected app (see here for a list of supported apps), simply modify the filter at the top-right of the screen.
Query latency follows a power-law distribution – and most queries in the warehouse run rather quickly, with a long tail of slow queries.
Measuring query latency is best done by looking at different points across that distribution – the most common being the P95 and P99 latencies. These are shown prominently at the top of the page.
Further, we show you a distribution of latencies across the entire cluster, grouped by user (by default). Click on the chart to quickly go to the queries on the Discover page.
Example of SLA metrics for “Looker PDT” queries.
Our team has significant experience running Amazon Redshift at scale. In fact, we run a very large fleet of Amazon Redshift clusters as part of our product architecture. Based on this experience, we surface observation and recommendations on how to optimize the cost and performance of the cluster. The Recommendations cover things like:
The “Top 5 Longest Running Queries” list default to showing you the slowest queries in your cluster, over the past day.
Click on any of these queries, and go to a detail showing recommendations to optimize the query.
In recent months, there’s been a wave of acquisitions and fundings in the business intelligence (“BI tools”) and analytics space:
That’s a lot of movement in a short time for a segment of enterprise software that’s been somewhat static in the past 25 years.
A major driver in the uptick of M&A and venture activity are cloud warehouses like Amazon Redshift, Google BigQuery and Snowflake. They have brought down the cost and complexity to build a data platform, in a shift away from Hadoop, with BI tools as the catalyst to make data exploration and visualization available to a much wider audience.
We believe that the visualizations and charts are a commodity. What seems to matter to buyers of analytics is the “eye-candy” of the dashboards. But much of the hard work to build a data platform happens underneath the visualization layer, e.g. data acquisition and ingestion, ETL pipelines and data discovery layers.
There’s a bit of confusion on how these tools work and their use cases. So in this post, we’re helping you find the answers, by looking at the current BI tool landscape and their differentiators.
BI tools facilitate exploration and visualization of data, and for the purpose of this post, we’re only including tools that support the major data warehouse (Amazon Redshift, Snowflake, Google BigQuery, Azure SQL Data Warehouse.)
Let’s start by dividing BI tools into three simple categories:
We believe the distinction between “legacy” and “modern” is key.
In an old world of on-premise warehouses like Teradata and Oracle, the cost of running queries and experiments on your warehouse was expensive. Extracting the data into the BI tool made sense, to drive down the cost and allow users to be flexible in transforming data and exploring new types of queries.
But cloud warehouses have become so performant and scalable that running transformations and queries in-database is much more efficient. With the separation of storage and compute, there are a few reasons why you would run queries elsewhere. And the benefit is that employees explore one common & current dataset.
Let’s go through each of these categories and explore their key representatives in more detail.
Metabase is an open-source business intelligence tool designed for non-technical users to provide data insights and visualizations. One of the main advantages of Metabase is its excellent visualizations and the fact that you can create dashboards almost without any prior experience.
You can create dashboards and visualizations using a simple question creator with no programming. If the question creator is not enough for you, you can use basic SQL to create more complex queries.
Another useful feature is “Pulses”. With Pulses, you can send updates on specific queries (“questions”) to Slack channel or email.
Metabase supports multiple relational and non-relational databases as data sources. Among them are the following: Amazon Redshift, Snowflake, BigQuery, PostgreSQL, SQLite, MySQL, Druid, MongoDB, and more. The Detailed list you can see in the documentation.
Superset is a modern BI tool with a simple interface allowing to create and share dashboards. Superset comes with a lot of different, rich visualization types included. To create queries and dashboards, you need to know basic SQL.
Superset started as a project at Airbnb, to as a fully customizable application to visualize and explore massive amounts of data in a fast and intuitive way. Today, Superset is one of the leading open source BI solutions.
Superset provides an advanced security management system, with flexibleadjustment for levels of product, feature and data set access.. Superset also has integrations with major authentication backends.. Another attractive Superset feature is the SQL/IDE editor with interactive querying.
Superset supports multiple data source types through SQLAlchemy, a Python ORM that is compatible with the most common databases (PostgreSQL, Amazon Redshift, MySQL, Oracle, SQLite, Microsoft SQL Server).
Redash is a simple and powerful BI tool, suitable for advanced users since you need to know SQL to work with it. Redash provides an interactive query editor for sharing both the dataset and the query that generated it.
Redash allows you to use query results as data sources to join different databases and make advanced reports and dashboards. With Redash you can set up reminders and automatic notifications (Alerts) to notify a group of people or an individual user when a specific query reaches a preset value. The biggest advantage of Redash is a very vast array of different visualizations that users can create.
Redash provides the most impressive list of data sources from all open source BI tools in our review. You can use almost any data source you can imagine, from CSV to Elasticsearch, and of course the major data warehouses like Amazon Redshift, Snowflake and BigQuery. You can explore the full list here.
Tableau is one of the leading BI solutions offering a wide range of visualization capabilities, extensive documentation, and a simple interface. Tableau is friendly for beginners, the needed functions are most often achieved in no more than 2 clicks, filters are easy to find, and all operations are clearly documented.
A key differentiator for Tableau is data blending, a method for combining data from numerous databases and sources. Tableau also allows multiple users to work on a report in real-time simultaneously.
Tableau supports a wide range of data sources. These can be files (CSV, JSON, MS Excel, etc.), relational and non-relational databases (PostgreSQL, MySQL, SQL Server, MongoDB, etc.) and cloud warehouse (Amazon Redshift, Oracle Cloud, Google BigQuery, Microsoft Azure).
Tableau provides several pricing plans for different needs:
A free trial is also available.
Domo is a powerful cloud-based BI tool with solid data visualization capabilities that integrates with multiple data sources, including spreadsheets, databases, social networks, and almost any other existing cloud or local software solutions.
Domo provides a wide range of connectors (over 100 connectors right out of the box) and impressive sharing features. But its user interface is not intuitive, and this tool is much less friendly to newcomers than some other competitors.
There are several pricing options:
Standard – $83/user per month (billed annually).
Professional – $160/user per month (billed annually).
Enterprise – $190/user per month (billed annually).
Qlik Sense is a powerful BI solution. Its Associative Engine automatically finds relationships inside the data. This simplifies the study of data and allows easy combination of any data sources, no matter how large or complex they are, into a single view.
Qlik Sense lets you explore data at any level of detail you need, explore the shape of the data and pinpoint outliers. AI-powered insight suggestions will not allow any meaningful data to escape from you. Qlik Sense also provides an opportunity for you to dive into advanced analytics calculations from R and Python using simple clicks.
Qlik Insight Bot adds AI-powered conversational analytics capabilities to Qlik Sense. With it, you can receive comprehensive reports, ask questions and discover insights using natural language.
Cloud Basic – Free
Cloud Business – $15 per user/month (billed annually).
Power BI is a business intelligence tool from Microsoft with all the benefits that Microsoft services ecosystem provides. PowerBI connects with the company’s main products, such as MS Excel, Azure Cloud Service, and SQL Server. Power BI is marked as a leader in Gartner’s Magic Quadrant for Analytics and Business Intelligence Platforms.
Power BI interface is simple and clear, it will be familiar to Windows users. Visualizations are created by drag-and-drop. All you need to create any graphics – click on the desired item and drag it to an empty space in the report.
Power BI supports plenty of ways to incorporate or import your data (streaming data, cloud services, excel spreadsheets and third-party connections) and provides a simple API for integration with your applications.
Looker is a BI tool of a new generation. It is fundamentally different from other solutions. It offers LookML, its own data modeling language, which simplifies the descriptions of dimensions, calculations (measures), and data relationships in a database. Using LookML, database queries are more descriptive and subject to fewer errors.
Once you have created a data model with LookML describing in which dimensions you want to group the data and how you want to evaluate it, you can open access to this model to other users who can use it to create their own queries, reports, and dashboards through a simple interface. Thus, the work is divided and optimized, data analysts get a convenient tool for modeling and business users are freed from the need to create complex SQL queries.
The need to learn LookML and data modeling before any data visualizations is made easy through the educational materials for learning this language provided by Looker. (Learn more about Looker and Amazon Redshift Dashboards.)
Looker introduces innovation in how to collaborate on BI projects. Each project or data analyses can be saved as a Git project. In this way, you can use the full power of modern version control systems when collaborating on data.
Looker supports about 45 different data sources. The full list you can see on this page.
Looker’s price ranges from $3,000 to $5,000 per month for 10 users. Additional users are $50 per month.
With intermix.io, Looker users get all the tools they need to optimize their queries running on Amazon Redshift. For more info on how this feature works, visit intermix.io/looker.
Chartio is a great BI tool for advanced users with excellent data blending, data transformation, data sharing capabilities and robust security options.
Chartio provides two modes of work: drag-and-drop interactive mode or SQL mode. While its UI is quite powerful, it is not so easy for non-experienced users as UI of some other competitors. Chartio has built a visual version of SQL, which enables you to explore, transform and visualize data on the fly through a drag-and-drop interface.
Chartio’s visual SQL allows you to quickly combine and compare data from different sources. For example, you can get data from Google Analytics and then compare it with Campaign history from Mailchimp. All this can be done in a few clicks and you will be helped by extensive documentation.
Chartio supports about a hundred of different data sources including relational and non-relational databases and a bunch of SaaS partners. The full list you can see on this page.
With intermix.io, Chartio users get all the tools they need to optimize their queries running on Amazon Redshift. For more info on how this feature works, visit intermix.io/chartio.
Mode is a light-weight framework for business intelligence. You can run it directly in the browser. The tool allows running complex queries using SQL, Python, or R. The query editor is very intuitive and simple even for novices. For Python and R, you can perform data analysis in the Mode Notebooks. More than 60 Python and R libraries for data wrangling and analysis are available out of the box. If needed, you can install other public packages. Also, Mode Analytics supports the creation of the reports, which you can customize and share with different stakeholders.
The target group of use cases where Mode Analytics is extremely good is the fast and light data analysis. In-browser running, drag-and-drop capabilities for reporting, simple interface, a large number of built-in features, integration with SQL and such popular programming languages for data analytics as Python and R makes Mode Analytics a perfect choice for fast prototyping and ideas sharing.
Mode Analytics has two pricing options:
With intermix.io, Mode users get all the tools they need to optimize their queries running on Amazon Redshift. For more info on how this feature works, visit intermix.io/mode.
Periscope Data is one of the new tools for BI. It provides support for different data source connection and data preparation using SQL, Python, and R. It also allows to perform predictive analytics and unstructured data analysis (for example, during the natural language processing).
Periscope Data has a lot of built-in visualizations. Among the provided templates there are even such exotic charts as radar chart, cohort grid, gauge chart, etc. Charting with maps is also available. In addition, you can use Python and R libraries to prepare your own graphs. There is a lot of space for customization anyway.
As many users say, there is no learning curve for Periscope Data at all. It is extremely simple. Periscope data supports easy-to-understand drag-and-drop interface. If you need to find something, usually it is no farther than a few clicks from your current location in the app. Sharing capabilities and collaboration are also great.
Enterprise pricing plans for periscope data are not available at the official website. However, there is a free trial package available for the customers to get started with the software. Later, they can consult the sales team for their respective pricing packages.
With intermix.io, Periscope Data users get all the tools they need to optimize their queries running on Amazon Redshift. For more info on how this feature works, visit intermix.io/periscope.
[You can embed the code in the article. This will allow to continuously update the graphs. If only including screenshots, we will state the date when they were taken.]
On the chart below, you can see the searching trends for 3 open source BI tools described in this article: Metabase, Apache Superset, and Redash. At the time of writing, the most popular tool is Metabase which is approximately 2 times more popular than its closest pursuer, Redash. Also, you can notice that while the popularity of Apache Superset and Redash remains on a relatively stable level, the popularity of Metabase started to grow in the last couple of months (at the time of writing).
As was mentioned earlier, “legacy” commercial BI tools include Tableau, Domo, Qlik, and PowerBI. Their search popularity on Google Trends is depicted in the graph below. Tableau is the undoubted leader here. Its popularity is substantially higher than all other instruments. PowerBI ranks second. Domo is a little bit more popular than Qlik, but the level is relatively the same. There is no vivid trend for any platform, with unexpected peaks due to one-time events (i.e. the Tableau acquisition by Salesforce). The stable (side) nature of trend for all of these tools emphasizes their matureness in BI world one more time.
The leader in the set of “Modern” BI tools is even more clear than in the set of “Legacy” tools. Looker is like a giant among them. To better understand the gap between these instruments remember that Google Trends displays the relative popularity. Now, Looker at the time of writing has 37 points, while Chartio, Mode, and Periscope Data have 2, 3, and 1 point respectively. Looker had a huge peak in popularity due to its acquisition by Google. The trend is not clear here for any of the instruments. They all remain at the same level.
In this article, we highlighted some of the most popular business intelligence tools. We split them into open source and commercial proposals. In turn, commercial tools were conditionally split into “legacy” and “modern”.
“Legacy” BI tools are probably the most well-known. They are admitted as classic platforms and every BI analyst should know at least some of them.
“Modern” BI tools have emerged relatively recently. These tools provide some new value for data users. In many aspects, they are very good and can be considered as worthy competitors to the “legacy giants”.
We briefly described features of each of the instruments as well as tried to measure their popularity and existing trends. We can say that regardless of the fact that there are a lot of BI tools nowadays, there is no “swiss knife”. In each case, a person or organization should consider their use case, estimate what and how they need, which resources are available and then decide which instrument will be the optimal choice.