The Company

Udemy is an online learning marketplace. They provide a platform for instructors around the world to host their content, while simultaneously serving as a place for students to find courses key to their needs, interests, or professional fields.

Data plays an integral role for Udemy. Analysts need raw data to make reports that help instructors understand what they can do to help their students learn more effectively. They also need to monitor the signup rates, student activity levels, and other metrics to know that their advertising and business models are working, to help better match students with content and instructors.

The People

Nathan Sullins is the Engineering Manager of the data warehouse team at Udemy. Nathan’s team makes sure that important raw data is collected and accessible in Amazon Redshift for in-house analysts to be able to generate reports. These reports then help Udemy make smart decisions to grow their platform.

Udemy had a team of ten analysts. In addition, over 200 employees across product, finance, sales and marketing can create their own analytics via Chartio, running close to 1,000 different dashboards every day.

Nathan Sullins

The Challenge

But as their data volume grew, it wasn’t immediately clear what schemas were growing the fastest. They also noticed how queries  and dashboards started to be slow, which caused frustration among the analysts. Some of the major workflows also failed during peak usage. Udemy added more nodes, but that didn’t translate into 1:1 performance increases.

Nathan set out a few goals:

  1. Make queries fast again
  2. Ensure workflows don’t fail
  3. Find fast-growing schemas and tables

Our queries were slow, and we knew that we needed to modify the WLM configuration. But if you start to look online for resources on how best to do that, there’s not a lot available. That was the challenge we wanted to get help with. 

Nathan Sullins

The Process

We started with a trial, and a joint Slack channel helped us to collaborate in real-time. Nathan looked at Udemy’s workload patterns. Analysts were very data hungry, running lots of experiments and reports, resulting in the cluster being bombarded by dashboard queries.  The actual queries only took seconds to executed, but queue wait time was causing the hanging dashboards. And the frustration.

We also included the analysts into the trial from the start, to understand their requirements. When they needed results from their dashboards, how often. And how they had set up their data transformations. The needs of the analysts served as a starting point to set up a new cluster configuration.

The Solution

Udemy used our “Throughput Analysis” and “Memory Analysis” features to re-configure the Workload Management (WLM) queues. Queues, concurrency and memory settings were balanced with the different workloads. The result was a drop in average queue time for all queries from 11 seconds to below 1 second. With data flowing again, aborted queries and failed workflows started to disappear as well.

We also used the “Storage” page in the dashboard to identify fast-growing tables, and drill down into the data sources and queries driving that growth. 

The end result was a much leaner, more simple to manage cluster. The new cluster configuration and its performance served as the starting point for more productive work with data at Udemy. Udemy will continue facing growth in data volume, and the number of people who need access to it. With the new setup and understanding, they are in a position to meet this growth head on.