The Advantages of (and a little Amazon Redshift training)

To win in today’s market, companies must invest in both their infrastructure and their people. Data-first companies like Netflix, Uber and Tinder are dominating their industries. Mentions of “AI” are often heard in advertisements, product launches, and earnings calls. Businesses are scrambling to re-imagine their tech infrastructure and provide their people with things like Amazon Redshift training.

Why is this happening now?

  • Businesses can store and process vast quantities of data thanks to cost-effective warehousing platforms like Amazon Redshift.
  • Machine learning algorithms, which leverage this data, are open-source and available to everyone.
  • Data scientists know how to derive actionable insights from machine learning tools.

Data is the new differentiator:

Moving towards a data-driven organization can be daunting for decision-makers. Learning investments such as Amazon Redshift training can sometimes produce limited results, thanks to the complex nature of modern data platforms, which can intimidate even the most experienced IT professionals.

The Shift in Data Platforms

We’re all generating a lot more data than ever before. Storing it is a challenge; analyzing it is an even bigger challenge. Over the years, we’ve seen highly structured data warehouses give way to more anarchic data lakes, which in turn are giving way to multi-tiered structures like data lakehouses.

Evolution of Data Platforms

As a result of the shift, traditional concepts of “ETL” are being re-imagined for a new world where:

  • Disparate sources contain structured and unstructured data
  • Terabytes of new data appear every day, often with no set expiry date
  • Workloads rise due to machine learning and data democratization

The response has been that teams are building complex data assembly lines which have common characteristics:

  • A data lake stores all the raw, unstructured data, forever
  • Several analytics databases process data and warehouse it
  • An orchestration layer coordinates the flow of data between databases

Data lake architectures pose new challenges. Since lakes store their data without any oversight of the contents, the data needs to have defined mechanisms for cataloging in order to make it usable. Without this, data isn’t reliable. Companies need data pipelines that offer governance consistency and access controls to the data lake.

New technologies like Amazon Redshift allow all of the storage and processing power required to run a data-first business, but you still need the tools to turn that data into actionable insights.

Data Use Cases

Companies use the data for the following purposes:

  1. Analytics applications. Customer- or internal- facing dashboards that present data for analysis and reporting.
  2. Machine learning. Data scientists pull data to develop models. Once ready, training data sets are fed continuously into these models to operationalize AI.
  3. Traditional reporting. Ad hoc queries run by business analysts and citizen users.

The Data Engineer

Building and managing complex data assembly line requires a new skill set.

The last time we saw a comparable shift was back when cloud computing was first developed. Running cloud apps required an operational (ie uptime, cost, and performance) mindset coupled with an ability to write code. The DevOps role was born out of a need to manage infrastructure as code. As a result, engineering teams had to establish new teams and hire new types of employees. These employees needed different tools to do their jobs.

Similarly, a new role has emerged to manage data pipelines: the data engineer. Data engineers manage complex data flows by writing code that manipulates data. But they are also accountable for the uptime, performance, and cost accounting for the data flows. This skillset is a combination of DevOps and data analyst, with a sprinkling of a database administrator. They also need platform knowledge, such as Amazon Redshift training.

Building a data assembly line involves:

1. Data Acquisition and Cataloging

Raw data often exists in application silos, stranded.93% of enterprises have a multi-cloud strategy, while 87% have a hybrid cloud strategy. So, for many organizations, data exists across multiple clouds, plus their own data centers. They will need to inspect and catalog this data before they realize any value.

2. Security

Security is a fundamental part of any data pipeline. The data owners must audit all access, and also ensure that the right people and processes have the right permissions.

3. Transformation and Cleaning

Before anyone can trust data, they have to transform and cleanse it. Consider how you identify a customer across the data – is it by email, name, or some unique ID? If you want to combine two data sources, which one of those do you use? This stage involves data validation, removal of duplication, and handling of null values.

4. Performance.

Mobile and IoT generate huge volumes of data. Running fast queries on huge data volumes requires careful planning, tuning, and configuration of data analytics infrastructure.

Most companies start by hiring data engineers to implement reliable data pipelines, and they may also hire data scientists with analytics skills. For a data pipeline, you may also require staff with specialist knowledge, such as Hadoop, ETL or Amazon Redshift training.

Data Apps

Unliked end-user applications, data apps runs jobs on the data assembly line. There are three categories of data apps.

  1. Data integration services. Vendors who move data from external systems or applications, into your data lake. Examples are Informatica, Stitch Data, Fivetran, Alooma, ETLeap.
  2. Workflow orchestration. These are systems that implement workflows that do various job on your data assembly line. Transformations, ETL, etc. Examples are Apache Airflow, Pinterest’s Pinball, Spotify’s Luigi.
  3. Analysis. These are data science, reporting and visualization apps. Examples are Tableau, Jupyter notebooks, Mode Analytics, Looker, Chartio, Periscope Data.

New Problems

The fundamental problem solved by the data engineer is to ensure that the data assembly line is working.

Are data flows operating normally?
Do my data tables contain the correct results?
Are data apps able to access the data quickly?

data pipelines are constantly evolving

This requires answering questions in real-time across multiple systems:

  • Is query latency increasing for this App? If so, why?
  • Are queries completing as expected?
  • What are my key data flows? How is their performance changing over time?
  • This table is growing quickly, what is causing that?
  • This table has stopped updating. Why?
  • Did this data load operation complete successfully? Were all the rows captured?
  • Do I have contention issues and what is causing them?

A New type of Monitoring

In order to accomplish this, you need new types of metrics. Traditional networking monitoring metrics like CPU and network utilization are irrelevant when monitoring data assembly lines, because data flows operate at a different layer.

A monitoring tool for data flows must consider:

  • Query information
    • Query text, execution times, cost information, and runtime information.
  • Data app context
    • Monitoring App performance requires correlating app context to query performance. This allows you to measure latency and cost of individual users and data flows.
  • Data metadata
    • Understanding data flows requires knowing how queries interact with tables. This lets you determine how data is moving across your data assembly line.

We started to solve these problems. Our mission is to provide data engineers with a single dashboard to help them monitor their mission critical data flows. And if there are problems, that they are the first to know and the reason why.

Find out more about building platforms with our SF Data Weekly newsletter, or hop on the Intermix Slack Community and join the conversation.

Mark Smallcombe

Mark Smallcombe

Join 11,000 of your peers.
Subscribe to our newsletter SF Data.
People at Facebook, Amazon and Uber read it every week.

Every Monday morning we'll send you a roundup of the best content from and around the web. Make sure you're ready for the week! See all issues.