Table of Contents
To win in today’s market, companies must build core competencies in advancing their use of data. Data-first companies are dominating their industries. e.g. Netflix vs network TV; Tinder vs Match.com; Stitch Fix vs The Mall. Mentions of “AI” are often heard in advertisements, product launches, and earnings calls.
Why is this happening now?
Data is the new differentiator:
These trends have led to a shift in how companies build data platforms. The shift is away from single, monolithic data warehouses to data lake based architectures. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, and then decide at a later time how to use it.
As a result of the shift, traditional concepts of “ETL” are being re-imagined for a new world where:
The response has been that teams are building complex data assembly lines which have common characteristics:
Data lake architectures pose new challenges. Since data is stored without any oversight of the contents, the data needs to have defined mechanisms for cataloging in order to make it usable. Without this, data cannot be found or trusted. So meeting the needs of the company requires that data assembly lines assign governance consistency and access controls to the data lake.
Companies use the data for the following purposes:
Building and managing complex data assembly line requires a new skill set.
The last time we saw a similar shift was 10 years ago when cloud computing was first developed. Running cloud apps required an operational (ie uptime, cost, and performance) mindset coupled with an ability to write code. The DevOps role was born out of a need to manage infrastructure as code. As a result, engineering teams had to establish new teams and hire new types of employees. These employees needed different tools to do their jobs.
Similarly, a new role has emerged to manage data assembly lines: the data engineer. Data engineers manage complex data flows by writing code that manipulates data. But they are also accountable for the uptime, performance, and cost accounting for the data flows. This skillset is a combination of DevOps and data analyst, with a sprinkling of database administrator. No wonder they are in high demand!
Building a data assembly line involves:
Raw data is stranded in application silos. 23% of enterprise application workloads live in the cloud, growing to 52% in 5 years. Over half of enterprises intend to have “multi-cloud” architectures. So within five years, over half of a company’s data will be in at least two clouds plus their own data centers.
This data must be inspected and catalogued so it can be understood by analytics databases for processing and analysis.
Data must be secured to ensure that data assets are protected. Access to data must be audited. Access rights to data must be assigned to the correct teams and tools.
3. Transformation and Cleaning.
Data must be reduced and cleaned so it can be trusted. Consider how a customer is identified across the data – is this done by email, name, or some unique ID? If I want to combine two data sources, which one of those do I use? Duplicate data should be omitted, and data should be validated to ensure it is complete, without any gaps.
Data volumes are enormous today because of mobile and IoT. Running fast queries on huge data volumes requires careful planning, tuning, and configuration of data analytics infrastructure.
The first hire for most data teams is usually a data scientist, but this is the wrong first hire to make. The data engineer should be the first hire of any data team. Without a data engineer, the company’s data is stranded and unusable.
Unliked end-user applications, data apps runs jobs on the data assembly line. There are three categories of data apps.
Monitoring the performance of these data apps is critical to building reliable data assembly lines.
The fundamental problem solved by the data engineer is to ensure that the data assembly line is working.
Are data flows operating normally?
Do my data tables contain the correct results?
Are data apps able to access the data quickly?
This requires answering questions in real-time across multiple systems:
In order to accomplish this, you need new types of metrics. Traditional networking monitoring metrics like CPU and network utilization are irrelevant when monitoring data assembly lines, because data flows operate at a different layer.
A monitoring tool for data flows must consider:
We started intermix.io to solve these problems. Our mission is to provide data engineers with a single dashboard to help them monitor their mission critical data flows. And if there are problems, that they are the first to know and the reason why.
Today, Uncork Capital and S28 Capital along with PAUA Ventures, Bastian Lehmann, CEO Postmates, and Hasso Plattner, Founder of SAP are backing us to help reach this goal. If we are successful, then all companies will have the tools they need to win with data.
Photo by Hans-Peter Gauster