Why We Built intermix.io – “APM for Data”
To win in today’s market, companies must build core competencies in advancing their use of data. Data-first companies are dominating their industries. e.g. Netflix vs network TV; Tinder vs Match.com; Stitch Fix vs The Mall. Mentions of “AI” are often heard in advertisements, product launches, and earnings calls.
Why is this happening now?
- The cloud made it cheap and easy to collect, store and process data.
- Machine learning algorithms, which leverage this data, were once the secret sauce of a company. But everything changed when tech giants began open sourcing their algorithms a few years ago.
- Data scientists are trained in using these machine learning software.
Data is the new differentiator:
- WeWork is collecting a proprietary data set on real estate, demographics and how people work
- Udemy is collecting proprietary data on how students interact with course content
- Stitch Fix / Zappos collects proprietary data on clothing preferences.
The Shift in Data Platforms
These trends have led to a shift in how companies build data platforms. The shift is away from single, monolithic data warehouses to data lake based architectures. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, and then decide at a later time how to use it.
As a result of the shift, traditional concepts of “ETL” are being re-imagined for a new world where:
- Source data resides in multiple clouds
- Enormous data volumes (TB of new data / day) expected to be stored forever
- Rising workloads due to machine learning and data democratization
The response has been that teams are building complex data assembly lines which have common characteristics:
- A data lake (ie S3) stores all the raw, unstructured data, forever
- Several analytics databases process data and warehouse it
- An orchestration layer coordinates the flow of data between databases
Data lake architectures pose new challenges. Since data is stored without any oversight of the contents, the data needs to have defined mechanisms for cataloging in order to make it usable. Without this, data cannot be found or trusted. So meeting the needs of the company requires that data assembly lines assign governance consistency and access controls to the data lake.
Data Use cases
Companies use the data for the following purposes:
- Analytics applications. Customer- or internal- facing dashboards that present data for analysis and reporting.
- Machine learning. Data scientists pull data to develop models. Once ready, training data sets are fed continuously into these models to operationalize AI.
- Traditional reporting. Ad hoc queries run by business analysts and citizen users.
The Data Engineer
Building and managing complex data assembly line requires a new skill set.
The last time we saw a similar shift was 10 years ago when cloud computing was first developed. Running cloud apps required an operational (ie uptime, cost, and performance) mindset coupled with an ability to write code. The DevOps role was born out of a need to manage infrastructure as code. As a result, engineering teams had to establish new teams and hire new types of employees. These employees needed different tools to do their jobs.
Similarly, a new role has emerged to manage data assembly lines: the data engineer. Data engineers manage complex data flows by writing code that manipulates data. But they are also accountable for the uptime, performance, and cost accounting for the data flows. This skillset is a combination of DevOps and data analyst, with a sprinkling of database administrator. No wonder they are in high demand!
Building a data assembly line involves:
- Data acquisition and cataloging.
Raw data is stranded in application silos. 23% of enterprise application workloads live in the cloud, growing to 52% in 5 years. Over half of enterprises intend to have “multi-cloud” architectures. So within five years, over half of a company’s data will be in at least two clouds plus their own data centers.
This data must be inspected and catalogued so it can be understood by analytics databases for processing and analysis.
Data must be secured to ensure that data assets are protected. Access to data must be audited. Access rights to data must be assigned to the correct teams and tools.
3. Transformation and Cleaning.
Data must be reduced and cleaned so it can be trusted. Consider how a customer is identified across the data – is this done by email, name, or some unique ID? If I want to combine two data sources, which one of those do I use? Duplicate data should be omitted, and data should be validated to ensure it is complete, without any gaps.
Data volumes are enormous today because of mobile and IoT. Running fast queries on huge data volumes requires careful planning, tuning, and configuration of data analytics infrastructure.
The first hire for most data teams is usually a data scientist, but this is the wrong first hire to make. The data engineer should be the first hire of any data team. Without a data engineer, the company’s data is stranded and unusable.
Unliked end-user applications, data apps runs jobs on the data assembly line. There are three categories of data apps.
- Data integration services. Vendors who move data from external systems or applications, into your data lake. Examples are Informatica, Stitch Data, Fivetran, Alooma, ETLeap.
- Workflow orchestration. These are systems that implement workflows that do various job on your data assembly line. Transformations, ETL, etc. Examples are Apache Airflow, Pinterest’s Pinball, Spotify’s Luigi.
- Analysis. These are data science, reporting and visualization apps. Examples are Tableau, Jupyter notebooks, Mode Analytics, Looker, Chartio, Periscope Data.
Monitoring the performance of these data apps is critical to building reliable data assembly lines.
The fundamental problem solved by the data engineer is to ensure that the data assembly line is working.
Are data flows operating normally?
Do my data tables contain the correct results?
Are data apps able to access the data quickly?
This requires answering questions in real-time across multiple systems:
- Is query latency increasing for this App? If so, why?
- Are queries being killed, or aborted?
- What are my key data flows? How is their performance changing over time?
- This table is growing quickly, what is causing that?
- This table has stopped updating, why?
- Did this data load operation complete successfully? Were all the rows captured?
- Do I have contention issues and what is causing them?
A new type of monitoring
In order to accomplish this, you need new types of metrics. Traditional networking monitoring metrics like CPU and network utilization are irrelevant when monitoring data assembly lines, because data flows operate at a different layer.
A monitoring tool for data flows must consider:
- Query information
- Query text, execution times, cost information, and runtime information.
- Data app context
- Monitoring App performance requires correlating app context to query performance. This allows you to measure latency and cost of individual users and data flows.
- Data metadata
- Understanding data flows requires knowing how queries interact with tables. This lets you determine how data is moving across your data assembly line.
We started intermix.io to solve these problems. Our mission is to provide data engineers with a single dashboard to help them monitor their mission critical data flows. And if there are problems, that they are the first to know and the reason why.
Today, Uncork Capital and S28 Capital along with PAUA Ventures, Bastian Lehmann, CEO Postmates, and Hasso Plattner, Founder of SAP are backing us to help reach this goal. If we are successful, then all companies will have the tools they need to win with data.
Photo by Hans-Peter Gauster