Start Now Login
Why We Built intermix.io – “APM for Data”

Why We Built intermix.io – “APM for Data”

To win in today’s market, companies must build core competencies in advancing their use of data. Data-first companies are dominating their industries. e.g. Netflix vs network TV; Tinder vs Match.com; Stitch Fix vs The Mall. Mentions of “AI” are often heard in advertisements, product launches, and earnings calls.

Why is this happening now?

Data is the new differentiator:

The Shift in Data Platforms

These trends have led to a shift in how companies build data platforms. The shift is away from single, monolithic data warehouses to data lake based architectures. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, and then decide at a later time how to use it.

As a result of the shift, traditional concepts of “ETL” are being re-imagined for a new world where:

The response has been that teams are building complex data assembly lines which have common characteristics:

Data lake architectures pose new challenges. Since data is stored without any oversight of the contents, the data needs to have defined mechanisms for cataloging in order to make it usable. Without this, data cannot be found or trusted. So meeting the needs of the company requires that data assembly lines assign governance consistency and access controls to the data lake.

Data Use cases

Companies use the data for the following purposes:

  1. Analytics applications. Customer- or internal- facing dashboards that present data for analysis and reporting.
  2. Machine learning. Data scientists pull data to develop models. Once ready, training data sets are fed continuously into these models to operationalize AI.
  3. Traditional reporting. Ad hoc queries run by business analysts and citizen users.

The Data Engineer

Building and managing complex data assembly line requires a new skill set.

The last time we saw a similar shift was 10 years ago when cloud computing was first developed. Running cloud apps required an operational (ie uptime, cost, and performance) mindset coupled with an ability to write code. The DevOps role was born out of a need to manage infrastructure as code. As a result, engineering teams had to establish new teams and hire new types of employees. These employees needed different tools to do their jobs.

Similarly, a new role has emerged to manage data assembly lines: the data engineer. Data engineers manage complex data flows by writing code that manipulates data. But they are also accountable for the uptime, performance, and cost accounting for the data flows. This skillset is a combination of DevOps and data analyst, with a sprinkling of database administrator. No wonder they are in high demand!

Building a data assembly line involves:

  1. Data acquisition and cataloging.

Raw data is stranded in application silos. 23% of enterprise application workloads live in the cloud, growing to 52% in 5 years. Over half of enterprises intend to have “multi-cloud” architectures. So within five years, over half of a company’s data will be in at least two clouds plus their own data centers.

This data must be inspected and catalogued so it can be understood by analytics databases for processing and analysis.

2. Security.

Data must be secured to ensure that data assets are protected. Access to data must be audited. Access rights to data must be assigned to the correct teams and tools.

3. Transformation and Cleaning.

Data must be reduced and cleaned so it can be trusted. Consider how a customer is identified across the data – is this done by email, name, or some unique ID? If I want to combine two data sources, which one of those do I use?
 Duplicate data should be omitted, and data should be validated to ensure it is complete, without any gaps.

4. Performance.

Data volumes are enormous today because of mobile and IoT. Running fast queries on huge data volumes requires careful planning, tuning, and configuration of data analytics infrastructure.

The first hire for most data teams is usually a data scientist, but this is the wrong first hire to make. The data engineer should be the first hire of any data team. Without a data engineer, the company’s data is stranded and unusable.

Data Apps

Unliked end-user applications, data apps runs jobs on the data assembly line. There are three categories of data apps.

  1. Data integration services. Vendors who move data from external systems or applications, into your data lake. Examples are Informatica, Stitch Data, Fivetran, Alooma, ETLeap.
  2. Workflow orchestration. These are systems that implement workflows that do various job on your data assembly line. Transformations, ETL, etc. Examples are Apache Airflow, Pinterest’s Pinball, Spotify’s Luigi.
  3. Analysis. These are data science, reporting and visualization apps. Examples are Tableau, Jupyter notebooks, Mode Analytics, Looker, Chartio, Periscope Data.

Monitoring the performance of these data apps is critical to building reliable data assembly lines.

New Problems

The fundamental problem solved by the data engineer is to ensure that the data assembly line is working.

Are data flows operating normally?
Do my data tables contain the correct results?
Are data apps able to access the data quickly?

data pipelines are constantly evolving

This requires answering questions in real-time across multiple systems:

A new type of monitoring

In order to accomplish this, you need new types of metrics. Traditional networking monitoring metrics like CPU and network utilization are irrelevant when monitoring data assembly lines, because data flows operate at a different layer.

A monitoring tool for data flows must consider:

We started intermix.io to solve these problems. Our mission is to provide data engineers with a single dashboard to help them monitor their mission critical data flows. And if there are problems, that they are the first to know and the reason why.

Today, Uncork Capital and S28 Capital along with PAUA Ventures, Bastian Lehmann, CEO Postmates, and Hasso Plattner, Founder of SAP are backing us to help reach this goal. If we are successful, then all companies will have the tools they need to win with data.

——-

Photo by Hans-Peter Gauster

Related content
3 Things to Avoid When Setting Up an Amazon Redshift Cluster Apache Spark vs. Amazon Redshift: Which is better for big data? Amazon Redshift Spectrum: Diving into the Data Lake! What Causes "Serializable Isolation Violation Errors" in Amazon Redshift? A Quick Guide to Using Short Query Acceleration and WLM for Amazon Redshift for Faster Queries What is TensorFlow? An Intro to The Most Popular Machine Learning Framework Titans of Data with Mirko Novakovic - How Containers are Giving Rise to New Data Services 4 Simple Steps To Set-up Your WLM in Amazon Redshift For Better Workload Scalability World-class Data Engineering with Amazon Redshift - Training Announcing App Tracing - Monitoring Your Data Apps With intermix.io Have Your Postgres Cake with Amazon Redshift and eat it, too. 4 Real World Use Cases for Amazon Redshift 3 Steps for Fixing Slow Looker Dashboards with Amazon Redshift Zero Downtime Elasticsearch Migrations Titans of Data with Florian Leibert – CEO Mesosphere Improve Amazon Redshift COPY performance:  Don’t ANALYZE on every COPY Building a Better Data Pipeline - The Importance of Being Idempotent The Future of Machine Learning in the Browser with TensorFlow.js Gradient Boosting Libraries — A Comparison Crowdsourcing Weather Data With Amazon Redshift The Future of Apache Airflow Announcing Query Groups – Intelligent Query Classification Top 14 Performance Tuning Techniques for Amazon Redshift Product Update: An Easy Way To Find The Cause of Disk Usage Spikes in Amazon Redshift How We Reduced Our Amazon Redshift Cost by 28%
Ready to start seeing into your data infrastructure?
Get started with a 14-day free trial, with access to the full platform

No Credit Card Required