The Advantages of intermix.io (and a little Amazon Redshift training)
To win in today’s market, companies must invest in both their infrastructure and their people. Data-first companies like Netflix, Uber and Tinder are dominating their industries. Mentions of “AI” are often heard in advertisements, product launches, and earnings calls. Businesses are scrambling to re-imagine their tech infrastructure and provide their people with things like Amazon Redshift training.
Why is this happening now?
- Businesses can store and process vast quantities of data thanks to cost-effective warehousing platforms like Amazon Redshift.
- Machine learning algorithms, which leverage this data, are open-source and available to everyone.
- Data scientists know how to derive actionable insights from machine learning tools.
Data is the new differentiator:
- Amazon rely on a data-driven recommendation engine for 35% of retail sales
- American Express have used data analytics to increase online customer customer acquisitions by 40%
- Starbucks use data from their 13 million app users to craft menus and identify promising new locations
Moving towards a data-driven organization can be daunting for decision-makers. Learning investments such as Amazon Redshift training can sometimes produce limited results, thanks to the complex nature of modern data platforms, which can intimidate even the most experienced IT professionals.
Table of Contents
The Shift in Data Platforms
We’re all generating a lot more data than ever before. Storing it is a challenge; analyzing it is an even bigger challenge. Over the years, we’ve seen highly structured data warehouses give way to more anarchic data lakes, which in turn are giving way to multi-tiered structures like data lakehouses.
As a result of the shift, traditional concepts of “ETL” are being re-imagined for a new world where:
- Disparate sources contain structured and unstructured data
- Terabytes of new data appear every day, often with no set expiry date
- Workloads rise due to machine learning and data democratization
The response has been that teams are building complex data assembly lines which have common characteristics:
- A data lake stores all the raw, unstructured data, forever
- Several analytics databases process data and warehouse it
- An orchestration layer coordinates the flow of data between databases
Data lake architectures pose new challenges. Since lakes store their data without any oversight of the contents, the data needs to have defined mechanisms for cataloging in order to make it usable. Without this, data isn’t reliable. Companies need data pipelines that offer governance consistency and access controls to the data lake.
New technologies like Amazon Redshift allow all of the storage and processing power required to run a data-first business, but you still need the tools to turn that data into actionable insights.
Data Use Cases
Companies use the data for the following purposes:
- Analytics applications. Customer- or internal- facing dashboards that present data for analysis and reporting.
- Machine learning. Data scientists pull data to develop models. Once ready, training data sets are fed continuously into these models to operationalize AI.
- Traditional reporting. Ad hoc queries run by business analysts and citizen users.
The Data Engineer
Building and managing complex data assembly line requires a new skill set.
The last time we saw a comparable shift was back when cloud computing was first developed. Running cloud apps required an operational (ie uptime, cost, and performance) mindset coupled with an ability to write code. The DevOps role was born out of a need to manage infrastructure as code. As a result, engineering teams had to establish new teams and hire new types of employees. These employees needed different tools to do their jobs.
Similarly, a new role has emerged to manage data pipelines: the data engineer. Data engineers manage complex data flows by writing code that manipulates data. But they are also accountable for the uptime, performance, and cost accounting for the data flows. This skillset is a combination of DevOps and data analyst, with a sprinkling of a database administrator. They also need platform knowledge, such as Amazon Redshift training.
Building a data assembly line involves:
1. Data Acquisition and Cataloging
Raw data often exists in application silos, stranded.93% of enterprises have a multi-cloud strategy, while 87% have a hybrid cloud strategy. So, for many organizations, data exists across multiple clouds, plus their own data centers. They will need to inspect and catalog this data before they realize any value.
Security is a fundamental part of any data pipeline. The data owners must audit all access, and also ensure that the right people and processes have the right permissions.
3. Transformation and Cleaning
Before anyone can trust data, they have to transform and cleanse it. Consider how you identify a customer across the data – is it by email, name, or some unique ID? If you want to combine two data sources, which one of those do you use? This stage involves data validation, removal of duplication, and handling of null values.
Mobile and IoT generate huge volumes of data. Running fast queries on huge data volumes requires careful planning, tuning, and configuration of data analytics infrastructure.
Most companies start by hiring data engineers to implement reliable data pipelines, and they may also hire data scientists with analytics skills. For a data pipeline, you may also require staff with specialist knowledge, such as Hadoop, ETL or Amazon Redshift training.
Unliked end-user applications, data apps runs jobs on the data assembly line. There are three categories of data apps.
- Data integration services. Vendors who move data from external systems or applications, into your data lake. Examples are Informatica, Stitch Data, Fivetran, Alooma, ETLeap.
- Workflow orchestration. These are systems that implement workflows that do various job on your data assembly line. Transformations, ETL, etc. Examples are Apache Airflow, Pinterest’s Pinball, Spotify’s Luigi.
- Analysis. These are data science, reporting and visualization apps. Examples are Tableau, Jupyter notebooks, Mode Analytics, Looker, Chartio, Periscope Data.
The fundamental problem solved by the data engineer is to ensure that the data assembly line is working.
Are data flows operating normally?
Do my data tables contain the correct results?
Are data apps able to access the data quickly?
This requires answering questions in real-time across multiple systems:
- Is query latency increasing for this App? If so, why?
- Are queries completing as expected?
- What are my key data flows? How is their performance changing over time?
- This table is growing quickly, what is causing that?
- This table has stopped updating. Why?
- Did this data load operation complete successfully? Were all the rows captured?
- Do I have contention issues and what is causing them?
A New type of Monitoring
In order to accomplish this, you need new types of metrics. Traditional networking monitoring metrics like CPU and network utilization are irrelevant when monitoring data assembly lines, because data flows operate at a different layer.
A monitoring tool for data flows must consider:
- Query information
- Query text, execution times, cost information, and runtime information.
- Data app context
- Monitoring App performance requires correlating app context to query performance. This allows you to measure latency and cost of individual users and data flows.
- Data metadata
- Understanding data flows requires knowing how queries interact with tables. This lets you determine how data is moving across your data assembly line.
We started intermix.io to solve these problems. Our mission is to provide data engineers with a single dashboard to help them monitor their mission critical data flows. And if there are problems, that they are the first to know and the reason why.
Join 11,000 of your peers.
Subscribe to our newsletter SF Data.
People at Facebook, Amazon and Uber read it every week.
Every Monday morning we'll send you a roundup of the best content from intermix.io and around the web. Make sure you're ready for the week! See all issues.