How to fully understand a data pipeline? You need to break it.

Tanmay Gupta
Analytics Vidhya
Published in
4 min readMay 19, 2021

--

So, you just got hired as a Data Scientist or ML Engineer? This is for all the new folks on an established team feeling confused.

If you’re a Data Scientist or ML Engineer working at a large company with messy enterprise data like me, you’re no stranger to complex data pipelines that you don’t fully understand. Especially in an established project where the data experts are in one city and your team in the other. No one really understands your ML pipeline. Not completely anyway. There are different hierarchies for different databases with unclear documentation, forecasts are made at a certain level and presented at various aggregated levels, and the mapping between tables is never complete! Its a mess.

Everyone’s comprehension is at the comfort of the lines that they’ve written and contributed themselves. Sure they understand the high level, but when you have clients asking why you don’t have all their forecasts during delivery and you need to read every line of code in the preprocessing module to figure out where you could’ve filtered out their data and yet, after a few hours of reruns, commenting and uncommenting, you’re still lost, you’ll know that high level won’t cut it. You need to know the pipeline in and out to be able to diagnose a fire in critical moments like that. And to be perfectly honest, you need such fires at critical moments to master the pipeline too.

So, how does one begin to navigate their pipeline intelligently to learn its nuances in the absence of such a fire? I say, you start by breaking your pipeline. What you need is a challenge.

I don’t obviously mean messing up your production systems — that’s at your own risk — I mean creating a test file that is the smallest subsample of data needed to run your full pipeline and pass a series of tests. In the process of creating such a file, you will break and bend and beat your pipeline. And in that recursion lies all the intuition you need.

Wait, what should I do?

What I’m proposing is an exercise where you synthetically generate a sample test file that you will use to test the functionality of your team’s pipeline.

A benefit of this exercise is that at the end of it, you have data to run an end-to-end test on Travis (or any other CI tool)!

It’s like running an experiment, where you put an input (our tiny test file) through a set of transformations and functions, and monitor the output (model predictions). If the output is what you’d expect, awesome! If not, you’ll learn why that’s the case and either change your expectations or change the files!

Okay, but how to I do it?

The exercise is that you need to challenge yourself to create a sample set of data that can successfully run through your Machine Learning pipeline and has the following characteristics (you can be creative here, but this is the least you should expect from it):

  1. Has the least amount of observations needed before your pipeline cannot run due to some minimum data error (train_test_split error, cannot run model, etc)
  2. Should contain observations from multiple but not all categories if your schema contains categorical data. If you have time series data, maybe you can test with fewer time series data points to see if your pipeline can handle that. The idea here is to test how your pipeline will perform when your data changes/isn’t what you expect.
  3. Passes all the tests that have been implemented in the pipeline (usually in the form of assert statements, eg, minimum number of observations per category, or PyTests)
  4. (Advanced) Be flexible enough to test full functionality of the pipeline from fetching from database to posting the results

It is important to remember that you don’t care about the model performance here. Obviously if you retain the minimum amount of data before your pipeline breaks, your model isn’t going to be performing all that well.

The idea here is to ensure that you have an infrastructure in place where your pipeline is robust enough to handle some exceptions. And if not, you have the foresight to log such a situation or the pipeline tells you when it breaks. The last thing you want is for your model to create faulty predictions due to data quality errors and your client to tell you about it.

How to use this test file?

Use this file as a test in your testing environment before pushing your code to production. Using GitHub, you can configure a Travis build (or any CI tool) to run every time you create a PR from your branch. That Travis build can use our test file to complete and evaluate its build.

This is a simple and quick way for you to start making real contributions to your team.

If your team already has such a test file, ask your PM how you can make it more robust (trust me, they’ve recently found a bug that they haven’t tested). If not, get to it.

In summary, you create a minimal test file that has just the right amount of observations in it to run through the pipeline, pass through all tests, and test the functionality with what’s expected. What we want is for you to get onboarded to your new ML project at the earliest. If this helps you, go for it!

My test file has 41 obversations. How many do you have?

--

--