Reading Time: 4 minutes

by: Zank, CEO of Bennett Data Science
There’s a wonderful document that Google created many years ago, called Rules of Machine Learning: Best Practices for ML Engineering and it’s one of the most important documents I’ve ever read on machine learning. Not because it taught me how to use the latest algorithm or gave me some ideas about how to better segment users for a client. But because it lays the foundation for so many of the things data scientists do every day. In their own words, it’s like a “style guide” for machine learning.
In this article, I’ll go over some of the most important parts to executives; fundamental
From a high level, here’s where the document starts:
  1. Make sure your pipeline is solid end to end
  2. Start with a reasonable objective
  3. Add common­-sense features in a simple way
  4. Make sure that your pipeline stays solid
Let’s look more closely at each of these.

ONE: Make sure your pipeline is solid end to end

The pipeline is an essential starting place. But what is a “pipeline”? – A pipeline is the infrastructure that gets data to a data science model and provides the framework to train and retrain the model, keeping it fresh over time.
It’s important to prioritize the pipeline, even over the type of data science model. The emphasis should be on the infrastructure, the data delivery, not the model. When a solid pipeline is in place, it should be able to support everything from a simple heuristic model to a more complex predictive model. In this manner, your data science team can iterate, from simple to complex; accuracy should improve and the team can own steady progress and improvements. The idea is to build the chassis of a race car, and install a standard engine. Then, as conditions require, update the engine. If the chassis was built correctly from the start, it will support the new engine and go faster. If not, it will fall apart, just when it’s most important to hold together. It’s very liberating to work on a team that has this type of pipeline in place; every iteration is easier, more rewarding.
On the other hand, with a just good enough approach to pipeline building, each model iteration is met with more tickets into dev ops for support. The data science team gets deflated as timelines drag on. This is a classic case where some additional time up saves tons of future headaches.


TWO: Start with a reasonable objective

If you know me at all, you know that this is something I find essential to get right, and as early as possible. The objective is the thing that your data science is designed to optimize. Want to increase personalization? Then the objective should be to show the right product to the right person at the right time. But early on, popularity might be a good proxy for your objective that’s easy to realize when there aren’t much data available. Then, as more data become available, you can implement more complex models. The objective never changes, but the model can grow in complexity, increasing your objective of personalization. Without the right objective, your data scientists will be working on the wrong problem.

 

THREE: Add common­-sense features in a simple way

This is a governor against unnecessary complexity. At Bennett Data Science, we make a habit out of starting simple. Clients often ask if we’re going to use this method or that, but when the suggestions are unnecessarily complex or black box approaches (such as neural nets), I always suggest using simpler, easy to explain models. Features are no different. Want to throw in the kitchen sink? There’s a time for that, and it’s not usually at the beginning of a project. More often, it’s best to use simple powerful predictors to prove efficacy, and only then add complexity.

 

FOUR: Make sure that your pipeline stays solid

Now we’ve come full circle. Are you tracking your data science models? Are you making sure they don’t drift? And are all the inputs still available? We’ve seen an instance when location data stopped feeding a predictive model. The model didn’t break or complain (by throwing a log error we would have seen) so we didn’t find it for months. The problem was that the client didn’t have ample monitoring set up on the pipeline. This is an important step to making sure the pipeline stays solid.
There’s a lot more than this in the document. I highly recommend reading it and making it mandatory for your entire technical team who works with data science or data scientists.

We’re experts in these methodologies. Don’t hesitate to contact us if you would like to learn more about how we’ve helped companies big and small set up their data science initiatives for success.

Links:

Zank Bennett is CEO of Bennett Data Science, a group that works with companies from early-stage startups to the Fortune 500. BDS specializes in working with large volumes of data to solve complex business problems, finding novel ways for companies to grow their products and revenue using data, and maximizing the effectiveness of existing data science personnel. https://bennettdatascience.com

Signup for our Newsletter