Reading Time: 4 minutes
by: Zank, CEO of Bennett Data Science
There’s a wonderful document that Google created many years ago, called Rules of Machine Learning: Best Practices for ML Engineering and it’s one of the most important documents I’ve ever read on machine learning. Not because it taught me how to use the latest algorithm or gave me some ideas about how to better segment users for a client. But because it lays the foundation for so many of the things data scientists do every day. In their own words, it’s like a “style guide” for machine learning.
In this article, I’ll go over some of the most important parts to executives; fundamental
From a high level, here’s where the document starts:
Make sure your pipeline is solid end to end
Start with a reasonable objective
Add common-sense features in a simple way
Make sure that your pipeline stays solid
Let’s look more closely at each of these.
ONE: Make sure your pipeline is solid end to end
The pipeline is and essential starting place. But what is a “pipeline”? – A pipeline is the infrastructure that gets data to a data science model and provides the framework to train and retrain the model, keeping it fresh over time.
It’s important to prioritize the pipeline, even over the type of data science model. The emphasis should be on the infrastructure, the data delivery, not the model. When a solid pip
eline is in place, it should be able to support everything from a simple heuristic model to a more complex predictive model. In this manner, your data science team can iterate, from simple to complex; accuracy should improve and the team can own steady progress and improvements. The idea is to build the chassis of a race car, and install a standard engine. Then, as conditions require, update the engine. If the chases was built correctly from the start, it will support the new engine and go faster. If not, it will fall apart, just when it’s most important to hold together. It’s very liberating to work on a team that has this type of pipeline in place; every iteration is easier, more rewardi
On the other hand, with a just good enough approach to pipeline building, each model iteration is met with more tickets into dev ops for support. The data science team gets deflated as timelines drag on. This is a classic case where some additional time up saves tons of future headaches.
TWO: Start with a reasonable objective
If you know me at all, you know that this is something I find essential to get right, and as early as possible. The objective is the thing that your data science is designed to optimize. Want to increase personalization? Then the objective should be to show the right product to the right person at the right time. But early on, popularity might be a good proxy for your objective that’s easy to realize when there aren’t much data available. Then, as more data become available, you can implement more complex models. The objective never changes, but the model can grow in complexity, increasing your objective of personalization. Without the right objective, your data scientists will be working on the wrong problem.
THREE: Add common-sense features in a simple way
This is a governor against unnecessary complexity. At Bennett Data Science, we make a habit out of starting simple. Clients often ask if we’re going to use this method or that, but when the suggestions are unnecessarily complex or black box approaches (such as neural nets), I always suggest using simpler, easy to explain models. Features are no different. Want to throw in the kitchen sink? There’s a time for that, and it’s not usually in the beginning of a project. More often, it’s best to use simple powerful predictors to prove efficacy, and only then add complexity.
FOUR: Make sure that your pipeline stays solid
Now we’ve come full circle. Are you tracking your data science models? Are you making sure they don’t drift? Are all the inputs still available? We’ve seen instance when location data stopped feeding a predictive model. The model didn’t break or complain (by throwing a log error we would have seen) so we didn’t find it for months. The problem was that the client didn’t have ample monitoring set up on the pipeline. This is an important step to making sure the pipeline stays solid.
There’s a lot more than this in the document. I highly recommend reading it and making it mandatory for your entire technical team who works with data science or data scientists.
We’re experts in these methodologies. Don’t hesitate to contact us if you would like to learn more about how we’ve helped companies big and small set up their data science initiatives for success.
The PDF can be found here: http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf
The (newer) web page is here: https://developers.google.com/machine-learning/guides/rules-of-ml/