…and number 2 will really shock you! – Please excuse the cheap shot at click-bait titles. I’ve always wanted to try that 🙂.
Data science is growing like mad. Everyone’s got piles of data and smart companies are asking how they can leverage their data to make their products more personalized and profitable. That makes a lot of sense. But too often, data science initiatives fail to see the light of day.
From a very high level, I’ll give an overview of the main parts of a data science project, then show where things fall apart – failing their missions to provide utility or insights.
From a high-level, data science, like just about any other discipline, requires two main things to be effective:
- Excellent communication between data scientists and stakeholders, and
- Access to the right data
For this article, I’m going to assume we’re not talking about poor communication or non-existent data, as these are generally recognized as non-starters
Let’s look at the steps in our proven data science process:
- Business understanding
- Data assessment and understanding
- Choose model candidates
- Data preparation
- Pipeline construction
- Modeling
- Evaluation
- Deployment
- Monitoring and maintenance
- Optimization
That’s a lot of words and steps. It boils down to this: figure out what you’re going to do and get agreement on that. See if there’s enough of the right data. Build a good pipeline to make that data is available. Go off and do some great data science work, then make sure the parties you spoke with at the offset actually USE what you made.
If you do this effectively, you’re way ahead of the curve. So where does it go south?
Three Reasons why Data Science Projects Fail:
- No or misaligned objectives. Objective gathering is very time consuming (think weeks, not hours or days) and gets to the core of a business. Should data science help optimize for user engagement in terms of time on site, or number of social shares, or referrals, or reduced churn? It’s complex and requires strong product-team alignment. It’s so easy to get this wrong, and when data science moves on a project before there’s a strong understanding of objectives, failure is imminent.
- No plan for deployment. Deployment is where all the data science work comes to life. It’s where the predictive models get used by (usually) other parts of the company. If data scientists build a recommender, it’s in the deployment phase where a marketing team uses the recommendations to email personalized offers to customers. We hear over and over that this part is often completely taken for granted. “Some other team will handle it for us when we get there” is all too common. Success in deployment involves conversations that happen at the very beginning of a project. Without an internal customer (such as a marketing team) and a plan for utilization, the project should never kick-off.
- The Pipeline. This is the bedrock of the entire technical process. It refers to steps where data is delivered to the model and how that data is carried from the deployed model to the final end-user. Getting this right, and early, is essential. Trying to complete a data science project without a robust pipeline would be like trying to driving a car with no chassis. It might look ok from the outside, but it would go nowhere.
These topics are so important that I’ve been asked to speak about them at several conferences this summer. I’ll post events here as they get closer.
At Bennet Data Science, we recently helped one company drop its model deployment time from months to days. If you need help in one of these areas, please hit reply and let’s have a quick talk to see how to get things going again.
Of Interest
The Brain Behind the Music
If you’re a Spotify listener, you’ve probably spent significant time on your personalized “Discover Weekly” playlist (if this is news to you, go check it out. It’s usually fantastic.) This article dives into the three components of music recommendations. It’s quite beautiful.
https://medium.com/s/story/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76ef
Deploying Data Science Models is Har
In this article, the author recants the challenges of handing “over his or her data model to engineering for implementation. And it’s during this step that some of the most common data science problems appear.” You can expect to hear a lot more about model deployment in the upcoming weeks.
https://towardsdatascience.com/why-is-machine-learning-deployment-hard-443af67493cd
Here’s a Sampling of Google’s Data Science Interview Brain Teasers
Can you get any of these right? I’ve never been a fan of (or used) these sorts of Interview Brain Teaser questions as I believe they drive untoward anxiety in candidates, not to mention how ridiculous I would feel trying to trick people who are generally already anxious. But in this context, they’re actually fun to read through!
https://towardsdatascience.com/googles-data-science-interview-brain-teasers-7f3c1dc4ea7f