We’re often asked to deliver value early in a new engagement. This makes a lot of sense; customers want a return on investment, and when better than early to see it? But what’s better? To aim for speed or accuracy? Or is there a way to account for both?
Here’s our Take
Rather than there being a right way and a wrong way, we tend to approach challenges with our customers in mind.
When data scientists sit down to start a project, they generally have a goal in mind. For a company that shows movies online, the goal might sound like this: successfully automate the process of picking the right movie for a person to watch whenever that person logs on.
That goal would generally be achieved by building a predictive model that shows users a screen with highly-personalized movie recommendations. Personalizing movies to each person at any time is a very involved process that will likely take months to do well.
On top of that, this type of work generally requires buy-in from multiple teams, from the data scientists building it to the developers who have to integrate it into the front-end, so that users can see it to the marketing team who wants to show off the latest features.
In other words, it’s a complex process.
Upwards of three out of four predictive models made by data scientists never make it to deployment, the stage where they are actually utilized to predict or automate something. This is due to several reasons, not the least of which is poor initial planning, lack of resources, or even expertise.
But with the proliferation of tools currently at our disposal to deploy models quickly, could data scientists shorten the whole process? And, if we do, what would be the consequence(s)?
This is exactly what several clients have asked me recently and I want to talk about some of the up- and down-sides.
The Up- and Down-Sides of Faster Deployment
From a high level, we generally carry out a version of the following steps for our clients:
- Data – We gather data, clean it up, and put it somewhere where it’s easily accessible (think, data warehouse)
- Model – We use that clean data to build a predictive model
- Deploy – We deploy that predictive model so our client can realize its value
Overall, we call this the data science pipeline.
If we build the pipeline correctly, good data will flow through model creation to deployment and provide our clients with a rock-solid system they can use to automate their processes. And that’s how we generally approach these problems. Now, if we need to get to deployment much sooner, there are some interesting ways we can “skip” ahead.
For example, we can “deploy” by providing daily email reports to a team in need of insights. Perhaps we’re ranking a list of potential buyers. Instead of interfacing directly with a CRM in the “Deploy” phase, we could send a simple email. And instead of using all the data available and writing robust code to clean data in a very repeatable way, we could use only a few of the most important data points to make a decision.
There are times to do things this way, especially for projects where showing a minimum viable working prototype is more important than spending six months and a lot of money building something robust and long-lasting. After lengthy discussions about the risks, some projects move forward with this approach.
However, it exposes us to a lot of long-term risk at the gain of short-term payoff.
At Bennett Data Science, we’ve had success of late doing these sort of high-gain agile projects, finding lift, then going back, and establishing a robust pipeline to support them only after proving value. Perhaps this is one of the ways data science can see more success in building useful tools more quickly and delivering more value overall.
The question of speed versus accuracy is one that we must answer each time we start a new project. Hopefully, this has been some good food for thought when you’re in a hurry to show results but are concerned about the longevity and viability of your work.
What do you think of this approach? I’d love to hear from you.
It’s time for a Bill of Data Rights
As the US Senate debates a new bill, a data-governance expert presents a plan to protect liberty and freedom in the digital age. This essay argues that “data ownership” is a flawed, counterproductive way of thinking about data. It not only does not fix existing problems; it creates new ones. Instead, writer Tisne argues, a new framework is needed that gives people rights to stipulate how their data is used without requiring them to take ownership of it themselves.
Scented Candles and Covid Infections
A researcher has found a correlation between negative reviews for scented candles and a rise in coronavirus infections. Kate Petrova, who works on the Harvard Study of Adult Development, questioned whether this could be because of the Covid-related loss of smell, and posted a fun data thread on Twitter about the “unexpected victims” of the pandemic.
Data-Driven? Think Again
For a decision to be data-driven, it has to be the data — as opposed to something else entirely — that drive it. Seems so straightforward, and yet it’s so rare in practice because decision-makers lack a key psychological habit.