Reading Time: 3 minutes
by: Zank, CEO of Bennett Data Science
I’m such an optimist that that title is a bit abrupt for me. But it’s true, I get asked a very similar, quite ridiculous question all the time. And the ridiculousness has absolutely nothing to do with the “asker”. Rather, it has everything to do with the general perception of what data scientists are capable of. In this post, I’ll dispel some pretty insane thoughts that smart people get wrong more often than you might think.
The question was, simply: “We have a bunch of users, but not much data on them, can you use some data science magic to tell me anything I want to know about them?” It was followed by some assessment of data science, like, “everyone is using data science to do these things these days, when can we start?!”
The fact of it is, these aren’t the worst questions. The elephant in the room is that most people don’t understand how data science works, fundamentally. Let’s dive into THAT! It’s pretty easy to explain. It has to do with training. I’ll explain.
How Data Science Works
Imagine you’re at a cabin out in the woods. It’s a cabin you frequent, and each time you’re there, you pick and catalog mushrooms (not those mushrooms) as poisonous or not using some fancy machine you have for just that purpose.
Other than this fancy machine, you take 22 measurements of things like cap-shape, stalk shape and odor. Over the years you collect 8,124 examples. Then something crazy happens! The zombie apocalypse comes and there you are, safe your cabin in the woods. Now, if only you had some food. With the last little bit of battery you have in your laptop you code up a quick predictive model that predicts for a given set of the 22 measurement, whether or not a mushroom is poisonous.
In this (rather ridiculous case) your predictive model might just keep you alive! Here’s a bit about how it works.
Of the 8,124 mushrooms you measured, you train your predictive models on most of them; let’s call it 6,000 mushroom examples
Now you have a model. But you still have no idea how well it will perform. This is where the remaining 1,124 mushrooms examples come in. You test your model on thesis remaining examples, by hiding the poisonous answer from the model. Then, the model predicts poisonous or not, and you compare the model with the real answer to see if it was right. That gives us an “accuracy” score
If the accuracy is high enough (technically, we want a very very low false negative rate here!) you might trust it to keep you from eating a mushroom that would be your end.
But what does this have to do with the most-ridiculous question? A lot actually.
What if you never measured any mushrooms; no measurements of cap-shape or stalk shape or color, and certainly no measurement of poisonous status? Would there be any way at all of knowing if a mushroom is poisonous or not based on data? Um, data you didn’t collect? No, of course not. In other words, we can’t predict what we haven’t seen before. It’s just not possible. Can we go out and buy datasets? Yes, of course. Companies like Acxiom do just that. Generally, data augmentation can make models smarter, but if you’re hoping to predict a purchase or a click or adoption, your data scientist will need lots of historical examples of those events from which to construct a predictor of future behavior.
It’s not that the question is that ridiculous, it’s that with all the hype surrounding data science, it becomes easy to see our professions as a panacea. It is not! It’s an incredibly smart tool to help companies in the right position to be helped. This is exactly why we offer a Machine Learning Readiness service to our clients. It’s a way for them to get the answers to all their questions. I don’t truly believe any question is a bad one. Rather, we take a lot of pride in being able to help companies understand exactly how they can benefit from predictive analytics.
If you think we could do the same for you, please contact us today. Our first consultation is always free. We’d love to talk with you!
P.S. My silly mushroom example, isn’t so fabricated at all. If you’re more interested in this problem, it just turns out that UC Irvine has this exact dataset! I love talking about it, because it’s so easy to conceptualize. Here’s the link to the official page: https://archive.ics.uci.edu/ml/datasets/mushroom