Need More Data? Simulate it

Reading Time: 3 minutes

Don’t have enough data for that new A.I. initiative? Simulate some. Let’s take a look at why that can be advantageous and when you might want to simulate big piles of data rather than waiting for all those clicks or visits or rides to roll in.

We’re going to focus on Monte Carlo simulations, based originally on simulations of the Monte Carlo Solitaire card game. In this type of simulation, we make assumptions about the type and distribution of data we have. Then we use algorithms to “play the game” over and over, generating data for us. Here’s why this is useful.

For card games like solitaire or multi-deck black jack, there are so many permutations of hands that it quickly becomes intractable to compute win probabilities by hand. Instead, we can simulate the results, and after playing thousands of hands, have a pretty good idea about our odds of winning with a given strategy. Using this method, one could compute exactly went hit and when to stand, for example.

Let’s look at how we could we use this for purposes outside of gambling.

Election prediction – knowing the polling results, voter turnout for previous years for given counties, and a few other details, we could run a Monte Carlo simulation of an entire countrywide election (and in fact this is done). The same can be done with sports, weather, and even customer purchases.

Product recommenders can leverage a host of customer data. But how much is enough and how useful is it and how will algorithms leverage it effectively? It’s questions like these that prompt me to simulate customer data for most recommender projects I work on. It’s rare that a client comes to us with comprehensive customer data, so when they want to understand the value of an additional parameter vs. the complexity, we’ll often simulate the additional field with our assumptions and carefully watch the results. In this manner, we can help companies choose the right data to go after.

Simulations are fairly straightforward to understand, but can be quite challenging to set up correctly. We highly recommend trusting a professional for this type of work 🙂

Here are some external links where you can learn more:
https://www.thisismetis.com/blog/what-is-a-monte-carlo-simulation-part-1
https://towardsdatascience.com/an-overview-of-monte-carlo-methods-675384eb1694

Of Interest

Computers generally don’t get many laughs…
but one may be writing the caption of your favorite cartoon and you would never know. See how one data scientist automatically captions New Yorker cartoons for their weekly caption contests: nobody knows you’re a bot

More about deepfakes:
Deepfake techniques, which present realistic AI-generated videos of real people doing and saying fictional things, have significant implications for determining the legitimacy of information presented online. Yet the industry doesn’t have a great data set or benchmark for detecting them. There’s a push for more research and development in this area and ensure that there are better open source tools to detect deepfakes. Facebook, the Partnership on AI, Microsoft, and academics from Cornell Tech, MIT, University of Oxford, UC Berkeley, University of Maryland, College Park, and University at Albany-SUNY are coming together to build the Deepfake Detection Challenge (DFDC). Read more about it here: https://ai.facebook.com/blog/deepfake-detection-challenge/

Do we require too wide of a skillset from data scientist?
There’s no doubt that data science is a multi-disciplinary field. But how much should we expect from a high-priced senior data scientist? How about a junior analyst? How wide versus deep can we expect? That’s what this Forbes article dives into: https://www.forbes.com/sites/cognitiveworld/2019/09/11/the-full-stack-data-scientist-myth-unicorn-or-new-normal/#2646182f2c60

Of Interest

Previous PostLoss and the Importance of Diversity

Next PostThe Trouble with Star Ratings