Reading Time: 3 minutes

Generally speaking, data scientists build “models” that predict things such as customer churn, what a customer might want next while browsing your website, or even what you yell into your phone in a loud room.

But how long does it take them to build a useful model? Data scientists are expensive to keep around. And if building models takes forever and a day, two bad things happen:

  1. It costs a lot
  2. Over time, struggling to pull good data to even start the model-building process becomes demoralizing to the data scientist(s)

Thankfully there are ways to prevent this from happening. This week I want to show you one of the ways in which data are typically stored, why that’s not how data scientists want it, and what you can do about data storage to make life easier (read: save a lot of time and money!)

What you see below is how we’ve helped many of our most successful clients reorganize their approach to data; unlocking data science teams to get more done in less time.

Let’s say we have some user data that comes from log files (this is vastly oversimplified… I even left out timestamps):

User Name, Age, Zip Code
Bob, 34, 92089
Ann, 29, 06378
Tim, 62, 42567
Bob, 34, 92021

If we pretend that the user names are unique (there can only be one Bob), then we have two rows for the same user – Bob – each with a different zip code. Clearly Bob moved. But if his old address is of no interest to us, then we’ve got surplus data we don’t need and won’t ever use. In other words, we can consolidate the table by removing the first row with his old address like this:

Name, Age, Zip Code
Ann, 29, 06378
Tim, 62, 42567
Bob, 34, 92021

Now we have all we need — the most current ages and zip codes from our users.

Why is this sort of simplification so important?
Data scientists need to build rich user profiles, but with multiple entries per user, complexity increases very quickly. Imagine a customer table with hundreds if not thousands of duplicates for 10 million users. It’s a nightmare! And we see it all the time!

Such situations require us to “clean” the data, which can take months depending on myriad factors such as the number of rows and columns, the sanctity of the data, and our objective.

The same is true with product data. We generally don’t need multiple rows of data for each product, especially since products don’t generally change (without a whole new SKU).

What DO Data Scientists Want?
Eighty percent of the work we do relies on having the following datasets:

  1. A user table without duplicates
  2. A product table without duplicates
  3. An interaction table that shows when a user interacted with a product

Generally, this format affords us a greatly simplified and quite powerful look into a business.

From such simple, clean data tables we can build very intelligent recommenders, cluster users based on attributes and purchase behavior, or find similar products when something is out of stock. We can also compute customer lifetime value (LTV) or churn. In fact, most of what we need to do for the majority of our clients (especially in e-commerce) can be achieved with these three datasets.

We’ve implemented this methodology for our customers and we’ve seen it take months off their model deployment time and facilitated very fast exploratory work. Think days versus months.

If you think your team could be more efficient, it could be as simple as a data access issue. Contact us and let’s have a chat about how we can help.

Of Interest

How to Ensure Your Data Science Projects are Successful Every Time
More than half of data science projects never get used by organizations. This might lead some to believe that there are flaws within the data, analytics tools or underlying ML models, but that’s not the case. Failures to launch typically stem from an inability to bridge model outputs with real business or organizational next steps. https://towardsdatascience.com/how-to-ensure-your-data-science-projects-are-successful-every-time-8348ce5e37a0

Can we Solve Diversity with 100,000 Fake A.I. Faces?
Diversity in A.I. is getting a lot of (due) attention lately. Here’s an interesting take: Generated Photos is a free-to-use Google Drive full of algorithmically-generated face, that the creators say will solve a lack of diversity in stock imagery. Spoiler: It won’t. https://www.vice.com/en_us/article/mbm3kb/generated-photos-thinks-it-can-solve-diversity-with-100000-fake-ai-faces

Take a Look at Uber’s Massive Collection of Open Source A.I. Tools
Uber Has Been Quietly Assembling One of the Most Impressive Open Source Deep Learning Stacks in the Market. This article looks at some of Uber’s top machine learning open-source projects and where you can find them. https://towardsdatascience.com/uber-has-been-quietly-assembling-one-of-the-most-impressive-open-source-deep-learning-stacks-in-b645656ddddb