Data Cleaning

Your Team’s not Holding You Back, Your Data Is

By | Tech Tuesdays
Reading Time: 3 minutes

Generally speaking, data scientists build “models” that predict things such as customer churn, what a customer might want next while browsing your website, or even what you yell into your phone in a loud room.

But how long does it take them to build a useful model? Data scientists are expensive to keep around. And if building models takes forever and a day, two bad things happen:

  1. It costs a lot
  2. Over time, struggling to pull good data to even start the model-building process becomes demoralizing to the data scientist(s)

Thankfully there are ways to prevent this from happening. This week I want to show you one of the ways in which data are typically stored, why that’s not how data scientists want it, and what you can do about data storage to make life easier (read: save a lot of time and money!)

What you see below is how we’ve helped many of our most successful clients reorganize their approach to data; unlocking data science teams to get more done in less time.

Let’s say we have some user data that comes from log files (this is vastly oversimplified… I even left out timestamps):

User Name, Age, Zip Code
Bob, 34, 92089
Ann, 29, 06378
Tim, 62, 42567
Bob, 34, 92021

If we pretend that the user names are unique (there can only be one Bob), then we have two rows for the same user – Bob – each with a different zip code. Clearly Bob moved. But if his old address is of no interest to us, then we’ve got surplus data we don’t need and won’t ever use. In other words, we can consolidate the table by removing the first row with his old address like this:

Name, Age, Zip Code
Ann, 29, 06378
Tim, 62, 42567
Bob, 34, 92021

Now we have all we need — the most current ages and zip codes from our users.

Why is this sort of simplification so important?
Data scientists need to build rich user profiles, but with multiple entries per user, complexity increases very quickly. Imagine a customer table with hundreds if not thousands of duplicates for 10 million users. It’s a nightmare! And we see it all the time!

Such situations require us to “clean” the data, which can take months depending on myriad factors such as the number of rows and columns, the sanctity of the data, and our objective.

The same is true with product data. We generally don’t need multiple rows of data for each product, especially since products don’t generally change (without a whole new SKU).

What DO Data Scientists Want?
Eighty percent of the work we do relies on having the following datasets:

  1. A user table without duplicates
  2. A product table without duplicates
  3. An interaction table that shows when a user interacted with a product

Generally, this format affords us a greatly simplified and quite powerful look into a business.

From such simple, clean data tables we can build very intelligent recommenders, cluster users based on attributes and purchase behavior, or find similar products when something is out of stock. We can also compute customer lifetime value (LTV) or churn. In fact, most of what we need to do for the majority of our clients (especially in e-commerce) can be achieved with these three datasets.

We’ve implemented this methodology for our customers and we’ve seen it take months off their model deployment time and facilitated very fast exploratory work. Think days versus months.

If you think your team could be more efficient, it could be as simple as a data access issue. Contact us and let’s have a chat about how we can help.

Of Interest

How to Ensure Your Data Science Projects are Successful Every Time
More than half of data science projects never get used by organizations. This might lead some to believe that there are flaws within the data, analytics tools or underlying ML models, but that’s not the case. Failures to launch typically stem from an inability to bridge model outputs with real business or organizational next steps. https://towardsdatascience.com/how-to-ensure-your-data-science-projects-are-successful-every-time-8348ce5e37a0

Can we Solve Diversity with 100,000 Fake A.I. Faces?
Diversity in A.I. is getting a lot of (due) attention lately. Here’s an interesting take: Generated Photos is a free-to-use Google Drive full of algorithmically-generated face, that the creators say will solve a lack of diversity in stock imagery. Spoiler: It won’t. https://www.vice.com/en_us/article/mbm3kb/generated-photos-thinks-it-can-solve-diversity-with-100000-fake-ai-faces

Take a Look at Uber’s Massive Collection of Open Source A.I. Tools
Uber Has Been Quietly Assembling One of the Most Impressive Open Source Deep Learning Stacks in the Market. This article looks at some of Uber’s top machine learning open-source projects and where you can find them. https://towardsdatascience.com/uber-has-been-quietly-assembling-one-of-the-most-impressive-open-source-deep-learning-stacks-in-b645656ddddb

How Large Companies use A.I. to Increase Engagement

By | Tech Tuesdays
Reading Time: 4 minutes

Engagement is essential.

For a lot of businesses, personalizing great content or products is the key to high customer engagement. For companies with lots of products where users cannot sift through everything on their own, a recommender system is an effective and valuable tool with which they can show the right item to the right user at the right time.

These recommender systems work well until the number of products skyrockets, or when products are changed and/or added very rapidly. In such conditions, relevant and personalized recommendations become much more challenging to deliver.

The best modern example of a company whose recommender system is faced with this challenge is YouTube, where 300 hours of video are uploaded every minute!

We’ve all experienced searching for a five-minute video and losing an hour or more to the YouTube rabbit hole. That’s not a mistake. The average mobile viewing session lasts more than 40 minutes, up more than 50% year-over-year.

YouTube gets engagement.

YouTube captures your attention by providing incredibly accurate recommendations, in near real-time. But how is that even possible with such colossal data rates?

YouTube has billions of potential videos to show you. Even the most efficient recommender algorithm cannot scale to billions of products and update frequently enough to incorporate all the new content. So YouTube did something smart; they created a two-step solution to the problem.

  1. For each user, in real-time, YouTube uses some high-level information about what types of videos were watched in the past to filter through billions of options and generate a few hundred candidate videos tailored to each user
  2. A very accurate recommender algorithm sorts through this narrowed-down selection of videos and ranks them according to what the user is most likely to engage with

You don’t have to look for any accuracy metrics here. YouTube recommendations are about as good as they get, and that’s shown by the incredible engagement numbers: Around 5 billion videos are watched on YouTube every single day by their nearly 30 million daily visitors!

Clearly, the two-step recommendation process works well for them.

This is generally how we approach “Big Data” in practice; we first look for ways to simplify or cull down the huge number of examples. This might include:

  • Training our models on a short time window of data, removing the need for years of data collection that might only add a trivial amount of increased accuracy
  • Performing intelligent segmentation that allows us to omit irrelevant items and create smaller, more manageable models
  • Using special techniques to omit or combine multiple data fields

We often work with large datasets, but they usually don’t stay big for long; by employing some smart decisions about what’s important to the user and the company (good and fast recommendations that drive engagement), we often find that data science doesn’t have to be overly complex or incredibly computationally expensive.

If you’re running into capacity issues or are struggling to find ways to engage your customers with an increasingly large data set, contact me at zank@bennettdatascience.com and tell us about it. You may be a simple solution away from a better customer experience!

 

Read about the YouTube recommender here: 
https://hackernoon.com/youtubes-recommendation-engine-explained-40j83183

And here are some mind-blowing stats from YouTube in 2019: 
https://merchdope.com/youtube-stats/

Of Interest

A.I. ‘Outperforms’ Doctors Diagnosing Breast Cancer
A study in the journal Nature suggests that  A.I. is more accurate than doctors in diagnosing breast cancer from mammograms. But before we start replacing all our doctors with iPads, there’s a lot more to treating a human than a simple diagnosis. This topic is surrounded by ethics and privacy and myriad other concerns. I believe that at best, A.I. is an important tool to assist physicians, not replace them. https://www.bbc.com/news/health-50857759

The Hidden Dangers in Algorithmic Decision Making
While we’re on the topic of A.I. making decisions for us, have you ever considered how the content you consume all day was chosen for /recommended to you? From YouTube videos to the latest Spotify playlist to your Instagram feed to the movies you watch at night… All of those sources use algorithmic recommenders to deliver your content. The problem is with so many of us in that loop of recommendations, these algorithms can generate a tremendous amount of bias for the content we consume. Read more about this fascinating and very real dilemma here: https://towardsdatascience.com/the-hidden-dangers-in-algorithmic-decision-making-27722d716;a49

A.I. is a Fast-Growing Field and That’s not Going to Change any Time Soon
A.I. jobs are on the upswing, as are the capabilities of A.I. systems. The speed of deployments has also increased exponentially. It’s now possible to train an image-processing algorithm in about a minute – something that took hours just a couple of years ago. Everything from A.I. conference attendance to the shrinking amount of time required to build and deploy intelligent models, speaks to the huge growth of A.I. https://www.zdnet.com/article/artificial-intelligence-the-score/

customer personalization credit card

Want More Revenue? Focus on Customer Engagement!

By | Tech Tuesdays
Reading Time: 4 minutes

“Customer engagement is where the heart is”
– Neil Patel (1)

A study by Constellation Research reported that companies who improve engagement can increase cross-sell revenue by 22 percent, up-sell revenue by 38 percent and order size by 5 to 85 percent. (2)

Engagement is huge. For some companies, it’s their mission: Maximize Customer Engagement. Think about big companies like Facebook or Netflix. They strive for our engagement. It’s everything to them. Without engagement, there’s no revenue or profit. Without profits, there can be no company (sorry WeWork). Instagram was massive before Facebook bought them. Instagram was (and still is) all about engagement.

But it’s not just these behemoths that have the data required to effectively increase engagement. In fact, most companies collect data that can help maximize engagement.

If engagement is a big goal for you or your company, you may be asking yourself how your data can help. Here are a few ways data scientists use data to help companies increase customer engagement:

1-to-1 Personalization
Personalization is not only convenient for your customers, it’s what they expect! You know how you feel when your recommendations are spot on? It’s like they know you. When this happens, it builds trust and trust supports continued engagement. Here are a couple of ways in which we do that for our customers:

  • Personalized Product Recommenders
    Product recommenders are algorithms that drive engagement because they “pay attention” to each click or action a customer takes. Product recommenders use this valuable information to provide 1-to-1 personalized product recommendations in real-time. This is valuable in several touchpoints that drive up engagement:
  1. Product pages – tailoring the browsing experience to the customer
  2. Emails – recommending specific content for that customer
  • Marketing materials designed to speak to specific segments
    There are many ways to achieve segmentation for marketing, from splitting an audience on gender and/or age, to building informative customer avatars or profiles.When our clients have sufficient customer data, we can create very specific segments, and we see engagement spike due to hyper-personalized marketing. For clients with less data or who are just starting out, we can employ a few fluid onboarding questions to capture customer preferences. These preferences immediately impact engagement with marketing materials. It’s important to remember that when a customer gives us valuable onboarding data, it’s our obligation to act on that. For example, no meal delivery service should suggest steak to a vegetarian.

Reduce Poor Engagement – Predict Positive and Negative Reviews or Churn
To increase engagement, we often look at large collections of data showing customer churn. Then we build a model to predict when a customer has a high probability of churning. A retention team can use this information to intervene in real-time and hopefully prevent customer loss. The same process can be applied to negative reviews to predict these before they occur. This provides opportunities to reach out with personalized messages with the goal of re-engaging dissatisfied customers.

These engagement techniques are each data-driven. If you’re collecting customer data (to create rich customer segments), product data (containing the attributes of your products) and interaction data (how your customers interact with your products), you probably have everything you need to use these effective data-driven methods.

Many companies, however, collect piles of engagement data and never use it effectively. For example, have you ever received an email showing you a smattering of the latest products that you weren’t interested in at all? Unfortunately, these non-personalized messages are still common.

Showing a customer new or popular items is a solid first step towards driving engagement, but it never works as well as true data-driven 1-to-1 personalization. We usually see a 30% uptick in customer engagement when moving from such non-personalized approaches to true 1-to-1 personalization.

If you’ve made the transition to personalized treatment of your customers, please hit reply; I’d love to hear how it’s going! And if you’re interested in learning how your specific company/organization’s data can be used to enjoy the benefits of 1-to-1 personalization, feel free to reach out to us at zank@bennettdatascience.com. The first consultation is always on us!

Read more on this topic here:
https://sloanreview.mit.edu/projects/using-analytics-to-improve-customer-engagement/

Links to referenced articles
(1) https://neilpatel.com/blog/analytics-can-strengthen-engagement/
(2) https://www.constellationr.com/blog-news/research-summary-why-live-engagement-marketing-supercharges-event-marketing

Of Interest

Cutting Costs – New Hardware for Making Recommendation to Billions of Customers
The biggest recommendation services in the world are switching to powerful GPUs for their huge calculations and seeing massive benefits. They’re able to save time and expense by replacing large CPU clusters using hundreds of nodes with a single chip. In doing this, costs drop by 90%. https://www.nextplatform.com/2019/12/19/ai-recommendation-systems-get-a-gpu-makeover/

Most Hackers Aren’t Criminals
Read this interesting article about ethical hackers who spend their days breaking into secure systems before their adversaries do. It’s not A.I. specifically, but quite interesting. https://www.nytimes.com/2019/11/07/opinion/hackers-hacking.html

And for a bit of levity, check out this excuse generator you can use if you’re hacked: https://whythefuckwasibreached.com/

Kaggle First Place Winner Cheated, $10,000 Prize Declared Irrecoverable
Kaggle is a data science competition site, where data scientists (or teams of them) compete for cash prizes, usually by providing the most accurate solution to a problem. Over the years this approach to data science has proven controversial more than once. In this latest debacle, read about how a team obtained private data, constructed a fake A.I. model, and got away with the money from a platform for adopting neglected pets. https://towardsdatascience.com/kaggle-1st-place-winner-cheated-10-000-prize-declared-irrecoverable-bb7e1b639365

tic tac toe

What I Learned Teaching my Computer to Play Tic-Tac-Toe

By | Tech Tuesdays
Reading Time: 4 minutes

Good morning. This is David Albrecht, data scientist at Bennett Data Science, filling in for Zank.

The other day, I saw a post from someone explaining that their father went his entire life having never heard of Tic-Tac-Toe. It blew my mind. Can you imagine that?! Blasphemy, if you ask me.

As a kid, I played it a lot on restaurant napkins with my younger brother. It’s a funny little game – easy to explain and yet difficult to completely master. Sometimes he’d win and sometimes I would, but neither of us ever really could remember how to ensure a win (or at least a tie) each time. We’d forget how we got duped by that double-whammy 4 games ago or whether we should take the middle spot to start the game. Our memories just couldn’t remember all the details.

Computers are the best at these sorts of tasks. They can be programmed to do the same things over and over and remember the past. For Tic-Tac-Toe, however, we’d need to be experts at the game to program a computer to be one as well.

Instead, and much more interestingly, we can create an expert.

Reinforcement Learning is a branch of Artificial Intelligence whereby we have the computer learn the best move at every step of the process to eventually win, which we call receiving a “reward”.

Different from more typical A.I. tasks, Reinforcement Learning requires the system to understand that there is a sequence of events that end in a desired outcome.

Recently, this sort of A.I. has hit a major milestone by beating a top professional player in StarCraft II, a notoriously complex video game that requires careful thought in order to outsmart the opponent and finally win.

Not only useful for video games, Reinforcement Learning can be used to efficiently automate other tasks like stock market trading, computer resource management, traffic light control, and robotics, to name a few.

Some of the world’s most successful tech companies use it too. Did you know that Netflix uses artwork tailored to your preferences to try and get you to watch the movies they recommend? They do, and they use special types of Reinforcement Learning to do it!

They’re incredibly powerful algorithms.

The main drawback, however, is that we usually need data describing every possible state, action, and reward. That means that if you want to know what actions to take to sell your most expensive widget, the agent must have access to historic sales data that shows sales of the widget. Otherwise, there’s no way for it to learn! Luckily, in some cases, it’s possible to simulate that data.

How does it work?
We can teach a computer to play a simple game with multiple steps such as Tic-Tac-Toe. To do this, we take two players who start by playing random moves against each other. Over the course of each game, one is going to learn how to play, the “agent, and the other is going to continue playing randomly.

For each step of the game, called a “state”, we save an entry in a large table so that the agent can look up a state and remember the moves that resulted in winning or losing. The agent is given a reward for a win and a penalty for a loss, and informed by this data it’ll continue to play the moves that resulted in a win and avoid the moves that tended to lead to a loss. Over time, the agent learns that the best first move is a corner or the center, and it effectively learns how to play a game with multiple steps.

As you can imagine, this process is similar to other systems of steps, such as traffic optimization, a robot moving and picking things up, and movie selections based on artwork.

This process is called Reinforcement Learning because we reinforce the actions that we want the agent to take. What’s most interesting here is that we don’t even have to be experts in playing Tic-Tac-Toe to teach the agent how to be an expert – it will learn itself through trial and error just like we would!

Once the agent becomes an expert, it can be used to make decisions at every step of the game or sales process so that, when it gets to the end of the game, it has the best chance of winning. Or the best chance of selling that widget!

Reinforcement Learning is one way we can learn to automate tasks when we have sufficient training data. Have you tried using Reinforcement Learning or another type of A.I. in your business? Please hit reply and let us know what you think.

If you’d like to see our simulation in action, you can see all the code here:
https://github.com/dpalbrecht/TicTacWhoa

Of Interest

If you read one thing this week…
Read this hierarchy of needs for A.I. projects. In clear terms, this post explains all the complexity required to launch data science projects. It’s important to understand all that goes into a deployed predictive model, as each step along the long road is tied to great time and expense.
https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007

Employees need to understand analytics to succeed
“[A study from] Deloitte says 67% of employees are not comfortable accessing data or using insights on company tools. And “The majority of companies today adopt a fragmented, siloed approach to analytics tools and data, which correlates with diminished business success”.
https://medium.com/@datagran/democratization-of-data-science-79ce64b4b98c

Here’s a huge pile of A.I. resources
Here, you’ll find links to a long list of Machine Learning Resources, many with explanations.
https://medium.com/mlait/a-z-machine-learning-resources-5a7e29d9c45c

Is that Data Science Company a Fake?

By | Tech Tuesdays
Reading Time: 4 minutes

Today’s newsletter is a reaction to an article, How to Recognize A.I. snake oil. The author writes about the large and growing number of fake A.I. claims companies are making. Unfortunately, it can sometimes be quite difficult to tell real from fake A.I..

I love this topic. It’s also quite timely, as a colleague I respect a lot wrote me recently. In his wonderful stream of consciousness style, he asked me the following:

What makes someone’s machine learning or AI efforts in marketing better than others? Everyone has a model and algorithm. Are some trained better? Is the answer “prove it” and the more performant algorithm wins? Or is it tied to the sustainability to ML over time? Curious how customers make that distinction as many I know who buy AI/ML as part of a system are about an inch deep?

DB, thank you very much for initiating this conversation with me!!

My knee-jerk reaction is to say, ask a professional (data scientist) to properly vet the company. But of course that doesn’t scale, and really, that’s not what data scientists are for.

But there’s a lot in there in his questions and comments. Let’s unpack it a bit.

What about the word “better”? How is one A.I. better than another? The simple/naive answer is, the one that performs best on the metric(s) used to measure maximization of some objective. But that’s (wordy as heck and also) not what he had in mind.

Rather, the questions are part of a growing dilemma: there’s so much confusion around what A.I. is and what it can do, that companies are getting away with using words like data driven personalization or cutting edge A.I., when in fact they have little or none of either. Snake oil!

I’ve seen my very own words featured on websites of companies that I had been talking with. They were bold enough take phrases from my initial call with them and paste them on the front page of their websites a few days later. I have to say it was flattering, but even more elucidating.

Here’s another example:I hear companies promise that they use machine learning to do customer segmentation all the time. Then, behind the scenes, they’re using male/female or age brackets chosen arbitrarily to perform segmentation. Are these segments really purchasing/behaving differently? The hope is yes, but there’s zero analysis done behind the scenes to assure it. But it sort of works. I’m reminded here of the Anchorman quote:

60% of the time, it works every time.

These examples are (almost) amusing. But this can be very frustrating and expensive for companies looking to pay for intelligent services. In some cases, job candidates are overlooked because the A.I. in charge of vetting didn’t pick up the correct number of keywords. What are applicants doing? They’re taught to use words like “Cambridge” or “Oxford” in white text in their digital resumes. Humans can’t see it but computers drink it up! Hardly intelligent on the part of the machine.

This is the result of companies using the confusion around our field and slapping the A.I. label on anything and everything. After all, it helps with fundraising and getting that next client. There’s a lot of ethics involved here. But it gets worse.

Want to feel a little uneasy? Read this list of areas where A.I. could potentially help:

  • Predicting criminal recidivism
  • Predicting job performance
  • Predictive policing
  • Predicting terrorist risk
  • Predicting at-risk kids

The question here is, can social outcomes be predicted?

I won’t spoil everything in the article, but the answer is, of course, hardly. And:

“We must resist the enormous commercial interests that aim to obfuscate this fact.”

Please read the fascinating (PDF) article on this topic here: https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf

Of course, there are good companies out there providing fantastic data-driven products and services to their clients. But A.I. practitioners face real challenges as we hope to differentiate our companies from those offering little or no intelligent use of data.

And if you’re wondering whether to get involved with a company or organization offering “cutting edge A.I.”, ask to speak with their head of data science. In many cases, you’ll know right away.

Of Interest

Senators Want Answers About Algorithms That Provide Black Patients Less Healthcare
This makes sense. When data scientists train algorithms on datasets where minorities are underrepresented, results can be life-threatening. This sort of “bias” is the topic of this article. We’re all responsible for paying attention to the ethical issues that come from how we interpret and use our data. Cambridge Analytica ring a bell?
https://arstechnica.com/tech-policy/2019/12/senators-want-answers-about-algorithms-that-provide-black-patients-less-healthcare/

Gender Bias in A.I. and What we can do to fix It
There is a big bias against the pronoun “hers” in the datasets used to train most of the language models we use today. The result: bias in how we address women in A.I. products. The source of the bias is the perfect metaphor for bias in A.I. more broadly. Read more in this fascinating, and quite long, article. The author poses the problem well, then offers an elegant solution.
https://medium.com/@robert.munro/bias-in-ai-3ea569f79d6a

Who Better Than Pinterest to Discuss A.I. for Images
In this technical article, Pinterest scientists talk about how they utilize image embeddings throughout their search and recommendation systems to help users navigate through visual content by powering experiences like browsing of related content and searching for exact products for shopping. They describe a multi-task deep metric learning system to learn a single unified image embedding, that can be used to power our multiple visual search products.
https://arxiv.org/abs/1908.01707

Speedometer

A.I. and Highway Speed Traps

By | Tech Tuesdays
Reading Time: 4 minutes

When you’re driving down the street in your car (or it’s driven by A.I., if you’re reading this in the near future) and you want to know if you’re speeding, you look at the speedometer. Pretty simple.

That’s a measurement of a rate (mph or kmph), and a really important one at that. If you’re speeding, you want to know. And generally, we trust the speedometer to be accurate.

A.I. systems have similar measures. These are the numbers that we produce to help us understand how our models are performing. These metrics help us make all sorts of decisions:

  • Is the new model “better than” the old model? If so, it may be time to test the new model in production.
  • Is the current model still performing? Drift or changes to input data can have big effects on revenue; we monitor metrics to identify and correct for drift early on.
  • Do we have the necessary data to predict churn or fraud? Metrics tell us how accurately we can predict such events.

Ok, but what does this have to do with driving a car and reading the speedometer?

Say you’re speeding down the freeway at 90 mph and the speed limit is 70 mph. Well, you’re in a prime position to get a ticket. Is it your fault? Yes, of course it is! You were speeding. But what if your speedometer said you were going 70 mph? That doesn’t matter, it’s still your fault. You made a mistake when you trusted your speedometer.

Specifically, you chose a false negative answer to the question, “Am I speeding?”. You said, “no”, and you were wrong (false negative).

What’s the penalty for a false negative like this? Well, probably a few hundred dollars and maybe a weekend day in traffic school! No thank you!

Could you go back to the car manufacturer, and demand an explanation? Of course you could. And if they sent a lot of cars onto the streets that made errors like these, they would certainly hear about it. In other words, the penalty of false negatives for car manufacturers is very real.

What about the other way around, false positives. In the car example, a false positive to the question, “Am I speeding?” Would be a “yes” if you were driving equal to or less than 70 mph. No big deal. You’d simply slow down. The consequence here is entirely different.

Now we have enough information to put together a compelling way to think about how accurate speedometers need to be.

As a driver, I would take a higher false positive rate if it meant keeping my false negative rate VERY low. In practice this would mean that occasionally I might buy a car that shows I’m driving 72, when I’m really only going 70. In this case I would slow down a bit, but would never be in danger of a ticket. I would be willing to accept that tradeoff much more often than the other case; only very rarely would I want to have a speedometer that says I’m going 68 mph when I’m really going 75 mph.

The tradeoff I’m discussing here is illustrated beautifully in a plot called a “receiver operating characteristics curve” (no one calls it that, it’s abbreviated ROC curve and pronounced like the word “rock”). It allows data scientists to set a false positive rate based on an acceptable false negative rate. It does this by showing the relationship between the true positive rate and the false positive rate, and allowing someone to choose the acceptable risk (the operating point).

Generally, the operating point is determined by product leaders, as it has a lot to do with business risk (think about the risk to an automobile manufacturer when its speedometers read too low for the actual speed.)

We’ve used ROC analysis often. One example is fraud detection for financial institutions where a false positive means calling a customer or sending an SMS when there was no fraud event.

That’s an inconvenience, but compared to a false negative (when fraud happens and it is not detected), it’s a lot less expensive for the bank. In this case, banks are usually willing to send out a few unwarranted messages to customers in lieu of losing money.

You can read more here about ROC curves and the importance of picking the right operating point, and how it’s the job of product leads to do this
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

Of Interest

Biased Algorithms are Easier to fix Than Biased People
In one study published 15 years ago, two people applied for a job. Their résumés were about as similar as two résumés can be. One person was named Jamal, the other Brendan. In a study published late last year, two patients sought medical care. Both were grappling with diabetes and high blood pressure. One patient was black, the other was white. Both studies documented racial injustice: In the first, the applicant with a black-sounding name got fewer job interviews. In the second, the black patient received worse care. It’s easier for analytics professionals to de-bias data and generate race-independent predictions than it is for individuals to change their unconscious bias.
https://www.nytimes.com/2019/12/06/business/algorithm-bias-fix.html

Using Algorithms to Understand the Biases in Your Organization
There’s no doubt there’s bias in A.I. algorithms. This article talks about using A.I. to identify that bias. Organizations should use statistical algorithms for the magnifying glasses they are: Algorithms can aggregate individual data points with the purpose of unearthing patterns that people have difficulty detecting.
https://hbr.org/2019/08/using-algorithms-to-understand-the-biases-in-your-organization

Soft Skills for Data Science
Part of being successful as a data scientist involves a lot of math and analytical skills. That’s fairly well accepted. To be truly successful, however, requires a few soft skills as well. In this article, you will see how something like skepticism can be essential to succeeding as an analytics professional.
https://towardsdatascience.com/soft-skills-for-data-science-fee73ae4821a

new year

Happy New Year – Future Data Science Trends for 2020

By | Tech Tuesdays
Reading Time: 3 minutes

Tomorrow we embark on the 20’s! Finally, a decade that’s fun to say. Did anyone ever refer to the last decade as the tens? That’s to say nothing of the “naughts”.

This week, I’d like to look at the direction I think data science is headed in this next year.

Firstly, I foresee that deep learning will continue to dominate natural language processing and image processing. These technologies power self driving cars to all the digital assistants we talk to each day (Siri, Alexa, Google Assistant, to name a few).

Companies will find new and innovative ways to leverage the more and more accurate deep learning models as they’re released. Deep text and visual understanding is now within the reach of anyone with a credit card and the small budget required to rent a cluster of virtual machines for a few hours.

These costs are now too low to be a barrier of entry anymore, and practitioners are getting easier to find all the time. Take a look at these headlines:

Secondly, as deep learning progresses and becomes more accessible, classical machine learning methods will see a lot less use and adoption. Simultaneously, the big three cloud platforms (Amazon Web Services, Google Cloud Platform and Azure) will continue to make it easier to deploy and monitor neural network models at scale.

The result is that development operations teams will become less of an important group to data scientists. New data science teams will have an internal resource for handling data (the data engineer) and deployment (the machine learning engineer).

Thirdly, reinforcement learning is coming up a lot more lately. Here’s an example to help motivate the need for reinforcement learning: Imagine you’re running an A/B test to determine which product to show to a group of users, and the click through rate for A is 10% better than B.

Well, you should always show A, right? Right! But that assumes that the click is the desired outcome. It might be. But maybe it’s customer lifetime value (LTV). But typical A/B tests don’t look downstream enough to measure LTV. That’s where reinforcement learning comes in. It’s much more challenging to get right, and can be labor intensive, but works well when it does, and it’s getting a lot more attention these days.

Other trends I see getting a lot more attention in the coming year(s):

  • A.I. driven Security
  • Explainability of A.I. “black box” models
  • Better voice assistants (Google, Siri, Alexa)
  • Wider adoption of autonomous vehicles through legislature and advances in A.I.

And finally, if you’re worried, here’s an overview of seven jobs that may be gone by 2020: https://fortune.com/2019/11/19/artificial-intelligence-will-obliterate-these-jobs-by-2030/

Happy New Year!!

Of Interest

Safer A.I. – Take a Look at Safety Gym
While much work in data science to date has focused on algorithmic scale and sophistication, safety — that is, safeguards against harm — is a domain no less worth pursuing. This is particularly true with applications like self-driving vehicles, where a machine learning system’s poor judgment might contribute to an accident. https://openai.com/blog/safety-gym/
https://venturebeat.com/2019/11/21/openai-safety-gym/amp/

The 8 Minute Guide to how Your Business can Solve Problems with A.I. and Machine Learning
There are a tremendous number of applications of A.I. to many types of businesses. But what are these applications? How do companies know if they can use their data to impact products and revenue? This easy-to-read article gets at the heart of those questions. https://towardsdatascience.com/the-8-minute-guide-to-how-your-business-can-solve-problems-with-ai-and-machine-learning-b7e66f4b484e

The 6 Research Directions of Deep Recommendation Systems That Will Change the Game
Over the past couple of years, there have been big changes in the recommendation domain, shifting from traditional matrix factorization algorithms to state-of-the-art deep learning-based methods. This post looks at the main reasons why this happened and what the research says about building modern (product) recommenders. https://towardsdatascience.com/recommendation-system-series-part-3-the-6-research-directions-of-deep-recommendation-systems-that-3a328d264fb7

holiday shopping

The Holidays Screw it all up for Data Scientists

By | Tech Tuesdays
Reading Time: 4 minutes

Know what I usually think about around the holidays? Models that suddenly don’t work at all.

I should explain.

A lot of what we do day in and day out revolves around repeatable patterns of user or customer behavior. If users/customers browsed, clicked and bought in a completely random way, we would have no way to predict what they want. Personalization would drop off drastically.

Imagine going to Amazon and buying random products that don’t seem to go together at all. This rarely happens, but is exactly what we see around the holidays – you should see the random things we get for our house during Halloween!

We worked for a company several years ago that tracked their engagement and related revenue by the hour. We’d see huge drops during Christmas and several other holidays. We completely expected it, but it was still terrifying! What if the revenue drop was a false negative for detecting a system failure? In other words, what if a holiday coincided with some sort of tech snafu that just happened to land on that day? Improbable, but not impossible.

Of course, the way to handle this is to simply relax a bit over the big holidays. Sit back and have a sip of eggnog as all your customers head off in their cars or open presents or dress up like demons and terrorize the neighborhood. For companies that have been around long enough, it’s possible to look back over the years and predict these periodic events.

Holidays are different than seasonality, where weather patterns over month-long timelines affect user behavior. I’m talking about the fast, one-day-and-they’re-gone events. We know how to handle seasonality pretty well, but is there anything we can do about holidays, especially in multiple countries where traditions and time zones differ?

To answer this, I went looking for a Python package that might help. A Python package is a bit of code designed to accomplish a specific task. Often packages are open source, so we’re free to use them commercially. With little effort I discovered one, and it’s called, unimaginatively, holidays.

Essentially, the package returns true/false for the question:

Is today (or any date) a holiday in a given location?

It also reports each major holiday for populous areas around the world.

Here’s what that looks like for the U.S. vs. Portugal:

Holidays in The United States for 2019:
2019-01-01 New Year’s Day
2019-01-21 Martin Luther King, Jr. Day
2019-02-15 Susan B. Anthony Day
2019-02-18 Washington’s Birthday
2019-03-31 César Chávez Day
2019-04-01 César Chávez Day (Observed)
2019-05-27 Memorial Day
2019-07-04 Independence Day
2019-09-02 Labor Day
2019-10-14 Columbus Day
2019-11-11 Veterans Day
2019-11-28 Thanksgiving
2019-12-25 Christmas Da

Holidays in Portugal for 2019:
2019-01-01 Ano Novo
2019-03-05 Carnaval
2019-04-19 Sexta-feira Santa
2019-04-21 Páscoa
2019-04-25 Dia da Liberdade
2019-05-01 Dia do Trabalhador
2019-06-10 Dia de Portugal
2019-06-13 Dia de Santo António
2019-06-20 Corpo de Deus
2019-08-15 Assunção de Nossa Senhora
2019-10-05 Implantação da República
2019-11-01 Dia de Todos os Santos
2019-12-01 Restauração da Independência
2019-12-08 Imaculada Conceição
2019-12-24 Vespera de Natal
2019-12-25 Christmas Day
2019-12-26 26 de Dezembro
2019-12-31 Vespera de Ano novo

How each or any of these holidays may affect your business is going to be entirely related to your customers and the type of business you have, but it’s a start.

Are you curious about the different holidays in some of your target countries? Here’s a link to the Python notebook I used to generate the output above:
https://colab.research.google.com/drive/1kRL3-pLg0fbgD07gKMpjMy-BRl7D3MC2.

Moreover, here’s a link to a five-minute guide that shows the ins and outs of using holidays package:
https://towardsdatascience.com/5-minute-guide-to-detecting-holidays-in-python-c270f8479387

May your sales be strong, and nerves even stronger this holiday season 🙂

Happy Holidays from all of us at Bennett Data Science!

Of Interest

Data Science Books you should read in 2020
As we get into the new year, here’s a jump on a few good reads to help you or your favorite data scientist kick off the 20’s!
https://towardsdatascience.com/data-science-books-you-should-read-in-2020-358f70e1d9b2

Optimizing Blackjack Strategy through Monte Carlo Methods
Ever wondered how exactly to play blackjack to maximize your chances of winning? This article gives all the Python code required to simulate thousands of hands, allowing you to change the way you “play” each hand and viewing the results. A Monte Carlo simulation approach relies on random sampling of a model, observing the rewards returned by the model, and collecting information during normal operation to define the average value of its states. The value of all possible combinations of player and dealer hands in Blackjack can be judged through repeated Monte Carlo simulations, opening the way for optimized strategies.
https://towardsdatascience.com/optimizing-blackjack-strategy-through-monte-carlo-methods-cbb606e52d1b

Using A.I. to Separate Songs into Their Individual Instruments
While not a broadly known topic, the problem of source separation has interested a large community of music signal researchers for a couple of decades now. It starts from a simple observation: music recordings are usually a mix of several individual instrument tracks (lead vocal, drums, bass, piano etc..). The task of music source separation is: given a mix can we recover these separate tracks.
https://deezer.io/releasing-spleeter-deezer-r-d-source-separation-engine-2b88985e797e

Bandwith Close Up

Machine Learning is Dead

By | Tech Tuesdays
Reading Time: 3 minutes

Companies are moving away from machine learning, quickly.

We see it every day; companies are leveraging the power of deep learning – that subset of A.I. based on neural networks that were designed to mimic how our brains learn and function. Deep learning models generally outperform classical machine learning methods. Drastically. There are quite a few good reasons for this, and I will discuss some of them here, along with the advances we’ve made for our clients over the past few years.

Where anything related to image processing is concerned, we can often use transfer learning to extract deep knowledge from images. Companies have released pre-trained deep learning models to the public. These models were trained on millions of examples and have “learned” an incredible amount about what’s in digital images. We use a technique called transfer learning to piggyback off this work, and it’s not only useful for images. We can also use pre-trained neural nets for natural language processing.

I’ve written here about the sometimes scary power of the biggest text models, like GTP-2. We can no longer tell if a human or machine wrote a block of text. Using classical machine learning methods such as latent semantic analysis (LSA) is still a powerful technique for text processing, but when the application allows, leveraging the deep power of the newest language models can provide that magic that standard machine learning methods cannot.

We’ve used image similarities for our clients to successfully search through millions of images to find duplicates or similar images. And we can do it in only a few milliseconds. This is because we no longer need to think of images as these large matrices of red, blue and green color mappings. Using transfer learning, we reduce large complex images to thousands or hundreds of numbers. Computers are quite good at handling arrays of numbers this way. This simplification allows us great freedom to compare images quickly and at scale.

We built our text-based travel recommender (TRecs) atop the Bert deep learning language model as well as the LSA method mentioned above. It provides incredible fidelity, leveraging deep destination knowledge from our extensive use of transfer learning applied to our travel problem. We are able to search through thousands of destinations and identify the top 50 in only a few milliseconds.

As your company gets into A.I., are you using classical machine learning methods or have you started to leverage neural networks for their power? Neural networks aren’t appropriate to use for every case, but gone are the days when using their incredible power required weeks or months of training on millions of labeled examples, costing hundreds of thousands of dollars.

I recommend assessing your use of A.I. and speaking with someone familiar with neural networks to see if you may be missing a lot of potential upside. We’ve employed deep learning models for many recent clients, across all the major cloud platforms with fantastic success. We’ve seen this innovative technology help companies increase engagement, personalization and drive revenue. Don’t be left behind.

Of Interest

From Google – A Simulation Platform for Recommenders
Significant advances in machine learning, speech recognition, and language technologies are rapidly transforming the way in which recommender systems engage with users. As a result, collaborative interactive recommenders (CIRs) — recommender systems that engage in a deliberate sequence of interactions with a user to best meet that user’s needs — have emerged as a tangible goal for online services. https://ai.googleblog.com/2019/11/recsim-configurable-simulation-platform.html

Here’s why so Many Data Scientists are Leaving Their Jobs
There are many reasons why data scientists leave their jobs. For example, many companies hire data scientists without a suitable infrastructure in place to start getting value out of AI. This contributes to the cold start problem in AI. Couple this with the fact that these companies fail to hire senior/experienced data practitioners before hiring juniors, you’ve now got a recipe for a disillusioned and unhappy relationship for both parties. Read more here: https://towardsdatascience.com/why-so-many-data-scientists-are-leaving-their-jobs-a1f0329d7ea4

The 5 Most Useful Techniques to Handle Imbalanced Datasets
Have a classification problem where you need to train a model to detect an event, say, churn. But most of your customers (lucky you) don’t churn. What do you do when most of your training data contains non-events (such as a churn)? This article dives into this problem, called class imbalance and gives some ideas for handling it. Read more here: https://towardsdatascience.com/the-5-most-useful-techniques-to-handle-imbalanced-datasets-6cdba096d55a

Is Your Data Holding You Back?

Is Your Data Holding You Back?

By | Tech Tuesdays
Reading Time: 3 minutes

Want to know what holds projects back? Access to good, clean and reliable data.

Sure, it makes sense – we’ve got to get that new product or feature out, and like yesterday. But this isn’t endlessly sciencing away in the corner type of work I’m talking about. I’m talking about building rock-solid data pipelines that feed your data science team. We see 80-90% of our time often spent on cleaning/structuring data so we can use it for analysis, model building and display. Too often we spend weeks or months working with bad data.

And bad data delays projects that rely on predictive models. Here are a few reasons why:

  • Bad data leads to inaccurate models – before building models, data needs to be checked for outliers, distributions, collinearity and so much more. Without these checks, model performance suffers.
  • Bad data requires longer to process each time – imagine building a few different models using the same messy dataset that must be cleaned each time, for each new model. Sure, data scientists can write code to handle the cleaning, but this is not the place to do it, and asking data scientists to maintain this sort of work is suboptimal. It’s much better to use a data engineer, and put some process around the data warehousing efforts.
  • Bad data makes observing the data more cumbersome – think about your data dashboards. If you’re running a company, data dashboards report the health of sales, revenue, customer satisfaction and all sorts of other metrics and key performance indicators (KPI’s). When your data are a mess, reporting suffers and mistakes frequently show up in dashboards.

The solution to all of this is to talk to the teams that use data in your company. Ask them what they’re doing and what format they’d like most of the data to be in. Then go create those data stores. Data science probably uses data differently than the rest of your company, and that’s ok. Let’s not force them (or any other team) to jump through hoops to do their jobs.

Ask your team leaders, “What architecture would give your team 80% of the data they need?” Listen to what they say and make the changes you can afford. This alone can save your organization huge sums of money!

In one case, we helped a client that had over 20 different data stores available to the data science team. Everything was stored in those tables. From a high-level, it looked like a good approach; the team had access to everything it could possibly need. Upon closer look, however, it was information overload:

  • Queries hundreds of lines long, even for everyday tasks such as exploratory data analysis or rudimentary model building
  • Data heterogeneity exploded, as there was no single accepted way to pull certain types of standard data, such as user profiles or recent transactions

The result was huge delays to almost all data science efforts. And worse, frequent discussions on the validity of results (never mind the lack of experiment repeatability).

Simplifying this architecture took months, but when it was complete, team throughput rose by 75%. It was a massive win, but also bittersweet, since it could have been done so much better from the outset!

Of Interest

A Newsletter I Highly Recommend
My friend and colleague Luigi writes a weekly newsletter, ML in Production. Like this one you’re reading now, it comes out on Tuesdays. If you’re interested in some really important guidance on how to deploy, manage and think about your data science projects in production, this will be a wonderful addition to your weekly reading. https://mlinproduction.com/newsletter/

Recommender Models – Code and Comparison
Need to recommend the right thing to your customers? Need to rank items to show your users? Here’s a huge repo of recommender models along with benchmark comparisons of each against the MovieLens dataset. This is a highly valuable collection and could be the starting point for anyone working on product recommendations. https://github.com/microsoft/recommenders

Simulations for Recommendations
When you’re showing news (or other) articles to users in succession, it’s important to consider each new click when ranking content to determine what to show next. That’s exactly the challenge this codebase simulates so that data scientists can try various algorithms. See the code here: https://github.com/mjugo/StreamingRec