Reading Time: 3 minutes

Want to know what holds projects back? Access to good, clean and reliable data.

Sure, it makes sense – we’ve got to get that new product or feature out, and like yesterday. But this isn’t endlessly sciencing away in the corner type of work I’m talking about. I’m talking about building rock-solid data pipelines that feed your data science team. We see 80-90% of our time often spent on cleaning/structuring data so we can use it for analysis, model building and display. Too often we spend weeks or months working with bad data.

And bad data delays projects that rely on predictive models. Here are a few reasons why:

  • Bad data leads to inaccurate models – before building models, data needs to be checked for outliers, distributions, collinearity and so much more. Without these checks, model performance suffers.
  • Bad data requires longer to process each time – imagine building a few different models using the same messy dataset that must be cleaned each time, for each new model. Sure, data scientists can write code to handle the cleaning, but this is not the place to do it, and asking data scientists to maintain this sort of work is suboptimal. It’s much better to use a data engineer, and put some process around the data warehousing efforts.
  • Bad data makes observing the data more cumbersome – think about your data dashboards. If you’re running a company, data dashboards report the health of sales, revenue, customer satisfaction and all sorts of other metrics and key performance indicators (KPI’s). When your data are a mess, reporting suffers and mistakes frequently show up in dashboards.

The solution to all of this is to talk to the teams that use data in your company. Ask them what they’re doing and what format they’d like most of the data to be in. Then go create those data stores. Data science probably uses data differently than the rest of your company, and that’s ok. Let’s not force them (or any other team) to jump through hoops to do their jobs.

Ask your team leaders, “What architecture would give your team 80% of the data they need?” Listen to what they say and make the changes you can afford. This alone can save your organization huge sums of money!

In one case, we helped a client that had over 20 different data stores available to the data science team. Everything was stored in those tables. From a high-level, it looked like a good approach; the team had access to everything it could possibly need. Upon closer look, however, it was information overload:

  • Queries hundreds of lines long, even for everyday tasks such as exploratory data analysis or rudimentary model building
  • Data heterogeneity exploded, as there was no single accepted way to pull certain types of standard data, such as user profiles or recent transactions

The result was huge delays to almost all data science efforts. And worse, frequent discussions on the validity of results (never mind the lack of experiment repeatability).

Simplifying this architecture took months, but when it was complete, team throughput rose by 75%. It was a massive win, but also bittersweet, since it could have been done so much better from the outset!

Of Interest

A Newsletter I Highly Recommend
My friend and colleague Luigi writes a weekly newsletter, ML in Production. Like this one you’re reading now, it comes out on Tuesdays. If you’re interested in some really important guidance on how to deploy, manage and think about your data science projects in production, this will be a wonderful addition to your weekly reading.

Recommender Models – Code and Comparison
Need to recommend the right thing to your customers? Need to rank items to show your users? Here’s a huge repo of recommender models along with benchmark comparisons of each against the MovieLens dataset. This is a highly valuable collection and could be the starting point for anyone working on product recommendations.

Simulations for Recommendations
When you’re showing news (or other) articles to users in succession, it’s important to consider each new click when ranking content to determine what to show next. That’s exactly the challenge this codebase simulates so that data scientists can try various algorithms. See the code here: