Getting data science to work

Predictive models are more than just the predictions

A toppling bar chart

When building a predictive model, most junior (and many senior) data scientists fall into a trap of thinking that the success of their work depends precisely on how accurate the predictions are. Here is an example: a data scientist gets pulled into a project to “help predict which customers will most likely open a marketing email.” They spend weeks figuring out how to do it: linear regression or random forest? Which features should be included in the model? What time period should we build the prediction over? Should the model have a regularization component to avoid over-fitting? The data scientist combs Google Scholar, scours Cross Validated, and emails their colleagues for advice. Then comes narrowing down to a single model: this one has a better R² but that one includes fewer features. After deep pondering, a choice is made and the data scientist feels satisfied.

Unfortunately, the model never gets put to use, making the entire effort worthless.

A perfect case study for this is the Netflix prize: Netflix paid out a $1 million prize to the team that could best predict which movies people would enjoy. Unfortunately, the solution was never implemented. It turns out the solution that won the prize was so complex that the work involved for the engineering team to implement it wasn’t worth the accuracy gains. The winning solution was actually an ensemble of 107 different machine learning methods, and so trying to keep 107 methods running in production just didn’t make sense.

When building a predictive model, you can’t just go for the one which best fits the data, you need to design a model that people will want to implement. To do this, the model needs sign-off by many people in the organization. It needs sign-off by the engineering team who will be monitoring and maintaining the model. That often means implementing a simpler model ' you can code a logistic regression in just about any system, but having a deep learning network run reliably for data stored in a legacy database may be a more daunting task. It also means creating a solution quickly: if you could spend three weeks using a logistic regression to get a “decent” solution then why spend six months looking for a better one? By that time the business problem may no longer be relevant, or whoops your funding ran out months ago.

Just as importantly, you need to have the business stake-holders buy in. These are the people who are going to be using the results of the model (and likely paying for the development). In my position this tends to be marketing executives who are going to use the tools to predict customer behavior. For others that could be the product design teams or the logistics departments. These people need to trust that the model will provide a better solution than not having one at all, which is not easy because models take work to maintain.

So how do you get business stake-holders on board? Through presenting the model in a clear narrative. Show them, through PowerPoint or otherwise, what data you are using for the model and why you chose it. What choices did you make when building the model, and how does this affect the results? Does the model tend to do a good job at prediction, and how much potential revenue gain is there from having this in place. Does the feature importance in the model tell you anything about how you could be marketing differently?

You’d be amazed at how much colorful shapes can help sell your data science solutionYou’d be amazed at how much colorful shapes can help sell your data science solution

If you can effectively get people to agree “yes this is a thing we should have,” it becomes much easier to put the model in place. People will be excited to start watching the results of the model help the business, and eager to get you working on other projects. You can’t get to this state until you have the ability to present your model to the people who will be using it, which means spending time on creating presentations around the work instead of prioritizing building a better model.

It’s often difficult to know when to stop ' if you can always make the model a little better why not do it? The best approach is to let outside constraints make the decision for you. If throughout your model development you’re continuously keeping a presentation of the model ready to present and updating it when your model changes, then its easy to keep making the model better until some other factor (like your boss) requires the model to be put in place. This is most easily done using tools like Rmarkdown (R) or Jupyter Notebooks (Python) that let you easily export results with a click of a button.

So as you work on building new predictive capabilities for your business, remember that getting other people to want to use them is just as hard as creating them. While it’s easy (and fun!) to go deep into data science to try and build the best predictive model, don’t fail by forgetting that your working in a group of people who will need to be involved too.


If you want a ton of ways to help grow a career in data science, check out the book Emily Robinson and I wrote: Build a Career in Data Science. We walk you through getting the skills you need the be a data scientist, finding your first job, then rising to senior levels.