[This is final post in my series on hiring data scientists. If you want an overview of the skills I look for in candidates, the archetypes of candidates I see, and the interview questions start from the beginning]
So we’ve had a candidate come in and do an interview. They’ve given good answers to our questions in math and statistics as well as programming and databases. They have effectively expressed that they understand how to think in a business context and have described previous experiences where they had to be smart and get things done. Then the final step before we hire them is to give them a case study as a controlled environment where we will see how well they perform on at real data science task.
The case study is designed to mimic what it is like to work as a data scientist as closely as possible. This minimizes the chance for us that the candidate will take the job and not be able to fulfill it, and minimizes the chance for the candidate that we’ll surprise them with a job that’s less than what they wanted. What about working on the data science team at the consulting firm I’m a part of do we want to mimic? Our job involves:
We have a constraint that the candidate shouldn’t spend more than a weekend on this case study–they’re a human being a life and we want to respect that. I’ve heard of companies having candidates spend weeks working for them in a temporary status to see if they are a good fit before hiring them full time, which seems cruel. I’ve also have interviewed at companies where they have given a case study but only an hour to work on it. That doesn’t give candidates an environment similar to what they’d be doing, since no one actually works under those absurd deadlines.
We give the case study as the last step in process since we only want someone spending time on it if there’s a good chance we’ll hire them. I’ve seen companies that email every applicant with a case study, and again that’s highly immoral. There’s some deep truth in the fact that the interview process, which has a high power imbalance, is filled with immoral behavior. Let’s not dwell on that! 👉😎👉
Below, I’ll describe in more detail how we created our case study. Then I will discuss the process candidates go through to do the case study, and what makes the best candidates.
The case study is based on a real project we’ve done. We took the data from the project and heavily anonymized it. Then we ask the candidate to solve the same business problem that we were given from our client. This ensures that the candidate isn’t dealing with a toy problem, instead it’s an actual real-world situation. There isn’t a single elegant solution to the case study because there wasn’t a single elegant solution to the actual project. The candidate has to decide how to overcome the lack of clarity.
The particular business problem was that an executive wanted a new marketing strategy based on the customer data their department collected. This was an extremely open ended problem, and thus the case study is open ended too. The candidate must decide how to think about the customer population and how to analyze the many facets of the data in a way that helps shape a strategy.
The data itself is messy. It spans four tables which are not straightforward to join. The columns come in many formats: dates, strings, numbers, and categorical variables, many which are missing data and or ambiguous meaning. None of these challenges were intentionally put in there, they are all artifacts of the original data.
Before turning the original data into that for the case study I anonymized the heck out of it. I sampled a small random selection of the original data, then removed any reference to the actual company, the actual products, or the actual customers. I added tons of complex noise and complications to the data in ways that were more than merely adding white noise to it. The goal was to (1) ensure there was no way someone could infer anything from the data, such as the location of military bases or customer sexuality, and (2) keep some of the underlying structure available so it wasn’t only boring noise.
If you’re making a case study, I cannot stress enough to get a problem and data that mimics a real project. Here is an example of a bad case study I received when interviewing. I was sent an Excel file with a single unlabeled table of numeric data. I was then told to fit the best model to it, without any context for what the data meant or what the business wanted to use it for. All they wanted me to do is to use a machine learning algorithm on the data and send my prediction back. This would give them no indication of how good I was at cleaning data, finding insights, or presenting results effectively. Instead of doing the case study I actually removed myself from the interview process–given their interview process it was clear they didn’t share the same values as me.
Once we know the candidate is moving to the case study stage, we immediately send them the case study files and schedule the presentation. The files include a 3-page document explaining the context of the business problem, the objective the executive is trying to achieve, and a guide for what we are looking for in a successful presentation. When scheduling the presentation, we don’t worry if we are giving the candidate more than a weekend. It’s fine with us if they want to do all the work in one weekend or spread it out over multiple, as long as they limit the overall hours to “a weekends worth.”
The candidates can use whatever tools they desire: R, Python, Excel, SAS, Fortran, whatever. Since they don’t have much time to work on this, it doesn’t make sense for us to try to force them to use a tool they don’t know. The point of the case study is to see them working at their best. We do however ask that they share the code for the analysis with us through GitHub or email, so we can see how they structure code.
We intentionally do not include a data dictionary with the initial set of files, but do tell the candidate they can get the data dictionary from us if they email someone on our team. We also encourage them to email us with any questions they have about the data as they work on it. This forces them to communicate with us at least once during the case study, which lets us see how they format a request. The skill of communicating questions is important for our team since so much of our job requires us to communicate with client data owners.
When the candidate is ready to present, we schedule them to come in and present in front of employees at my consulting firm. Typically, there will be two people viewing the presentation: a senior member of the data science team and a senior consultant from the business side. The data science person is there to get a feel for how good their technical skills are, whereas the business consultant is there to pretend to be the client. We give them 20 minutes for them to go through their work followed by 10 minutes of a Q&A. During the Q&A, we ask them both technical questions about the techniques they chose and the insights they found, and business questions around what they think the analysis means for the broader business.
To succeed at this case study, the candidate has to do a number of things right. First, they have to do a good analysis. That means they’ve found insights in the data that are free of errors and meaningful to the business. Second, they have to be able to explain what they’ve done and why the business should care. They must create a compelling narrative that a person without data science expertise can understand, and use slides and visualizations to show it. They have to be able to think on their feet during the Q&A section. They do not have to have all the answers, but they have to know when to elaborate on a point and when to say they don’t have an answer. Finally, they have to act professionally and courteously to the people they are presenting to, which includes not getting defensive at difficult questions.
We typically see people struggle with the case study in a few different ways. the most common issue is a lack of a narrative in the presentation. Instead of presenting an insight and what that means for the business, the candidate will present a bunch of graphs and leave it for the viewers to find the meaning in them. Another issue is balancing the technical level of the presentation–often candidates use really technical terms that the pretend-client doesn’t understand. Rarely do we have a candidate who makes it to the case study and performs poorly on the technical-skills side. Usually those people don’t make it to the case study in the first place.
Our hiring process is far from perfect, but the case study especially does a good job at helping us see what the candidate would be like when they are here. We’ve learned so much about people from these case studies, and the more we do the more examples we have of different ways to approach the data. If you’re considering hiring a data scientist for your team I think this is essential for your process too.
This concludes my series on hiring data scientists. I hope after reading it, you now feel better equipped to hire for your own team. Or if you’re a data science job applicant, you can use this as a guide to how a good team will be thinking about you. If you’ve found it helpful please let me know! You can find my contact info on my website, or check out my twitter.
If you want a ton of ways to help grow a career in data science, check out the book Emily Robinson and I wrote: Build a Career in Data Science. We walk you through getting the skills you need the be a data scientist, finding your first job, then rising to senior levels.