I thought it would be interesting to talk about a few dangers of data science. Here’s one: confirmation bias.
As a data scientist, you have a client. That client may be a colleague, a customer, or another piece of analytics. If you trace the paths far enough, you are going to find a group of people who have a vested interest in the results of your analysis. If you’re in the business world it’s all about the Benjamins, and if you’re in the academic world it revolves around tenure.
If you are training to become a data scientist, I hope somebody has told you that you will be pressured to produce a specific result by someone with a special interest. I guarantee it. At Nielsen, my group built the analytics systems that were used to carry out marketing return on investment studies for Nielsen clients – big Fortune 500 companies who spent millions or billions of dollars on advertising. A big part of an ROI study was to decompose sales by sales driver, for example: what was the sales due to TV? Pricing considerations? Facebook? Always, without exception, clients had expectations about what these numbers would be. If they decided to make a big push into digital during the previous year, you bet your ass they’re looking for high digital ROI. Or that ROI for “diet” advertising was higher than “classic”. Or that macroeconomic factors were what dragged down sales, and so on. Moreover, there were always specific expectations from the team carrying out the analysis! The biggest one was that the results would be different, but not too different, than the results from the study carried out the year before! Different, so that the client would feel like they got their money’s worth, but not so different that the data, methodology, or modeling would be questioned or dismissed. Let’s be clear: this is not a Nielsen-specific problem, and in fact our team and all of the modeling teams at Nielsen went to huge pains to make sure that these kinds of biases did not affect our findings, by incorporating these concerns into our training, to reporting in our software systems, to automation that would prevent even the temptation of fudging the numbers. We ran into cases where competitors were clearly thumbing the scales to produce the result that was expected, and by holding firm we sometimes lost deals. You may find yourself in the same situation. By the way, the more complicated your data and model, are the easier it is to fudge! Sometimes people make things complicated to avoid accountability.
If you are going to call yourself a data scientist, you are going to have to have a strong spine and not let these pressures get to you, or your models. You have an obligation to listen to those with domain expertise, to be realistic, and to understand that your model is just that: a model. A model is a representation of reality, and not reality itself. Human factors can and should affect how you build and reason about your models. Here’s the point: you need to do this in a level-headed, honest way, even if your client, or their client, doesn’t like it. Don’t just do what people tell you to do, or say what you are expected to say. Use your brain and use your data.