At Impact Makers, we’re about the people—the impact makers. We spent some time getting to know our former Data Scientist Jonathan Kuhn, PhD., whose stories reflect the past, the present and the future of data analytics.
!m: Why is data science important today?
Jonathan Kuhn: Twenty years ago, we lacked the computing firepower, methodology and software to analyze anything except for simple categorical or numeric data. Today, we have tools to analyze all sorts of recorded history as though it were data. Each year we record orders of magnitudes more data than the year before. With so much data and firepower, we have a reached a time where we underestimate what we can do. We now study relationships every day that are so subtle that twenty years ago we would have described them as butterfly effects. The opportunities are obvious. The pitfalls are less obvious, but equally important. Data science will be responsible for sorting out the opportunities from the pitfalls. I imagine that the California Gold Rush felt just like this.
Fascinating. Data is gold. Where do you envision the data science industry expanding and contracting?
JK: Like many disciplines the focus of data science swings like a pendulum between rigor and discovery. We are pushing the extremes of discovery right now, but we’re seeing the need for rigor begin to pull the pendulum back. Another aspect of the pendulum pattern is explanatory vs predictive models. Right now, we are heavily weighted towards predictive models.
The great news is that we have the history of the last pendulum swing to guide us through the next steps. The last time we were here we solved many of the issues of bias, power and multiple comparisons. These will once again be challenging but we have our experiences in the disciplines of probability and statistics to help us map the way forward.
Jonathan, you’re kind of the king of storytelling in the office. Can you share a story with us about a problem you solved with data science which had a big impact?
JK: I was developing methodologies to predict disease state changes for diabetes, COPD and a few other diseases. The challenge was to find new predictors. In order to do that, we created a hybrid simulation/predictive model. This is a variant of supervised machine learning, but since we needed the model to be explanatory, we created a model of models. It was an interesting process which ended up directing chart reviews which ended up improving prediction. Though it was successful, it felt kind of roundabout and onerous at the time.
Then I was presented with a problem which at first glance seemed super simple: identifying pharmacies that were diverting drugs, which means transferring controlled substances illegally for illicit use. There was one big problem: Though it was obvious that many pharmacies were diverting drugs, there was next to no information about which ones were diverting. Desperate to find a solution, I used the same process for disease state change with one notable difference: We made this into a game of cops and robbers. Instead of robbers, it was simulated diversions. I built simulations of the kinds of diversions I learned of from retired DEA agents who worked as consultants. Jason Owen, a friend of mine from grad school, played the role of cop. He used the structures I developed for disease state change in order catch all the drug diverters.
Here is where it got fun. Jason would catch about 60% of my simulated diversions. But since we were using real-time active data, he was also catching real diverters. When he identified one, he would check with me. If it wasn’t me then we would analyze it and mark it for investigation. At first, he was mostly catching me. But I got smarter at my diversions and he improved the models. After a few rounds of this and a lot of automation we had adaptive models which were catching drug diversion in real-time.
In this scenario, I got to be the bad guy and now instead of one step behind the bad guys I was the smartest bad guy and the model was keeping up with me. It was both fun and rewarding.
You mentioned good guys, bad guys. What makes a good data scientist and a bad data scientist?
JK: Any scientist who understands that there are generalists and specialists, and acts appropriately will be a great data scientist. Conversely, any data scientist who has a success in one area and attempts to use the same template in other areas are doomed to learn the hard way. You know the adage: when you are carrying a hammer, everything looks like a nail? Data scientists are very prone to this issue. After a successful build of a data model or machine learning algorithm it is easy to forget the decisions that you made along the way to implementation.
A good data scientist guides the stakeholders through the process without miring them in technical details. It is important to understand whether their goal is explanatory vs predictive. Learn the nuances of the situation and suspect everything as a potential bias. Ask questions like, why is that data missing? Are all outcomes recorded, or only the wins (losses)?
What are some misconceptions about data science?
JK: The biggest one that comes to mind is that data science is only about prediction and reporting. While not as sexy (right now at least) descriptive models are a huge part of data science that isn’t getting much press right now.
You may wonder why are descriptive models important? Well, how do you know that the reason a predictive model is effective isn’t that it is modelling (and perpetuating) inappropriate social bias (inertia modelling). Ignorance is not an excuse for producing models that perpetuate inappropriate biases. Also, descriptive models are responsible for numerous breakthroughs. So while we make process breakthroughs with predictive models, we can only improve our understanding with descriptive models.
I know you love buzz words. Help us understand the connection between data analytics vs data science vs machine learning.
JK: Data science is really four different disciplines: analysis, predictive modeling, mining and science. Data analytics and machine learning are parts of data science, but they don’t form a complete package without science. On small teams, everyone must be at least competent in three of the disciplines. On large teams you can have specialists who act in only one of the disciplines. But all four disciplines must be represented.
Data analysis is the measurement of the strength of patterns, relationship and correlations whether it was discovered by experience, theory, mining or experiment. Data mining is the search for patterns, relationships and correlations. Data mining uses analytics along the way to score the strength of the patterns, relationships and correlations.
Predictive modeling is a process that uses data mining and probability to forecast outcomes.
Machine learning (ML) is automated method that accomplishes data mining and predictive modeling.
Science asks the simple question of “could this pattern, relationship, correlation or result be reasonably well explained by random chance?”. If it could then it is not significant. While the question is simple, how to answer it is the subject of the disciplines of probability and statistics.
Another aspect of data science is understanding and avoiding pitfalls such as bias, multiple comparisons, and kitchen-sink models. Again, these are all cases where the outcome could be reasonably well explained by random chance. Having experience with these phenomena helps design analytics, data collection and even guide ML algorithms to more fruitful and pragmatic outcomes.
Drive decisions by transforming numbers into meaning and action. Check out Impact Makers’ Data & Analytics services including data science.