On Data Science, The Mets, and Gerrymandering
[CRC Press] Why is data science growing as an academic subject area?
[Benjamin Baumer] There are a lot of reasons! The obvious one is that the job market for data science skills is so strong, and higher education is responding to that demand. At the same time, people are recognizing that even within a liberal arts curriculum, data science has a lot to offer. Indeed, at Smith we see data science as an integral part of the college's strategic plan for the future, with strengthening faculty and student capacities for critical analysis of data being a key component of that vision. Since virtually all subjects collect data of some type, there is no limit to the interdisciplinary possibilities for data scientists, and these connections can be especially close in liberal arts colleges.
[CRC] What is the relationship between big data, computer science, and statistics?
[BB] Everyone seems to agree that data science is somewhere in the intersection of computer science and statistics. But I wonder whether we might be better served by seeing computer science and statistics as sub-disciplines within data science. Imagine if there were only three things: mathematical sciences, data science, and engineering. If you build things, maybe you identify with the engineers. If you work with data, maybe you identify with the data scientists. If your work is mostly theoretical, maybe you identify with the mathematicians. This type of realignment might be helpful in reducing some of the turf wars that we see springing up.
[CRC]According to one source, data scientist has been chosen as the best job in America two years in a row. What background does one need to be successful in this interdisciplinary field?
[BB] Certainly, one needs a solid foundation in both computer science and statistics. That said, one does not need to know everything that typically goes into those majors. In the Curriculum Guidelines for Undergraduate Programs in Data Science, we spent a lot of time thinking about this question. We tried to come up with a cohesive major in data science that integrated the most important concepts from each discipline. For computer science, programming, algorithms, and databases seemed essential, but software engineering, and systems architecture were things we decided to make optional. In math and statistics, statistical modeling, statistical inference, and linear algebra seemed essential, but calculus and mathematical statistics were left optional. One thing that we all agreed upon was that some sort of applied capstone project in a specific application domain was a necessary component. In general, domain knowledge is really important, so whatever domain you end up working in, you need to learn about it in its own right.
[CRC] What made you become interested in data science and R?
[BB] Mostly natural curiosity. As we note in the book, we take a broad view of data science, and so I would say that data science is simply how I make sense of the world. As a scientist, I try to form opinions on the basis of the available evidence, and evidence in the form of data just happens to be the kind of evidence that makes the most sense to me. So I would say that I have *always* been interested in data science, even before it existed as a field.
Once I was exposed to R, there was no turning back. It has all of the attributes that you would want: open source, fully scriptable and reproducible, extensible, powerful, and backed by a huge, active community of developers and statisticians.
[CRC] You were a statistical analyst for the NY Mets and won the 2016 Contemporary Baseball Analysis award. How did you end up in that position? What aspects of the job did you find interesting and/or frustrating?
[BB] Here again, my job with the Mets was as a data scientist even before that job title was used. I worked the entire data analysis cycle, from integrating raw data from a variety of sources, administering and developing our internal relational database management system, administering and developing our internal website, building statistical models for player evaluation and other things, and presenting those results to decision-makers. It was an incredibly rewarding experience that I wouldn't trade for anything.
When I came to Smith, that experience immediately informed my teaching, because I could easily navigate between what we were teaching and what I knew students needed to know. The increased interest in data science locally and nationally has really helped me find my place and understand my role in academia.
[CRC] What types of real-world problems does the book address?
[BB] We use data from a variety of fields in the book, including sports, politics, epidemiology, ecology, entertainment, travel, and of course Twitter. I think the gerrymandering example from the chapter on spatial data best captures what we hope to achieve. The real-world problem here is that while voters in North Carolina in the 2012 election were split roughly 50/50 in their votes for president, 10 of the 13 congressional seats went to one party. This is numerical evidence that there might be some imbalance in how the districts are constructed. To get this information, we need some data wrangling skills and some intuitive summary statistics. However, to really see what is going on, we need to incorporate spatial information about the districts. This requires a different set of skills since now we are working with spatial data. Bringing the numerical and the spatial data together allows us to construct a powerful visualization that brings the most important issues into stark relief. This is the full power of data science: the ability to bring disparate data sources together and transform them into meaningful information in context.
[CRC] How does coding factor into the book?
[BB] Extensively. We see coding as essential to the modern practice of data science, so code is interwoven into the text in all but the first two chapters. We see code as expressive in the same way as written English, so we strive to have meaningful interplay between the code we write, our discussion of how that code works and what it is doing, and the narrative we tell.
[CRC] Why is this book significant to today’s statisticians and scientists?
[BB] I get a lot of questions from people (students, faculty, and professionals) who are interested in data science but aren't sure how to start learning more about it. Most of the resources that are out there focus on learning a specific technical skill, whether it be a programming language, a particular package, or a statistical modeling technique. What we tried to do was to develop these skills together in conjunction with a meaningful application. We realized early on that we would have to sacrifice depth in order to achieve the breadth that we wanted. My hope is that this book can serve as the primary text for a variety of introductory and intermediate courses, or for independent learners.