Guest blog post by Sarah Aerni
What is the profession of data science really about? How does one best become a data scientist or grow a career as one? What does the Data Science Central community think about these questions? (Please chime in!)
We’ve all read about the shortage of data scientists from McKinsey, heard about the salaries, and know about the volume of recruiter emails. As a practicing data scientist at Pivotal (a leading vendor in open source, big data platforms specifically used for data science), I was recently interviewed on careers in data science. Because it has been a popular topic on Data Science Central, I wanted to share some of this perspective and see what other practitioners thought.
What Is Data Science About? What Is The Heritage And Current Practice?
From my own historical perspective—looking back to high school algebra—no one told us math could predict and help prevent someone from going to the ER or even the ICU. Depending on when you graduated college, you may have heard about algorithmic trading or analyzing the human genome, but, until very recently, we certainly didn’t hear about sentiment analysis on social media or machine learning on sensor data. Now, anyone with internet has exposure to apps driven by data science on YouTube, Facebook, Twitter, and your phone—analytics are embedded in every part of our lives, personal and business.
Today, most of the people I know in data science enjoy blazing new trails with the latest technology and lots of data, solving problems that the world could never address before. Our data science practice at Pivotal isn’t about reports, basic analytics, or business intelligence on data sitting inside traditional enterprise apps like CRM, ERP, SCM, and anything that took a paper-based workflow and stuck it in a database. While the data science heritage is most closely related to statistics, data science is more exploratory—our team is in search of new discoveries and “eureka” moments.
We don’t know what is possible when we start work—we only have a compass, not a map. We start looking at ten terabytes of data from 20 different systems that no one has ever holistically looked at before. We let the data take us places and envision what is possible with it, challenging everything we find along the way. We only know the data can be used to uncover, interpret, and optimize things in new ways that create value. Then, we use math to create a new method to improve something. Outside of our tribe, people often need an example to really grasp the fact that we go beyond pie chart creation and forecasts roll-ups. We generate those stories for them, and importantly, based on a lot more information, we tell you how accurate they probably are and what they are likely to be in the future.
Looking To The Future Of Data Science
At Pivotal, I am engaged across a wide range of customers and problems and see several pragmatic themes for where the discipline is headed and how our careers may unfold:
1. Data science is creeping into every department and industry, much like business intelligence did as it emerged into the world and moved past the accounting department. Specialization in data science is happening, but the principles of data science will still be taken from one area and applied to another. There will still be great power in interdisciplinary thinking.
2. As data scientists, leadership skills like communications will become more important. Data-driven business is where the world is going, and we provide guidance. We envision and communicate some possible outcome, bring various stakeholders and SMEs together, and make a case to operationalize our work with significant changes to teams, processes, and technologies.
3. Tenacity and curiosity will continue to drive our day-to-day agenda—we will forever be looking to figure out why models aren’t performing as well, try others, and improve results. We continuously back up our logic under scrutiny from ourselves and stakeholders. Then, we will add more data and do it again, all the while holding on to a healthy dose of skepticism to keep us probing and questioning, why did this work? Should I really believe it?
4. More and more, leading companies are putting data-driven, predictive, and prescriptive insights into real-time processes and other operational aspects of a company—embedded in mobile customer apps, connected to power generators, driving media purchase decisions, or optimizing supply chains. More processes are going to include this type of data-driven support.
5. We will have to wrangle even more data together, from more systems, in greater volumes, and across broader types. For example, we will all build models on customer-centric views consisting of one integrated dataset including web page, mobile app, log file, device signal, document, email, phone call, social media, video, and traditional business performance data. In medicine, that view could be for a patient, in a pharma company for a disease or drug.
6. Our tools and data platforms will continue to become more cost-effective and powerful. Algorithms will continue to evolve quickly due to online collaboration and sharing “in the wild.” For example, Pivotal really focuses on massively parallel processing, and our tools typically include SQL, R, MADlib, and Python. Then, something like Apache Spark™ comes into play, and we start using MLlib. Just days ago, our own Pivotal GemFire product was open-sourced into Project Geode. I can only imagine where the community will take it now.
What have I missed that you see in our future? Where do you agree or disagree? How do these outlooks impact our profession?
If you want to read a bit about how our data science practice works, you can read my interview on the Pivotal blog.