Subscribe to our Newsletter

Industrialising Data Science

Guest blog post by Harry Powell

The application of pattern recognition technology to large datasets has revolutionised the digital economy. But digital represents only 5% of GDP in OECD countries: the remaining 95% is still largely untouched by data science (DS). The larger “old economy” companies are just beginning their data journey and data science is yet to be institutionalised: Outside the tech leviathans DS is still a cottage industry with artisan DS crafting bespoke prototypes to their own standards.

If DS is to fulfil its promise, it needs to industrialise. This blog explains what I mean by this, and proposes a number of issues which must be addressed if it is to do so.

Most DS blogs are technical: algorithms, distributed computation, visualisation etc. The rest are case studies of projects where these techniques are applied to a domain. I would like to look beyond these issues, interesting and important as they may be, to the general structure of what we are trying to do here: applying automated pattern recognition systems to real world problems in an industrial way.

To the extent that there are any articles on industrialisation, they are written by consultants and business school people who have never written a Spark job or run a regression. This leaves them somewhat empty, often substituting jargon for insight. And they often take a polarised position for effect: “Data Science will save the world”; “Data Science is dead”; etc. So this is my attempt to frame a debate on the basis of practical experience. Let me know what you think.

The core questions are as follows: How can we turn an activity into an industry? How can we build a DS framework that meets the needs of businesses and people: How can we ensure quality? How can we build sustainably? How can we be responsible? How can we as DS benefit from the value we create?

The note below highlights a number of areas. I guess there will be more. My intention is to address each idea in turn for a blog of its own. At some point.

A common definition of Data Science

Everyone seems to mean something slightly different by the term “data science”. And then they disagree about who is a “real” data scientist and who is a fake. If we can’t agree on who we are and what we do, what hope have the businesses who might want to use our skills? We also need to be able to articulate why what we do is genuinely different to the analysts, data engineers, software developers and quants who used to occupy the “brilliant nerd” space, and who think they still do.

I am not even sure if Data Scientist and Big Data really deserve capitalisation.

My view is that there is a spectrum of different activities that could validly claim to be data science, from analysts who run decision tree models on R, to data engineers with a copy of Sqoop and Oozie, to guys building Bayes networks in Spark.

A common understanding of the dimensions and boundaries of our space, and a taxonomy of roles that exist within it will enable DS to explain ourselves and our value to the world. Without it the perception of DS will revert to the lowest common denominator.

Recognising the importance of production Data Science

In my view, the value of DS is not just in the impact of analytics, but in the ability to execute analytics at an unprecedented level of detail and to deliver that content direct to users on demand. Business Analytics serves (relatively few) actionable insights to (relatively few) managers. Data science technologies allow Amazon to recommend a different subset of products directly to millions of individuals. You can’t do that manually.

A key component of the value of DS is automation, the only way to reduce the marginal cost of production low enough for this to be viable. In my team we put an equal emphasis on automation as we do on pattern recognition and distributed computation.  We don’t think about delivering actionable insights, but applications which deliver actionable insights. I get the feeling that this is not universal yet.

Production is not just about delivery and latency. DS apply complex techniques to large data sets. If the results are going to be used in business critical situations, how do you know they are right? Many DS still use untestable code such as SQL. Some use testable code, but haven’t adopted test frameworks. Some are yet to use version control or continuous integration. Industrialisation of DS will mean that we raise standards of quality to the level of those used in the production, and it is important to realise that we aren’t there yet.

Exploration vs Production

The recognition of production as a key element of value implies the following tension. On the one hand data science is exploratory; You are employing analytical techniques to play around with data to find out new things and to do this you need freedom. On the other hand data science is production-oriented; You need to build robust applications which deliver consistent quality. Ultimately, the exploration can only be evaluated in production, and the exploratory phase must be adjusted with what you’ve learned. So you shouldn’t have two discrete phases where models are designed and then implemented. You need to design as you build and build as you design. Industrial data science must reconcile this tension by developing practical approaches to team structure and working practices, and by adopting technologies that enable exploratory scientists to develop production code without compromising intellectual momentum. In particular we need to think about how the Agile Methodology can be adapted for data science, and how to document the process. My colleagues have been working on the Agile DS manifesto which I cannot wholeheartedly endorse, but I think it is a worthwhile first attempt (see http://www.datasciencemanifesto.org).

Embracing changing technologies

Techy nerds tend to discuss technological choices ad nauseam, and DS are no exception. Each DS joins a tribe and rants at the other tribe, betraying that the decision is not as important as the implied investment following from that decision. I defend Scala/R/Python not so much because of its superiority, but because I have spent years learning that technology, and if was not the best, I would be an idiot. “People’s Front of Judea? Splitters!”

All of this is as it should be, and good fun. But there is a more important challenge for DS as an industry. DS has been built upon rapid technological change which shows no sign of slowing down. Open source technologies are engaged in chaotic constant revolution. Even the technologies which win survive for only a few years before the world moves on. Anyone want to use Map/Reduce or Hive? Where will Spark and Yarn be in a few years’ What will happen to a business’s stock of analyses build in obsolete environments?

How should a business manage the adoption of technological change? To be clear, this is not a question of choosing individual technologies that will last: they won’t.  How can a business embrace technological development and yet at the same time preserve the pre-eminence of industrial production. If a firm ignores change through picking winners, long procurement and deployment cycles, it risks losing the race. If it adopts every technology that comes along it will bequeath a hotchpotch of incompatible applications.

The answer probably lies in moving away from the specifics of individual implementations towards shared design standards, in particular a move towards APIs and reactive services, and common coding/development practices. The principles of Minimum Viable Product (MVP) and Single Point of Responsibility (SPR) will become increasingly important as will the reality of continuous improvement and refactoring. You need to design your applications so that components can be adapted to embrace new tech. What those practices are is up to us.

Data Science in old organisations

Data science offers great promise to businesses, but its implementation places equally great strain on those businesses’ organisations. For digital business models to generate returns they need to do things differently, and that’s hard.

As information becomes a core component of a firm’s products, so the problems of legacy data systems are intensified. Layering Hadoop, unstructured data and machine learning on top of a failing RDBMS with an outdated schema is a recipe for disaster. But fixing legacy systems is difficult. Your firm has probably already tried and failed. DS needs to reach out beyond its comfort zone to help data architects adapt their tried-and-mistrusted designs for the modern age. IMHO the answer does not lie in grandiose data lakes, but in simplifying data warehouses and building APIs to serve de-normalised data to users. My team has been developing a Spark design to achieve this, which we can share soon.

Equally the Agile principles mentioned above may strain your firm’s current delivery architecture. Companies will fight to disintermediate and serve customers directly and to do that business units will want to control the full delivery pipeline themselves. If you want to try out a new product or a new price you need to deliver it tomorrow not in line with an existing quarterly release cycle. DS moves analytics to the front office. We need to become the product owners.

All this has unresolved implications for where DS teams sit within organisations: do they sit in IT, Analytics, or should they be embedded in the business itself? What is the balance between embedding for relevance and centralising for technical excellence? Can large organisations live with the freedoms (admin access, linux, acces to production data etc) expected and required by scientists but which are normally denied to analysts and developers? How can a business manage the potential for conflict with incumbent conventional IT and analytical functions who may want a piece of the Big Data action but perhaps don’t have the right expertise or working patterns?

In my experience, DS tend to spend too much of our energies navigating organisations inappropriately configured for innovation. As we industrialise, the questions will be how much of the change needs to come from DS and how much from the organisations themselves?

Communication with business leaders.

DS are sometimes unwilling communicators. We are often more interested in finding answers and building applications than telling the world about them. And when we do try to explain what we are doing, we often bore generalists with technical details. Most DS role specs include communication skills as a key attribute, but it is often traded off against technical skill when both aren’t available in one unit.

But should all the onus be on DS? Just as business organisations must adapt to the centrality of data to their product lines, so business leaders need to engage more deeply in technical issues. Many CEOs will defer to their CIO on all matters technical, challenging only on matters of budget and delivery. Can you imagine Zuckerberg being so passive? CEOs are not idiots, but they have become lazy and over-reliant on IT advisors. This is not good enough.

Communication is a reciprocal process. DS need to become better storytellers and business leaders need to become better storylisteners.

How can DS help the boardroom understand DS issues? It should not come down to individual charisma. In my experience business leaders learn from a combination of their own experience and stories: case studies, taught in business schools and written in journals like HBR. There is a real shortage of case studies which focus on the kind of engagement that CEOs need to have with our technology. Many of these articles are glorified sales pitches, where success is reported as having flowed exclusively from the engagement of some smart consultants (who wrote the case). But in reality, CEOs need to be able to understand what different DS approaches will do to their core business. They can’t just decide on budget and then outsource. Can you imagine doing that with your other core products?

Equally, we need to find a way of making this technology exciting without resorting to hype. How can we evangelise yet keep expectations realistic? How can we stay truthful and spot Big Data frauds (you know, the guys who hang around conferences and write presentations on Big Data, but are yet to run a Map Reduce job)?

Part of this is developing a business language which embraces the technical nature of what we do. We need to find a way of communicating which does not shy away from the decisions and implications but which allows business leaders to link the technology challenges to business outcomes.

I think there is some way to go yet.

A career path for Data Scientists

If we are going to build a DS industry, we are going to need some data scientists (see taxonomy question above). Everyone knows about the supposed shortfall of analytical talent expected over the next 5 years. To meet this shortfall we need to find and develop talent, but first we will need to agree about the roles and skillsets that make up a data science team.

Firstly, what does it take to be a DS? What skills and aptitudes do they need to have? How can you identify a good one? What credentials should a DS have? In particular can you create a DS via a specialised course of learning like an MSc in Data Science (I am not so sure about this one)?

Secondly, how can businesses offer DS career paths in an organisation? DS are human capital. When they walk out of the door they take value with them. Businesses have become bad at managing technology talent, encouraged by the offshoring/outsourcing narrative in which technical talent is a commodity. In my experience, DS are remarkably long term in their outlook. Most of them want to make a difference in their organisations, and they recognise that this can take time to effect. To keep these people, non-tech businesses need to develop career paths that recognise the contributions that DS make over the long term.

The days when talent was prepared to work for nothing in bad conditions for the sake of an interesting problem are over. At tech-enlightened firms you can do cool work and put your kids through college: you don’t have to become a general manager to go up the pay scale. Businesses that don’t address this may struggle in a data-science lead world.

But it’s not just down to the businesses to change. If DS is to industrialise, we need to come to a common understanding of what is expected of the DS and what is expected of the business. If you want to make an impact with DS in a corporation, you may have to wear a shirt and turn up before 10am. It’s a tough world out there.

The economics of  Data Science

Every business function has to pull its weight and data science is no exception. At least DS are in a good position to measure their impact and so make a claim on the value they generate. But perhaps a more difficult question is that of resource allocation. Where should DS resource be applied in order to generate greatest value? In particular is it better to have lots of DS applications which perform moderately, or should resource be concentrated on a few very highly performant applications. Conventionally, businesses are thought to display diminishing marginal returns; that is for each additional unit of effort, you get a lower return. Some people summarise this in an “80:20” rule. You don’t get much benefit from going for the final few percentages of performance.

But it is not clear that all DS applications exhibit this behaviour. In my opinion (but it would be great to measure this, somehow) the value of a personalisation engine is not in doing the easy part (I am a man therefore I like lawn mowers) but in the difficult part (Each Friday I travel to an area where there is a mosque, therefore I might want to buy a present at Eid).

As we industrialise, we will need to help businesses allocate scarce resources to DS projects.

Data Science and society.

DS like to get their hands on data and make use of it for interest and profit. But we are increasingly aware of our duties to the originators of that data (in some sense the “owners” of that data). There are all sorts of issues around consent and privacy and a significant risk that data, if it gets into the wrong hands, can be used to the detriment of those people.

DS have a duty of care to protect private data, and yet we have neither a common and agreed set of values about this, nor do we have an equivalent set of practical steps to ensure that data isn’t compromised.

An industry body to promote Data Science

The standard of public debate about DS is low, with hysterical newspapers making hyperbolic claims about the danger of this new technology. This may well become more problematic as the DS industry establishes and the technology becomes pervasive. DS needs to find some kind of forum through which it can get its message across, and an industry body might be a good way to do this.

 

I guess there may be a few more Big Issues. Let me know what you think.

I will do what I can to address these issues in more depth in the coming weeks.

E-mail me when people leave their comments –

You need to be a member of Hadoop360 to add comments!

Join Hadoop360

Featured Blog Posts - DSC