2014 In Review: Five Data Science Trends

2014 was another transformative, exciting year for data science. Summarizing even one measly year of progress is difficult! Here, in no particular order, are five trends that caught my attention. They are are likely to continue in 2015.

Adoption of higher productivity analytics programming environments. Traditional languages and environments such as C, C++, and SAS, are diminishing in importance as R, Python, and Scala ascend. It is not that data scientists are dumping the old stuff; it is that a flood of new data scientists have entered the fray, overwhelmingly choosing more modern environments. These newer systems provide language conveniences as well as a rich library of built-in (or easy to install) libraries that handle higher abstraction analytics-related tasks. Modern data scientists don’t want to write CSV read routines, JSON parsers, SQL INSERTs or logging systems. R is notable in the sense that its productivity gains come from its packages and community support, not from the language itself. R’s clunkiness will be its downfall as Python, Scala, and other languages gain greater traction within the analytics community and narrow the libraries gap.

Machine learning is king. Data science, however you define it, is a broad discipline, but the thunder belongs to machine learning. Rather, machine learning is becoming synonymous with data science. Optimization, for example, is increasingly being described as a backend algorithm supporting prediction or classification. Objective functions are described as “learning functions”, and so on. We are collectively inventing definitions and classifications as we go along, so in some sense this is merely semantics. There is a real problem however: we risk ignoring the collective wisdom of past masters, throwing rather broad but shallow techniques at models with rather particular structure. Somewhere in a coffee shop right now, somebody is using a genetic algorithm to solve a network model.

Visualization is everywhere. To the point that it’s starting to get annoying. Whether in solutions such as Tableau or TIBCO Spotfire, add-ons such as Excel Power View or Frontline’s XLMiner Visualization app (plug!), or programming libraries such as matplotlib or d3.js, the machinery to build good visualizations is maturing. The explosion of infographics in popular media have raised expectations: users expect visualizations to guide and inform them at all stages of the analytics lifecycle. As you have no doubt seen, the problem is that it is so easy to build shitty viz: bar charts with no content; furious heat maps signifying nothing. We’ll start to see broader, if unarticulated, consensus on appropriate visualizations and visual metaphors for the results of quantitative analysis. I hope.

Spark is supplanting Hadoop. Apache Spark wins on performance, has well-designed streaming, text analytics, machine learning, and data access libraries, and has huge community momentum. This was all true at the beginning of 2014, but now at the end of 2014 we are starting to see a breakthrough in industry investment. Hadoop isn’t going anywhere, but in 2015 many new, “big time” big data projects will be built on Spark. The more flexible graph-based pipeline at the heart of Spark is begging for great data scientists to exploit – what will 2015 bring?

Service oriented architectures are coming to analytics. The ubiquity of REST-based endpoints in web development, combined with a new culture of code sharing, have engendered a new “mixtape” development paradigm. A kid (or even an older guy…) can whip out their MacBook, create a webservice on django, deploy it on AWS using Elastic Beanstalk, connect it to interactive visualizations in an afternoon, and submit an app that night. Amazon has built a $1B+ side business on the strength of cloud-only services, and same these forces will drive analytics forward. The prime mover in data science is not big data. It is cloud. RESTful, service-based analytics services will explode.

Author: natebrix

Follow me on twitter at @natebrix.

5 thoughts on “2014 In Review: Five Data Science Trends”

  1. Love this set – and would add that demand for higher productivity and machine learning also support the growing self-service data prep category – where machine-driven algorithms and semantic/syntactic understanding across varied data sets provide analysts with clarity about what their data looks like, how dirty or inconsistent it might be, how it can be better shaped, how they might combine varied data together easily, etc.

    It will be critical that end-user data prep is part of an analytic workflow, where data is brought together as quickly as BI questions are being asked…as opposed to something that is only done at the beginning of the exercise. I think 2015 will see this category explode in the same way we saw data visualization and discovery in 2011-14.

  2. “R’s clunkiness will be its downfall as Python, Scala, and other languages … narrow the libraries gap” Could be – but R is gaining tools (magrittr and pipeR packages) that allow simplified (or at least more readable) code using pipes. It’s also not entirely clear that the library gap is closing all that much. Do the other languages have the equivalent of Hadley Wickham’s plyr, dplyr, ggvis etc. packages?

    “Optimization, for example, is increasingly being described as a backend algorithm supporting prediction or classification.” True among the ML crowd; the rest of us still see it mainly as a way to solve tangible/real-world problems (such as optimizing the investment in hardware for the ML folks to play with).

    “Somewhere in a coffee shop right now, somebody is using a genetic algorithm to solve a network model.” Sadly true … and after wasting a few hours convincing him-/herself that the problem is NP-hard and therefore cannot be solved exactly (regardless of problem dimensions). I suspect, though, that an ML person would choose a neural net in preference to a GA.🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s