Analytics Decathlon: 10 tasks every pro should know

I tried to think of 10 fundamental tasks that every analytics programmer should know how to do. I’m trying to keep it task-oriented (“do this”) rather than concept-oriented (“understand this”).  In thinking about this list I tried to make sure that I accounted for data preparation, carrying out a computation, sharing results, and code maintenance. Here goes:

  1. Read data from a CSV file.
  2. Sort a large multi-keyed dataset.
  3. Roll up numerical values based on a hierarchy. For example, given sales figures for all US grocery stores, produce state- and national-level sales.
  4. Create a bar chart with labels and error bars. Make sure the chart presents information clearly and beautifully. Read Tufte.
  5. Create a histogram with sensible bins. I include a second visualization item not only because histograms are so frequently used, but also because thinking about how to bin data causes one to think more deeply about how results should be summarized to tell a story.
  6. Perform data classification. Classification algorithms place items into different groups based on similarity. Several different popular machine learning approaches focus on this problem, for example k-means and decision trees. 
  7. Linear regression.
  8. Solve a linear programming problem. I am an optimization guy, so this should not surprise you. Optimization is underutilized, which is strange considering it sits atop  Analytics Mountain.
  9. Invoke an external process. Call an arbitrary executable program, preparing the necessary input, processing the output, and handling any errors.
  10. Consume and publish projects from a source control repository. I use “source control repository” loosely – simply: you need to know how to share code with others. For example: github, sourceforge, or CRAN.

It’s even better if you know how each of these tasks are actually implemented (at a high level)!

I intentionally skipped a few items:

  • Read XML. The key is to be able to process and produce structured data. Naturally tabular data, which can always be written out to CSV, seems to be more important in my experience.
  • Regular expressions. Really handy, but not vital. If you are focusing exclusively on text analytics then the situation is different.
  • Programming language X. This is worthy of a separate post – but I think it is unwise from a professional development and productivity standpoint to be religious about any particular programming language or environment: C++, .Net, Java, Python, SAS, R, Matlab, AMPL, GAMS, etc. Not all languages and environments are created equal, but no single environment provides everything an analytics pro needs in all situations (at this point). It is frequently the case that those who claim that a particular environment or language is uniquely qualified to perform a scientific programming task are unfamiliar with the alternatives.
  • Writing unit tests. I am a huge proponent of writing unit tests and test driven development, but this is not as important for consultants or academics. My hope that the thought of sharing code (number 10) scares most people enough into making sure that their code is correct and presentable.
    This list is meant to provoke discussion. What do you think? What’s missing? What’s wrong?

Author: natebrix

Follow me on twitter at @natebrix.

4 thoughts on “Analytics Decathlon: 10 tasks every pro should know”

  1. Great list, but I’d argue that tests are equally important for academics as long as they deal with any software development. Errors even in published papers are not rare, Erwin was writing about one recently, and unit tests can at least reduce the number of implementation mistakes.

  2. You missed “clean data” (or maybe “identify outliers”, although to my mind not all outliers are errors).

    Also, would “listen to customers” and “verbally explain results” fit? (I stuck in “verbally” because you’ve already mentioned graphics.)

    1. Fair enough! My list was pretty focused on the technical side, and I think “clean data” definitely merits a spot. I could probably do a whole separate list of 10 “soft skills” – and the two points you mentioned would both make it, along with project management and a few others besides. Something to think about on the drive to work tomorrow…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s