I tried to think of 10 fundamental tasks that every analytics programmer should know how to do. I’m trying to keep it task-oriented (“do this”) rather than concept-oriented (“understand this”). In thinking about this list I tried to make sure that I accounted for data preparation, carrying out a computation, sharing results, and code maintenance. Here goes:
- Read data from a CSV file.
- Sort a large multi-keyed dataset.
- Roll up numerical values based on a hierarchy. For example, given sales figures for all US grocery stores, produce state- and national-level sales.
- Create a bar chart with labels and error bars. Make sure the chart presents information clearly and beautifully. Read Tufte.
- Create a histogram with sensible bins. I include a second visualization item not only because histograms are so frequently used, but also because thinking about how to bin data causes one to think more deeply about how results should be summarized to tell a story.
- Perform data classification. Classification algorithms place items into different groups based on similarity. Several different popular machine learning approaches focus on this problem, for example k-means and decision trees.
- Linear regression.
- Solve a linear programming problem. I am an optimization guy, so this should not surprise you. Optimization is underutilized, which is strange considering it sits atop Analytics Mountain.
- Invoke an external process. Call an arbitrary executable program, preparing the necessary input, processing the output, and handling any errors.
- Consume and publish projects from a source control repository. I use “source control repository” loosely – simply: you need to know how to share code with others. For example: github, sourceforge, or CRAN.
It’s even better if you know how each of these tasks are actually implemented (at a high level)!
I intentionally skipped a few items:
- Read XML. The key is to be able to process and produce structured data. Naturally tabular data, which can always be written out to CSV, seems to be more important in my experience.
- Regular expressions. Really handy, but not vital. If you are focusing exclusively on text analytics then the situation is different.
- Programming language X. This is worthy of a separate post – but I think it is unwise from a professional development and productivity standpoint to be religious about any particular programming language or environment: C++, .Net, Java, Python, SAS, R, Matlab, AMPL, GAMS, etc. Not all languages and environments are created equal, but no single environment provides everything an analytics pro needs in all situations (at this point). It is frequently the case that those who claim that a particular environment or language is uniquely qualified to perform a scientific programming task are unfamiliar with the alternatives.
- Writing unit tests. I am a huge proponent of writing unit tests and test driven development
, but this is not as important for consultants or academics. My hope that the thought of sharing code (number 10) scares most people enough into making sure that their code is correct and presentable.
- This list is meant to provoke discussion. What do you think? What’s missing? What’s wrong?