Programming for Data Scientists – Types of Programming

This post is part of a series that discusses programming for data scientists. Let’s discuss the different types of programing tasks a data scientist is likely to encounter: scripting, applications programming, and systems programming. These are general purpose terms that have specific meaning when applied to data science.

Data scientists often start their careers doing a lot of scripting. Scripting generally involves issuing a sequence of commands to link data and external processes. In other words, it is the glue that holds a larger system together. For example, a script may retrieve poll results from a number of different websites, consolidate key fields into a single table, then push a CSV version to this table to a Python script for analysis. In another system, a script may generate AMPL input files for an optimization model, invoke AMPL, and kickoff a database upload script on completion. Python, R, and Matlab all provide REPL environments that can be used for scripting computationally oriented tasks. Good scripters are highly skilled and highly productive, however a formal education in software engineering principles is not required. Most scripts are intended to have a finite (and short) lifespan and used by only a few people, though it is often the case that these assumptions are incorrect!

Applications programming involves producing a system to be used by a client such as a paying customer, a colleague in another department, or another analyst. Some applications programming scenarios start as scripts. For example, the AMPL “connector script” described above above may be turned into a full Java module installed on a server that handles requests from a web page to invoke Gurobi. In many cases, the “model building” phase in an analytics project is essentially an applications programming task. An important aspect of applications programming is that the client has a different set of skills than you do, and therefore a primary focus is to present a user interface (be it web, command-line, or app-based) that is productive for the client to use. Unlike scripting, an applications programming application may have several tiers of functionality that are responsible for different aspects of the system. For example, a three-tiered architecture is so named for its principal components: the user interface component, the data access component, and the business rules component that mediates between the user and the data required to carry out the task at hand. Each of these components, in turn, may have subcomponents. In our Java/Gurobi example above, there may be subcomponents for interpreting and checking user input, preparing the data necessary for the optimization model, creating and solving the optimization model, solving it, and producing a solution report in a user-friendly form.

Systems programming deliverables are intended for other programmers to use. They provide a set of programming interfaces or services that allow a programmer to carry out a related set of tasks. Traditionally, systems programming has been associated with computer operating systems such as Windows and Linux. However, systems programming has long been a critical component in analytics solutions. A classic example is the LAPACK linear algebra library, designed by Jack Dongarra and others. LAPACK has in turn served as the basis for countless other libraries, ranging from signal processing to quadratic programming. NumPy is a more recent example. If you are building a numerical library of your own, you are undertaking a systems programming task. Hallmarks of well-designed systems libraries are consistent naming, API design, library, and high performance. While such qualities are also important for applications, they are not as critical because the aim is different.

Designers of systems libraries often have deep software engineering experience, as well as technical depth in the area for which the library is intended. It’s easy to fall into the trap of believing that systems programming is therefore inherently more challenging or superior. It’s not that simple. An amusing entry in the Tao of Programming contrasts the challenges between applications and systems programming makes this point well! Systems programmers often make awful applications programmers, and vice versa. Both camps have the potential required for both types of programming; it is simply that they often have not developed the muscles required for the other discipline, similar to sprinters and long-distance runners. As a data scientist, you may be asked to sprint, run a marathon, or something in between, so get used to it!


Programming for Data Scientists – Guidelines

As a data scientist, you are going to have to do a lot of coding, even if you or your supervisor do not think that you will. The nature of your coding will depend greatly on your role, which will change over time. For the foreseeable future, an important ingredient for success in your data science career will be writing good code. The problem is that many otherwise prepared data scientists have not been formally trained in software engineering principles, or even worse, have been trained to write crappy code. If this sounds like your situation, this post is for you!

You are not going to have the time, and may not have the inclination, to go through a full crash course in computer science. You don’t have to. Your goal should be to become proficient and efficient at building software that your peers can understand and your clients can effectively use. The strategy for achieving this goal is purposeful practice by writing simple programs.

For most budding data scientists, it is a good idea to try to develop an understanding of the basic principles of software engineering, so that you will be prepared to succeed no matter what is thrown at you. More important than the principles themselves are to put them into practice as soon as you can. You need to write lots of code in order to learn to write good code, and a great way to start is to write programs that are meaningful to you. These programs may relate to a work assignment, or to an area of analytics that you know and love, or a “just for fun” project. Many of my blog posts are the result of me trying to develop my skills in a language that I’m trying to learn, or not very good at. A mistake many beginners make is to be too ambitious in their “warm up” programs. Your warm up programs should not be intellectually challenging, at least in terms of what they are actually trying to do. They should focus your attention on how to effectively write solutions in your programming environment. Doing a professional-grade job on a simple problem is a stepping-stone for greater works. Once you nail a simple problem, choose another one that exercises different muscles, rather than a more complicated version of the problem you just solved. If your warm up was a data access task, try charting some data, or writing a portion of an algorithm, or connecting to an external library that does something cool.

There are many places where you can find summaries of software engineering basics: books, online courses, and blogs such as Joel Spolsky’s (here’s one example). Here I will try to summarize a few guidelines that I think are particularly important for data scientists. Many data scientists already have formal training in mathematics, statistics, economics, or a hard science. For this group, the basic concepts of computing (memory, references, functions, control flow, and so on) are easily understood. The challenge is engineering: building programs that will stand the test of time.

Be clear. A well-organized mathematical paper states its assumptions, uses good notation, has a logical flow of connected steps, organizes its components into theorems, lemmas, corollaries, and so on, succinctly reports its conclusions, and produces a valid result! It’s amazing how often otherwise talented mathematicians and statisticians forget this once they start writing code. Gauss erased the traces of the inspiration and wandering that led him to his proofs, preferring instead to present a seamless, elegant whole with no unnecessary pieces. We needn’t always go that far, whether in math or in programming, but it’s good to keep this principle in mind: don’t be satisfied with stream-of-consciousness code that runs without an error. A number of excellent suggestions for writing clear code are given here [pdf], in particular to use sensible names and to structure your code logically into small pieces. 

Keep it simple. Don’t write code you think you will need, write the code you actually need. Fancy solutions are usually wrong, or at least they can be broken up into simpler pieces. If you have made it too simple, you will figure it out soon enough, whereas if you have made things too complicated you will be so busy trying to fix your code that you may never realize it.

Pretend that you are your own customer. If you are writing a library for others to use, you can start by writing example programs that use the library. Of course at the beginning these examples won’t work – the point is to force yourself to understand how your solution will be used so that you can make it as easy and fun to use as possible. It also forces you to think about what should happen in the case of user or system errors. You may also discover additional tasks that your solution should carry out in order to make life simple. These examples may pertain not only to the entire solution, but to small portions of your solution. By writing tests and examples early – that is, by practicing unit testing and test driven development – you can ensure high quality from the start.

Learn how to debug. Most modern languages are associated with development environments that have sophisticated debuggers. They allow you to step through your code bit by bit, inspecting the data and control flow along the way. Learn how to set breakpoints, inspect variables, step in and out of functions, and all of the keyboard shortcuts associated with each. Computational errors in particular can be very hard to catch by simply reviewing code, so you’ll want to become adept at using the debugger efficiently.

Write high performance code. I wrote about this subject in a previous post. The key is to measure. Rico Mariani does an awesome job of describing why measurement is so important in this post. Rookie data scientists frequently spend too much time tuning their code without measuring.

Add logging and tracing to your code. Your code may run as part of a web application, a complicated production system, or may be self-contained, but it’s always a good idea to add logging and tracing statements to your code. Logging is intended for others to follow the execution of your code, whereas tracing is for your own benefit. Tracing is important because analytics code often has complicated control flow and is more computationally intensive than typical code. When users run into problems, such as incorrect results, sluggish performance, or “hangs”, often the only thing you have to go on are the trace logs. Most developers add too little tracing to their code, and many not at all. Just like the rest of your code, your trace statements should be clear, easy to follow, and tuned for the application. For example, if you are writing an optimization algorithm then you may wish to trace the current iterate, the error, and iteration number, and so on. Sometimes the amount of tracing information can become overwhelming, so add switches that help you to control how much is actually traced.

In future posts in this series, I will talk about developing as part of a team, choice of language and toolset, and the different types of programming tasks a data scientist is likely to encounter over the course of their career.