Programming for Data Scientists – Types of Programming

This post is part of a series that discusses programming for data scientists. Let’s discuss the different types of programing tasks a data scientist is likely to encounter: scripting, applications programming, and systems programming. These are general purpose terms that have specific meaning when applied to data science.

Data scientists often start their careers doing a lot of scripting. Scripting generally involves issuing a sequence of commands to link data and external processes. In other words, it is the glue that holds a larger system together. For example, a script may retrieve poll results from a number of different websites, consolidate key fields into a single table, then push a CSV version to this table to a Python script for analysis. In another system, a script may generate AMPL input files for an optimization model, invoke AMPL, and kickoff a database upload script on completion. Python, R, and Matlab all provide REPL environments that can be used for scripting computationally oriented tasks. Good scripters are highly skilled and highly productive, however a formal education in software engineering principles is not required. Most scripts are intended to have a finite (and short) lifespan and used by only a few people, though it is often the case that these assumptions are incorrect!

Applications programming involves producing a system to be used by a client such as a paying customer, a colleague in another department, or another analyst. Some applications programming scenarios start as scripts. For example, the AMPL “connector script” described above above may be turned into a full Java module installed on a server that handles requests from a web page to invoke Gurobi. In many cases, the “model building” phase in an analytics project is essentially an applications programming task. An important aspect of applications programming is that the client has a different set of skills than you do, and therefore a primary focus is to present a user interface (be it web, command-line, or app-based) that is productive for the client to use. Unlike scripting, an applications programming application may have several tiers of functionality that are responsible for different aspects of the system. For example, a three-tiered architecture is so named for its principal components: the user interface component, the data access component, and the business rules component that mediates between the user and the data required to carry out the task at hand. Each of these components, in turn, may have subcomponents. In our Java/Gurobi example above, there may be subcomponents for interpreting and checking user input, preparing the data necessary for the optimization model, creating and solving the optimization model, solving it, and producing a solution report in a user-friendly form.

Systems programming deliverables are intended for other programmers to use. They provide a set of programming interfaces or services that allow a programmer to carry out a related set of tasks. Traditionally, systems programming has been associated with computer operating systems such as Windows and Linux. However, systems programming has long been a critical component in analytics solutions. A classic example is the LAPACK linear algebra library, designed by Jack Dongarra and others. LAPACK has in turn served as the basis for countless other libraries, ranging from signal processing to quadratic programming. NumPy is a more recent example. If you are building a numerical library of your own, you are undertaking a systems programming task. Hallmarks of well-designed systems libraries are consistent naming, API design, library, and high performance. While such qualities are also important for applications, they are not as critical because the aim is different.

Designers of systems libraries often have deep software engineering experience, as well as technical depth in the area for which the library is intended. It’s easy to fall into the trap of believing that systems programming is therefore inherently more challenging or superior. It’s not that simple. An amusing entry in the Tao of Programming contrasts the challenges between applications and systems programming makes this point well! Systems programmers often make awful applications programmers, and vice versa. Both camps have the potential required for both types of programming; it is simply that they often have not developed the muscles required for the other discipline, similar to sprinters and long-distance runners. As a data scientist, you may be asked to sprint, run a marathon, or something in between, so get used to it!

Author: natebrix

Follow me on twitter at @natebrix.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s