As a data scientist, you are going to have to do a lot of coding, even if you or your supervisor do not think that you will. The nature of your coding will depend greatly on your role, which will change over time. For the foreseeable future, an important ingredient for success in your data science career will be writing good code. The problem is that many otherwise prepared data scientists have not been formally trained in software engineering principles, or even worse, have been trained to write crappy code. If this sounds like your situation, this post is for you!
You are not going to have the time, and may not have the inclination, to go through a full crash course in computer science. You don’t have to. Your goal should be to become proficient and efficient at building software that your peers can understand and your clients can effectively use. The strategy for achieving this goal is purposeful practice by writing simple programs.
For most budding data scientists, it is a good idea to try to develop an understanding of the basic principles of software engineering, so that you will be prepared to succeed no matter what is thrown at you. More important than the principles themselves are to put them into practice as soon as you can. You need to write lots of code in order to learn to write good code, and a great way to start is to write programs that are meaningful to you. These programs may relate to a work assignment, or to an area of analytics that you know and love, or a “just for fun” project. Many of my blog posts are the result of me trying to develop my skills in a language that I’m trying to learn, or not very good at. A mistake many beginners make is to be too ambitious in their “warm up” programs. Your warm up programs should not be intellectually challenging, at least in terms of what they are actually trying to do. They should focus your attention on how to effectively write solutions in your programming environment. Doing a professional-grade job on a simple problem is a stepping-stone for greater works. Once you nail a simple problem, choose another one that exercises different muscles, rather than a more complicated version of the problem you just solved. If your warm up was a data access task, try charting some data, or writing a portion of an algorithm, or connecting to an external library that does something cool.
There are many places where you can find summaries of software engineering basics: books, online courses, and blogs such as Joel Spolsky’s (here’s one example). Here I will try to summarize a few guidelines that I think are particularly important for data scientists. Many data scientists already have formal training in mathematics, statistics, economics, or a hard science. For this group, the basic concepts of computing (memory, references, functions, control flow, and so on) are easily understood. The challenge is engineering: building programs that will stand the test of time.
Be clear. A well-organized mathematical paper states its assumptions, uses good notation, has a logical flow of connected steps, organizes its components into theorems, lemmas, corollaries, and so on, succinctly reports its conclusions, and produces a valid result! It’s amazing how often otherwise talented mathematicians and statisticians forget this once they start writing code. Gauss erased the traces of the inspiration and wandering that led him to his proofs, preferring instead to present a seamless, elegant whole with no unnecessary pieces. We needn’t always go that far, whether in math or in programming, but it’s good to keep this principle in mind: don’t be satisfied with stream-of-consciousness code that runs without an error. A number of excellent suggestions for writing clear code are given here [pdf], in particular to use sensible names and to structure your code logically into small pieces.
Keep it simple. Don’t write code you think you will need, write the code you actually need. Fancy solutions are usually wrong, or at least they can be broken up into simpler pieces. If you have made it too simple, you will figure it out soon enough, whereas if you have made things too complicated you will be so busy trying to fix your code that you may never realize it.
Pretend that you are your own customer. If you are writing a library for others to use, you can start by writing example programs that use the library. Of course at the beginning these examples won’t work – the point is to force yourself to understand how your solution will be used so that you can make it as easy and fun to use as possible. It also forces you to think about what should happen in the case of user or system errors. You may also discover additional tasks that your solution should carry out in order to make life simple. These examples may pertain not only to the entire solution, but to small portions of your solution. By writing tests and examples early – that is, by practicing unit testing and test driven development – you can ensure high quality from the start.
Learn how to debug. Most modern languages are associated with development environments that have sophisticated debuggers. They allow you to step through your code bit by bit, inspecting the data and control flow along the way. Learn how to set breakpoints, inspect variables, step in and out of functions, and all of the keyboard shortcuts associated with each. Computational errors in particular can be very hard to catch by simply reviewing code, so you’ll want to become adept at using the debugger efficiently.
Write high performance code. I wrote about this subject in a previous post. The key is to measure. Rico Mariani does an awesome job of describing why measurement is so important in this post. Rookie data scientists frequently spend too much time tuning their code without measuring.
Add logging and tracing to your code. Your code may run as part of a web application, a complicated production system, or may be self-contained, but it’s always a good idea to add logging and tracing statements to your code. Logging is intended for others to follow the execution of your code, whereas tracing is for your own benefit. Tracing is important because analytics code often has complicated control flow and is more computationally intensive than typical code. When users run into problems, such as incorrect results, sluggish performance, or “hangs”, often the only thing you have to go on are the trace logs. Most developers add too little tracing to their code, and many not at all. Just like the rest of your code, your trace statements should be clear, easy to follow, and tuned for the application. For example, if you are writing an optimization algorithm then you may wish to trace the current iterate, the error, and iteration number, and so on. Sometimes the amount of tracing information can become overwhelming, so add switches that help you to control how much is actually traced.
In future posts in this series, I will talk about developing as part of a team, choice of language and toolset, and the different types of programming tasks a data scientist is likely to encounter over the course of their career.