Programming for Data Scientists – Guidelines

As a data scientist, you are going to have to do a lot of coding, even if you or your supervisor do not think that you will. The nature of your coding will depend greatly on your role, which will change over time. For the foreseeable future, an important ingredient for success in your data science career will be writing good code. The problem is that many otherwise prepared data scientists have not been formally trained in software engineering principles, or even worse, have been trained to write crappy code. If this sounds like your situation, this post is for you!

You are not going to have the time, and may not have the inclination, to go through a full crash course in computer science. You don’t have to. Your goal should be to become proficient and efficient at building software that your peers can understand and your clients can effectively use. The strategy for achieving this goal is purposeful practice by writing simple programs.

For most budding data scientists, it is a good idea to try to develop an understanding of the basic principles of software engineering, so that you will be prepared to succeed no matter what is thrown at you. More important than the principles themselves are to put them into practice as soon as you can. You need to write lots of code in order to learn to write good code, and a great way to start is to write programs that are meaningful to you. These programs may relate to a work assignment, or to an area of analytics that you know and love, or a “just for fun” project. Many of my blog posts are the result of me trying to develop my skills in a language that I’m trying to learn, or not very good at. A mistake many beginners make is to be too ambitious in their “warm up” programs. Your warm up programs should not be intellectually challenging, at least in terms of what they are actually trying to do. They should focus your attention on how to effectively write solutions in your programming environment. Doing a professional-grade job on a simple problem is a stepping-stone for greater works. Once you nail a simple problem, choose another one that exercises different muscles, rather than a more complicated version of the problem you just solved. If your warm up was a data access task, try charting some data, or writing a portion of an algorithm, or connecting to an external library that does something cool.

There are many places where you can find summaries of software engineering basics: books, online courses, and blogs such as Joel Spolsky’s (here’s one example). Here I will try to summarize a few guidelines that I think are particularly important for data scientists. Many data scientists already have formal training in mathematics, statistics, economics, or a hard science. For this group, the basic concepts of computing (memory, references, functions, control flow, and so on) are easily understood. The challenge is engineering: building programs that will stand the test of time.

Be clear. A well-organized mathematical paper states its assumptions, uses good notation, has a logical flow of connected steps, organizes its components into theorems, lemmas, corollaries, and so on, succinctly reports its conclusions, and produces a valid result! It’s amazing how often otherwise talented mathematicians and statisticians forget this once they start writing code. Gauss erased the traces of the inspiration and wandering that led him to his proofs, preferring instead to present a seamless, elegant whole with no unnecessary pieces. We needn’t always go that far, whether in math or in programming, but it’s good to keep this principle in mind: don’t be satisfied with stream-of-consciousness code that runs without an error. A number of excellent suggestions for writing clear code are given here [pdf], in particular to use sensible names and to structure your code logically into small pieces. 

Keep it simple. Don’t write code you think you will need, write the code you actually need. Fancy solutions are usually wrong, or at least they can be broken up into simpler pieces. If you have made it too simple, you will figure it out soon enough, whereas if you have made things too complicated you will be so busy trying to fix your code that you may never realize it.

Pretend that you are your own customer. If you are writing a library for others to use, you can start by writing example programs that use the library. Of course at the beginning these examples won’t work – the point is to force yourself to understand how your solution will be used so that you can make it as easy and fun to use as possible. It also forces you to think about what should happen in the case of user or system errors. You may also discover additional tasks that your solution should carry out in order to make life simple. These examples may pertain not only to the entire solution, but to small portions of your solution. By writing tests and examples early – that is, by practicing unit testing and test driven development – you can ensure high quality from the start.

Learn how to debug. Most modern languages are associated with development environments that have sophisticated debuggers. They allow you to step through your code bit by bit, inspecting the data and control flow along the way. Learn how to set breakpoints, inspect variables, step in and out of functions, and all of the keyboard shortcuts associated with each. Computational errors in particular can be very hard to catch by simply reviewing code, so you’ll want to become adept at using the debugger efficiently.

Write high performance code. I wrote about this subject in a previous post. The key is to measure. Rico Mariani does an awesome job of describing why measurement is so important in this post. Rookie data scientists frequently spend too much time tuning their code without measuring.

Add logging and tracing to your code. Your code may run as part of a web application, a complicated production system, or may be self-contained, but it’s always a good idea to add logging and tracing statements to your code. Logging is intended for others to follow the execution of your code, whereas tracing is for your own benefit. Tracing is important because analytics code often has complicated control flow and is more computationally intensive than typical code. When users run into problems, such as incorrect results, sluggish performance, or “hangs”, often the only thing you have to go on are the trace logs. Most developers add too little tracing to their code, and many not at all. Just like the rest of your code, your trace statements should be clear, easy to follow, and tuned for the application. For example, if you are writing an optimization algorithm then you may wish to trace the current iterate, the error, and iteration number, and so on. Sometimes the amount of tracing information can become overwhelming, so add switches that help you to control how much is actually traced.

In future posts in this series, I will talk about developing as part of a team, choice of language and toolset, and the different types of programming tasks a data scientist is likely to encounter over the course of their career.

Fantasy Football Ratings 2014

I have prepared fantasy football ratings for the 2014 NFL season based on the data from last year’s season. I hope you will find them useful! You can download the ratings here.

These ratings are reasonable but flawed. The strengths of the ratings are:

  • They are based on player performance from the 2013 season, using a somewhat standard fantasy scoring system. (6 points for touchdowns, -2 for turnovers, 1 point per 25 passing yards, 1 point per 10 passing yards, and reasonable scoring for kickers.)
  • The ratings are comparable across positions because the rating means the expected number of fantasy points that a player will score compared to a “replacement level” player for that position. I call this “Fantasy Points Over Replacement”: FPOR.
  • Touchdowns are a key contributor fantasy performance, but they are fickle: they often vary dramatically between players of the same overall skill level, and even between seasons for the same player. In a previous post I showed that passing and rushing touchdowns are lognormally distributed against the yards per attempt. I have accounted for this phenomenon in the rankings. Loosely speaking, this means that a player that scored an unexpectedly high number of touchdowns on 2013 will be projected to score fewer in 2014.
  • The ratings do a rough correction for minor injuries. Players that play in 10 or more games in a season are rated according to the number of points they score per game. Therefore a player who missed, say, two games in 2013 due to injury is not disadvantaged compared to one that did not.

There are several weaknesses:

  • I have data for several previous seasons but do not use it. This would stabilize the ratings, and we could probably account for maturation / aging of players from season-to-season.
  • Rookies are not rated.
  • We do not account for team changes. This factor is often very important as a backup for one team may end up as a starter for another, dramatically affecting fantasy performance. (I actually have a pretty good heuristic for accounting for this, but I have not implemented it in Python…only SAS and I no longer have access to a SAS license.)
  • Players who missed a large portion of the 2013 season are essentially penalized for 2014, even if they are expected to return fully.
  • I have not rated defense/special teams.

You may want to adjust the rankings accordingly. Here are the top 25 rated players (again, the full ratings are here):

Name Position RawPts AdjPts FPOR
LeSean McCoy RB 278.6 199.3125 85.96875
Jamaal Charles RB 308 194 80.65625
Josh Gordon WR 218.6 176.3571429 71.57142857
Matt Forte RB 261.3 177.46875 64.125
Calvin Johnson WR 219.2 157.7142857 52.92857143
DeMarco Murray RB 205.1 155.4642857 42.12053571
Reggie Bush RB 185.2 153.4285714 40.08482143
Jimmy Graham TE 217.5 113.90625 35.90625
Antonio Brown WR 197.9 140.53125 35.74553571
Knowshon Moreno RB 236.6 148.6875 35.34375
Adrian Peterson RB 203.7 147.5357143 34.19196429
Marshawn Lynch RB 239.3 145.59375 32.25
Le’Veon Bell RB 171.9 142.9615385 29.61778846
Demaryius Thomas WR 227 134.0625 29.27678571
A.J. Green WR 208.6 133.6875 28.90178571
Eddie Lacy RB 207.5 141.5 28.15625
Andre Johnson WR 170.7 131.90625 27.12053571
Alshon Jeffery WR 182.1 131.34375 26.55803571
Peyton Manning QB 519.98 172.48125 22.18125
Stephen Gostkowski K 176 176 22
Drew Brees QB 435.68 172.2 21.9
Ryan Mathews RB 184.4 133.5 20.15625
DeSean Jackson WR 187.2 124.875 20.08928571
Pierre Garcon WR 162.6 124.3125 19.52678571
Jordy Nelson WR 179.4 123.1875 18.40178571

FPOR is the adjusted, cross-position score described earlier. RawPts is simply 2013 fantasy points. AdjPts are the points once touchdowns have been “corrected” and injuries accounted for.

We will see how the ratings work out! If I have time I will post a retrospective once the season is done.

It’s Not My System

Those of us who build and inhabit systems often forget how arbitrary they are. I am reminded of this every time I go through airport security. I fly occasionally but not often, just frequently enough to be reminded of the variations in the collection of regulations and tasks that James Fallows calls “security theater”. Shoes on? Shoes off? Ring on? Off? Cell phone on? Off? Boarding pass in hand? Toiletries in a plastic bag? Shoes off? On? One trip it is one way and the next it is the opposite. Some TSA agents bark out the policies, seemingly annoyed at the fact that we’re getting it all wrong. Maybe some of these things have remained the same for years – I don’t know; it’s not my system. Just tell me what to do.

Don’t get me started on website passwords. Six characters. Eight characters. Letter and a number. Upper and a lower. Case insensitive. Special character. No special character. Different from the last ten. I don’t remember my password, so shoot me. It’s not my system. Just tell me what to do.

Windows 8. Swipe down. Swipe to the side. Scroll to the corner. Drag to the corner. Use the “share charm”. Right click. Don’t right click. It’s not my system. Just tell me what to do.

You: person who is about to build a system that I will use some day. I will never give it as much thought as you, or at least I hope not to. I am a capable guy. I’m no dummy. I just don’t care as much about the thing you built as you do. Except for certain cases where I care with an intensity that you will probably never understand, because I may miss my flight, or need to pay a bill, or need to connect to wireless. It’s not my system. It’s yours. Just tell me what to do.

Negative Space and Analytical Models

Bertrand Russell said, “Mathematics, rightly viewed, possesses not only truth, but supreme beauty – a beauty cold and austere, like that of sculpture.” Analytical models, borne of math and forged with code, should possess the same properties.

A painting, or a sculpture, or a piece of music, is not made better by cramming more stuff into it, a lesson George Lucas famously unlearned. Adding too much to an analytical model results in overfitting, confusion, and mistakes. We must resist this temptation, and move to the next level: viewing the empty spaces as elements that enhance the whole. This is the concept of negative space.

A sales model should account not only for those who do purchase, but those who do not – otherwise you will overestimate the contributions of factors that sometimes generate sales (I am looking at you, Twitter). A scheduling model must consider not only which options are possible, but those that are not. A social model should consider not only those who are active tweeters, bloggers, and commenters, but also those who do their talking off the grid. Otherwise those who shout the loudest in one general direction are mistakenly interpreted as having real influence. There is no truth, and no beauty, in that!

401k Simulation Using Analytic Solver Platform

You can build a pretty decent 401k simulation in a few minutes in Excel using Analytic Solver Platform:


Let’s give it a shot! You can download the completed workbook here.

First, let’s build a worksheet that calculates 401k balances for 10 years. At the top of the worksheet let’s enter a yearly contribution rate:


Let’s compute 401k balances for the next 10 years, based on this contribution. A simple calculation for the balance for a given year involves five factors:

  1. The 401k balance for the previous year.
  2. The rate of return for the 401k.
  3. The previous year’s salary.
  4. The rate of increase in the salary (your raise).
  5. The rate of contribution (entered above).

In row 6 we will enter in the starting values for return, salary increase, balance, and salary in columns B, C, D, E respectively. For now let’s assume:

  • Return = 0.05
  • Salary Increase = 0.05
  • Balance = 5,000
  • Salary = 100,000

With a couple of small assumptions, the new balance is old balance * return + contribution * (salary * (1 + salary increase)). In the next row we will compute Year 1, using this formula:

  • Salary = D6 * (1 + C6). This simply means that this year’s salary is last year’s adjusted by raise. (Obviously salary could be modeled differently depending on when the raise kicks in.)
  • Balance = E6*(1 + B6)+D6*$B$3. There are two terms. The first is the old balance times the portfolio return. The second is the current salary times the contribution rate.

We can fill these values down, giving us the 401k balance for the entire period:


Here’s the thing: we don’t actually know what our portfolio return and salary increases will be in future years. They’re uncertain. We can use Analytic Solver Platform to turn the wild guesses in columns B and C into probability distributions. Using simulation we can then determine the most likely range for future 401k balances.

For portfolio return, a reasonable thing to do is to go back and look at past performance. Rates of return for the S&P 500 (and other financial instruments) are given on this page. Using the “From Web” feature of Power Query (or by simply copy-pasting) you can bring this data into another Excel worksheet with no sweat:


Now let’s turn this historical data into a probability distribution we can use in our model. Select the S&P 500 historical return data and select Distibutions –> Distribution Wizard in the Analytic Solver Platform tab:


Fill in the first page of the wizard:


Select “continuous values” in the next step, “Fit the data” in the next, and then pick an empty cell for “Location” in the final step. In the cell that you selected, you will see a formula something like this:

=PsiWeibull(3.55593208704872,0.692234009779183, PsiShift(-0.509633992648591))

This is a Weibull distribution that fits the historical data. If you hit “F9” to recalculate the spreadsheet you will see that the value for this cell changes as a result of sampling from this distribution. Each sample is a different plausible yearly return. Let’s copy this formula in place of the 0.05 values we entered in column B of our original spreadsheet. If we click on the “Model” button in the Analytic Solver Platform ribbon, we will see that these cells have been labeled as “Uncertain Variables” in the Simulation section.

For Salary Increase we will do something simpler. Let’s just assume that the increase will be between 2% and 7% each year. Enter =PsiUniform(0.02, 0.07) in cell C6, and fill down.

The last thing we need to do is to define an “output” for the simulation, called an Uncertain Function. When we define Uncertain Functions, we get nice charts and stats for these cells when we run a simulation. Click on the Balance entry for Year 10, then click on arrow next to the “+” in the Model Pane, and then Add Uncertain Function. Your Model Pane will look something like this:


And your spreadsheet will look something like this:


Now all we need to do is click Simulate in the ribbon. Analytic Solver Platform draws samples for the uncertain variables (and evaluates everything in parallel for fast performance) and then shows you a chart showing the different possible 401k balances. As you can see, the possible balances vary widely but are concentrated around $100,000:



Here’s the great thing: you can now build out this spreadsheet to your heart’s content to build simulations that incorporate more factors. If you want to get really fancy, you can correlate yearly returns. Check out the extensive help on for more.

Spark Summit Keynote Notes

Here is a summary of my key impressions from the Day 1 keynotes of the 2014 Spark Summit

This year’s Spark Summit was one of the deepest, most interesting technical conferences I have attended, and I don’t say that lightly. It is easy to get caught up in the excitement of a conference filled with enthusiasts, but trust me when I say that conventional MapReduce–based Hadoop is over and technologies like Spark will be part of the tipping point that will turn Big Data hype into real applications and much more widespread deployment. Hadoop is legacy.

Spark, like Hadoop, is more than one “thing”. The base component of Spark is a cluster computation engine that is like MapReduce on steroids. Instead of the simple two stage “map then reduce” computational model, Spark supports more general DAG-structured computational flows (Microsoft watchers will remember Dryad). This in itself, is a big innovation, especially for analytics scenarios. Indeed, Spark has been shown to be 10, 100, even 1000 times faster than Hadoop on a number of real workloads. More important than this, in my view, is that Spark includes higher level libraries for data access and analytics, surfaced in clean, consistent APIs that are available from three languages: Java, Scala, and Python. Three important Spark components are MLLib, a machine learning library; GraphX, a graph processing library; and Spark SQL, introduced at this conference. An analogue for those familiar with the Microsoft ecosystem is the .Net Framework – .Net provides languages, a runtime, and a set of libraries together. The integration of the pieces makes each much more useful.

The Summit is organized and principally sponsored by Databricks (tagline: “making big data easy”). This is the company founded by the Berkeley-based creators of Spark. Ion Stoico, CEO of Databricks, kicked off Monday’s festivities, introducing Databricks Cloud, a web based Spark workbench for doing big data analytics. You can find screenshots on the Databricks Cloud site, or on twitter. Key points:

  • Databricks Cloud is currently in a private beta.
  • It’s a standalone web interface.
  • It has a command-line “REPL” interface.
  • The examples I saw were in Scala (which is kind of like a mix of F# and Java).
  • You can bring in data from Amazon S3 or other sources using Spark SQL (more on that in future posts).
  • It includes pre-canned datasets such as a twitter snapshot/firehose (can’t tell which).
  • You can do SQL queries right from the REPL.
  • It has incredibly simple, clean looking visualizations tied to results.
  • You can drag and drop dashboard components that correspond to REPL / program outputs. You can configure how often these components are “refreshed”.
  • We were presented a live 10 demo to create a dashboard to filter live tweets, based on a similarity model authored with the help of MLLib, and trained on wikipedia.
  • Databricks Cloud would be quite useful even as a standalone, single node analytics workbench, but recall that all of this is running on top of Spark, on the cluster without any “parallel programming” going on by the user. 
  • Everything you create in the workbench is Spark 1.0 compliant, meaning you can move it over to any other Spark 1.0 distribution without changes.

The development plan is sound, and there is a ton of corporate support for Spark from Cloudera, Hortonworks, DataBricks, SAP, IBM, and others. If time permits I will summarize some of the other keynotes and sessions.

INFORMS Big Data 2014 Conference Notes

Many moons have passed since my last conference report: let’s do this. I will follow Steven Sinofsky’s style and try to keep this fact-based, saving a few more subjective thoughts for the end. I attended the INFORMS Big Data Conference in San Jose, representing Frontline Systems. At the conference we announced the release of the newest version of Analytic Solver Platform.

This was the very first INFORMS Big Data Conference. INFORMS was originally focused on “operations research” aka optimization aka prescriptive analytics. In recent years, it has embraced analytics more broadly: the spring “practice conference” was rebranded as a “Business Analytics” conference, an analytics professional certification program was rolled out, and last week INFORMS introduced an analytics maturity model for organizations. Holding a “big data” conference is a natural extension.

The conference was relatively small. There were between 15 and 20 exhibitors, including Frontline. The two biggest guns were SAS and FICO. There were several booths that were connected to academia: a couple of graduate programs and a booth for a company run by students. There were also several booths by smaller analytics and/or big data firms, mostly offering web-based experiences for authoring and visualizing predictive models. I think it is fair to say that the majority of the exhibitors have an analytics, as opposed to a big data, emphasis.

There were several technology workshops on Sunday, followed by two days of talks. Monday’s keynote was given by Bill Franks, Chief Analytics Officer at Teradata and author of Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics. Tuesday’s keynote was given by Michael Svilar of Accenture. The talks were divided into several tracks, among them Big Data 101, Case Studies, and Emerging Trends. The talks in the Big Data 101 track were generally well attended. The attendance in other tracks varied widely. In general, talks given by people from “cutting edge” Analytics 10.0 organizations such as Kaggle were quite popular.

Diego Klajban, the founding director of Northwestern University’s MS in Analytics, gave a nice overview of the basic Hadoop stack beyond basic Java-based MapReduce, in particular Pig and Hive. This talk, along with others in the “Big Data 101” track, functioned as mini “survey courses” in various aspects of practicing analytics on Big Data platforms. These talks were helpful for framing technologies and concepts, weaving them into a much more coherent totality.

Paul Kent, VP of Big Data at SAS, gave a talk in the “Emerging Trends” track titled “Big Data and Big Analytics – So Much More Gunpowder!” Paul’s talk focused on four themes: abundance, Hadoop, SAS on Hadoop, and Big Data ideas for organizations. We find ourselves in an “era of abundance” because the cost of storing information has become less than the cost of making the decision to throw it away. We can use this data to answer questions that have not even been formulated at the time of data collection. Paul summarized the Hadoop ecosystem which supports the collection and processing of such data. He went on to describe several SAS offerings which interact with Hadoop in various ways. It was interesting to me to learn how many SAS procedures are now supported “on node” for high performance, among them HPMIXED, HPNEURAL, HPFOREST, HPSVM, and so on. SAS’s continued investment in Hadoop is reflective of a more general challenge: how can organizations realize the potential of Big Data and “Big Data Analytics” when they often have large existing investments in “good old fashioned” storage and analytics. 

Paul provided his own definition of Big Data: it is “the amount of data or complexity that puts you out of your comfort zone”. Indeed I heard several different definitions of Big Data at the conference. This variety is indicative of the buzz that surrounds Big Data, particularly among commercial organizations looking to position their offerings.

Frontline has long been in the business of providing analytics solutions to business analysts, and has had a strong presence at various INFORMS conferences for years. I spent a lot of time at the Frontline booth talking to students, professors, business analysts and consultants. Our booth had lots of traffic, and it was interesting to note both how many familiar “INFORMS faces” came by, as well as the number of people who had never heard of Frontline before. In this sense, the INFORMS Big Data conference achieved its mission of connecting the traditional analytics and big data communities. It was interesting to note how many questions there were about Excel’s capabilities. Many did not seem to realize that Excel’s row and column limits increased years ago, and that PowerPivot can bring in much larger data sets with ease – let alone features offered in the recently released Power BI. The hype cycle has largely left Excel behind.

In short, the conference was worth Frontline’s time. We were able to tell the story of not only our most recent release (go get it!) but also Frontline’s overall value proposition. INFORMS Big Data was a nice first bridge between two communities that really should be one. The conference, bluntly, should be much more interesting in a couple of years time, as hype diminishes and case studies increase.