Four Things I Learned from Jack Dongarra

Opening the Washington Post today brought me a Proustian moment: encountering the name of Jack Dongarra. His op-ed on supercomputing involuntarily recalled to mind the dusty smell of the third floor MacLean Hall computer lab, xterm windows, clicking keys, and graphite smudges on spare printouts. Jack doesn’t know it, but he was a big part of my life for a few years in the 90s. I’d like to share some things I learned from him.

I am indebted to Jack. Odds are you are too. Nearly every data scientist on Earth uses Jack’s work every day, and most don’t even know it. Jack is one of the prime movers behind the BLAS and LAPACK numerical libraries, and many more. BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) are programming libraries that provide foundational routines for manipulating vectors and matrices. These routines range from the rocks and sticks of addition, subtraction, and scalar multiplication up to finely tuned engines for solving systems of linear equations, factorizing matrices, determining eigenvalues, and so on.

Much of modern data science is built upon these foundations. They are hidden by layers of abstractions, wheels, pips and tarballs, but when you hit bottom, this is what you reach. Much of ancient data science is also built upon them too, including the solvers I wrote as a graduate student when I was first exposed to his work. As important as LAPACK and BLAS are, that’s not the reason I feel compelled to write about Jack. It’s more about how he and his colleagues went about the whole thing. Here are four lessons:

Layering. If you dig into BLAS and LAPACK, you quickly find that the routines are carefully organized. Level 1 routines are the simplest “base” routines, for example adding two vectors. They have no dependencies. Level 2 routines are more complex because they depend on Level 1 routines – for example multiplying a matrix and a vector (because this can be implemented as repeatedly taking the dot product of vectors, a Level 1 operation). Level 3 routines use Level 2 routines, and so on. Of course all of this obvious. But we dipshits rarely do what is obvious, even these days. BLAS and LAPACK not only followed this pattern, they told you they were following this pattern.

I guess I have written enough code to have acquired the habit of thinking this way too. I recall having to rewrite a hilariously complex beast of project scheduling routines when I worked for Microsoft Project, and I tried to structure my routines exactly in this way. I will spare you the details, but there is no damn way it would have worked had I not strictly planned and mapped out my routines just like Jack did. It worked, we shipped, and I got promoted.

Naming. Fortran seems insane to modern coders, but it is of course awesome. It launched scientific computing as we know it. In the old days there were tight restrictions on Fortran variable names: 1-6 characters from [a-z0-9]. With a large number of routines, how does one choose names that are best for programmer productivity? Jack and team zigged where others might have zagged and chose names with very little connection to English naming.

“All driver and computational routines have names of the form XYYZZZ”

where X represents data type, YY represents type of matrix, and ZZZ is a passing gesture at the operation that is being performed. So SGEMV means “single precision general matrix-vector multiplication”.

This scheme is not “intuitive” in the sense that it is not named GeneralMatrixVectorMultiply or general_matrix_vector_multiply, but it is predictable. There are no surprises and the naming scheme itself is explicitly documented. Developers of new routines have very clear guidance on how to extend the library. In my career I have learned that all surprises are bad, so sensible naming counts for a lot. I have noticed that engineers whom I respect also think hard about naming schemes.

Documentation. BLAS and LAPACK have always had comprehensive documentation. Every parameter of every routine is documented, the semantics of the routine are made clear, and “things you should know” are called out. This has set a standard that high quality libraries (such as the tidyverse and Keras – mostly) have carried forward, extending this proud and helpful tradition.

Pride in workmanship. I can’t point to a single website or routine as proof, but the pride in workmanship in the Netlib has always shone through. It was in some sense a labor of love. This pride makes me happy, because I appreciate good work, and I aspire to good work. As a wise man once said:

Once a job is first begun,
Never leave it ’till it’s done.
Be the job great or small,
Do it right or not at all.

Jack Dongarra has done it right. That’s worth emulating. Read more about him here [pdf] and here.

JackDongarra

Advertisement

2018 NCAA Tournament Picks

Every year since 2010 I have used data science to predict the results of the NCAA Men’s Basketball Tournament. In this post I will describe the methodology that I used to create my picks (full bracket here). The model has Virginia, Michigan, Villanova, and Michigan State in the Final Four with Virginia defeating Villanova in the championship game:

Screen Shot 2018-03-13 at 9.20.05 PM

Here are my ground rules:

  • The picks should not be embarrassingly bad.
  • I shall spend no more than one work day on this activity (and 30 minutes for this post). This year I spent two hours cleaning up and running my code from last year.
  • I will share my code and raw data. (The data is available on Kaggle. The code is not cleaned up but here it is anyway.)

I used a combination of game-by-game results and team metrics from 2003-2017 to build the features in my model. Here is a summary:

I also performed some post-processing:

  • I transformed team ranks to continuous variables given a heuristic created by Jeff Sonos.
  • Standard normalization.
  • One hot encoding of categorical features.
  • Upset generation. I found the results to be not interesting enough, so I added a post-processing function that looks for games where the win probability for the underdog (a significantly lower seed) is quite close to 0.5. In those cases the model picks the underdog instead.

The model predicts the probability one team defeats another, for all pairs of teams in the tournament. The model is implemented in Python and uses logistic regression. The model usually performs well. Let’s see how it does this year!

Advice for Underqualified Data Scientists

A talented individual seeking entry-level data science roles recently asked me for advice. “How can you show a potential employer that you’d be an asset when on paper your resume doesn’t show what other candidates have?”

I’ll stick to data science, but much of what I share applies to other roles, too.

Let’s think about the question first. Why do coursework and skills matter for employers? It depends. Different employers have different philosophies about how they evaluate candidates. Most job listings specify required skills and qualifications for applicants, for example “must have 3-5 experience programming in R or Python.” Usually there is more to the story. Sometimes employers don’t expect candidates to meet all the criteria. Other times, the criteria are impossible to meet.

In most situations, employers are looking for additional attributes not provided in the job listing. Some employers will tell you their philosophy by listing the attributes they value: “ability to deal with ambiguous situations”, “being a team player”, “putting the customer first”, “seeks big challenges”, and so on. Others don’t. Even if they tell you, you don’t typically know which attributes are most important. What really matters? If I am a so-so programmer but a brilliant statistician, do I have a shot?

Individuals who make hiring decisions have a mental image of how a successful candidate will perform on the job. This mental image includes possessing and using a certain set of skills. Qualifications such as a degree, a certificate, or code on github provide part (but only part) of the evidence necessary to ensure hiring managers that they are making a sound decision.

Let’s be simplistic and say that employers consider both “explicit skills” and “implicit skills”. Examples of explicit skills are demonstrated knowledge or capability with programming language X, technology Y, or methodology Z. Examples of implicit skills might be the ability to break down a complicated problem into its constituent parts, dealing with ambiguity, working collaboratively, and so on. Certainly some employers are very focused on finding candidates with explicit skills, sometimes to the exclusion of implicit skills.

A reframing of the question is then: “If I sense that a potential employer is looking for certain explicit skills and I don’t think I have them, what do I do?” Here are some ideas:

Provide evidence you are good at acquiring explicit skills. Given an example of learning an explicit skill. (“No, I don’t know R, but I know Python. In my blah blah class I had to learn Python so I could apply it to XYZ problem, and it was no big deal. I did ABC and now my code is up on github. Learning R is really not a big deal, I’m confident I could hit the ground running. What would you have in mind for me for my first project?”)

Emphasize your implicit skills. Game plan about questions you’ll be asked and think about how you’d highlight what you believe to be your differentiating skills. (Without sounding like a politician.) By the way, now that I think about it, I followed my own advice when I interviewed at Market6 (now 84.51). I talked about the fact that I have worked in both software engineering and data science roles, and that made me uniquely qualified to work at a company that was trying to deliver data science at scale through SaaS offerings.

Do your own screening. Focus your search on employers who seem to value implicit skills. Rule out others. Do your research prior to applying. Ask friends or contacts. Early in your conversations with employers you can ask the recruiter about their philosophy. Not every job is right for you, so try and figure out which ones are.

That’s all I’ve got. I will close by telling two quick stories.

First story: My first job after finishing my PhD was as an entry level software engineer at Microsoft. When I interviewed, I was fortunate because Microsoft weighted implicit skills highly in their evaluation process. One of my favorite bosses at Microsoft was a classics major (as in Euripides, not the Stones). Another engineering manager started his career localizing dialog box messages into French. Oui, c’est vrai. Both had, and continue to have, a very strong set of implicit skills. They, in turn, looked for implicit skills. Talent comes in many different packages.

Second story: I believe that for early career stage positions it’s important to weight implicit skills more highly than explicit ones. Sometimes it’s a relief if certain explicit skills aren’t there! Several years ago, I had an entry level scientific researcher on my team who did not know how to code, in a position where lots of coding was required. This individual had very deep knowledge of optimization and statistics, was a hard worker, and was incredibly motivated. I was thrilled that they didn’t know how to code because then I could teach them! No bad habits!

 

Two Frustrations With the Data Science Industry

I saw some serious BS about data science on LinkedIn last night. This is nothing new, but this time I couldn’t help myself. I went on a small rant:

I don’t give a shit if you call yourself a data scientist, an analyst, a machine learning practitioner, an operations research specialist, a data engineer, a modeler, a statistician, a code poet, or a squirrel. I don’t care if you have a PhD, if you went to MIT or a community college, if you were born on a farm or in a city, or if Andrew Ng DMs you for tips. I want to know what you can do, if you can share, if you can learn, if you can listen, and if you can stand for what is right even if it’s unpopular. If we’re good there, the rest we can figure out together.

I must have tapped into something, so I’d like to myself a bit more thoroughly.

My rant is rooted in two frustrations about data science.

My first frustration relates to overclassification. How many different terms can we use to refer to data scientists? I honestly don’t know. I have it on authority that there are six types of data scientists. No, wait, there are seven. Strike that, eight. Actually there are ten. Stop the insanity!

Susan-Powter-image

The industry itself is also subject to this kind of sillified stratification. I don’t know what the hell I do anymore. Is it operations research? Statistics? Analytics? Machine Learning? Artificial Intelligence? All of it? It depends which thought leadership piece I read. And what is the current state of this field, anyway? Are we in the age of Analytics 2.0? Or is it 3.0? Is big data saving the world, or is it the “trough of disillusionment”? I find all of this unhelpful.

Why is this happening? The use of computer models to learn from data has been around for at least five decades now, but data science has moved from an unnamed, specialized backwater into a rapidly growing and vital industry. This growth has created a market for teaching others about this hot new field. It has also led to the organization of a hierarchy of those who are “in the know” and those who are not. These are the factors driving the accelerating creation of labels and classifications.

However, knowing the names of things does not constitute understanding of essence; the proliferation of labels under the banner of “thought leadership” is often a gimmick; and as Martin Gardner said, inventing your own terminology is a sign of a crank. Debates about terminology often draw us away from doing good data science. Maybe it’s just me but sometimes I get the feeling these distractions are on purpose. They don’t help anyone solve any problems, that’s for sure.

The second frustration I have is overreliance on credentials. As opposed to academic or research positions, my own work in industry has been focused on the practical use of data science to address business problems. More often than not, I’ve worked as part of a team to get the job done. What matters for people like me is whether problems actually get solved, in a reasonable amount of time with a reasonable amount of expense.

I have encountered situations where employers would only consider applicants who had graduated from certain schools, or with certain degrees, or with a certain number of years of experience with a certain specific technical skill. All of these qualifications are proxies for what actually matters: whether someone can meaningfully contribute to team-based analytical problem solving. Focusing on proxies results in both Type I and Type II errors: hiring scientists with great credentials but an inability to deliver (“all hat and no cattle“), or even worse, missing out on the opportunity to hire the proverbial “unicorn” because they didn’t tick the right box. I’ve seen both happen. These proxies are not without their uses: if I really require the development of an MINLP solver to solve optimization models with a particular structure…the right candidate very likely has a PhD. The point is not to confuse correlation with causation. Having a PhD does not make me a great data scientist. Nor does github, nor Coursera, nor Kaggle points. We need to dig deeper.

I suppose I should end positively. The last part of my rant was an appeal to inclusiveness and an appeal to pragmatism. Practical data science means making tradeoffs, large and small, every single day. It means seeing the big picture but also being willing to dig into the details. Let’s take this same practical mindset in growing our skills and building our teams.

2015 NFL Statistics by Player and Team

I have downloaded stats for the recently completed 2015 NFL regular season from yahoo.com, cleaned the data, and saved the data in CSV format. The files are located here. If you prefer a github repository, check here. The column headers should be self-explanatory.

You will find seven CSV files, which you can open in Excel or Google Sheets:

  • QB: quarterback data.
  • RB: running backs.
  • WR: wide receivers.
  • TE: tight ends.
  • K: kickers. I have broken out attempted and made field goals by distance into separate columns for convenience.
  • DEF: defensive stats by team.
  • ST: special teams stats by team.

Enjoy!

Nathan’s Reading List: 7/17/2015

Enjoy. And thank you Pocket.

Nathan’s Reading List: 7/7/2015

Here are a few interesting things I read last week:

Domain Expertise and the Data Scientist

When it comes to analytics, domain expertise matters. If you are going to be an effective data scientist doing, say, marketing analytics, you’re going to need to know something about marketing. Supply chain optimization? Supply chain. Skype? Audio and networking.

The point is so obvious that it is surprising that it is so often overlooked. Perhaps this is because in the decades before the terms “data science” and “analytics” entered common usage, the programmers, statisticians, and operations researchers who filled data science roles were simply known as “analysts” or “quants”. They were associated with their industry rather than their job function. Now that the broad “data scientist” label has entered general usage, it is difficult to speak to the domain-specific skills and knowledge required by all data scientists, because they are so industry-specific. I can give marketing analytics data scientists all kinds of advice, but what good would that do most of you?

Many new data scientists have math, stats, hard science, or analytics degrees and do not have deep training in the industries where they are hired. This was common in the 90s when investment firms hired physics grads to become quants. At Nielsen, all of my college graduate hires were trained in something other than media and advertising. The challenge for these newbies is to learn the domain skills they need – on the job! A few words of advice:

Do your homework, but not too much. You may be provided with some intro reading, for example PowerPoint training decks, books, or research papers. It’s obviously a good idea to read these materials, but don’t get your hopes up. I find that these materials often suffer from two flaws: 1) they are organization- rather than industry-specific (for example, describing how a marketing mix model is executed at Nielsen, rather than how marketing mix models work generally), and 2) they are too deep (for example, an academic paper describing a particular type of ARIMA analysis). In the beginning you will want to get the lay of the land, so seek out “for dummies” materials such as undergraduate texts or even general purpose books for laypeople.

Seek out experts in other job functions. Unless you are an external consultant, you will usually have coworkers whose job it is to have tons of domain expertise. For example, at Nielsen, Analytics Development team members worked in the same office as analysts and consultants, whose job it was to carry out projects for clients (rather than building the underlying models and systems). In another organization, it may be a software developer that is building a user interface. Or it may be the client themselves. They will understand the underlying business problems to be addressed, and hopefully be able to describe it in plain language. They may also be well acquainted with the practical difficulties in delivering projects in your line of work. Finally, they are likely to have lots of their own resources for learning.

Teach someone else. The best way to learn is to have to explain it to someone else, so a great technique is to prepare a presentation or whitepaper regarding a process or model that is underdocumented, or write an “executive summary” of something that is complicated. Even better is to write a “getting started” guide for someone in your role. Even if it is never used, it is a good way to crystallize the domain specific information you need to learn to do your job.

It’s Not My System

Those of us who build and inhabit systems often forget how arbitrary they are. I am reminded of this every time I go through airport security. I fly occasionally but not often, just frequently enough to be reminded of the variations in the collection of regulations and tasks that James Fallows calls “security theater”. Shoes on? Shoes off? Ring on? Off? Cell phone on? Off? Boarding pass in hand? Toiletries in a plastic bag? Shoes off? On? One trip it is one way and the next it is the opposite. Some TSA agents bark out the policies, seemingly annoyed at the fact that we’re getting it all wrong. Maybe some of these things have remained the same for years – I don’t know; it’s not my system. Just tell me what to do.

Don’t get me started on website passwords. Six characters. Eight characters. Letter and a number. Upper and a lower. Case insensitive. Special character. No special character. Different from the last ten. I don’t remember my password, so shoot me. It’s not my system. Just tell me what to do.

Windows 8. Swipe down. Swipe to the side. Scroll to the corner. Drag to the corner. Use the “share charm”. Right click. Don’t right click. It’s not my system. Just tell me what to do.

You: person who is about to build a system that I will use some day. I will never give it as much thought as you, or at least I hope not to. I am a capable guy. I’m no dummy. I just don’t care as much about the thing you built as you do. Except for certain cases where I care with an intensity that you will probably never understand, because I may miss my flight, or need to pay a bill, or need to connect to wireless. It’s not my system. It’s yours. Just tell me what to do.

Things I Wish I Had Learned in School

Here’s a list of subjects of professional relevance I wish I had invested more time in as a starry eyed youngster.

Presentation skills. Perhaps you are like me: more naturally drawn towards building things, and would rather someone else explain what it is and what it does. I don’t have that luxury. I have to explain, train, or convince colleagues, clients, or partners every day: formally and informally, conceptual and practical, by phone, Skype, and in person. I’ve had a lot of practice these five years, but boy would it have helped for me to have come out of school more prepared. My experience as a teaching assistant in graduate school was extremely helpful; I recommend all graduate students sign up to do classroom teaching if they can. Even so, formal training would have helped. Presenting is tough. You need to meet your audience where they are, with your demeanor and content, while staying on message and being yourself.

Statistics. I half-assedly audited a couple of stats classes in grad school but never really took it seriously. Big mistake! Who knew at the time (the early seventeenth century) that we would see not one but two revolutions in statistics: the mainstreaming of Bayesian statistics and the emergence of analytics as a discipline. These days, if you know stats and can code you can write your own ticket. Even as a journalist.

Writing. I had the good fortune to attend the University of Iowa, which has the best writing program in the country, but I didn’t fully take advantage of it. Blogging has helped to compensate a little, but I don’t write frequently enough, and when I do it is the material is often hastily thrown together. The sad thing is that few notice. Standards are low, so it is easy to get away with being a poor writer in technical disciplines. I would stand to gain little materially by improving my writing, yet writing appears on my list because it is a pleasurable activity.

Graphic design. Look at that Wikipedia definition: “the art of communication, stylizing, and problem-solving through the use of type, space and image.” Who wouldn’t want to do that? Notice that I didn’t say web design: too limiting and too focused on technology. Type/space/image problems have come up again and again in my professional life, and I find that the time I’ve spent paying heed to these concerns has always paid off. Well executed graphic design is a joy to create and a wonder to behold.