How I Create Presentation Content

Over the years, I have developed an approach for building content for presentations where a PowerPoint or Google Slides deck is required. Here it is, in case it is helpful to you.

I don’t go through this whole process every single time, however if I have no previous content to draw from, I have adequate time to prepare, and I need to do a good job, this is what I do:

1. Write down bullet points for the following. [15 minutes]

  • Who is the audience for the presentation?
  • Who else is likely to see the presentation?
  • Why do they care about the subject?
  • What do I want them to know about the topic? (3-5 things)
  • What I hope to accomplish by presenting? (1-3 things)
    • This is what you want, not what the audience wants. e.g. for a project to be funded, to hire more people, to give management confidence in the team’s capabilities.

2. Write a short essay about the topic. [1 hour]

  • Make an outline of the main points, based on the above.
  • Write a 1-3 page essay based on the outline.
  • Assume very little prior knowledge.
  • Don’t use jargon.
  • Do it quickly and don’t worry about flow or structure. I try to write the essay in less than an hour.

3. Do something else for a day or two.

4. Revise your essay. [30 minutes]

  • Read it, without consulting your prior notes.
  • Pull out the main points based on your reading (or ask someone else to read it and tell you).
  • Think about whether the main points should change, or be reworded.
  • Edit the essay as necessary.

5. Make a skeleton for the deck. [30 minutes – 1 hour]

  • Make a single slide with the main points as bullets.
  • Make a single blank slide for each main point, with the main point as the title
  • Ask myself which basic concepts need to be explained, and make one slide for each concept.
    • You will reuse these “concept” slides in future decks.
    • Even if your audience is already familiar with the concept, build the slide anyway. You can put it in the appendix. Most of the time, your audience is not as familiar with basic concepts as you think.
  • Think about whether there is a good real world example that makes the point.
  • Don’t worry about whether the ordering of slides makes sense.

6. Build slides. [several hours]

  • Look at one slide at a time.
  • If you have content from previous decks that fits, go ahead and pull it in now, as long as it really does fit.
  • Think about the point you want to make on the slide.
    • There should be only one main point.
    • If there is more than one point, break into more than one slide.
    • If you end up with too many slides on one topic, don’t worry about that now.
  • Think about a picture, data visualization, or table that might make the point well. Re-use old content if you can. Don’t worry about making the visual content look nice at first.
  • Work in whatever order feels comfortable.
  • Keep doing this until most of the slides have content.
  • Do not do an intro slide or a conclusion slide.

7. Edit slides. [several hours]

  • Flip through all of the slides and see if there is a logical story being told.
    • Example: “Situation, Obstacle, Action, Result”
  • Move slides to the appendix if they seem extraneous
  • Start asking for feedback from people you trust, even before you are done editing
  • Think about ‘frequently asked questions’ that might come up. If they are important, put them in the main flow. Otherwise, make a slide in the appendix.
  • Now get very picky about wording and presentation:
    • Remove extra words
    • Always use the same terminology for the same concept
    • Spell out all acronyms the first time you use them
    • Use consistent fonts, visual styles, and alignments
    • Format all tables properly
    • Always label the axes of charts, and give all charts meaningful titles
    • Add arrows and short comments for things that require particular emphasis. Lower the cognitive burden on your audience.
    • Here are some more tips from a blog post I wrote seven years ago…
  • Make an intro slide last.
  • Consider omitting a ‘conclusion’ slide – you often don’t need it.


Creating Equation Haikus using Constraint Programming

Bob Bosch pointed out (in a response to Shea Serrano, wonderfully…) that 77 + 123 = 200 is a haiku in English:

How many such haikus are there? Well, we can use constraint satisfaction (CSP) to give us an idea (Python code here for the impatient):

A traditional haiku consists of three lines: the first line has five syllables, the second seven, and the last five, for a total of seventeen syllables.

We can write equations that describe our problem. First, let’s define the function s[n] as the number of syllables in the English words for n. For example, s[874] = 7; try saying “eight hundred seventy four” out loud. It’s not that hard in English to compute s[n]; at least, I don’t think I messed it up.

Let’s call A the number in the first line of the haiku, B the second, and C the third. If we want a haiku with the right number of syllables and for the equation it describes to hold, all we need is:

s[A] = 5
1 + s[B] = 7 (since “plus” is one syllable)
2 + s[C] = 5 (since “equals” is two syllables)
A + B = C

If you let A=77, B=123, C=200 as in Bob’s example, you will see the above equations hold.

Using the constraint package in Python, it’s easy to create this model and solve it. Here is the code. A great thing about CSP solvers is that they will not just give you one of the solutions to the model, but all the solutions, at least for cases where A, B, C <= 9999. It turns out there are 279 such haikus, including

eight thousand nineteen
plus nine hundred eighty one
equals nine thousand

and the equally evocative

one hundred fourteen
plus one hundred eighty six
equals three hundred

Of course, this is not the only possible form for an equation haiku. For example, why not:

+ B + C
= D

I invite you to modify the code and find other types of equation haikus!

Screen Shot 2019-09-22 at 2.35.14 PM

Optimizing 19th Century Typewriters

The long title for this post is: “Optimizing 19th Century Typewriters using 20th Century Code in the 21st Century”.

Patrick Honner recently shared Hardmath123’s wonderful article “Tuning a Typewriter“. In it, Hardmath123 explores finding the best way to order the letters A-Z on an old and peculiar typewriter. Rather than having a key for each letter as in a modern keyboard, the letters are laid out on a horizontal strip. You shift the strip left or right to find the letter you want, then press a key to enter it:

Screen Shot 2018-11-26 at 12.40.38 PM.png

What’s the best way to arrange the letters on the strip? You probably want to do it in such a way that you have to shift left and right as little as possible. If consecutive letters in the words you’re typing are close together on the strip, you will minimize shifting and type faster.

The author’s approach is to:

  • Come up with an initial ordering at random,
  • Compute the cost of the arrangement by counting how many shifts it takes to type out three well-known books,
  • Try to find two letters that when you swap them results in a lower cost,
  • Swap them and repeat until you can no longer find an improving swap.

This is a strong approach that leads to the same locally optimal arrangements, even when you start from very different initial orderings. It turns out that this is an instance of a more general optimization problem with an interesting history: quadratic assignment problems. I will explain what those are in a moment.

Each time I want to type a letter, I have to know how far to shift the letter strip. That depends on two factors:

  1. The letter that I want to type in next, e.g. if I am trying to type THE and I am on “T”, “H” comes next.
  2. The location of the next letter, relative to the current one T. For example, if H is immediately to the left of T, then the location is one shift away.

If I type in a bunch of letters, the total number of shifts can be computed by multiplying two matrices:

  • A frequency matrix F. The entry in row R and column C is a count of how often letter R precedes letter C. If I encounter the word “THE” in my test set, then I will add 1 to F(“T”, “H”) and 1 to F(“H”, “E”).
  • A distance matrix D. The entry in row X and column Y is the number of shifts between positions X and Y on the letter strip. For example, D(X, X+1) = 1 since position X is next to position X+1.

Since my problem is to assign letters to positions, if I permute the rows and columns of D and multiply this matrix with F, I will get the total number of shifts required. We can easily compute F and D for the typewriter problem:

  • To obtain F, we can just count how often one letter follows another and record entries in the 26 x 26 matrix. Here is a heatmap for the matrix using the full Project Gutenberg files for the three test books:

Screen Shot 2018-11-26 at 12.58.40 PM.png

  • The distance matrix D is simple: if position 0 is the extreme left of the strip and 25 the extreme right, d_ij = abs(i – j).

The total number of shifts is obtained by summing f_ij * d_p(i),p(j) for all i and j, where letter i is assigned to location p(i).

Our problem boils down to finding a permutation that minimizes this matrix multiplication. Since the cost depends on the product of two matrices, this is referred to as a Quadratic Assignment Problem (QAP). In fact, problems very similar to this one are part of the standard test suite of problems for QAP researchers, called “QAPLIB“. The so-called “bur” problems have similar flow matrices but different distance matrices.

We can use any QAP solution approach we like to try to solve the typewriter problem. Which one should we use? There are two types of approaches:

  • Those that lead to provably global optimal solutions,
  • Heuristic techniques that often provide good results, but no guarantees on “best”.

QAP is NP-hard, so finding provably optimal solutions is challenging. One approach for finding optimal solutions, called “branch and bound”, boils down to dividing and conquering by making partial assignments, solving less challenging versions of these problems, and pruning away assignments that cannot possibly lead to better solutions. I have written about this topic before. If you like allegories, try this post. If you prefer more details, try my PhD thesis.

The typewriter problem is size 26, which counts as “big” in the world of QAP. Around 20 years ago I wrote a very capable QAP solver, so I recompiled it and ran it on this problem – but didn’t let it finish. I am pretty sure it would take at least a day of CPU time to solve, and perhaps more. It would be interesting to see if someone could find a provably optimal solution!

In the meantime, this still leave us with heuristic approaches. Here are a few possibilities:

  • Local optimization (Hardmath123’s approach finds a locally optimal “2-swap”)
  • Simulated annealing
  • Evolutionary algorithms

I ran a heuristic written by Éric Taillard called FANT (Fast ant system). I was able to re-run his 1998 code on my laptop and within seconds I was able to obtain the same permutation as Hardmath123. By the way, the zero-based permutation is [9, 21, 5, 6, 12, 19, 3, 10, 8, 24, 1, 16, 18, 7, 15, 22, 25, 14, 13, 11, 17, 2, 4, 23, 20, 0] (updated 12/7/2018 – a previous version of this post gave the wrong permutation. Thanks Paul Rubin for spotting the error!)

You can get the data for this problem, as well as a bit of Python code to experiment with, in this git repository.

It’s easy to think up variants to this problem. For example, what about mobile phones? Other languages? Adding punctuation? Gesture-based entry? With QAPs, anything is possible, even if optimality is not practical.

Overview: Reductionism in Art and Brain Science

The subtitle of Eric Kandel’s Reductionism in Art and Brain Science is “Bridging the Two Cultures”. Which cultures? Why bridge them?

Kandel’s two cultures are science and humanities: the physical nature of the universe, and the nature of human experience. His claim is that neither culture understands the other’s methodologies or goals. I can accept this claim, though I have no deep experience with either culture.

The real reason Kandel chose to write a book about art and brain science is that he knows a lot about brain science, a lot about art, and likes to talk about them. The intellectual justification is that the modern versions of art and brain science are reductionist in nature. This commonality suggests that the gap between cultures is narrower than it may first appear.

“Reductionist” is a scary and pretentious word, but all it means is breaking things down into parts. Scientific reductionism seeks to explain a complex phenomenon by examining one of its components on “a more elementary, mechanistic level”. This examination often involves experiments on the edge of our understanding, zeroing in on a single component by controlling for others. Kandel explains that modern art uses methodologies similar to those used by scientists: probing the limits of what can be explained or predicted with existing models of reality. This is so because abstract art (unlike representational art) does not seek to show the world in a familiar, three-dimensional way. It explores relationships between shapes, spaces, colors. This is reductionist:

[de Kooning, Excavation, 1950]

De Kooning applies certain artificial restrictions in this work, permitting other dimensions to run free. It is a kind of experiment.

Another reason for the joint study of art and brain science is that their forms of reductionism are related in a fascinating way: the brain subsystems that we use to form our perceptions of reality are highly activated when appreciating certain kinds of abstract art. Kandel pays special attention to the investigative, experimental work of the New York School. When we understand more about how our brain perceives, Kandel says, we can better appreciate what is captivating and unique about abstract art. This kind of investigation need not be a joy kill, as physicist Richard Feynman explains here.

This leads to Kandel’s examination of the nature of perception. Perception is more than what comes in from the outside. Perception based on visual input alone is incomplete and ambiguous: we must apply additional context in order to make sense of a flawed, jittery two dimensional projection of objects in a three-dimensional world. Our eyes are not enough. It’s long been known that the inverse optics problem can only be solved with additional top-down information (see this Quanta article for more). Top-down information helps me figure out that’s not the Statue of Liberty, it’s just a plastic ornament on the dash. Chalk drawings sometimes fool us nonetheless:


Just as perception is incomplete without top-down information, so is art. Art is incomplete without the perceptual and emotional involvement of the viewer (sayeth Riegl). Gombrich calls this the “beholder’s share“: the viewer’s interpretation of what is seen in personal terms. The beholder’s share is what supplies meaning to the picture. It reflects our consciousness and humanity.

As Kandel explains and is widely known, our brain has both top-down and bottom-up processing systems. Bottom-up systems build up higher level conclusions from small pieces of data: three edges joined make a triangle. In the case of visual processing, top-down and bottom-up systems work together to process images, allowing us to interpret the information they contain. Here is a picture:


One of Kandel’s key points is that abstract art relies more heavily on top-down processes than figurative art. Stripping away easily pattern-matched representational images puts more weight on our top-down systems, leaning on our beholder’s share: our imaginations, emotions, and creativity.

That’s quite a bit to digest, yet I’ve summarized only parts of the first four chapters, chapters that provide the motivation for Kandel to dive into topics that clearly provide Kandel joy and intellectual stimulation:

First, the societal and technological change that catalyzed the evolution of Western painting. Oversimplifying: Western painting evolved to depict an increasingly realistic representation of the world (sometimes with help, as chronicled by Hockney and in Tim’s Vermeer) until the advent of photography. Photography can represent reality more accurately than painting, “thus a dialogue emerged through the two art forms”. Painting needed to go elsewhere. A search for an alternative niche began, one of which was greater abstraction. This leads to a discussion of the abstraction of the figurative image by Turner, Mondrian, and others.

Next, investigations of color abstraction,

[Morris Louis, Alpha Pi 1960-1961]

meeting.jpg[James Turrell, Meeting]

a return to figuration, reimagined:

NPG_Katz_Wintour_cred.jpg[James Katz, Anna Wintour]

and finally the conclusion: that a deeper understanding of brain science will inform and enhance art. It’s fascinating stuff, filled with well-chosen examples.

Kandel provides an entertaining and thoughtful read. In so doing, Kandel takes another step down a path reaching back to the ancient worlds on all continents and through the golden ages of many of the world’s great cultures.

The Origin of CC and BCC

Those born into the computer age unwittingly use metaphors without awareness of their origins. I will explain one such metaphor painfully, and at length: CC.

Before the advent of word processors and personal computers, the typewriter was the dominant tool for producing professional documents. Here is a picture of Robert Caro’s typewriter, the Smith-Corona Electra 210:


An advantage of typewriters is that they can produce legible, consistently formatted documents. A disadvantage is that they do not scale: 1000 copies requires 1000 times the work, unless other accommodations are made. The mass production of a single document was the primary job of the printing press. Later, the mimeograph and the photocopier began to be used in certain schools and organizations. But printing presses, mimeographs, and photocopiers were all expensive.

So along with these more sophisticated tools, there was a simpler, more primal method for document duplication – the carbon copy. Carbon paper, a sheet with bound dry ink on one side, was placed in between two conventional sheets of paper, and the triplet fed into the typewriter. When the keys of the typewriter were struck, the type slug pressed against the ribbon, marking the top page with a character. The slug also pressed against the carbon paper, pressing the dried ink on the back side of the carbon paper onto the second plain sheet of paper, making the same mark. In this way, one key press marked two pages at once. Magic!

Style guidelines for typewritten letters and documents directed authors to indicate when multiple copies of the same document were being distributed to multiple recipients. The notation for this notification was to list the recipients after “cc”, for “carbon copy”.

In other words, an abbreviation for the means of duplication became a notification of duplication.

A variant is the “blind carbon copy“, or BCC. This originally meant carrying out the physical act of duplication – using carbon paper – but omitting the notification. Hence the “blind”: if you are looking at the document, you cannot determine the list of recipients. This carried over to email too.

If you are my age or older, you already knew this. If you are younger, you very likely did not. I was interested in computers in middle school but computer classes were not available. I did the next best thing and took a typing class. That’s how I learned.

Chaining Machine Learning and Optimization Models

Rahul Swamy recently wrote about mixed integer programming and machine learning. I encourage you to go and read his article.

Though Swamy’s article focuses on mixed integer programming (MIP), a specific category of optimization problems for which there is robust, efficient software, his article applies to optimization generally. Optimization is goal seeking; searching for the values of variables that lead to the best outcomes. Optimizers solve for the best variable values.

Swamy describes two relationships between optimization and machine learning:

  1. Optimization as a means for doing machine learning,
  2. Machine learning as a means for doing optimization.

I want to put forward a third, but we’ll get to that in a moment.

Relationship 1: you can always describe predicting in terms of solving. A typical flow for prediction in ML is

  1. Get historical data for:
    1. The thing you want to predict (the outcome).
    2. Things that you believe may influence the predicted variable (“features” or “predictors”).
  2. Train a model using the past data.
  3. Use the trained model to predict future values of the outcome.

Training a model often means “find model parameters that minimize prediction error in the test set”. Training is solving. Here is a visual representation:

Screen Shot 2018-08-19 at 3.40.32 PM

Relationship 2. You can also use ML to optimize. Swami gives several examples of steps in optimization algorithms that can be described using the verbs “predict” or “classify”, so I won’t belabor the point. If the steps in our optimization algorithm are numbered 1, 2, 3, the relationship is like this:

Screen Shot 2018-08-19 at 3.40.39 PM

In these two relationships, one verb is used as a subroutine for the other: solving as part of predicting, or predicting as part of solving.

There is a third way in which optimization and ML relate: using the results of machine learning as input data for an optimization model. In other words, ML and optimization are independent operations but chained together sequentially, like this:

Screen Shot 2018-08-19 at 3.40.46 PM

My favorite example involves sales forecasting. Sales forecasting is a machine learning problem: predict sales given a set of features (weather, price, coupons, competition, etc). Typically business want to go further than this. They want to take actions that will increase future sales. This leads to the following chain of reasoning:

  • If I can reliably predict future sales…
  • and I can characterize the relationship between changes in feature values and changes in sales (‘elasticities’)…
  • then I can find the set of feature values that will increase sales as much as possible.

The last step is an optimization problem.

But why are we breaking this apart? Why not just stick the machine learning (prediction) step inside the optimization? Why separate them? A couple of reasons:

  • If the ML and optimization steps are separate, I can improve or change one without disturbing the other.
  • I do not have to do the ML at the same time as I do the optimization.
  • I can simplify or approximate the results of the ML model to produce a simpler optimization model, so it can run faster and/or at scale. Put a different way, I want the structure of the ML and optimization models to differ for practical reasons.

In the machine learning world it is common to refer to data pipelines. But ML pipelines can involve models feeding models, too! Chaining ML and optimization like this is often useful, so keep it in mind.

Overview: Consciousness and the Brain

Consciousness and the Brain, written by French neuroscientist Stanislas Dehaene, is a fascinating overview of the mechanisms, boundaries, and possibilities of consciousness, from the point of view of an applied researcher. The work of great Dutch primatologist Frans de Waal led me to Dehaene; time permitting I will summarize some of de Waal’s insights in future posts.

The primary claim of Consciousness and the Brain is that genuine consciousness is indicated by conscious access: the ability for attended information to enter awareness and become reportable to others. It is the capacity to bring to mind accessible perceptions and thoughts. Vigilance, the state of wakefulness, and attention, the focusing of mental resources onto specific information, are not sufficient to constitute consciousness in Dehaene’s view.

“Consciousness is global information broadcasting within the cortex: it arises from a neuronal network whose reason to be is the massive sharing of pertinent information throughout the brain.” This network is referred to as a global neuronal workspace or simply “global workspace”. It enables a number of capabilities including:

  • purely mental operations;
  • the ability to keep data in mind indefinitely;
  • performing arbitrary mental operations;
  • reporting to others;
  • the enablement of autonomy (augmenting the spontaneous activity that also occurs in the brain).

Experiments can teach us much about consciousness and the theory of the global workspace. Dehaene focuses on experiments with “minimal contrast”: “a pair of experimental situations that are minimally different but only one of which is consciously perceived”. Riding the line in this way helps determine the mechanisms that govern conscious perception.

Unconscious mechanisms play a massive role in our lives and being. While much of our mental activity, and much of what drives our lives, are unconscious mechanisms, virtually all of the brain’s regions can participate in both conscious and unconscious thought. Unconscious mechanisms can be of a higher order than we may assume; for example conscious attention is not required to bind the elements of a scene together. The unconscious binding together of systems occurs in vision, language, attention, and even certain mathematics. (Notably, several of these examples match areas of recent progress in machine learning using neural networks.) These unconscious operations can be detected and measured through careful experiment.

That said, consciousness provides unique capabilities that even sophisticated unconscious processing cannot. This is because “conscious perception transforms incoming information into an internal code that allows it to be processed in unique ways.” For example:

  • the stable perception of objects moving through our visual space, even as we ourselves move;
  • the ability to store and retain lasting thoughts, which permit us to learn over time;
  • the ability to sequentially process information according to rules, similar to the functioning of a digital computer;
  • the ability to route information to arbitrary systems in our brain;
  • the ability to share our thoughts and perceptions through the use of language.

Through brain imaging and experimentation we are now able to identify the signatures of conscious thought. These signatures include:

  1. The ignition of our parietal and prefrontal circuits.
  2. A late slow wave in our brains, referred to as a “P3 wave”.
  3. A late and sudden burst of high-frequency oscillations.
  4. The synchronization of information exchanges across distant brain regions.

These signatures are detectable through various means including FMRI (Functional Magnetic Resonance Imaging), MEG (magnetoencephalography) and EEG (electroencephalography).


Because consciousness appears to emerge from the building, looping, and coordination of signals from different brain subsystems (see above), consciousness lags the real world. Consciousness is formed from loops in the brain. It is these loops that permit construction of mental images given incomplete sensory data; for example there are massive differences between our the raw, imperfect visual data that enters our eyes (e.g. the blind spot in our eyes directly behind our optic nerve, or our limited color range perception outside the center of our attention) and our conscious perception.

The ability to read the traces of conscious thought allow us to theorize about consciousness. Global neuronal workspace theory claims that the human brain has developed efficient long-distance networks to select relevant information and disseminate it throughout the brain. Consciousness is an evolved device that allows to attend to a piece of information and keep it active. Conscious information can then be routed to other areas based on goals.

Within our brain there are collections of neurons that send reinforcing signals to each other under certain conditions, for example when a particular person, event, or sensation is perceived or remembered. During conscious perception, a small subset of workspace neurons become active, while most others are inhibited. The panoply of signaling related to inhibited clusters, for example, all clusters that do not pertain to the 2016 Chicago Cubs, form a recognizable signal which is referred to as the P3 wave. In other words, a primary signifier of conscious thought is the signal resulting from the repression of neurons. The P3 wave is in a sense a “negative thought signal”.

Not all unconscious thought is the same. Several types can be defined. Preconscious thought is information already encoded by an active assembly. It can become conscious at anytime. Subliminal thought is input given or processed so weakly that we lack the capability of attending to it. Disconnected patterns are mental activities which have no relation to conscious thought, for example our regular breathing. Diluted mental activity is neural information that has been “downsampled” for use by other systems in the brain, and therefore cannot be brought to consciousness. For example a visual pattern that flickers so fast that you cannot see it. Early levels in our visual system may register this flickering but it is transformed by later levels.

Detecting and theorizing about consciousness allows us to ask, and potentially answer, deep questions. For example, a series of recent experiments seems to indicate the presence of consciousness within severely injured patients who cannot express their consciousness. An example is the case of Jean-Dominique Bauby, the author and subject of The Diving Bell and the Butterfly. A patient without the ability to communicate or move can be asked to imagine, for example, riding a bicycle for 30 seconds. The patient’s brain can be scanned via MRI during this period, and compared to a control group to establish that the neuronal clusters responding to bike riding, along with the signatures of consciousness, are present.

These questions can be asked of non-humans as well. Dehaene says “I would not be surprised if we discovered that all mammals, and probably many species of birds and fish, show evidence of a convergent evolution to the same sort of conscious workspace” as found in humans. (Here is the primary connection to Frans de Waal’s work.) Experiments show that monkeys, dolphins, and even rats and pigeons possess at least the rudiments of metacognition: thinking about thinking. “Animal behavior bears the hallmark of a conscious and reflexive mind.”

It is then worth asking what is uniquely human about human consciousness. Dehaene provides informed speculation. Perhaps it is our ability to combine our core brain systems using a “language of thought”: the inner voice that is nearly always present inside of us. Perhaps also it is our capacity to compose our thoughts using nested or recursive structures of symbols. This implies that language evolved as an internal representational device, not just as a communication system with other hairless apes. The ability to compose and nest may underlie many of our unique human abilities, such as the ability to craft complex tools, perform higher mathematics, and our self-consciousness. An examination of brain areas that are particularly well-developed among humans (as opposed to other primates) seems to support these theories at a high level.

Dehaene closes by theorizing what it would take to build artificial consciousness using computers. This topic, while fascinating, is not a primary subject of Dehaene’s book, and is best saved for another time.

Other Minds: Consciousness and Evolution

I highly recommend Other Minds, by Peter Godfrey-Smith. It’s a fascinating exploration of the minds of cephalopods, who independently from vertebrates developed sophisticated nervous systems and what any reasonable person would call intelligence. In so doing, Godfrey-Smith explores the tree of life, the origins and components of complex thought and consciousness, and the ways of formless, curious creatures deep below. Read this book!


Godfrey-Smith describes two important revolutionary periods in Earth’s evolutionary history. In each case, a means of communication between organisms became a means of communication within them.

The first is the Sense-Signaling revolution. Roughly 700 million years ago, the first organisms that we could reasonably call animals – sensing and acting organisms – evolved. Just a bit later, around 542 million years ago according to Godfrey-Smith, certain organisms began to develop not only sensing mechanisms, but signaling mechanisms too. Both sensing and signaling, directed outward, provide evolutionary benefits: they help animals navigate and influence their environments. There is another advantage: these same sensing and signaling mechanisms can be used inside the space of the organism to better coordinate its sense-action loop. Input is processed through the senses, and then signaled in a targeted fashion to another part of the organism (be it tentacle, flipper, paw, or hand) to generate a specific response. Sensing and signaling happens inside only higher order organisms like animals. The internalization of sensing and signaling marks the beginnings of the development of the nervous system. Millions of years later, the sensing and signaling mechanisms found in animals are often incredibly complex.

The second is Language. Less than half a million years ago, human language emerged from simpler forms of communication. Language is nothing but an elaborate, auditory form of sensing and signaling. Its more rudimentary forms are used by our primate cousins to warn, coax, plead, and threaten. In humans, these signals became more universal in expressive power, and more nuanced (despite recent examples to the contrary).

Godfrey-Smith, building on Hume, Vygotsky and others, notes that speech is not only for our others, but for ourselves. Each of us has an inner dialogue that runs through our heads from wake to sleep, and even in our dreams. Our inner speech is inseparable from our conscious selves. In a beautiful passage, Godfrey-Smith writes, “inner speech is a way your brain creates a loop, intertwining the construction of thoughts and the reception of them.” This loop not only helps us direct our action, but it can clarify, integrate, and reinforce our conceptions. For Godfrey-Smith and others such as Baars and Dehaene (whom I’ll cover in a future post since Consciousness and the Brain is amazing), our inner speech is a necessary ingredient in our integrated subjective experience as human beings. It helps us to direct our thoughts in a deliberate, planful way – the “System 2 thinking” that Kahneman writes about in Thinking Fast and Slow.

The Sense-Signaling and Language revolutions were both forerunners of radical planetary change. In the first case, the Cambrian explosion, in the second the rise to primacy of homo sapiens.

Speaking of us: it is interesting to compare the revolutions described by Godfrey-Smith to those described by Yuval Harari in Sapiens: A Brief History of Humankind (link). Harari’s, as summarized by Wikipedia, are the following:

  • “The Cognitive Revolution (c. 70,000 BCE, when Sapiens evolved imagination).
  • The Agricultural Revolution (c. 10,000 BCE, the development of farming).
  • The unification of humankind (the gradual consolidation of human political organisations towards one global empire).
  • The Scientific Revolution (c. 1500 CE, the emergence of objective science).”

Harari’s Cognitive Revolution, in my view, maps reasonably well to Godfrey-Smith’s Language revolution. The remaining items in Harari’s list, when considered in Godfrey-Smith’s context, seem like nearly inevitable consequences of the first. Perhaps I am giving us too little credit, or perhaps too much.

Four Things I Learned from Jack Dongarra

Opening the Washington Post today brought me a Proustian moment: encountering the name of Jack Dongarra. His op-ed on supercomputing involuntarily recalled to mind the dusty smell of the third floor MacLean Hall computer lab, xterm windows, clicking keys, and graphite smudges on spare printouts. Jack doesn’t know it, but he was a big part of my life for a few years in the 90s. I’d like to share some things I learned from him.

I am indebted to Jack. Odds are you are too. Nearly every data scientist on Earth uses Jack’s work every day, and most don’t even know it. Jack is one of the prime movers behind the BLAS and LAPACK numerical libraries, and many more. BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) are programming libraries that provide foundational routines for manipulating vectors and matrices. These routines range from the rocks and sticks of addition, subtraction, and scalar multiplication up to finely tuned engines for solving systems of linear equations, factorizing matrices, determining eigenvalues, and so on.

Much of modern data science is built upon these foundations. They are hidden by layers of abstractions, wheels, pips and tarballs, but when you hit bottom, this is what you reach. Much of ancient data science is also built upon them too, including the solvers I wrote as a graduate student when I was first exposed to his work. As important as LAPACK and BLAS are, that’s not the reason I feel compelled to write about Jack. It’s more about how he and his colleagues went about the whole thing. Here are four lessons:

Layering. If you dig into BLAS and LAPACK, you quickly find that the routines are carefully organized. Level 1 routines are the simplest “base” routines, for example adding two vectors. They have no dependencies. Level 2 routines are more complex because they depend on Level 1 routines – for example multiplying a matrix and a vector (because this can be implemented as repeatedly taking the dot product of vectors, a Level 1 operation). Level 3 routines use Level 2 routines, and so on. Of course all of this obvious. But we dipshits rarely do what is obvious, even these days. BLAS and LAPACK not only followed this pattern, they told you they were following this pattern.

I guess I have written enough code to have acquired the habit of thinking this way too. I recall having to rewrite a hilariously complex beast of project scheduling routines when I worked for Microsoft Project, and I tried to structure my routines exactly in this way. I will spare you the details, but there is no damn way it would have worked had I not strictly planned and mapped out my routines just like Jack did. It worked, we shipped, and I got promoted.

Naming. Fortran seems insane to modern coders, but it is of course awesome. It launched scientific computing as we know it. In the old days there were tight restrictions on Fortran variable names: 1-6 characters from [a-z0-9]. With a large number of routines, how does one choose names that are best for programmer productivity? Jack and team zigged where others might have zagged and chose names with very little connection to English naming.

“All driver and computational routines have names of the form XYYZZZ”

where X represents data type, YY represents type of matrix, and ZZZ is a passing gesture at the operation that is being performed. So SGEMV means “single precision general matrix-vector multiplication”.

This scheme is not “intuitive” in the sense that it is not named GeneralMatrixVectorMultiply or general_matrix_vector_multiply, but it is predictable. There are no surprises and the naming scheme itself is explicitly documented. Developers of new routines have very clear guidance on how to extend the library. In my career I have learned that all surprises are bad, so sensible naming counts for a lot. I have noticed that engineers whom I respect also think hard about naming schemes.

Documentation. BLAS and LAPACK have always had comprehensive documentation. Every parameter of every routine is documented, the semantics of the routine are made clear, and “things you should know” are called out. This has set a standard that high quality libraries (such as the tidyverse and Keras – mostly) have carried forward, extending this proud and helpful tradition.

Pride in workmanship. I can’t point to a single website or routine as proof, but the pride in workmanship in the Netlib has always shone through. It was in some sense a labor of love. This pride makes me happy, because I appreciate good work, and I aspire to good work. As a wise man once said:

Once a job is first begun,
Never leave it ’till it’s done.
Be the job great or small,
Do it right or not at all.

Jack Dongarra has done it right. That’s worth emulating. Read more about him here [pdf] and here.


2018 NCAA Tournament Picks

Every year since 2010 I have used data science to predict the results of the NCAA Men’s Basketball Tournament. In this post I will describe the methodology that I used to create my picks (full bracket here). The model has Virginia, Michigan, Villanova, and Michigan State in the Final Four with Virginia defeating Villanova in the championship game:

Screen Shot 2018-03-13 at 9.20.05 PM

Here are my ground rules:

  • The picks should not be embarrassingly bad.
  • I shall spend no more than one work day on this activity (and 30 minutes for this post). This year I spent two hours cleaning up and running my code from last year.
  • I will share my code and raw data. (The data is available on Kaggle. The code is not cleaned up but here it is anyway.)

I used a combination of game-by-game results and team metrics from 2003-2017 to build the features in my model. Here is a summary:

I also performed some post-processing:

  • I transformed team ranks to continuous variables given a heuristic created by Jeff Sonos.
  • Standard normalization.
  • One hot encoding of categorical features.
  • Upset generation. I found the results to be not interesting enough, so I added a post-processing function that looks for games where the win probability for the underdog (a significantly lower seed) is quite close to 0.5. In those cases the model picks the underdog instead.

The model predicts the probability one team defeats another, for all pairs of teams in the tournament. The model is implemented in Python and uses logistic regression. The model usually performs well. Let’s see how it does this year!