Nathan Brixius

Review of “Parfit” by David Edmonds

Parfit: A Philosopher and His Mission to Save Morality is the story of a philosopher who gambled with his life.

David Edmonds is warm and particular in his profile of Derek Parfit, a British philosopher born in China who considered questions of personal identity, choice, ethics, and morality.

I came for the philosophy but stayed for the human drama. Parfit is chronicle of a journey into the monomaniacal. A boy with eccentricities enthusiastically expressed through a range of eclectic interests becomes a man with singular focus on his philosophy. This focus draws him near to some and alienates others.

By the end of the book I was looking back wistfully at Parfit’s own youth and was left wondering whether his path was inevitable. Whether the path was worth it. The reflexive answer to the latter question is, “Of course it was worth it! Look at the ideas he gave to the world and how that influenced so many others!” I can’t be sure of that. I can’t put those things on the scale. Ironically, exploring this question only leads me back to the moral theories of Parfit’s own work.

I bought Parfit’s Reasons and Persons many years ago and dropped it after a few pages. It was way too long and way too difficult for me. Many have said the same of Proust. Like Proust’s protagonist, Parfit was faced with two paths: devotion to philosophy or a fully lived life. Edmonds describes the personal (and at times touching) impact that Parfit’s work had on the philosophical community. So it can be said that his paths were joined, but not in Parfit. This is just as the two paths of In Search of Lost Time (art and human connection) are joined not in Marcel, but in Swann’s daughter.

Edmonds’ work makes plain further parallels between Proust and Parfit. Each eventually wins you over to their inner world. If you stick with In Search of Lost Time, Proust becomes a friend, or at the very least a roommate in a very small apartment. I suspect knowing Parfit would have been a similar experience. They wear you down, or you leave. Both men are monomoniacal: Proust in his cork lined room, living through intermediaries, writing letters to his neighbors and treating life as a distraction from art. Parfit in his upstairs hovel, living through his colleagues, writing detailed notes in response to anything he is sent, and treating life as a distraction from his philosophy.

Parfit did not believe in the self, unlike Proust. He understood “us” to be psychologically continuous thinking machines, and I am understood as “me” by others because “they” recognize “me” as a thinking machine. Proust on the other hand, believed almost exclusively in the self. The entire arc of In Search of Lost Time considered the exploration of two possible paths of the self, those of “Swann” (art) and “Guernmantes” (human connection). Both Proust and Parfit were obsessed with exploring as many of the details and consequences of their paths as possible, blowing out the length and latency of their books to implausible widths.

But back to the self. Parfit made an indelible mark on those he worked with, and also with those who have closely explored his work. In the memories of others there lies some fraction of psychological continuity with Parfit’s former physical being. By his own definition, Parfit is only mostly dead. He lives on now through his work, just as Proust believed that there can be eternality in a work of art.

So did Bulgakov. Woland in The Master and Margarita famously said “manuscripts don’t burn”. This is a rare case where the devil was right.

Parfit was a fun read and far shorter than Parfit’s own works, yet I felt as if I came away of something of his essence.

Don Knuth’s MIP, 64 years later

In 1960, the famed computer scientist Don Knuth wrote a technical paper in which he considered an integer programming model for minimizing memory access latency of an IBM 650. I’ve included a screenshot of the 51 variable, 43 constraint problem at the end of this post.

Prof Knuth describes the problem and his attempts to solve it in an unpublished note on his website. Knuth used a technique of Ralph Gomory published the same year, which he somehow became aware of despite the absence of email, the web, arxiv, or large language models. I asked ChatGPT to render an image of Gomory and Knuth pensively sitting atop a mainframe. It refused for privacy reasons. Here is the best it could do:

Knuth ran Gomory’s algorithm on an IBM 650 with less than 10K of memory, but was unable to find an optimal solution. Thirty-five years later (when I was an undergrad), Dimitris Alevras successfully found the optimum value of 22996 using CPLEX on a SPARCstation 5. I am proud to say I have coded on such a machine! Alveras reported that “32,200 branch-and-bound nodes” were needed to solve the problem to optimality. No CPU time is reported in Knuth’s writeup, but we can safely assume the solution time was minutes or likely hours.

I grabbed the problem formulation from Knuth’s writeup and translated it into Python code (copilot was helpful for this purpose). An MPS file with the model can be found here.

I found that the open source solver SCIP was able to find an optimal solution in two tenths of a second, and that Gurobi was able to solve it in less than 0.1s.

This is another testament to the amazingly wonderful effectiveness of operations research!

Caitlin Clark Would Score More Points than Pete Maravich in the Same Number of Games If She Shoots As Much As He Did, Even Without the Three Point Shot

It’s impossible to compare players from different eras. That stops precisely no one from doing so.

On March 3, 2024, Caitlin Clark became the all time leading scorer in major college basketball, passing Pete Maravich when she scored her 3668th point.

Some have pointed out that Pete played only 83 games over three seasons, whereas Caitlin has played 129 over four seasons, as of March 3 2024. Additionally, the three point shot did not exist when Pete played college ball. As a result some have characterized Pete as being in a “different stratosphere”. What does the math say? I took 15 minutes and gave it a shot.

Courtesy of sports-reference.com (founded by an Iowa grad), here are key statistics for Pete Maravich and Caitlin Clark as of March 2 2024. (This is before Caitlin broke Pete’s record, but don’t worry about that right now.)

	g	fg2	fga2	fg2%	fg3	fga3	fg3%	ft	fta	ft%
pete	83	16.7	38.1	0.438	0.0	0.0	0.000	10.8	13.9	0.8
caitlin	129	5.4	9.7	0.557	3.9	10.2	0.382	5.8	6.8	0.9

Pete Maravich and Caitlin Clark (as of March 3 2024)

Without the three point shot, Caitlin would score 503 fewer points. Furthermore, if Caitlin had played only 83 games instead of 129, her point total goes down to 2025, far behind Pete.

But we haven’t adjusted for the fact that Pete shot much more than Caitlin. 38.1 total shot attempts per game would result in 27.9 attempts from 2 point range by Caitlin per game. This would take her back up to 83 * [2 * (27.9 * 0.557 + 3.9) + 5.8] = 3708 points in 83 adjusted games, more than Pete’s 3667.

Now, if you say that some of Caitlin’s additional shots would have been from three point range (and would count for two), I would say you are right. But then I would say that Caitlin is no dummy and she would not shoot that many more long distance shots, and they would be higher percentage shots than the ones she takes right now, because she would know they are worth two not three. And then I would say that since she would be taking more two point shots, she would be getting more free throws so we should adjust those upward as well.

What’s the lesson here? Don’t compare players from different eras. But if you do, please don’t cherry pick which factors you adjust for.

Here’s a spreadsheet with all of the above – let me know if I missed something.

A Fight With A Cannon

Ed asked me if I had read Victor Hugo. I wish I could say it were on a bench on one of the trails that ran through the old Microsoft campus, perhaps next to a flower bed or a fountain or “Lake Bill”. But that would not be true – it was in a dark office with cheap blinds in an anonymous office building. I don’t know if it’s been torn down, but I hope it has.

At the time, I didn’t know how old Ed was. Ed died about ten years ago, so I have since learned that he was in his seventies when we worked together. I haven’t had the pleasure of working directly with someone of that age since. I am the worse for it.

Ed was a mentor of mine because he was wise, and insightful, and giving of his time. We had no formal schedule. Sometimes I’d wander down to his office, or we would hang out after a meeting, or we would just bump into each other and chat for a few minutes. Not all our interactions were meaningful, but all were memorable.

I’d had a rough year at work prior to meeting Ed. I was on a dream team with a dream job. I was working on exactly what I wanted to work on, with people who were much more talented, and smart, and motivated than me. I had reconnected with what I loved: math, operations research, and coding. Then the economy tanked, and Microsoft laid off several thousand employees, including nearly all of the experimental organization of which my team was a part. When the layoffs went down, our team was one of the only ones spared in our org, but only because we were moved into a much larger organization in a different part of the company. During the course of this change my manager and greatest supporter was obliged by circumstance to depart, leaving me as the leader of a small team in unfamiliar waters, surrounded by unfamiliar people who did not believe in and did not understand why we were doing what we were doing. Despite directives to shut down our small team and “align” with the charter of our new organization, I lobbied layer upon layer of management to preserve our team, and our charter, making the case that what we built was a small jewel to be placed in the very center of the strategic crown. Despite small victories, and the fact that I was right, I wasn’t really convincing anyone. Ed could see it. I could see it, but I couldn’t accept it. I kept lobbying, and fighting, and revising my arguments. Even the victories felt like losses.

So Ed asked me if I’d read Victor Hugo. I knew that Hugo wrote Les Miserables. I hadn’t read it, but had seen it performed. But I knew already that Ed wasn’t going to say something about Les Miz; that would be too obvious. Ed had started his career at IBM in 19XX, for some small value of XX. He was raised in the mainframe age, toiled and experimented and built through the minicomputer era, through the PC era, into the compute cluster era, and into what would become the age of cloud computing.

He forgot very little.

Which is why, when he asked me if I knew of Victor Hugo, it was not to reference Les Miz, but “A Battle with A Cannon”. On a French ship of the line, a twenty four pound cannon breaks free of its moorings, causing chaos and destruction.

It was the fault of the gun captain, who had neglected to fasten the screw-nut of the mooring-chain, and had insecurely clogged the four wheels of the gun carriage; this gave play to the sole and the framework, separated the two platforms, and the breeching.

The gun whirls around deck, placing everyone aboard at risk. The gunner heroically steps in:

Suddenly, in the midst of this inaccessible ring, where the escaped cannon was leaping, a man was seen to appear, with an iron bar in his hand. He was the author of the catastrophe, the captain of the gun, guilty of criminal carelessness, and the cause of the accident, the master of the carronade. Having done the mischief, he was anxious to repair it. He had seized the iron bar in one hand, a tiller-rope with a slip-noose in the other, and jumped, down the hatchway to the gun-deck.

The gunner is assisted by a haggard passenger, who turns out to be a general. Together they subdue the gun. The gunner is then addressed by the captain of the ship in front of an anxious crew, and the captain turns to the general to decide the gunner’s fate.

The general takes the Cross of Saint Louis from the captain’s uniform and pins it on the gunner, to the cheers of the crew. Then he says:

Carelessness has compromised this vessel. At this very hour it is perhaps lost. To be at sea is to be in front of the enemy. A ship making a voyage is an army waging war. The tempest is concealed, but it is at hand. The whole sea is an ambuscade. Death is the penalty of any misdemeanor committed in the face of the enemy. No fault is reparable. Courage should be rewarded, and negligence punished.

The gunner is shot.

I apologize for spelling out what is obvious, but Ed wanted me to consider myself in the position of the captain, and whether I would be willing to both receive a medal and be shot for my service. The point of this story is not to share what happened. (I was shot.)

I have thought about this story, and Ed, many times since. I learned that sometimes, I am the gunner. Sometimes, I am the captain, dealing with a gun careening around on deck, in danger even though I wasn’t around when the cannon was loosed. Sometimes I am the general. Other times, I am the fucking cannon. Sometimes I think I might be the boat. And once in a while I am a pirate, the wind on my face and a cutlass loose in my hand, peering through a looking glass at the chaos and smiling at my prize.

In times of stress or confusion, whether pangs of regret or rivers of injustice, I call to mind a boat, a cannon, a gunner, a captain. Who is the gunner? Who is the captain? Who is the boat? Who is the cannon?

I wish I could ask Ed. Not because he’d answer, but because he’d understand.

Data, Anecdotes, Maps, Territories

Incredibly, I am now N years old. In the years that have passed since I was N – 20, I have found that one of the most important questions I can ask myself at any time is, “What the hell is going on?” If I can answer that question, most of the time I know what to do.

On LinkedIn I had an interesting discussion with an old friend about data and decision making. I mentioned the popular Jeff Bezos saying, “when data and anecdotes disagree, the anecdotes are usually right.” Some had doubts about whether it is true. Others felt it is not well expressed. I empathize with both points but I have found this principle quite useful.

Lex Fridman’s recent podcast with Jeff Bezos covered this very topic, starting about 83 minutes in.

The key bit is this:

“You forget the truth of why you were watching that metric in the first place. [..] You need to constantly be on guard. [People are] managing to metrics that they don’t really understand. They don’t know why they exist, and the world may have shifted under them a little, and the metrics are no longer as relevant as they were when somebody ten years ago invented the metric.”

This is a restatement of the even older aphorism, “the map is not the territory”.

Before I continue, I want to make clear that data is good, and data should be used for decision making. I believe that most organizations vastly underutilize or misinterpret their data. The first order of business for such organizations should be to try to fix this problem, and not get too caught up in anecdotes. Why is data good? Because our judgment is bad – but bad in very useful ways. This has been discussed at length by social scientists, most notably Kahnamen and Tversky in “Thinking Fast and Slow”. Our “System 1” brain is frequently biased: we overemphasize negative outcomes, rely on information that readily comes to mind, and anchoring our assessments in previously set expectations. The list goes on. Data helps us to push past these biases and try to take a more reasoned “System 2” view.

So why are anecdotes useful? Why should we listen to them? Jeff’s point is nuanced. He is not suggesting that a single anecdote trumps a terabyte of data. He is not suggesting that we should not collect data. He is pointing out that the map is not the territory. The map may be out of date. The map may not be detailed enough to make the decision we want to make. We may have the map upside down. We may have sailed off the edge of the earth.

Moreover, even if the data is “right”, we may have done a crappy job of interpreting it. For example, Sam Savage’s Flaw of Averages entertainingly describes the perils of making decisions on single point estimates. Customers may like your product 90% of the time. That’s pretty good. But what if we are making a decision that is much more pertinent to the remaining 10%? The objection to this is “aha, but you can still make a data informed decision.” And that is true. But if the relevant metric, that is, the relevant transformation and pivot of the data, is not brought to fore, the relevant assessment will not happen. Interesting anecdotes often serve the purpose of highlighting the blind spots in how we look at our data.

I once worked for a manager who frequently reduced complicated problems down to rather pithy statements about data. “Ah, we just need to look at the P95.” An overreliance on aggregate metrics can blind us from understanding the variability and variety in the territory. This is often where key battles are won and lost.

I hesitate to use military analogies, but anecdotes can be much like “letters from the front lines”. They tell us what is going on in the world itself; the terra firma. Data, on the other hand, is indirect and necessarily lower fidelity (for as they say, the plural of anecdotes is data). A general who relies only on maps, computer screens, and PowerPoints, and fails to understand what the troops are saying is a poor one. (Orson Welles tells a likely apocryphal story about George Marshall that makes this point rather well.)

Good data scientists notice weird things and look into them. Little outliers, or blips, or things that don’t seem right. They do this even when the big important metrics say everything is fine. These little blips are, in essence, anecdotes. They should not be ignored. Nor should they be blindly followed. They should be investigated. For me, this is the point of the saying.

One last philosophical point. Metrics, on their own, do not require us to form a theory. Neither does code, on its own. I believe that good decisions, if they are good on purpose and not by accident, rely on some kind of theory about the world. A theory that can be tested and re-evaluated. This is entirely compatible with science, and compatible with data, and compatible with anecdotes. Anecdotes can point the way to situations where we do not understand things; where our theory is incomplete. Feynman used the analogy of chess to make this point.

So, yes, the aphorism is probably not well stated. The point is that quite often you ignore anecdotes at your peril!

Stumbling Towards Rightness

There is an Amazon leadership principle called “Are Right, A Lot”. It sounds kind of stupid. What the hell am I supposed to do about that? Some people are obviously more right than others. But why? They’re born that way? They work harder? How am I supposed to “Be Right, A Lot”?

In Kabbalah they say “God is a verb”, and so it is with rightness. “Right” is a process, not a state. The point is not to “be right” but “get to right”. The other day I was asked for advice from a talented colleague who is earlier in their career than I. I told them to pick one thing, even if it is small, and learn more about it than anyone else on the team. Number one without question: top of the heap, A-number-one. It’s useful to have that knowledge of course, but that’s not the point. People start to recognize you as an expert, but that’s not the point. It builds your personal brand, but that’s not the point. The point is to learn what it takes to be the best at something: learning what it takes to “get to right”. Learning what that feels like. It tends to require a great deal of focus, exploration, reliance on multiple sources of information, critical thinking, and failure. I think it’s important to go through that.

Even better, once you “get to right” in one area, it is starts to become easier to “get to right” in another. Best of all, you also begin to recognize this quality in others, which helps you as you grow into positions of leadership and influence.

I figured this out quite a few years ago, and it’s served me well. But I think I have learned something new:

Along with this pattern of rightness there emerges a “rightness instinct” that arises from our pattern matching brain. It starts to feel like you can “just tell” when people are right, even in situations where you don’t have all of the necessary information. Stuff starts to “sound right” or “sound wrong”. And it works! People can get really good at developing this instinct. Perhaps you have noticed leaders with this quality. They ask the incisive question. They immediately probe the most difficult area. They uncover the place where there was the most debate. It seems like magic.

This quality can be incredibly helpful, and impressive to watch, but also I think dangerous. One can begin to trust the instinct above what developed the instinct in the first place. Orson Welles referred to magicians who fall under their own spell of their own magic “becoming a shut eye”:

Orson Welles falling under his own spell

“A shut eye is the fellow who begins to believe himself. […] Suppose you are a night clerk in a hotel. When a fellow [comes to the desk] you look at them carefully and you [decide where to put them in the hotel] based on various pieces of evidence. When you have been there a little longer you glance and you tell them. Then after a little longer you don’t have to look at all. You’ve seen it, but the computer inside has made all of the deductions without your being conscious of it. The mind reader begins to do it without thinking and begins to say it’s true.”

A shut eye has become too accustomed to being right. They keep the instinct and discard what got them there in the first place. (Perhaps you’ve noticed leaders with this quality as well.) Because they are nearly always right, and they believe in their own powers, it’s hard to catch the rare cases where they, in fact, are wrong. This, eventually, becomes their undoing.

So what is to be done? I am not sure, but I have a theory: become a different kind of magician. Become the kind who reveals their trick. Penn and Teller illustrated this beautifully in this clip:

For this kind of magician, along with the question or conclusion comes the meta: the disclosure of the principle or instinct. Instead of saying, “You’re not going to meet the latency requirement” you might say instead “in these kinds of projects I typically find that latency requirements are tough to meet, because model accuracy requirements are often too demanding. Is that the case here?”

I’m finding that this method helps to demystify rightness, and open conversation. It also allows for the infinitesimal chance that you are wrong! As Penn and Teller demonstrate, the magician who reveals their trick is still magical. And far more endearing, for there is a problem with people who are always right but fail to disclose their methods. They are annoying. They can also seem distant, for most of us are so incredibly wrong most of the time. We are stumbling towards rightness.

Character Heatmaps in Proust’s In Search of Lost Time

This post is a continuation of a series where I use natural language processing (NLP) to analyze the text of Proust’s In Search Of Lost Time (ISLT). Here is the Python code for this series.

In my previous post I showed that the five most frequently mentioned characters in ISLT are Albertine (2338 mentions), Swann (1338), Charlus (1303), Robert Saint-Loup (1091) and Odette (971). But when do these references occur?

Readers of ISLT know that characters come and go as the story progresses. Swann and Odette play prominent roles in Volume 1 (Swann’s Way), whereas Albertine does not enter the story until Volume 2, coming to the fore as an object of the Narrator’s obsessions in Volumes 5 (The Prisoner) and 6 (The Fugitive).

I thought it would be interesting to try and visualize these transitions, so I recorded the chapter and paragraph of each reference of each character, in effect producing “coordinates” for every portion ofthe text. For example, the first proper name in the text (appropriately, François Ier), is (1, 1). I counted the number of references for each name within each of the 486 chapters.

These counts constitute a kind of “character heat map”: when characters are referenced frequently in a chapter, these counts are high, and are zero if a character is not mentioned at all.

I created the following heat map of the top five characters by ripping off some stuff I read on stackoverflow:

The tick marks represent the seven volumes of ISLT and each slender vertical line represents a chapter. Of course, not all chapters are the same length so the diagram is not accurate in that respect. From this visualization you can see the following:

Albertine does enter the picture in volume II and dominates volumes V and VI;
Odette features heavily in Volume I (especially in “Swann in Love”), with only intermittent appearances thereafter;
Saint-Loup is introduced in Volume II and taking over the focus from the Swann-Odette relationship in Volume I;
Swann is a central figure in Volume I and has occasional reappearances thereafter. When he enters the story he is typically a central figure.

Proust treats places as characters of their own, so it’s also interesting to examine where places are referenced. Here are four frequently referenced places. Combray is where the story starts, of course, whereas Balbec later becomes a place of relaxation and refuge to which Proust returns repeatedly. Doncières is where Saint-Loup (an aspiring officer) is stationed, and so references to Doncières are correlated to those of Saint-Loup. Paris, the beating heart of society, is mentioned throughout ISLT.

I also produced heat maps of five of the most frequently occurring words (each a theme):

There are other ways to visualize and juxtapose references to characters, places, and concepts, of course, and perhaps in future posts I will explore further.

For the next post in this series, however, we will examine the context in which characters are mentioned. Which are referred to in positive and negative terms, and when do changes in sentiment occur?

Names and Places in Proust’s In Search of Lost Time

This post is a continuation of a series where I use natural language processing (NLP) to analyze the text of Proust’s In Search Of Lost Time (ISLT). Here is the Python code for this series.

In my previous post I examined which words most frequently occur in ISLT. You may have noticed that both “faire” and “faisait” – variations of the verb “to do” – appear in the list. If we group variations of the same root word together (for example in English, “am”, “are”, and “was” all group with “be”), then we end up with a better characterization of word frequency. This process is called lemmatization. Lemmatizing reduces the number of unique words in ISLT from 38,662 to 20,948.

The table below summarizes word frequencies for this reduced set. Verbs are dominant: “to do” and “to have” on top, closely followed by “to see”, “to be”, “to know”, “to come”, “to take”, “to say”, “to go”, and “to want”.

faire	5669
avoir	4790
y	3759
voir	3506
être	3307
savoir	2661
venir	2605
dire	2406
aller	2398
vouloir	2378
Albertine	2339
croire	2162
jour	2124
femme	2028
chose	2022
grand	2012
pouvoir	1988
connaître	1934
donner	1856
petit	1827

Twenty most frequently occurring lemmatized words in ISLT

But we should pivot from verbs, shouldn’t we? After all, names and places are of deeper interest to the Narrator of ISLT; in fact the third part of Swann’s Way is called “Place Names: The Name”.

Which names and places are most referenced in ISLT? Using spaCy, I was able to determine the answer. Here are the top 30 names:

Albertine	2338
Swann	1338
Charlus	1303
Saint-Loup	1091
Odette	971
Mme de Guermantes	897
Françoise	787
Gilberte	715
Mme Verdurin	660
Morel	558
Bloch	474
Andrée	383
Mme de Villeparisis	375
duc de Guermantes	328
Brichot	309
Bergotte	300
Norpois	293
Cottard	282
Monsieur	223
Jupien	221
Verdurin	202
Mme de Cambremer	182
Elstir	179
Rachel	168
Vinteuil	167
M. Verdurin	158
Legrandin	154
les Verdurin	137
Mlle Vinteuil	136
Forcheville	118

The 30 most frequently referenced names in ISLT

There are certain to be some errors in these counts, but I gave it a shot. I tried to combine references to different names for the same individual, for example “Robert” and “Saint-Loup” for Saint-Loup, “Mme Swann” and “Odette” for Odette, “Basin” for duc de Guermantes, and so on. Perhaps in the future I will revise this table slightly as I add to my list of aliases.

I will leave deeper analysis to others, but there are two obvious points worth making. The first is that characters are not always referenced by their proper names. Often definite pronouns such as “you”, “him”, “her”, etc are often used in the text. This analysis fails to account for that fact. To do so, I’d need to trace the reference of each definite pronoun back to its source, which I don’t know how to easily do. (I do know that if I did, the most referenced character in ISLT would be, of course, the narrator Marcel – who is mentioned by name only five times, twice in the second paragraph of chapter 346 and three times in the second paragraph of chapter 357, but as “je” countless times.)

Secondly, in my view, the list itself does reflect the true importance of each character within ISLT. For example, Gilberte is quite a critical figure in ISLT, linking the first volume to the last, and symbolizing the theme of regaining time…however she barely makes the top 10 in terms of frequency. Are there better ways of trying to capture the “deeper significance” of characters within a story? Perhaps a future post will revisit this topic.

Moving on, here are the ten most mentioned places:

Balbec	737
Guermantes	657
Paris	521
Combray	426
France	154
Venise	128
Élysées	111
Champs	97
Méséglise	75
Doncières	66

The ten most frequently mentioned places in ISLT

(It seems weird that Champs and Élysées are not recognized by spaCy as belonging to a single compound token; I need to look into the details on that.)

In my next post, I will further examine the use of names and places in ISLT.

2/26/2024: Thanks to reader Vince for spotting a couple of errors.

Text Analytics on Proust’s In Search of Lost Time

Seven years ago I analyzed the text of Marcel Proust’s In Search of Lost Time to find the five longest sentences (in English). This is the first of a series of posts where I use natural language processing (NLP) to analyze the text of Proust’s sprawling, introspective, and intensely human epic.

The Python code supporting this analysis is provided here. I make frequent use of the popular NLP package spaCy in my code. Unlike my previous Proust posts, here I use the original public domain French text as provided by marcel-proust.com.

The complete text of In Search of Lost Time (ISLT) is comprised of seven volumes consisting of 486 chapters. Including chapter titles there are 6,023,707 characters, including punctuation. There are 1,175,999 words in all, and 38,662 unique words. (The average adult English language vocabulary is 20,000-35,000 words.)

Here the five most used words, with their frequencies. They are boring:

de	52378
la	28637
à	27597
que	25687
et	25238

The five most frequently used words in ISLT

This is not really what we want, of course; it’s just a bunch of conjunctions, articles, and prepositions – all examples of what the NLP world calls “stop words“. The list of most common words changes significantly when we exclude spaCy’s list of French stop words. The results are given below. Even in a simple analysis such as this, some of ISLT’s key themes emerge: place, time, women, men, Albertine, Swann.

y	3759
faire	2868
Albertine	2338
eût	2038
fois	1727
vie	1724
jamais	1643
temps	1632
voir	1570
air	1458
femme	1397
moment	1363
jour	1362
Swann	1337
Charlus	1302
monde	1291
faisait	1249
chose	1246
homme	1033
yeux	1028

The twenty most frequently occurring words in ISLT, excluding stopwords

Word frequency changes in interesting ways. Here is a plot of the 200 most frequent words; a visual representation of my first table. You can see that the frequency drops off quickly from over 50,000 to under 1,000 quickly, followed by a long tail of seldom-used words:

Frequency plot of the 200 most-used words in ISLT

If I exclude stop words, I get a similar result. The curve seems to fit a power distribution reasonably well.

Frequency plot of the 200 most-used words in ISLT, excluding stop words

In the plot above, the most frequent word “y” has rank 1, the next most frequent word “faire” has rank 2, and so on. It is has been observed that the rank and frequency of words in a large corpus of text often follow’s Zipf’s law. Another interesting writeup of Zipf’s law can be found here.

If we plot the log of the rank and the log of the frequency, Zipf’s law says that the result should be linear. When we do this for all 30,000+ unique words in ISLT, we do see something that roughly follows Zipf’s law:

I don’t know what it means, or if it means anything at all…but it’s interesting.

This is just a start. In future posts I will explore references to characters and places within ISLT, and share some interesting findings.

A Lazier Way to Solve Knuth’s Skeleton Multiplication Puzzle

Last time, I described a solution to the following “skeleton multiplication puzzle” from Don Knuth’s The Art of Computer Programming:

        N N N  (1)
    x N N N N  (2)
    ---------  
      N N N 7  (3)
    N N N N    (4)
  N N N N      (5)
  N N N        (6)
  -----------  
  7 7 7 7 7 7  (N != 7)

The answer found by Knuth is the following:

        5 3 9  (1)
    x 1 4 4 3  (2)
    ---------  
      1 6 1 7  (3)
    2 1 5 6    (4)
  2 1 5 6      (5)
  5 3 9        (6)
  -----------  
  7 7 7 7 7 7

I was able to find a second solution where there was a leading zero in row (4). To find it, I considered all the prime factors of 777777 and reasoned about which pairs could fit the pattern of the puzzle.

In this post I want to show you how I originally found this solution.

I wrote a bit of code in Python to express part of the problem as a constraint programming (CSP) model. Instead of trying to work out the solution myself, I write down the conditions that must be true, and let the constraint programming solver do the work.

The basic approach for solving a puzzle using a solver is:

Write down the variables (unknowns) that I want the solver to find.
Write down all of the constraints (logical conditions) that need to be true.
Turn the above into the right API calls for the solver I am using.

Here’s how it works (and here’s a link to my code). I used the python_constraint package as my CSP solver. To specify the puzzle we need a Problem() object:

problem = Problem()

In a CSP I first define variables that I want the solver to find. The variables I care most about are n_1 and n_2. I know that n_1 < 1000 and n_2 < 10000. addVariable‘s first argument is the name of the variable and the second is the domain.

problem.addVariable("n_1", list(range(100, 999+1)))
problem.addVariable("n_2", list(range(1000, 9999+1)))

I also am interested in the digits of each number. I will call the digits for the first (d_11, d_12, d_13), and the second (d_21, d_22, d_23, d_24). Remember that none of these digits can be a seven.

not_seven = [0, 1, 2, 3, 4, 5, 6, 8, 9] # no sevens allowed!                

# n_1 = d_13 d_12 d_11                                                      
problem.addVariable("d_13", not_seven)
problem.addVariable("d_12", not_seven)
problem.addVariable("d_11", not_seven)
 
# n_2 = d_24 d_23 d_22 d_21                                                 
problem.addVariable("d_24", not_seven)
problem.addVariable("d_23", not_seven)
problem.addVariable("d_22", not_seven)
problem.addVariable("d_21", not_seven)

Now I start writing down conditions that I know to be true. The first is that the product is a bunch of sevens:

problem.addConstraint(lambda n_1, n_2: n_1 * n_2 == 777777, ['n_1', 'n_2'])

Even if you know Python this may be confusing because you may not be familiar with the python_constraint package. The point is that I am providing an anonymous (lambda) function that evaluates the product of n_1 and n_2 as the first argument, and the names of the variables I defined earlier as the second.

Next, if I put the digits d_1x together, I get n_1:

problem.addConstraint(lambda d_11, d_12, d_13, n_1:
                      100 * d_13 + 10 * d_12 + d_11 == n_0,
                      ['d_11', 'd_12', 'd_13', 'n_1'])

Same for n_2:

problem.addConstraint(lambda d_21, d_22, d_23, d_24, n_2:
                      1000 * d_24 + 100 * d_23 + 10 * d_22 + d_21 == n_2,
                      ['d_21', 'd_22', 'd_23', 'd_24', 'n_2'])

Also, as considered earlier, when I multiply the last two digits of n_1 and n_2, the result must end in 7:

problem.addConstraint(lambda d_11, d_21: (d_11 * d_21) % 10 == 7, ['d_11', 'd_21'])

Now, if I wanted to be complete, I would write down a bunch more conditions for all of the Xs that appear in rows 3-6, using the rules of long multiplication. But why would I want to do that? The point is to be lazy.

Let’s just solve the problem, ask for all the solutions, and print them out.

s = problem.getSolutions()
# s is a list of solutions. Each solution is a dict of variable values.     
     for t in s:
         print(f"{t['n_0']} x {t['n_1']} = {t['n_0'] * t['n_1']}")

Wow! The same four that I discovered above. This was “easier” for me in the sense that I already knew how to use the python_constraint package, so I could just do the coding above without having to “think”. Doing it “the old fashioned way” was in the end more fun, but requires more deliberation and focus for me.