Text Analytics on Proust’s In Search of Lost Time

Seven years ago I analyzed the text of Marcel Proust’s In Search of Lost Time to find the five longest sentences (in English). This is the first of a series of posts where I use natural language processing (NLP) to analyze the text of Proust’s sprawling, introspective, and intensely human epic.

The Python code supporting this analysis is provided here. I make frequent use of the popular NLP package spaCy in my code. Unlike my previous Proust posts, here I use the original public domain French text as provided by marcel-proust.com.

The complete text of In Search of Lost Time (ISLT) is comprised of seven volumes consisting of 486 chapters. Including chapter titles there are 6,023,707 characters, including punctuation. There are 1,175,999 words in all, and 38,662 unique words. (The average adult English language vocabulary is 20,000-35,000 words.)

Here the five most used words, with their frequencies. They are boring:

The five most frequently used words in ISLT

This is not really what we want, of course; it’s just a bunch of conjunctions, articles, and prepositions – all examples of what the NLP world calls “stop words“. The list of most common words changes significantly when we exclude spaCy’s list of French stop words. The results are given below. Even in a simple analysis such as this, some of ISLT’s key themes emerge: place, time, women, men, Albertine, Swann.

The twenty most frequently occurring words in ISLT, excluding stopwords

Word frequency changes in interesting ways. Here is a plot of the 200 most frequent words; a visual representation of my first table. You can see that the frequency drops off quickly from over 50,000 to under 1,000 quickly, followed by a long tail of seldom-used words:

Frequency plot of the 200 most-used words in ISLT

If I exclude stop words, I get a similar result. The curve seems to fit a power distribution reasonably well.

Frequency plot of the 200 most-used words in ISLT, excluding stop words

In the plot above, the most frequent word “y” has rank 1, the next most frequent word “faire” has rank 2, and so on. It is has been observed that the rank and frequency of words in a large corpus of text often follow’s Zipf’s law. Another interesting writeup of Zipf’s law can be found here.

If we plot the log of the rank and the log of the frequency, Zipf’s law says that the result should be linear. When we do this for all 30,000+ unique words in ISLT, we do see something that roughly follows Zipf’s law:

I don’t know what it means, or if it means anything at all…but it’s interesting.

This is just a start. In future posts I will explore references to characters and places within ISLT, and share some interesting findings.

Author: natebrix

Follow me on twitter at @natebrix.

One thought on “Text Analytics on Proust’s In Search of Lost Time”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s