Seven years ago I analyzed the text of Marcel Proust’s In Search of Lost Time to find the five longest sentences (in English). This is the first of a series of posts where I use natural language processing (NLP) to analyze the text of Proust’s sprawling, introspective, and intensely human epic.
The Python code supporting this analysis is provided here. I make frequent use of the popular NLP package spaCy in my code. Unlike my previous Proust posts, here I use the original public domain French text as provided by marcel-proust.com.
The complete text of In Search of Lost Time (ISLT) is comprised of seven volumes consisting of 486 chapters. Including chapter titles there are 6,023,707 characters, including punctuation. There are 1,175,999 words in all, and 38,662 unique words. (The average adult English language vocabulary is 20,000-35,000 words.)
Here the five most used words, with their frequencies. They are boring:
This is not really what we want, of course; it’s just a bunch of conjunctions, articles, and prepositions – all examples of what the NLP world calls “stop words“. The list of most common words changes significantly when we exclude spaCy’s list of French stop words. The results are given below. Even in a simple analysis such as this, some of ISLT’s key themes emerge: place, time, women, men, Albertine, Swann.
Word frequency changes in interesting ways. Here is a plot of the 200 most frequent words; a visual representation of my first table. You can see that the frequency drops off quickly from over 50,000 to under 1,000 quickly, followed by a long tail of seldom-used words:
If I exclude stop words, I get a similar result. The curve seems to fit a power distribution reasonably well.
In the plot above, the most frequent word “y” has rank 1, the next most frequent word “faire” has rank 2, and so on. It is has been observed that the rank and frequency of words in a large corpus of text often follow’s Zipf’s law. Another interesting writeup of Zipf’s law can be found here.
If we plot the log of the rank and the log of the frequency, Zipf’s law says that the result should be linear. When we do this for all 30,000+ unique words in ISLT, we do see something that roughly follows Zipf’s law:
I don’t know what it means, or if it means anything at all…but it’s interesting.
This is just a start. In future posts I will explore references to characters and places within ISLT, and share some interesting findings.