Names and Places in Proust’s In Search of Lost Time

This post is a continuation of a series where I use natural language processing (NLP) to analyze the text of Proust’s In Search Of Lost Time (ISLT). Here is the Python code for this series.

In my previous post I examined which words most frequently occur in ISLT. You may have noticed that both “faire” and “faisait” – variations of the verb “to do” – appear in the list. If we group variations of the same root word together (for example in English, “am”, “are”, and “was” all group with “be”), then we end up with a better characterization of word frequency. This process is called lemmatization. Lemmatizing reduces the number of unique words in ISLT from 38,662 to 20,948.

The table below summarizes word frequencies for this reduced set. Verbs are dominant: “to do” and “to have” on top, closely followed by “to see”, “to be”, “to know”, “to come”, “to take”, “to say”, “to go”, and “to want”.

faire 5669
avoir 4790
voir 3506
être 3307
savoir 2661
venir 2605
dire 2406
aller 2398
vouloir 2378
Albertine 2339
croire 2162
jour 2124
femme 2028
chose 2022
grand 2012
pouvoir 1988
connaître 1934
donner 1856
petit 1827
Twenty most frequently occurring lemmatized words in ISLT

But we should pivot from verbs, shouldn’t we? After all, names and places are of deeper interest to the Narrator of ISLT; in fact the third part of Swann’s Way is called “Place Names: The Name”.

Which names and places are most referenced in ISLT? Using spaCy, I was able to determine the answer. Here are the top 30 names:

Mme de Guermantes897
Mme Verdurin660
Andrée 383
Mme de Villeparisis375
duc de Guermantes328
Mme de Cambremer182
M. Verdurin158
les Verdurin137
Mlle Vinteuil136
The 30 most frequently referenced names in ISLT

There are certain to be some errors in these counts, but I gave it a shot. I tried to combine references to different names for the same individual, for example “Robert” and “Saint-Loup” for Saint-Loup, “Mme Swann” and “Odette” for Odette, “Basin” for duc de Guermantes, and so on. Perhaps in the future I will revise this table slightly as I add to my list of aliases.

I will leave deeper analysis to others, but there are two obvious points worth making. The first is that characters are not always referenced by their proper names. Often definite pronouns such as “you”, “him”, “her”, etc are often used in the text. This analysis fails to account for that fact. To do so, I’d need to trace the reference of each definite pronoun back to its source, which I don’t know how to easily do. (I do know that if I did, the most referenced character in ISLT would be, of course, the narrator Marcel – who is mentioned by name only five times, twice in the second paragraph of chapter 345 and three times in the second paragraph of chapter 356, but as “je” countless times.)

Secondly, in my view, the list itself does reflect the true importance of each character within ISLT. For example, Gilberte is quite a critical figure in ISLT, linking the first volume to the last, and symbolizing the theme of regaining time…however she barely makes the top 10 in terms of frequency. Are there better ways of trying to capture the “deeper significance” of characters within a story? Perhaps a future post will revisit this topic.

Moving on, here are the ten most mentioned places:

The ten most frequently mentioned places in ISLT

(It seems weird that Champs and Élysées are not recognized by spaCy as belonging to a single compound token; I need to look into the details on that.)

In my next post, I will further examine the use of names and places in ISLT.


Author: natebrix

Follow me on twitter at @natebrix.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: