This post is a continuation of a series where I use natural language processing (NLP) to analyze the text of Proust’s In Search Of Lost Time (ISLT). Here is the Python code for this series.
In my previous post I examined which words most frequently occur in ISLT. You may have noticed that both “faire” and “faisait” – variations of the verb “to do” – appear in the list. If we group variations of the same root word together (for example in English, “am”, “are”, and “was” all group with “be”), then we end up with a better characterization of word frequency. This process is called lemmatization. Lemmatizing reduces the number of unique words in ISLT from 38,662 to 20,948.
The table below summarizes word frequencies for this reduced set. Verbs are dominant: “to do” and “to have” on top, closely followed by “to see”, “to be”, “to know”, “to come”, “to take”, “to say”, “to go”, and “to want”.
faire | 5669 |
avoir | 4790 |
y | 3759 |
voir | 3506 |
être | 3307 |
savoir | 2661 |
venir | 2605 |
dire | 2406 |
aller | 2398 |
vouloir | 2378 |
Albertine | 2339 |
croire | 2162 |
jour | 2124 |
femme | 2028 |
chose | 2022 |
grand | 2012 |
pouvoir | 1988 |
connaître | 1934 |
donner | 1856 |
petit | 1827 |
But we should pivot from verbs, shouldn’t we? After all, names and places are of deeper interest to the Narrator of ISLT; in fact the third part of Swann’s Way is called “Place Names: The Name”.
Which names and places are most referenced in ISLT? Using spaCy, I was able to determine the answer. Here are the top 30 names:
Albertine | 2338 |
Swann | 1338 |
Charlus | 1303 |
Saint-Loup | 1091 |
Odette | 971 |
Mme de Guermantes | 897 |
Françoise | 787 |
Gilberte | 715 |
Mme Verdurin | 660 |
Morel | 558 |
Bloch | 474 |
Andrée | 383 |
Mme de Villeparisis | 375 |
duc de Guermantes | 328 |
Brichot | 309 |
Bergotte | 300 |
Norpois | 293 |
Cottard | 282 |
Monsieur | 223 |
Jupien | 221 |
Verdurin | 202 |
Mme de Cambremer | 182 |
Elstir | 179 |
Rachel | 168 |
Vinteuil | 167 |
M. Verdurin | 158 |
Legrandin | 154 |
les Verdurin | 137 |
Mlle Vinteuil | 136 |
Forcheville | 118 |
There are certain to be some errors in these counts, but I gave it a shot. I tried to combine references to different names for the same individual, for example “Robert” and “Saint-Loup” for Saint-Loup, “Mme Swann” and “Odette” for Odette, “Basin” for duc de Guermantes, and so on. Perhaps in the future I will revise this table slightly as I add to my list of aliases.
I will leave deeper analysis to others, but there are two obvious points worth making. The first is that characters are not always referenced by their proper names. Often definite pronouns such as “you”, “him”, “her”, etc are often used in the text. This analysis fails to account for that fact. To do so, I’d need to trace the reference of each definite pronoun back to its source, which I don’t know how to easily do. (I do know that if I did, the most referenced character in ISLT would be, of course, the narrator Marcel – who is mentioned by name only five times, twice in the second paragraph of chapter 345 and three times in the second paragraph of chapter 356, but as “je” countless times.)
Secondly, in my view, the list itself does reflect the true importance of each character within ISLT. For example, Gilberte is quite a critical figure in ISLT, linking the first volume to the last, and symbolizing the theme of regaining time…however she barely makes the top 10 in terms of frequency. Are there better ways of trying to capture the “deeper significance” of characters within a story? Perhaps a future post will revisit this topic.
Moving on, here are the ten most mentioned places:
Balbec | 737 |
Guermantes | 657 |
Paris | 521 |
Combray | 426 |
France | 154 |
Venise | 128 |
Élysées | 111 |
Champs | 97 |
Méséglise | 75 |
Doncières | 66 |
(It seems weird that Champs and Élysées are not recognized by spaCy as belonging to a single compound token; I need to look into the details on that.)
In my next post, I will further examine the use of names and places in ISLT.