Character Heatmaps in Proust’s In Search of Lost Time

This post is a continuation of a series where I use natural language processing (NLP) to analyze the text of Proust’s In Search Of Lost Time (ISLT). Here is the Python code for this series.

In my previous post I showed that the five most frequently mentioned characters in ISLT are Albertine (2338 mentions), Swann (1338), Charlus (1303), Robert Saint-Loup (1091) and Odette (971). But when do these references occur?

Readers of ISLT know that characters come and go as the story progresses. Swann and Odette play prominent roles in Volume 1 (Swann’s Way), whereas Albertine does not enter the story until Volume 2, coming to the fore as an object of the Narrator’s obsessions in Volumes 5 (The Prisoner) and 6 (The Fugitive).

I thought it would be interesting to try and visualize these transitions, so I recorded the chapter and paragraph of each reference of each character, in effect producing “coordinates” for every portion ofthe text. For example, the first proper name in the text (appropriately, François Ier), is (1, 1). I counted the number of references for each name within each of the 486 chapters.

These counts constitute a kind of “character heat map”: when characters are referenced frequently in a chapter, these counts are high, and are zero if a character is not mentioned at all.

I created the following heat map of the top five characters by ripping off some stuff I read on stackoverflow:

The tick marks represent the seven volumes of ISLT and each slender vertical line represents a chapter. Of course, not all chapters are the same length so the diagram is not accurate in that respect. From this visualization you can see the following:

  • Albertine does enter the picture in volume II and dominates volumes V and VI;
  • Odette features heavily in Volume I (especially in “Swann in Love”), with only intermittent appearances thereafter;
  • Saint-Loup is introduced in Volume II and taking over the focus from the Swann-Odette relationship in Volume I;
  • Swann is a central figure in Volume I and has occasional reappearances thereafter. When he enters the story he is typically a central figure.

Proust treats places as characters of their own, so it’s also interesting to examine where places are referenced. Here are four frequently referenced places. Combray is where the story starts, of course, whereas Balbec later becomes a place of relaxation and refuge to which Proust returns repeatedly. Doncières is where Saint-Loup (an aspiring officer) is stationed, and so references to Doncières are correlated to those of Saint-Loup. Paris, the beating heart of society, is mentioned throughout ISLT.

I also produced heat maps of five of the most frequently occurring words (each a theme):

There are other ways to visualize and juxtapose references to characters, places, and concepts, of course, and perhaps in future posts I will explore further.

For the next post in this series, however, we will examine the context in which characters are mentioned. Which are referred to in positive and negative terms, and when do changes in sentiment occur?

Names and Places in Proust’s In Search of Lost Time

This post is a continuation of a series where I use natural language processing (NLP) to analyze the text of Proust’s In Search Of Lost Time (ISLT). Here is the Python code for this series.

In my previous post I examined which words most frequently occur in ISLT. You may have noticed that both “faire” and “faisait” – variations of the verb “to do” – appear in the list. If we group variations of the same root word together (for example in English, “am”, “are”, and “was” all group with “be”), then we end up with a better characterization of word frequency. This process is called lemmatization. Lemmatizing reduces the number of unique words in ISLT from 38,662 to 20,948.

The table below summarizes word frequencies for this reduced set. Verbs are dominant: “to do” and “to have” on top, closely followed by “to see”, “to be”, “to know”, “to come”, “to take”, “to say”, “to go”, and “to want”.

faire 5669
avoir 4790
3759
voir 3506
être 3307
savoir 2661
venir 2605
dire 2406
aller 2398
vouloir 2378
Albertine 2339
croire 2162
jour 2124
femme 2028
chose 2022
grand 2012
pouvoir 1988
connaître 1934
donner 1856
petit 1827
Twenty most frequently occurring lemmatized words in ISLT

But we should pivot from verbs, shouldn’t we? After all, names and places are of deeper interest to the Narrator of ISLT; in fact the third part of Swann’s Way is called “Place Names: The Name”.

Which names and places are most referenced in ISLT? Using spaCy, I was able to determine the answer. Here are the top 30 names:

Albertine2338
Swann1338
Charlus1303
Saint-Loup1091
Odette971
Mme de Guermantes897
Françoise787
Gilberte715
Mme Verdurin660
Morel558
Bloch474
Andrée 383
Mme de Villeparisis375
duc de Guermantes328
Brichot309
Bergotte300
Norpois293
Cottard282
Monsieur223
Jupien221
Verdurin202
Mme de Cambremer182
Elstir179
Rachel168
Vinteuil167
M. Verdurin158
Legrandin154
les Verdurin137
Mlle Vinteuil136
Forcheville118
The 30 most frequently referenced names in ISLT

There are certain to be some errors in these counts, but I gave it a shot. I tried to combine references to different names for the same individual, for example “Robert” and “Saint-Loup” for Saint-Loup, “Mme Swann” and “Odette” for Odette, “Basin” for duc de Guermantes, and so on. Perhaps in the future I will revise this table slightly as I add to my list of aliases.

I will leave deeper analysis to others, but there are two obvious points worth making. The first is that characters are not always referenced by their proper names. Often definite pronouns such as “you”, “him”, “her”, etc are often used in the text. This analysis fails to account for that fact. To do so, I’d need to trace the reference of each definite pronoun back to its source, which I don’t know how to easily do. (I do know that if I did, the most referenced character in ISLT would be, of course, the narrator Marcel – who is mentioned by name only five times, twice in the second paragraph of chapter 345 and three times in the second paragraph of chapter 356, but as “je” countless times.)

Secondly, in my view, the list itself does reflect the true importance of each character within ISLT. For example, Gilberte is quite a critical figure in ISLT, linking the first volume to the last, and symbolizing the theme of regaining time…however she barely makes the top 10 in terms of frequency. Are there better ways of trying to capture the “deeper significance” of characters within a story? Perhaps a future post will revisit this topic.

Moving on, here are the ten most mentioned places:

Balbec737
Guermantes657
Paris521
Combray426
France154
Venise128
Élysées111
Champs97
Méséglise75
Doncières66
The ten most frequently mentioned places in ISLT

(It seems weird that Champs and Élysées are not recognized by spaCy as belonging to a single compound token; I need to look into the details on that.)

In my next post, I will further examine the use of names and places in ISLT.

Text Analytics on Proust’s In Search of Lost Time

Seven years ago I analyzed the text of Marcel Proust’s In Search of Lost Time to find the five longest sentences (in English). This is the first of a series of posts where I use natural language processing (NLP) to analyze the text of Proust’s sprawling, introspective, and intensely human epic.

The Python code supporting this analysis is provided here. I make frequent use of the popular NLP package spaCy in my code. Unlike my previous Proust posts, here I use the original public domain French text as provided by marcel-proust.com.

The complete text of In Search of Lost Time (ISLT) is comprised of seven volumes consisting of 486 chapters. Including chapter titles there are 6,023,707 characters, including punctuation. There are 1,175,999 words in all, and 38,662 unique words. (The average adult English language vocabulary is 20,000-35,000 words.)

Here the five most used words, with their frequencies. They are boring:

de52378
la28637
à27597
que25687
et25238
The five most frequently used words in ISLT

This is not really what we want, of course; it’s just a bunch of conjunctions, articles, and prepositions – all examples of what the NLP world calls “stop words“. The list of most common words changes significantly when we exclude spaCy’s list of French stop words. The results are given below. Even in a simple analysis such as this, some of ISLT’s key themes emerge: place, time, women, men, Albertine, Swann.

y3759
faire2868
Albertine2338
eût2038
fois1727
vie1724
jamais1643
temps1632
voir1570
air1458
femme1397
moment1363
jour1362
Swann1337
Charlus1302
monde1291
faisait1249
chose1246
homme1033
yeux1028
The twenty most frequently occurring words in ISLT, excluding stopwords

Word frequency changes in interesting ways. Here is a plot of the 200 most frequent words; a visual representation of my first table. You can see that the frequency drops off quickly from over 50,000 to under 1,000 quickly, followed by a long tail of seldom-used words:

Frequency plot of the 200 most-used words in ISLT

If I exclude stop words, I get a similar result. The curve seems to fit a power distribution reasonably well.

Frequency plot of the 200 most-used words in ISLT, excluding stop words

In the plot above, the most frequent word “y” has rank 1, the next most frequent word “faire” has rank 2, and so on. It is has been observed that the rank and frequency of words in a large corpus of text often follow’s Zipf’s law. Another interesting writeup of Zipf’s law can be found here.

If we plot the log of the rank and the log of the frequency, Zipf’s law says that the result should be linear. When we do this for all 30,000+ unique words in ISLT, we do see something that roughly follows Zipf’s law:

I don’t know what it means, or if it means anything at all…but it’s interesting.

This is just a start. In future posts I will explore references to characters and places within ISLT, and share some interesting findings.

The Five Longest Proust Sentences

Previous posts have discussed Proust’s In Search Of Lost Time (best English translation here). Marco76UK asks “Out of curiosity, what is the third longest sentence?” I’ll do you two better and give the five longest English sentences in the Moncreiff translation. I am not 100% certain that I’ve got it right because sentence parsing is not as straightforward as it might seem – for example, does an internal quote ending in a sentence qualify as the end of a sentence? I will omit the details and simply give what seem to be the five longest sentences.

A disclaimer: Proust’s discussions of sexuality and Judaism (among other things) in ISLT are complicated and controversial. Other sources discuss these topics in detail, and perspectives vary. Reporting on the content of ISLT should not be taken as an endorsement of anything whatsoever.

#1: At 958 words, a passage linking both of the controversial topics above, a discussion of those similar to the fascinating M. de Charlus.

Their honour precarious, their liberty provisional, lasting only until the discovery of their crime; their position unstable, like that of the poet who one day was feasted at every table, applauded in every theatre in London, and on the next was driven from every lodging, unable to find a pillow upon which to lay his head, turning the mill like Samson and saying like him: “The two sexes shall die, each in a place apart!”; excluded even, save on the days of general disaster when the majority rally round the victim as the Jews rallied round Dreyfus, from the sympathy — at times from the society — of their fellows, in whom they inspire only disgust at seeing themselves as they are, portrayed in a mirror which, ceasing to flatter them, accentuates every blemish that they have refused to observe in themselves, and makes them understand that what they have been calling their love (a thing to which, playing upon the word, they have by association annexed all that poetry, painting, music, chivalry, asceticism have contrived to add to love) springs not from an ideal of beauty which they have chosen but from an incurable malady; like the Jews again (save some who will associate only with others of their race and have always on their lips ritual words and consecrated pleasantries), shunning one another, seeking out those who are most directly their opposite, who do not desire their company, pardoning their rebuffs, moved to ecstasy by their condescension; but also brought into the company of their own kind by the ostracism that strikes them, the opprobrium under which they have fallen, having finally been invested, by a persecution similar to that of Israel, with the physical and moral characteristics of a race, sometimes beautiful, often hideous, finding (in spite of all the mockery with which he who, more closely blended with, better assimilated to the opposing race, is relatively, in appearance, the least inverted, heaps upon him who has remained more so) a relief in frequenting the society of their kind, and even some corroboration of their own life, so much so that, while steadfastly denying that they are a race (the name of which is the vilest of insults), those who succeed in concealing the fact that they belong to it they readily unmask, with a view less to injuring them, though they have no scruple about that, than to excusing themselves; and, going in search (as a doctor seeks cases of appendicitis) of cases of inversion in history, taking pleasure in recalling that Socrates was one of themselves, as the Israelites claim that Jesus was one of them, without reflecting that there were no abnormals when homosexuality was the norm, no anti-Christians before Christ, that the disgrace alone makes the crime because it has allowed to survive only those who remained obdurate to every warning, to every example, to every punishment, by virtue of an innate disposition so peculiar that it is more repugnant to other men (even though it may be accompanied by exalted moral qualities) than certain other vices which exclude those qualities, such as theft, cruelty, breach of faith, vices better understood and so more readily excused by the generality of men; forming a freemasonry far more extensive, more powerful and less suspected than that of the Lodges, for it rests upon an identity of tastes, needs, habits, dangers, apprenticeship, knowledge, traffic, glossary, and one in which the members themselves, who intend not to know one another, recognise one another immediately by natural or conventional, involuntary or deliberate signs which indicate one of his congeners to the beggar in the street, in the great nobleman whose carriage door he is shutting, to the father in the suitor for his daughter’s hand, to him who has sought healing, absolution, defence, in the doctor, the priest, the barrister to whom he has had recourse; all of them obliged to protect their own secret but having their part in a secret shared with the others, which the rest of humanity does not suspect and which means that to them the most wildly improbable tales of adventure seem true, for in this romantic, anachronistic life the ambassador is a bosom friend of the felon, the prince, with a certain independence of action with which his aristocratic breeding has furnished him, and which the trembling little cit would lack, on leaving the duchess’s party goes off to confer in private with the hooligan; a reprobate part of the human whole, but an important part, suspected where it does not exist, flaunting itself, insolent and unpunished, where its existence is never guessed; numbering its adherents everywhere, among the people, in the army, in the church, in the prison, on the throne; living, in short, at least to a great extent, in a playful and perilous intimacy with the men of the other race, provoking them, playing with them by speaking of its vice as of something alien to it; a game that is rendered easy by the blindness or duplicity of the others, a game that may be kept up for years until the day of the scandal, on which these lion-tamers are devoured; until then, obliged to make a secret of their lives, to turn away their eyes from the things on which they would naturally fasten them, to fasten them upon those from which they would naturally turn away, to change the gender of many of the words in their vocabulary, a social constraint, slight in comparison with the inward constraint which their vice, or what is improperly so called, imposes upon them with regard not so much now to others as to themselves, and in such a way that to themselves it does not appear a vice.

#2: At 599 words, a passage from the opening section of Swann’s Way:

But I had seen first one and then another of the rooms in which I had slept during my life, and in the end I would revisit them all in the long course of my waking dream: rooms in winter, where on going to bed I would at once bury my head in a nest, built up out of the most diverse materials, the corner of my pillow, the top of my blankets, a piece of a shawl, the edge of my bed, and a copy of an evening paper, all of which things I would contrive, with the infinite patience of birds building their nests, to cement into one whole; rooms where, in a keen frost, I would feel the satisfaction of being shut in from the outer world (like the sea-swallow which builds at the end of a dark tunnel and is kept warm by the surrounding earth), and where, the fire keeping in all night, I would sleep wrapped up, as it were, in a great cloak of snug and savoury air, shot with the glow of the logs which would break out again in flame: in a sort of alcove without walls, a cave of warmth dug out of the heart of the room itself, a zone of heat whose boundaries were constantly shifting and altering in temperature as gusts of air ran across them to strike freshly upon my face, from the corners of the room, or from parts near the window or far from the fireplace which had therefore remained cold — or rooms in summer, where I would delight to feel myself a part of the warm evening, where the moonlight striking upon the half-opened shutters would throw down to the foot of my bed its enchanted ladder; where I would fall asleep, as it might be in the open air, like a titmouse which the breeze keeps poised in the focus of a sunbeam — or sometimes the Louis XVI room, so cheerful that I could never feel really unhappy, even on my first night in it: that room where the slender columns which lightly supported its ceiling would part, ever so gracefully, to indicate where the bed was and to keep it separate; sometimes again that little room with the high ceiling, hollowed in the form of a pyramid out of two separate storeys, and partly walled with mahogany, in which from the first moment my mind was drugged by the unfamiliar scent of flowering grasses, convinced of the hostility of the violet curtains and of the insolent indifference of a clock that chattered on at the top of its voice as though I were not there; while a strange and pitiless mirror with square feet, which stood across one corner of the room, cleared for itself a site I had not looked to find tenanted in the quiet surroundings of my normal field of vision: that room in which my mind, forcing itself for hours on end to leave its moorings, to elongate itself upwards so as to take on the exact shape of the room, and to reach to the summit of that monstrous funnel, had passed so many anxious nights while my body lay stretched out in bed, my eyes staring upwards, my ears straining, my nostrils sniffing uneasily, and my heart beating; until custom had changed the colour of the curtains, made the clock keep quiet, brought an expression of pity to the cruel, slanting face of the glass, disguised or even completely dispelled the scent of flowering grasses, and distinctly reduced the apparent loftiness of the ceiling.

#3: At 447 words – something in either the Fugitive or Time Regained (too lazy to look it up):

A sofa that had risen up from dreamland between a pair of new and thoroughly substantial armchairs, smaller chairs upholstered in pink silk, the cloth surface of a card-table raised to the dignity of a person since, like a person, it had a past, a memory, retaining in the chill and gloom of Quai Conti the tan of its roasting by the sun through the windows of Rue Montalivet (where it could tell the time of day as accurately as Mme. Verdurin herself) and through the glass doors at la Raspelière, where they had taken it and where it used to gaze out all day long over the flower-beds of the garden at the valley far below, until it was time for Cottard and the musician to sit down to their game; a posy of violets and pansies in pastel, the gift of a painter friend, now dead, the sole fragment that survived of a life that had vanished without leaving any trace, summarising a great talent and a long friendship, recalling his keen, gentle eyes, his shapely hand, plump and melancholy, while he was at work on it; the incoherent, charming disorder of the offerings of the faithful, which have followed the lady of the house on all her travels and have come in time to assume the fixity of a trait of character, of a line of destiny; a profusion of cut flowers, of chocolate-boxes which here as in the country systematised their growth in an identical mode of blossoming; the curious interpolation of those singular and superfluous objects which still appear to have been just taken from the box in which they were offered and remain for ever what they were at first, New Year’s Day presents; all those things, in short, which one could not have isolated from the rest, but which for Brichot, an old frequenter of the Verdurin parties, had that patina, that velvety bloom of things to which, giving them a sort of profundity, an astral body has been added; all these things scattered before him, sounded in his ear like so many resonant keys which awakened cherished likenesses in his heart, confused reminiscences which, here in this drawing-room of the present day that was littered with them, cut out, defined, as on a fine day a shaft of sunlight cuts a section in the atmosphere, the furniture and carpets, and pursuing it from a cushion to a flower-stand, from a footstool to a lingering scent, from the lighting arrangements to the colour scheme, carved, evoked, spiritualised, called to life a form which might be called the ideal aspect, immanent in each of their successive homes, of the Verdurin drawing-room.

#4: At 426 words, a reflection on a church from the narrator’s youth:

All these things and, still more than these, the treasures which had come to the church from personages who to me were almost legendary figures (such as the golden cross wrought, it was said, by Saint Eloi and presented by Dagobert, and the tomb of the sons of Louis the Germanic in porphyry and enamelled copper), because of which I used to go forward into the church when we were making our way to our chairs as into a fairy-haunted valley, where the rustic sees with amazement on a rock, a tree, a marsh, the tangible proofs of the little people’s supernatural passage — all these things made of the church for me something entirely different from the rest of the town; a building which occupied, so to speak, four dimensions of space — the name of the fourth being Time — which had sailed the centuries with that old nave, where bay after bay, chapel after chapel, seemed to stretch across and hold down and conquer not merely a few yards of soil, but each successive epoch from which the whole building had emerged triumphant, hiding the rugged barbarities of the eleventh century in the thickness of its walls, through which nothing could be seen of the heavy arches, long stopped and blinded with coarse blocks of ashlar, except where, near the porch, a deep groove was furrowed into one wall by the tower-stair; and even there the barbarity was veiled by the graceful gothic arcade which pressed coquettishly upon it, like a row of grown-up sisters who, to hide him from the eyes of strangers, arrange themselves smilingly in front of a countrified, unmannerly and ill-dressed younger brother; rearing into the sky above the Square a tower which had looked down upon Saint Louis, and seemed to behold him still; and thrusting down with its crypt into the blackness of a Merovingian night, through which, guiding us with groping finger-tips beneath the shadowy vault, ribbed strongly as an immense bat’s wing of stone, Théodore or his sister would light up for us with a candle the tomb of Sigebert’s little daughter, in which a deep hole, like the bed of a fossil, had been bored, or so it was said, “by a crystal lamp which, on the night when the Frankish princess was murdered, had left, of its own accord, the golden chains by which it was suspended where the apse is to-day and with neither the crystal broken nor the light extinguished had buried itself in the stone, through which it had gently forced its way.”

#5: At 398 words, a neat little passage about Gilberte early on in Swann’s Way, setting the stage for the final scenes a couple of thousand pages later:

The name Gilberte passed close by me, evoking all the more forcibly her whom it labelled in that it did not merely refer to her, as one speaks of a man in his absence, but was directly addressed to her; it passed thus close by me, in action, so to speak, with a force that increased with the curve of its trajectory and as it drew near to its target; — carrying in its wake, I could feel, the knowledge, the impression of her to whom it was addressed that belonged not to me but to the friend who called to her, everything that, while she uttered the words, she more or less vividly reviewed, possessed in her memory, of their daily intimacy, of the visits that they paid to each other, of that unknown existence which was all the more inaccessible, all the more painful to me from being, conversely, so familiar, so tractable to this happy girl who let her message brush past me without my being able to penetrate its surface, who flung it on the air with a light-hearted cry: letting float in the atmosphere the delicious attar which that message had distilled, by touching them with precision, from certain invisible points in Mlle. Swann’s life, from the evening to come, as it would be, after dinner, at her home, — forming, on its celestial passage through the midst of the children and their nursemaids, a little cloud, exquisitely coloured, like the cloud that, curling over one of Poussin’s gardens, reflects minutely, like a cloud in the opera, teeming with chariots and horses, some apparition of the life of the gods; casting, finally, on that ragged grass, at the spot on which she stood (at once a scrap of withered lawn and a moment in the afternoon of the fair player, who continued to beat up and catch her shuttlecock until a governess, with a blue feather in her hat, had called her away) a marvellous little band of light, of the colour of heliotrope, spread over the lawn like a carpet on which I could not tire of treading to and fro with lingering feet, nostalgic and profane, while Françoise shouted: “Come on, button up your coat, look, and let’s get away!” and I remarked for the first time how common her speech was, and that she had, alas, no blue feather in her hat.

Sentence Lengths in Proust’s “In Search of Lost Time”

My last post provided the complete text of Proust’s “In Search of Lost Time”, which I now want to explore a bit using Mathematica. I future posts I will do some experiments with Python as well – but for now, if you are sufficiently motivated (and can read Mathematica), you can probably translate much of my code to another language.

The first thing I did was read the entire text into Mathematica. Splitting text into sentences is not as easy as it may appear. Periods appear in abbreviations such as “M. Charlus”, “Mme. Cambremer” and so on. They also occasionally appear inside of quotes or in other places that do not indicate the end of a sentence. I handled abbreviations by simply replacing “M.” for “M”, and so on. The final result of the following three lines of code is a variable “proust” that is a list of lists: each sentence in ISLT broken up by word.

txt = Import[“islt_proust.txt”, CharacterEncoding -> “Unicode”];
txt2 = StringReplace[txt, x : {” M.”, ” Mme.”, ” Mlle.”} :> StringDrop[x, -1]]; (* Obvious cases where ‘.’ does not mean end of sentence *)
proust = ReadList[StringToStream[txt2], Word, RecordLists -> True, RecordSeparators -> {“.”}]

Based on these rules, there are over 35,000 sentences in ISLT. Here is the longest one:

longest5 = Ordering[sentLengths, -5]
{557, 27166, 8400, 41, 17005}

sentLengths[[longest5]]
{426, 447, 457, 599, 958}

proustSent[[Last[longest5]]]
{“Their”, “honour”, “precarious”, “their”, “liberty”, “provisional”, “lasting”, “only”, “until”, “the”, “discovery”, “of”, “their”, “crime”,”their”, “position”, “unstable”, “like”, “that”, “of”, “the”, “poet”, “who”, \
“one”, “day”, “was”, “feasted”, “at”, “every”, “table”, “applauded”, “in”, “every”, “theatre”, “in”, “London”, “and”, “on”, “the”, “next”, “was”, \
“driven”, “from”, “every”, “lodging”, “unable”, “to”, “find”, “a”, “pillow”, “upon”, “which”, “to”, “lay”, “his”, “head”, “turning”, “the”, “mill”, \
“like”, “Samson”, “and”, “saying”, “like”, “him”, “The”, “two”, “sexes”, “shall”, “die”, “each”, “in”, “a”, “place”, “apart”, “excluded”, “even”, \
“save”, “on”, “the”, “days”, “of”, “general”, “disaster”, “when”, “the”, “majority”, “rally”, “round”, “the”, “victim”, “as”, “the”, “Jews”, \
“rallied”, “round”, “Dreyfus”, “from”, “the”, “sympathy”, “at”, “times”, “from”, “the”, “society”, “of”, “their”, “fellows”, “in”, “whom”, “they”, \
“inspire”, “only”, “disgust”, “at”, “seeing”, “themselves”, “as”, “they”, “are”, “portrayed”, “in”, “a”, “mirror”, “which”, “ceasing”, “to”, “flatter”, \
“them”, “accentuates”, “every”, “blemish”, “that”, “they”, “have”, “refused”, “to”, “observe”, “in”, “themselves”, “and”, “makes”, “them”, “understand”, \
“that”, “what”, “they”, “have”, “been”, “calling”, “their”, “love”, “a”, “thing”, “to”, “which”, “playing”, “upon”, “the”, “word”, “they”, “have”, \
“by”, “association”, “annexed”, “all”, “that”, “poetry”, “painting”, “music”, “chivalry”, “asceticism”, “have”, “contrived”, “to”, “add”, “to”, “love”, \
“springs”, “not”, “from”, “an”, “ideal”, “of”, “beauty”, “which”, “they”, “have”, “chosen”, “but”, “from”, “an”, “incurable”, “malady”, “like”, “the”, \
“Jews”, “again”, “save”, “some”, “who”, “will”, “associate”, “only”, “with”, “others”, “of”, “their”, “race”, “and”, “have”, “always”, “on”, “their”, \
“lips”, “ritual”, “words”, “and”, “consecrated”, “pleasantries”, “shunning”, “one”, “another”, “seeking”, “out”, “those”, “who”, “are”, “most”, \
“directly”, “their”, “opposite”, “who”, “do”, “not”, “desire”, “their”, “company”, “pardoning”, “their”, “rebuffs”, “moved”, “to”, “ecstasy”, “by”, \
“their”, “condescension”, “but”, “also”, “brought”, “into”, “the”, “company”, “of”, “their”, “own”, “kind”, “by”, “the”, “ostracism”, “that”, “strikes”, \
“them”, “the”, “opprobrium”, “under”, “which”, “they”, “have”, “fallen”, “having”, “finally”, “been”, “invested”, “by”, “a”, “persecution”, “similar”, \
“to”, “that”, “of”, “Israel”, “with”, “the”, “physical”, “and”, “moral”, “characteristics”, “of”, “a”, “race”, “sometimes”, “beautiful”, “often”, \
“hideous”, “finding”, “in”, “spite”, “of”, “all”, “the”, “mockery”, “with”, “which”, “he”, “who”, “more”, “closely”, “blended”, “with”, “better”, \
“assimilated”, “to”, “the”, “opposing”, “race”, “is”, “relatively”, “in”, “appearance”, “the”, “least”, “inverted”, “heaps”, “upon”, “him”, “who”, \
“has”, “remained”, “more”, “so”, “a”, “relief”, “in”, “frequenting”, “the”, “society”, “of”, “their”, “kind”, “and”, “even”, “some”, “corroboration”, \
“of”, “their”, “own”, “life”, “so”, “much”, “so”, “that”, “while”, “steadfastly”, “denying”, “that”, “they”, “are”, “a”, “race”, “the”, “name”, \
“of”, “which”, “is”, “the”, “vilest”, “of”, “insults”, “those”, “who”, “succeed”, “in”, “concealing”, “the”, “fact”, “that”, “they”, “belong”, “to”, \
“it”, “they”, “readily”, “unmask”, “with”, “a”, “view”, “less”, “to”, “injuring”, “them”, “though”, “they”, “have”, “no”, “scruple”, “about”, \
“that”, “than”, “to”, “excusing”, “themselves”, “and”, “going”, “in”, “search”, “as”, “a”, “doctor”, “seeks”, “cases”, “of”, “appendicitis”, “of”, \
“cases”, “of”, “inversion”, “in”, “history”, “taking”, “pleasure”, “in”, “recalling”, “that”, “Socrates”, “was”, “one”, “of”, “themselves”, “as”, \
“the”, “Israelites”, “claim”, “that”, “Jesus”, “was”, “one”, “of”, “them”, “without”, “reflecting”, “that”, “there”, “were”, “no”, “abnormals”, “when”, \
“homosexuality”, “was”, “the”, “norm”, “no”, “antiChristians”, “before”, “Christ”, “that”, “the”, “disgrace”, “alone”, “makes”, “the”, “crime”, \
“because”, “it”, “has”, “allowed”, “to”, “survive”, “only”, “those”, “who”, “remained”, “obdurate”, “to”, “every”, “warning”, “to”, “every”, “example”, \
“to”, “every”, “punishment”, “by”, “virtue”, “of”, “an”, “innate”, “disposition”, “so”, “peculiar”, “that”, “it”, “is”, “more”, “repugnant”, \
“to”, “other”, “men”, “even”, “though”, “it”, “may”, “be”, “accompanied”, “by”, “exalted”, “moral”, “qualities”, “than”, “certain”, “other”, “vices”, \
“which”, “exclude”, “those”, “qualities”, “such”, “as”, “theft”, “cruelty”, “breach”, “of”, “faith”, “vices”, “better”, “understood”, “and”, “so”, \
“more”, “readily”, “excused”, “by”, “the”, “generality”, “of”, “men”, “forming”, “a”, “freemasonry”, “far”, “more”, “extensive”, “more”, \
“powerful”, “and”, “less”, “suspected”, “than”, “that”, “of”, “the”, “Lodges”, “for”, “it”, “rests”, “upon”, “an”, “identity”, “of”, “tastes”, \
“needs”, “habits”, “dangers”, “apprenticeship”, “knowledge”, “traffic”, “glossary”, “and”, “one”, “in”, “which”, “the”, “members”, “themselves”, \
“who”, “intend”, “not”, “to”, “know”, “one”, “another”, “recognise”, “one”, “another”, “immediately”, “by”, “natural”, “or”, “conventional”, \
“involuntary”, “or”, “deliberate”, “signs”, “which”, “indicate”, “one”, “of”, “his”, “congeners”, “to”, “the”, “beggar”, “in”, “the”, “street”, “in”, \
“the”, “great”, “nobleman”, “whose”, “carriage”, “door”, “he”, “is”, “shutting”, “to”, “the”, “father”, “in”, “the”, “suitor”, “for”, “his”, \
“daughter’s”, “hand”, “to”, “him”, “who”, “has”, “sought”, “healing”, “absolution”, “defence”, “in”, “the”, “doctor”, “the”, “priest”, “the”, \
“barrister”, “to”, “whom”, “he”, “has”, “had”, “recourse”, “all”, “of”, “them”, “obliged”, “to”, “protect”, “their”, “own”, “secret”, “but”, \
“having”, “their”, “part”, “in”, “a”, “secret”, “shared”, “with”, “the”, “others”, “which”, “the”, “rest”, “of”, “humanity”, “does”, “not”, “suspect”, \
“and”, “which”, “means”, “that”, “to”, “them”, “the”, “most”, “wildly”, “improbable”, “tales”, “of”, “adventure”, “seem”, “true”, “for”, “in”, \
“this”, “romantic”, “anachronistic”, “life”, “the”, “ambassador”, “is”, “a”, “bosom”, “friend”, “of”, “the”, “felon”, “the”, “prince”, “with”, “a”, \
“certain”, “independence”, “of”, “action”, “with”, “which”, “his”, “aristocratic”, “breeding”, “has”, “furnished”, “him”, “and”, “which”, “the”, \
“trembling”, “little”, “cit”, “would”, “lack”, “on”, “leaving”, “the”, “duchess’s”, “party”, “goes”, “off”, “to”, “confer”, “in”, “private”, “with”, \
“the”, “hooligan”, “a”, “reprobate”, “part”, “of”, “the”, “human”, “whole”, “but”, “an”, “important”, “part”, “suspected”, “where”, “it”, “does”, “not”, \
“exist”, “flaunting”, “itself”, “insolent”, “and”, “unpunished”, “where”, “its”, “existence”, “is”, “never”, “guessed”, “numbering”, “its”, \
“adherents”, “everywhere”, “among”, “the”, “people”, “in”, “the”, “army”, “in”, “the”, “church”, “in”, “the”, “prison”, “on”, “the”, “throne”, \
“living”, “in”, “short”, “at”, “least”, “to”, “a”, “great”, “extent”, “in”, “a”, “playful”, “and”, “perilous”, “intimacy”, “with”, “the”, “men”, “of”, \
“the”, “other”, “race”, “provoking”, “them”, “playing”, “with”, “them”, “by”, “speaking”, “of”, “its”, “vice”, “as”, “of”, “something”, “alien”, “to”, \
“it”, “a”, “game”, “that”, “is”, “rendered”, “easy”, “by”, “the”, “blindness”, “or”, “duplicity”, “of”, “the”, “others”, “a”, “game”, “that”, \
“may”, “be”, “kept”, “up”, “for”, “years”, “until”, “the”, “day”, “of”, “the”, “scandal”, “on”, “which”, “these”, “liontamers”, “are”, “devoured”, \
“until”, “then”, “obliged”, “to”, “make”, “a”, “secret”, “of”, “their”, “lives”, “to”, “turn”, “away”, “their”, “eyes”, “from”, “the”, “things”, \
“on”, “which”, “they”, “would”, “naturally”, “fasten”, “them”, “to”, “fasten”, “them”, “upon”, “those”, “from”, “which”, “they”, “would”, \
“naturally”, “turn”, “away”, “to”, “change”, “the”, “gender”, “of”, “many”, “of”, “the”, “words”, “in”, “their”, “vocabulary”, “a”, “social”, \
“constraint”, “slight”, “in”, “comparison”, “with”, “the”, “inward”, “constraint”, “which”, “their”, “vice”, “or”, “what”, “is”, “improperly”, \
“so”, “called”, “imposes”, “upon”, “them”, “with”, “regard”, “not”, “so”, “much”, “now”, “to”, “others”, “as”, “to”, “themselves”, “and”, “in”, “such”, \
“a”, “way”, “that”, “to”, “themselves”, “it”, “does”, “not”, “appear”, “a”, “vice”}

Proust is known for his long sentences, but do the lengths fit a pattern? In a previous series of posts I showed that certain (American) football stats fit a lognormal distribution. It turns out that ISLT sentence lengths also appear to be lognormally distributed. The FindDistributionParameters method makes this pretty easy. We scale the distribution so we can draw it along with a histogram on the same chart.

lnp = FindDistributionParameters[sentLengths, LogNormalDistribution[m, s],  ParameterEstimator -> “MethodOfMoments”]
ratio = Max[HistogramList[Select[sentLengths, # < 250 &]][[2]]] / Maximize[{PDF[LogNormalDistribution[m, s] /. lnp, x], x >= 0}, x][[1]]
Show[Histogram[Select[sentLengths, # < 250 &]], Plot[ratio*PDF[LogNormalDistribution[m, s] /. lnp, x], {x, 0, 250.},  PlotStyle -> Thick, PlotRange -> All]]

 

Hist1

It’s also interesting to examine sentence lengths by character. Using the Select method we can extract sentences that contain a particular name (or phrase). The MedianSentenceLength method is used below to create a bar chart comparing median lengths across characters. Interestingly, two powerful but distant (to the author) male figures top the list.

medSentLength = N[Median[sentLengths]];
SentLength[name_] :=   N[Median[Map[Length, Select[proustSent, MemberQ[#, name] &]]]];
barNames = {“I”,  “Oriane”, “Bloch”, “Albertine” , “Charlus”, “Mother”,  “Swann”, “Gilberte”,  “Father”, “Christ”};
barValues = Map[SentLength, barNames]
BarChart[MapThread[Labeled[#1, #2, Above] &, {barValues, barNames}], GridLines -> {None, {medSentLength}},  GridLinesStyle -> Directive[Red, Thick, Dashed],
 Method -> {“GridLinesInFront” -> True}, ChartLabels -> {barNames}, BaseStyle -> {FontFamily -> Helvetica, FontSize -> 12}]

Bar1

Finally, I wanted to see whether there were significant changes in average sentence length as we progress through the story. Can we identify trends? To try to examine this, I used the ExponentialMovingAverage method. I wasn’t really able to conclude anything substantial. So in the spirit of Stephen Wolfram, here are some pretty pictures.

ListLinePlot[ExponentialMovingAverage[sentLengths, 0.02]]

Pic1

mm = N[MovingMedian[sentLengths, 100]];
ListLinePlot[mm]

 

Pic2 

With the help of some online sources I call out high and low points in the trend.

points = Transpose[{Range[1, Length[mm]], mm}];
medLength = Median[sentLengths];
(* http://mathematica.stackexchange.com/questions/13557/local-max-min-of-mathematica-data-sets *)
peaks =  Select[Pick[points[[2 ;; -2]],  Differences[Sign[Differences[points[[All, 2]]]]], -2], #[[2]] >=  medLength + 20 &]
valleys = Select[Pick[points[[2 ;; -2]], Differences[Sign[Differences[points[[All, 2]]]]],  2], #[[2]] <= medLength – 15 &];
ListLinePlot[points, Prolog -> {PointSize[Large], Red, Point[peaks], Green, Point[valleys]}]

HiLo1

Text analytics on Proust’s “In Search of Lost Time”

This post is the first in a series on text analytics on Marcel Proust’s In Search of Lost Time. Here’s what I plan on doing:

  • Provide an interesting, extremely long plain-text data set useful for text analytics based on material in the public domain.
  • Demonstrate simple text processing tools in proprietary (Mathematica) and freely available (Python) languages.
  • Report descriptive statistics for In Search of Lost Time.
  • Perform a graph analysis of the major characters to create a visual representation of the relationships between them.

None of what I will be doing will be especially deep: these posts constitute my first and only experiments in this area. My only training in text analytics has been based on web searches to help figure out how to do the things I listed above. In other words, it’s just me fooling around in odd moments at nights and on weekends over the past year.

Why Proust? About five years ago I learned that a high school friend was working through seven of literature’s great-but-lengthy classics as a kind of a challenge. I decided to join him on one leg of that journey – In Search of Lost Time [hereafter ISLT] – and three years later I made it through all seven volumes. Proust’s voice is so persistent and personal that upon completion I felt as if I had a new friend, or perhaps a younger cousin or brother. Major themes include love, memory, sexuality, finding one’s place, art, obsession, but ultimately ISLT is so expansive that it becomes impossible to categorize. It is what it is; describing it is like trying to give a eulogy. ISLT comes to mind at least once a week, even today.

Reading Proust is something of a cliche. The first two things that people mention when discussing Proust is that 1) the sentences are long and hard to follow, 2) ISLT is unbearably long. Regarding the first point, my experience was that you get used to it. You adjust and work a little harder. The second point is more complicated. When I began ISLT I decided that I would not put any sort of timeline on finishing, or even on making progress. I would try to simply take it in for what it is, and accept it as presented. That helped immensely. That said, certain portions (sometimes hundreds of pages long) were really tedious – in particular large sections of “The Prisoner” and “The Fugitive”. The end, which in Proust-speak means the last several hundred pages, was so transcendently awesome that in retrospect I don’t care. Your milage may vary – my point is not to be intimidated or apologetic for giving it a try.

And now, to the main event:

Here is the link to the full text of Proust’s In Search of Lost Time.

This text includes the following volumes:

  • Swann’s Way
  • In the Shadow of Young Girls in Flower
  • The Guermantes Way
  • Sodom and Gommorah
  • The Prisoner
  • The Fugitive
  • Finding Time Again

The source text is generously provided by the University of Adelaide under the Creative Commons License. The consolidated text provided in the link above is also provided under the same license. I created the file as follows:

  1. I downloaded each volume of ISLT from the University of Adelaide as listed above.
  2. Removed special characters (but left in Unicode).
  3. Removed the front matter regarding web publishing of the text (substituting with the auxiliary file in the download directory along with the acknowledgement provided here).
  4. Removed book titles (while retaining chapter titles and chapter summaries provided in the later books).

I have not reformatted the text to make it pleasing to the eye. The purpose is simply to have a huge text corpus for use in future posts. If you decide to reformat the text to make it easier to read (without otherwise messing with it), I am happy to host it on your behalf, or link to it from this post provided the same licensing terms apply.

In my next post in this series I will begin my analysis of ISLT!

Update 5/11/2013: I discovered that I omitted the introduction to Sodom and Gommorah. The file has been updated.