Text analytics on Proust’s “In Search of Lost Time”

This post is the first in a series on text analytics on Marcel Proust’s In Search of Lost Time. Here’s what I plan on doing:

  • Provide an interesting, extremely long plain-text data set useful for text analytics based on material in the public domain.
  • Demonstrate simple text processing tools in proprietary (Mathematica) and freely available (Python) languages.
  • Report descriptive statistics for In Search of Lost Time.
  • Perform a graph analysis of the major characters to create a visual representation of the relationships between them.

None of what I will be doing will be especially deep: these posts constitute my first and only experiments in this area. My only training in text analytics has been based on web searches to help figure out how to do the things I listed above. In other words, it’s just me fooling around in odd moments at nights and on weekends over the past year.

Why Proust? About five years ago I learned that a high school friend was working through seven of literature’s great-but-lengthy classics as a kind of a challenge. I decided to join him on one leg of that journey – In Search of Lost Time [hereafter ISLT] – and three years later I made it through all seven volumes. Proust’s voice is so persistent and personal that upon completion I felt as if I had a new friend, or perhaps a younger cousin or brother. Major themes include love, memory, sexuality, finding one’s place, art, obsession, but ultimately ISLT is so expansive that it becomes impossible to categorize. It is what it is; describing it is like trying to give a eulogy. ISLT comes to mind at least once a week, even today.

Reading Proust is something of a cliche. The first two things that people mention when discussing Proust is that 1) the sentences are long and hard to follow, 2) ISLT is unbearably long. Regarding the first point, my experience was that you get used to it. You adjust and work a little harder. The second point is more complicated. When I began ISLT I decided that I would not put any sort of timeline on finishing, or even on making progress. I would try to simply take it in for what it is, and accept it as presented. That helped immensely. That said, certain portions (sometimes hundreds of pages long) were really tedious – in particular large sections of “The Prisoner” and “The Fugitive”. The end, which in Proust-speak means the last several hundred pages, was so transcendently awesome that in retrospect I don’t care. Your milage may vary – my point is not to be intimidated or apologetic for giving it a try.

And now, to the main event:

Here is the link to the full text of Proust’s In Search of Lost Time.

This text includes the following volumes:

  • Swann’s Way
  • In the Shadow of Young Girls in Flower
  • The Guermantes Way
  • Sodom and Gommorah
  • The Prisoner
  • The Fugitive
  • Finding Time Again

The source text is generously provided by the University of Adelaide under the Creative Commons License. The consolidated text provided in the link above is also provided under the same license. I created the file as follows:

  1. I downloaded each volume of ISLT from the University of Adelaide as listed above.
  2. Removed special characters (but left in Unicode).
  3. Removed the front matter regarding web publishing of the text (substituting with the auxiliary file in the download directory along with the acknowledgement provided here).
  4. Removed book titles (while retaining chapter titles and chapter summaries provided in the later books).

I have not reformatted the text to make it pleasing to the eye. The purpose is simply to have a huge text corpus for use in future posts. If you decide to reformat the text to make it easier to read (without otherwise messing with it), I am happy to host it on your behalf, or link to it from this post provided the same licensing terms apply.

In my next post in this series I will begin my analysis of ISLT!

Update 5/11/2013: I discovered that I omitted the introduction to Sodom and Gommorah. The file has been updated.