My last post provided the complete text of Proust’s “In Search of Lost Time”, which I now want to explore a bit using Mathematica. I future posts I will do some experiments with Python as well – but for now, if you are sufficiently motivated (and can read Mathematica), you can probably translate much of my code to another language.
The first thing I did was read the entire text into Mathematica. Splitting text into sentences is not as easy as it may appear. Periods appear in abbreviations such as “M. Charlus”, “Mme. Cambremer” and so on. They also occasionally appear inside of quotes or in other places that do not indicate the end of a sentence. I handled abbreviations by simply replacing “M.” for “M”, and so on. The final result of the following three lines of code is a variable “proust” that is a list of lists: each sentence in ISLT broken up by word.
txt = Import[“islt_proust.txt”, CharacterEncoding -> “Unicode”];
txt2 = StringReplace[txt, x : {” M.”, ” Mme.”, ” Mlle.”} :> StringDrop[x, -1]]; (* Obvious cases where ‘.’ does not mean end of sentence *)
proust = ReadList[StringToStream[txt2], Word, RecordLists -> True, RecordSeparators -> {“.”}]
Based on these rules, there are over 35,000 sentences in ISLT. Here is the longest one:
longest5 = Ordering[sentLengths, -5]
{557, 27166, 8400, 41, 17005}
sentLengths[[longest5]]
{426, 447, 457, 599, 958}
proustSent[[Last[longest5]]]
{“Their”, “honour”, “precarious”, “their”, “liberty”, “provisional”, “lasting”, “only”, “until”, “the”, “discovery”, “of”, “their”, “crime”,”their”, “position”, “unstable”, “like”, “that”, “of”, “the”, “poet”, “who”, \
“one”, “day”, “was”, “feasted”, “at”, “every”, “table”, “applauded”, “in”, “every”, “theatre”, “in”, “London”, “and”, “on”, “the”, “next”, “was”, \
“driven”, “from”, “every”, “lodging”, “unable”, “to”, “find”, “a”, “pillow”, “upon”, “which”, “to”, “lay”, “his”, “head”, “turning”, “the”, “mill”, \
“like”, “Samson”, “and”, “saying”, “like”, “him”, “The”, “two”, “sexes”, “shall”, “die”, “each”, “in”, “a”, “place”, “apart”, “excluded”, “even”, \
“save”, “on”, “the”, “days”, “of”, “general”, “disaster”, “when”, “the”, “majority”, “rally”, “round”, “the”, “victim”, “as”, “the”, “Jews”, \
“rallied”, “round”, “Dreyfus”, “from”, “the”, “sympathy”, “at”, “times”, “from”, “the”, “society”, “of”, “their”, “fellows”, “in”, “whom”, “they”, \
“inspire”, “only”, “disgust”, “at”, “seeing”, “themselves”, “as”, “they”, “are”, “portrayed”, “in”, “a”, “mirror”, “which”, “ceasing”, “to”, “flatter”, \
“them”, “accentuates”, “every”, “blemish”, “that”, “they”, “have”, “refused”, “to”, “observe”, “in”, “themselves”, “and”, “makes”, “them”, “understand”, \
“that”, “what”, “they”, “have”, “been”, “calling”, “their”, “love”, “a”, “thing”, “to”, “which”, “playing”, “upon”, “the”, “word”, “they”, “have”, \
“by”, “association”, “annexed”, “all”, “that”, “poetry”, “painting”, “music”, “chivalry”, “asceticism”, “have”, “contrived”, “to”, “add”, “to”, “love”, \
“springs”, “not”, “from”, “an”, “ideal”, “of”, “beauty”, “which”, “they”, “have”, “chosen”, “but”, “from”, “an”, “incurable”, “malady”, “like”, “the”, \
“Jews”, “again”, “save”, “some”, “who”, “will”, “associate”, “only”, “with”, “others”, “of”, “their”, “race”, “and”, “have”, “always”, “on”, “their”, \
“lips”, “ritual”, “words”, “and”, “consecrated”, “pleasantries”, “shunning”, “one”, “another”, “seeking”, “out”, “those”, “who”, “are”, “most”, \
“directly”, “their”, “opposite”, “who”, “do”, “not”, “desire”, “their”, “company”, “pardoning”, “their”, “rebuffs”, “moved”, “to”, “ecstasy”, “by”, \
“their”, “condescension”, “but”, “also”, “brought”, “into”, “the”, “company”, “of”, “their”, “own”, “kind”, “by”, “the”, “ostracism”, “that”, “strikes”, \
“them”, “the”, “opprobrium”, “under”, “which”, “they”, “have”, “fallen”, “having”, “finally”, “been”, “invested”, “by”, “a”, “persecution”, “similar”, \
“to”, “that”, “of”, “Israel”, “with”, “the”, “physical”, “and”, “moral”, “characteristics”, “of”, “a”, “race”, “sometimes”, “beautiful”, “often”, \
“hideous”, “finding”, “in”, “spite”, “of”, “all”, “the”, “mockery”, “with”, “which”, “he”, “who”, “more”, “closely”, “blended”, “with”, “better”, \
“assimilated”, “to”, “the”, “opposing”, “race”, “is”, “relatively”, “in”, “appearance”, “the”, “least”, “inverted”, “heaps”, “upon”, “him”, “who”, \
“has”, “remained”, “more”, “so”, “a”, “relief”, “in”, “frequenting”, “the”, “society”, “of”, “their”, “kind”, “and”, “even”, “some”, “corroboration”, \
“of”, “their”, “own”, “life”, “so”, “much”, “so”, “that”, “while”, “steadfastly”, “denying”, “that”, “they”, “are”, “a”, “race”, “the”, “name”, \
“of”, “which”, “is”, “the”, “vilest”, “of”, “insults”, “those”, “who”, “succeed”, “in”, “concealing”, “the”, “fact”, “that”, “they”, “belong”, “to”, \
“it”, “they”, “readily”, “unmask”, “with”, “a”, “view”, “less”, “to”, “injuring”, “them”, “though”, “they”, “have”, “no”, “scruple”, “about”, \
“that”, “than”, “to”, “excusing”, “themselves”, “and”, “going”, “in”, “search”, “as”, “a”, “doctor”, “seeks”, “cases”, “of”, “appendicitis”, “of”, \
“cases”, “of”, “inversion”, “in”, “history”, “taking”, “pleasure”, “in”, “recalling”, “that”, “Socrates”, “was”, “one”, “of”, “themselves”, “as”, \
“the”, “Israelites”, “claim”, “that”, “Jesus”, “was”, “one”, “of”, “them”, “without”, “reflecting”, “that”, “there”, “were”, “no”, “abnormals”, “when”, \
“homosexuality”, “was”, “the”, “norm”, “no”, “antiChristians”, “before”, “Christ”, “that”, “the”, “disgrace”, “alone”, “makes”, “the”, “crime”, \
“because”, “it”, “has”, “allowed”, “to”, “survive”, “only”, “those”, “who”, “remained”, “obdurate”, “to”, “every”, “warning”, “to”, “every”, “example”, \
“to”, “every”, “punishment”, “by”, “virtue”, “of”, “an”, “innate”, “disposition”, “so”, “peculiar”, “that”, “it”, “is”, “more”, “repugnant”, \
“to”, “other”, “men”, “even”, “though”, “it”, “may”, “be”, “accompanied”, “by”, “exalted”, “moral”, “qualities”, “than”, “certain”, “other”, “vices”, \
“which”, “exclude”, “those”, “qualities”, “such”, “as”, “theft”, “cruelty”, “breach”, “of”, “faith”, “vices”, “better”, “understood”, “and”, “so”, \
“more”, “readily”, “excused”, “by”, “the”, “generality”, “of”, “men”, “forming”, “a”, “freemasonry”, “far”, “more”, “extensive”, “more”, \
“powerful”, “and”, “less”, “suspected”, “than”, “that”, “of”, “the”, “Lodges”, “for”, “it”, “rests”, “upon”, “an”, “identity”, “of”, “tastes”, \
“needs”, “habits”, “dangers”, “apprenticeship”, “knowledge”, “traffic”, “glossary”, “and”, “one”, “in”, “which”, “the”, “members”, “themselves”, \
“who”, “intend”, “not”, “to”, “know”, “one”, “another”, “recognise”, “one”, “another”, “immediately”, “by”, “natural”, “or”, “conventional”, \
“involuntary”, “or”, “deliberate”, “signs”, “which”, “indicate”, “one”, “of”, “his”, “congeners”, “to”, “the”, “beggar”, “in”, “the”, “street”, “in”, \
“the”, “great”, “nobleman”, “whose”, “carriage”, “door”, “he”, “is”, “shutting”, “to”, “the”, “father”, “in”, “the”, “suitor”, “for”, “his”, \
“daughter’s”, “hand”, “to”, “him”, “who”, “has”, “sought”, “healing”, “absolution”, “defence”, “in”, “the”, “doctor”, “the”, “priest”, “the”, \
“barrister”, “to”, “whom”, “he”, “has”, “had”, “recourse”, “all”, “of”, “them”, “obliged”, “to”, “protect”, “their”, “own”, “secret”, “but”, \
“having”, “their”, “part”, “in”, “a”, “secret”, “shared”, “with”, “the”, “others”, “which”, “the”, “rest”, “of”, “humanity”, “does”, “not”, “suspect”, \
“and”, “which”, “means”, “that”, “to”, “them”, “the”, “most”, “wildly”, “improbable”, “tales”, “of”, “adventure”, “seem”, “true”, “for”, “in”, \
“this”, “romantic”, “anachronistic”, “life”, “the”, “ambassador”, “is”, “a”, “bosom”, “friend”, “of”, “the”, “felon”, “the”, “prince”, “with”, “a”, \
“certain”, “independence”, “of”, “action”, “with”, “which”, “his”, “aristocratic”, “breeding”, “has”, “furnished”, “him”, “and”, “which”, “the”, \
“trembling”, “little”, “cit”, “would”, “lack”, “on”, “leaving”, “the”, “duchess’s”, “party”, “goes”, “off”, “to”, “confer”, “in”, “private”, “with”, \
“the”, “hooligan”, “a”, “reprobate”, “part”, “of”, “the”, “human”, “whole”, “but”, “an”, “important”, “part”, “suspected”, “where”, “it”, “does”, “not”, \
“exist”, “flaunting”, “itself”, “insolent”, “and”, “unpunished”, “where”, “its”, “existence”, “is”, “never”, “guessed”, “numbering”, “its”, \
“adherents”, “everywhere”, “among”, “the”, “people”, “in”, “the”, “army”, “in”, “the”, “church”, “in”, “the”, “prison”, “on”, “the”, “throne”, \
“living”, “in”, “short”, “at”, “least”, “to”, “a”, “great”, “extent”, “in”, “a”, “playful”, “and”, “perilous”, “intimacy”, “with”, “the”, “men”, “of”, \
“the”, “other”, “race”, “provoking”, “them”, “playing”, “with”, “them”, “by”, “speaking”, “of”, “its”, “vice”, “as”, “of”, “something”, “alien”, “to”, \
“it”, “a”, “game”, “that”, “is”, “rendered”, “easy”, “by”, “the”, “blindness”, “or”, “duplicity”, “of”, “the”, “others”, “a”, “game”, “that”, \
“may”, “be”, “kept”, “up”, “for”, “years”, “until”, “the”, “day”, “of”, “the”, “scandal”, “on”, “which”, “these”, “liontamers”, “are”, “devoured”, \
“until”, “then”, “obliged”, “to”, “make”, “a”, “secret”, “of”, “their”, “lives”, “to”, “turn”, “away”, “their”, “eyes”, “from”, “the”, “things”, \
“on”, “which”, “they”, “would”, “naturally”, “fasten”, “them”, “to”, “fasten”, “them”, “upon”, “those”, “from”, “which”, “they”, “would”, \
“naturally”, “turn”, “away”, “to”, “change”, “the”, “gender”, “of”, “many”, “of”, “the”, “words”, “in”, “their”, “vocabulary”, “a”, “social”, \
“constraint”, “slight”, “in”, “comparison”, “with”, “the”, “inward”, “constraint”, “which”, “their”, “vice”, “or”, “what”, “is”, “improperly”, \
“so”, “called”, “imposes”, “upon”, “them”, “with”, “regard”, “not”, “so”, “much”, “now”, “to”, “others”, “as”, “to”, “themselves”, “and”, “in”, “such”, \
“a”, “way”, “that”, “to”, “themselves”, “it”, “does”, “not”, “appear”, “a”, “vice”}
Proust is known for his long sentences, but do the lengths fit a pattern? In a previous series of posts I showed that certain (American) football stats fit a lognormal distribution. It turns out that ISLT sentence lengths also appear to be lognormally distributed. The FindDistributionParameters method makes this pretty easy. We scale the distribution so we can draw it along with a histogram on the same chart.
lnp = FindDistributionParameters[sentLengths, LogNormalDistribution[m, s], ParameterEstimator -> “MethodOfMoments”]
ratio = Max[HistogramList[Select[sentLengths, # < 250 &]][[2]]] / Maximize[{PDF[LogNormalDistribution[m, s] /. lnp, x], x >= 0}, x][[1]]
Show[Histogram[Select[sentLengths, # < 250 &]], Plot[ratio*PDF[LogNormalDistribution[m, s] /. lnp, x], {x, 0, 250.}, PlotStyle -> Thick, PlotRange -> All]]

It’s also interesting to examine sentence lengths by character. Using the Select method we can extract sentences that contain a particular name (or phrase). The MedianSentenceLength method is used below to create a bar chart comparing median lengths across characters. Interestingly, two powerful but distant (to the author) male figures top the list.
medSentLength = N[Median[sentLengths]];
SentLength[name_] := N[Median[Map[Length, Select[proustSent, MemberQ[#, name] &]]]];
barNames = {“I”, “Oriane”, “Bloch”, “Albertine” , “Charlus”, “Mother”, “Swann”, “Gilberte”, “Father”, “Christ”};
barValues = Map[SentLength, barNames]
BarChart[MapThread[Labeled[#1, #2, Above] &, {barValues, barNames}], GridLines -> {None, {medSentLength}}, GridLinesStyle -> Directive[Red, Thick, Dashed],
Method -> {“GridLinesInFront” -> True}, ChartLabels -> {barNames}, BaseStyle -> {FontFamily -> Helvetica, FontSize -> 12}]

Finally, I wanted to see whether there were significant changes in average sentence length as we progress through the story. Can we identify trends? To try to examine this, I used the ExponentialMovingAverage method. I wasn’t really able to conclude anything substantial. So in the spirit of Stephen Wolfram, here are some pretty pictures.
ListLinePlot[ExponentialMovingAverage[sentLengths, 0.02]]

mm = N[MovingMedian[sentLengths, 100]];
ListLinePlot[mm]
With the help of some online sources I call out high and low points in the trend.
points = Transpose[{Range[1, Length[mm]], mm}];
medLength = Median[sentLengths];
(* http://mathematica.stackexchange.com/questions/13557/local-max-min-of-mathematica-data-sets *)
peaks = Select[Pick[points[[2 ;; -2]], Differences[Sign[Differences[points[[All, 2]]]]], -2], #[[2]] >= medLength + 20 &]
valleys = Select[Pick[points[[2 ;; -2]], Differences[Sign[Differences[points[[All, 2]]]]], 2], #[[2]] <= medLength – 15 &];
ListLinePlot[points, Prolog -> {PointSize[Large], Red, Point[peaks], Green, Point[valleys]}]
