- Provide an interesting, extremely long plain-text data set useful for text analytics based on material in the public domain.
- Demonstrate simple text processing tools in proprietary (Mathematica) and freely available (Python) languages.
- Report descriptive statistics for In Search of Lost Time.
- Perform a graph analysis of the major characters to create a visual representation of the relationships between them.
None of what I will be doing will be especially deep: these posts constitute my first and only experiments in this area. My only training in text analytics has been based on web searches to help figure out how to do the things I listed above. In other words, it’s just me fooling around in odd moments at nights and on weekends over the past year.
Why Proust? About five years ago I learned that a high school friend was working through seven of literature’s great-but-lengthy classics as a kind of a challenge. I decided to join him on one leg of that journey – In Search of Lost Time [hereafter ISLT] – and three years later I made it through all seven volumes. Proust’s voice is so persistent and personal that upon completion I felt as if I had a new friend, or perhaps a younger cousin or brother. Major themes include love, memory, sexuality, finding one’s place, art, obsession, but ultimately ISLT is so expansive that it becomes impossible to categorize. It is what it is; describing it is like trying to give a eulogy. ISLT comes to mind at least once a week, even today.
Reading Proust is something of a cliche. The first two things that people mention when discussing Proust is that 1) the sentences are long and hard to follow, 2) ISLT is unbearably long. Regarding the first point, my experience was that you get used to it. You adjust and work a little harder. The second point is more complicated. When I began ISLT I decided that I would not put any sort of timeline on finishing, or even on making progress. I would try to simply take it in for what it is, and accept it as presented. That helped immensely. That said, certain portions (sometimes hundreds of pages long) were really tedious – in particular large sections of “The Prisoner” and “The Fugitive”. The end, which in Proust-speak means the last several hundred pages, was so transcendently awesome that in retrospect I don’t care. Your milage may vary – my point is not to be intimidated or apologetic for giving it a try.
And now, to the main event:
This text includes the following volumes:
- Swann’s Way
- In the Shadow of Young Girls in Flower
- The Guermantes Way
- Sodom and Gommorah
- The Prisoner
- The Fugitive
- Finding Time Again
The source text is generously provided by the University of Adelaide under the Creative Commons License. The consolidated text provided in the link above is also provided under the same license. I created the file as follows:
- I downloaded each volume of ISLT from the University of Adelaide as listed above.
- Removed special characters (but left in Unicode).
- Removed the front matter regarding web publishing of the text (substituting with the auxiliary file in the download directory along with the acknowledgement provided here).
- Removed book titles (while retaining chapter titles and chapter summaries provided in the later books).
I have not reformatted the text to make it pleasing to the eye. The purpose is simply to have a huge text corpus for use in future posts. If you decide to reformat the text to make it easier to read (without otherwise messing with it), I am happy to host it on your behalf, or link to it from this post provided the same licensing terms apply.
In my next post in this series I will begin my analysis of ISLT!
Update 5/11/2013: I discovered that I omitted the introduction to Sodom and Gommorah. The file has been updated.
If you had a look at my calendar you would think that most of my days were the same. Daily status meetings, design reviews, one-on-one meetings with team members, client presentations, and so on. They’re not. Some days feel like a mad rush and others are more contemplative. Still others, the days with 90 minute project reviews and training and updates and alignment exercises, are a slog. On those days the mind wanders and in the odd moments when I am not checking twitter I ask myself, “What am I trying to achieve?” When I answer my own question I realize that with one tug at the bow I’m trying to hit two targets at once: the perfect product and the perfect team. The team often outlasts the product, and is often more important to the business. How tempting, and how dangerous it is to focus on the product to the detriment of the team.
Steven Sinofsky is blogging again; he’s focusing on the interplay between engineering and human factors in building technology. This focus leads him to consider how and why things are built, rather than what is built – which is why his blog makes for interesting reading. (**) In his internal blog at Microsoft Sinofsky killed countless virtual trees describing and justifying the organizational structures he feels are necessary to build big software in big companies. “Don’t ship the org chart”, Sinofsky is claimed to have said. I don’t remember if he really did, but who cares – he meant to. The point is simply for leaders to create an organizational structure where employees are best positioned to succeed. Brad Garlinghouse, formerly of Yahoo!, agrees. His widely-known “Peanut Butter Memo” warned his colleagues of the dangers of spreading organizational focus too thin at the expense of a coherent vision around which initiatives can be organized. It sounds almost tautologous when put this way, but the practical wisdom from this general point is elusive. Seven years later, Garlinghouse writes:
In a recent blog post, venture capitalist Ben Horowitz relegated corporate culture to a second-class citizen behind creating a great product. I respectfully disagree. Great products don’t come out of thin air. They are an outcome of environments where innovation can thrive and talented people are encouraged to be bold.
Garlinghouse says that building great teams is the ultimate objective because they produce worthy products. Yet a team without works is dead. Sinofsky focuses on structures and process because creating the right environment is essential. A key message is to get the right structures in place before it is time to execute. The lessons are too hard to apply on the fly in the heat of the moment, when deadlines loom, requirements stack, and you’re reminded daily of the incredible importance of the project upon which the fates of many rest, and upon which upper management is laser focused. “Our initiative to produce automated TPS reports could not possibly be more important!” And in some sense this is true. But this is the wrong time to be thinking about how to build a great team. The foundation needs to have been laid long before, so it all holds together when the storm hits. I realize I am not saying to much about what the frameworks look like, and that’s intentional: read Sinofsky’s blog and you’ll find out. It boils down to:
- Having a clear vision.
- Making sure that everyone has a clear role that they understand.
- Empower everyone to do their job.
When I say or write something, there are actually a whole lot of different things I am communicating. The propositional content (i.e. the verbal information I’m trying to convey) is only one part of it. Another part of is stuff about me, the communicator.
You are broadcasting on two channels and often the second one – what you’re trying to say about yourself in the way that you say what you say – is often more important than the first. And that’s okay. Similarly, as a manager when I discuss project-related business goals I am also discussing how to develop the team professionally in a way that is fulfilling and productive. In practice this means that project-related tasks (e.g. “write the user guide for the advanced optimization UI”) often have a broader, hidden meaning (“understand the complexity of the processes we are ask our users to carry out”). The goal-behind-the-task can be lots of things: to provide the opportunity to do something new or fun, to mentor a new teammate, to work with someone with a different communication style, to learn a process few people understand, to struggle, to get the chance to take a breather. It is not a hidden Karate Kid-stye lesson; we sometimes discuss it up front. I don’t know how to think about professional development outside the context of the things we are asked to do, and I don’t know how to get projects done without thinking deeply about the diversity of the skills and characteristics of each member of the team. This is why “matrixes organizations” confuse me: you can hit at most one target.
You’re building two things: be mindful of them both since they rest on each other.
(**) Side note: Steven’s blog is also praiseworthy because it is snark free. It’s filled with ideas, data, and observation. Truth is interesting when told well, but it feels dangerous to rely on it too much for fear of sounding boring. Perhaps the lesson here is that we should not care so much about being adored and simply write what is true.
I recently received an invitation for an annual fundraising event at my alma mater, the University of Iowa. Looking at the invitation, it reminded me of how I strongly believe in the value of public education. John Kenneth Galbraith called public education an “investment in man”. I’m grateful for the investment that was made in me, yet I also feel the payoffs of investment in, and the receipt of, quality public education are under-recognized and undervalued.
Before I continue: discussion of higher education in America is fraught with conversational landmines and mythmaking. Rather than set public and private against each other, I want to talk about why I went the public route. To cut to the chase: if you or your kids are going to an Ivy, that’s great. I don’t begrudge that and I don’t have a single problem with that choice. What I want is for public higher education to be given its due credit, and public graduates to be given appropriate respect for their choice. In this post, I’m exploring my experience with my education. I know others with basically similar stories. I think it’s a side of higher education that doesn’t get enough print or airtime.
Here’s what I have a problem with: I have a problem with relatively small differences between public and private higher education being overblown, and assumptions being made about graduates’ capabilities based solely or largely on their school. I have a problem with employers dismissing resumes largely or solely based on an applicants’ lack of a elite school’s degree. Additionally, I have no time for hearing or reading about a false narrative of equal opportunity access between those two streams of education. I hear that any kid who is bright enough can get into a private school and be adequately funded: false. I hear about kids in huge classes receiving mediocre undergraduate educations in public colleges, and about immersive foreign language programs in private schools, and very little in between. Most live somewhere in the middle.
My story is about living in the middle, being educated in the middle, and ultimately getting what I have needed to achieve in corporate America. I share it later in this post, and I share it as a testimony that success is possible, even with an Ivy. That smart kids… frequently don’t go to an Ivy. Often, they have made the very adult calculation that the debt accrued by heading to a private college, or the out-of-town college, is flatly not responsible or a wise investment. Which adolescent is smarter? The one who takes on thousands of extra dollars in a belief that a private college is necessary? Or the one who sucks it up and makes the more cautious choice for public, in-state tuition? Which demonstrates more critical thinking? Which capacity for personal judgment would you want in your employment?
I had a chat about college with a friend who works in a service occupation. In-state tuition at the University of Illinois is currently between $15,000 and $20,000, a far cry from the $2,000 at the University of Iowa twenty years ago. My friend has been saving for a couple of years to buy his wife a ’67 Camaro for his wife to mark their twentieth anniversary. Those cars are cheaper than you think if you’re not looking for mint condition. Last fall their basement flooded, and now every time he heads down to the rec room he thinks: this is my Camaro. This is life in the middle; the stories that play themselves out again and again, all over the country. In the same conversation, my friend told me that tuition at the University of Missouri is 9K after you establish in-state residency (one year). Technically my friend’s kids have every option available, but the financial implications are huge. Meanwhile, the rather slim differences between a great public and an Ivy are being trumped up and magnified, not only by parents who can afford to, but also by the media.
Even when we hear about the middle, the narrative is that equal opportunity is available for all. Pull yourself up by your bootstraps:
If you’re in the private system, you know something most Americans don’t – many private colleges and universities offer generous need-based financial aid packages to talented students with limited resources.
The message is that the private path is the only way to true success. Yet study after study has shown that it’s the kid more than the school that determines success. A kid capable of attending and succeeding at an Ivy and choosing to go somewhere else has pretty much the same shot at “success” as the Ivy kids. The only thing that gets in the way is optics, leading to privileged access. Further, in our increasingly global world, students who go to international universities are distorted by the same fun house mirror even though the (usually publically funded) schools they went to are outstanding. The value of education is being dictated by branding. Even in spite of this, it’s not crazy to choose to go to a public school over a private school. When the tradeoff is years of debt for marginal utility, the choice is often obvious.
So here’s how it went for me.
My education has always been in the public classroom setting. I went to a public high school in a small Iowa town of around 30,000 with a graduating class of around 300. In high school we read Chaucer and Hesse, we built wind tunnels, we had band and debate and Model UN. They taught us to think. Many of my teachers were excellent. The educational climate in that school was matter-of-fact and earnest: educating our kids in this way is what we do. It wasn’t perfect in the sense that my every need was not catered to. In my junior high years I learned to code, spending countless hours hacking away but when I reached high school I found that there were no computer classes. The Christmas when my aunt and uncle gave our family an Apple IIe, my life was transformed. In college, I realized that many of my fellow math majors had already taken calculus and were ahead of me for no good reason. I did not receive a perfect education; it had areas of excellence and areas of weakness. That’s life in the middle.
I received a tuition scholarship to the University of Iowa because I was poor and because my grades and test scores demonstrated that I was smart. When I arrived at U of I, I was met with a campus of thirty thousand kids, countless buildings, a giant book filled with course listings. I was assigned an academic advisor, whom I visited once for about thirty minutes after I had already figured out what I wanted to take. There wasn’t a whole lot for us to talk about. I didn’t expect much else; that’s life in the middle. I ended up in several huge “intro” courses with hundreds of students (Physics, Calculus, Computer Science) as well as smaller discussion oriented courses (Rhetoric, European Conquest and Colonization, French). I was well-prepared intellectually but I had no idea what to do and no guide to lead me.
A pivotal moment in my educational life came when I took an Intro to Numerical Optimization course, largely because I knew they used Mathematica and it was cross-listed in CS and Math (I was a double major). The instructor, Florian Potra, was by turns erudite, charismatic, rambling and cutting. I loved him, and I loved the course, so when he mentioned he was looking for an undergrad research assistant, I was all in. My academic life changed that summer: long afternoons and evenings in Florian’s office, sitting at the chalkboard while Florian and one of his grad students worked out new formulations, and endless hours in the library and computer lab reading papers and writing code. College suddenly transformed from the breadth of chess, where anything is possible but the game ultimately cannot be conquered, to checkers, a narrow but bottomless well where the goal is to plumb deeper and deeper until we finally hit bottom and obtain complete understanding of all there is to know. I loved it.
My undergraduate experience carried over into graduate school, where I began to find my own intellectual voice. Eighteen months in, Florian left to take the chair at the University of Maryland. My new advisor was Kurt Anstreicher. Kurt was a whip-smart, calming presence whose intensity exceeded my own. We put an unsolved combinatorial optimization problem from 1968 in our sights, and using a combination of chess and checkers thinking we took it down. He taught me that it takes persistence, creative thinking, and partnership to achieve big goals.
I graduated with no debt, a little money in my pocket, unpolished, no connections, but with an offer to work at Microsoft. I defended my thesis on a Friday, got on a plane to Seattle on a Saturday, and started work on a Monday. Twelve years later I am a Senior Vice President at a world class company, leading a talented team, working to solve mission critical problems for the world’s top companies. I am proud of the public school education that led me to where I am. If you go to a public school, and you work hard and make connections: you can be successful.
I am painfully aware that many in our country do not experience what I experienced in the public system. This awareness is what motivates me to share my story, because I want to encourage this kind of public education – the kind that does not start and end with vocational training or positioning for the societal elite, but towards the molding of balanced, critically thinking citizens. I’d like to see us preserve and extend this model, yet I fear that it is being eaten away at both ends.
If I could say one thing to the chattering classes: equal choice in educational options is a myth, so stop judging people by where they went to school. Seriously. Stop it. Public education is not a second choice for most of us. Hard working kids choose to go to public schools. Smart kids go to public schools. Those kids become successful. I hire people, and when I do I just want to know whether the applicant can solve problems, communicate, work hard, and commit to excellence in partnership with their scholarship. I don’t care where about which school they went to. I care about what they took from it.
It’s not clear to me that I should end this with a call to action, because it’s not clear who I’m even talking to. It’s simple: many in our country are seeing their only real educational options being underfunded and undervalued. These kids, who are just as worthy and deserving as anyone else, should get their shot. They’re worth our investment.
[Thanks to those who edited and provided fantastic feedback for this post.]
I apologize in advance for offending the entire analytics community.
The players in the analytics industry, like the great Houses in the kingdom of Westeros on the HBO show “Game of Thrones”, are contentious, calculating, and will stop at nothing in their drive to absolute power. Strike that – they actually all get along pretty well. But who cares: let’s match Game of Thrones houses to analytics companies. It’s a Wednesday. I haven’t read the books so I will keep this spoiler free and chock full of snark.
Let’s start with House Stark, the noble family from the North who are based in Winterfell, an ancient city known for its strength. They have been a strong presence in the North for ages, helping to protect all of Westeros from whatever lies behind the Wall. Their contributions are often taken for granted, and sometimes even mocked for their rough hewn, less refined nature. Of course, we are speaking of SAS.
House Lannister are one of the oldest and richest families in Westeros. Never far from and often in control of the locus of power, the Lannisters are sometimes loved, sometimes feared, but never forgotten. They are the House the others love to hate. They have built their house through shrewd diplomacy, marriage and acquisitions. They are known for always paying their debts, and always have the biggest and most impressive booths at INFORMS. Peter Dinklage: you’re an IBM’er.
House Greyjoy are tough and rugged. Ruling from near the sea on the West Coast, they often leave the other houses to their own devices, yet when they engage their presence is immediately felt. Their loyalties may shift but their core values never do. You don’t need a Crystal Ball to figure this out: Larry Ellison is the only analytics king with a navy. Oracle, what’s in your wallet?
House Baratheon are quite the enigma. They are obviously powerful and influential. You sometimes forget how influential – they are connected to almost every single plotline on the show. Their goal is to be present in every house in every castle. Divided into three factions (headed by Joffrey, Stannis, and Renly), they sometimes seem to be focused more on their own squabbles than the rest of Westeros. In spite of all of this, some forget that this house is even part of the show. All hail Microsoft! (PS: Welcome PASS Business Analytics attendees! Enjoy the rain and 40 degree weather.)
The land of House Tyrell is vast and plentiful. Their influence is felt not only through their own significant contributions but also through family connections: Margaery Tyrell is betrothed to Joffrey Baratheon-Microsoft. Frontline Systems – keep your feet on the ground and your head in the clouds.
One of the most fascinating aspects of Game of Thrones is the land Beyond the Wall. Above the Wall in the frozen north lie powers and forces beyond the comprehension of any of the Houses. Perhaps the King-Beyond-the-Wall may one day rule all of Westeros. Open source community, take a bow.
Members of House Targaryen once ruled Westeros, but key members of the ruling family fled King’s Landing to build their own army. I think those three dragons may be named Gu, Ro, and Bi.
See you at the fall conference!
I’ve got sports on the brain – perhaps you’ve noticed. Recently, without thinking about it too deeply, I posted former UCLA coach John Wooden’s Pyramid of Success on my office door. Wooden’s definition of success – the peace of mind which is a direct result of self-satisfaction in knowing you made the effort to become the best of which you are capable – is meaningful and practical to me. Wooden’s teams not only had unprecedented success on the basketball court (10 national championships), but also got along exceptionally well in a time when social and racial tensions in America ran very high. The successes his teams achieved were enduring and any organization would do well to try to emulate them. Wooden himself was by all accounts an admirable and decent man, in contrast to boorish, small, me-first coaches that are often seen strutting sidelines in fancy suits this time of year.
After sticking the pyramid on my door it dawned on me that my job at Nielsen is to be a coach. I’m a recruiter who tries to attract and retain players who are skilled, coachable, and pleasant to be around. I’m a bench coach who tries to make adjustments in the flow of the game – changing defenses, substituting players, hollering at the refs. I’m a practice coach who tries to teach my team about the game as best I can, requiring me to know what the heck I am talking about. I don’t score a single point myself (at least, not very many) but my team is successful based on the effort that we all put in together, for which I am accountable.
In this TED talk (1) Wooden describes the difference between winning and success. I think it’s pretty amazing that he could give an unscripted talk like this at age 98! A few nice lines include:
- Failing to prepare is preparing to fail.
- There is no progress without change, but change does not guarantee progress.
- Things will work out as they should, provided we do what we should.
- If you surf YouTube for other Wooden interviews, you will find many occasions where former players say, “What Coach says sounds corny, but…” An earnest approach like Wooden’s can be disarming and even uncomfortable because the message is so raw. We often wrap ourselves in cynicism and irony to give the appearance that what we’re going through doesn’t matter that much, and can’t hurt us. The genius of Wooden is that he thought carefully about the ultimate meaning of all this dribbling, cutting, boxing out and shooting, and directed his energy to help all those under his watch to find that meaning for themselves. That meaning, rather than the banners hanging from Pauley Pavilion, is Wooden’s lasting legacy.
(1): I read the recent (dead on right) HBR article about TED. The TED brand has become a joke, but this video is worth it, I promise.
I wrote a computer program to rate the strength of every NCAA men’s college basketball team based on the Iterative Strength Rating algorithm. Last post I previewed it and and now I am presenting my picks for the 2013 NCAA Tournament.
Peter Wolfe at UCLA has graciously provided scores for every single college basketball game (over 21,000), found here. I used this information to produce a rating for each team. I then produced a bracket by simply choosing the team with the higher rating.
My complete bracket is below: click to enlarge. Check my progress or lack thereof once the tournament starts by clicking here.
My Final Four is:
- Midwest: Duke beats Louisville
- West: New Mexico beats Gonzaga
- South: Georgetown beats Kansas
- East: Indiana beats Miami FL
with Duke beating Georgetown in the championship game. Notable upsets include Boise State defeating Arizona and Wisconsin, Bucknell defeating Butler, and Minnesota beating UCLA. The bracket is interesting in the sense that it is reasonable but the higher seed is not always selected.
Now, the gory details. I’ve based my rating on the Iterative Strength Rating by Boyd Nation. Here’s how ISR works. First, give each team an equal rating, say 1.0. Next, go through each game and give each time some points. The winning team gets the rating of the losing team plus a “winning bonus” of 0.25. The losing team gets the rating of the winning team minus a penalty of 0.25. Once all of the games have been scored, we can update ratings for each team by dividing the team’s total score by the number of games played. Now, we can rescore the games using the updated ratings again and again until the scores stabilize. The Net Prophet blog shows that this is a pretty good way to rate teams. (By the way: I highly recommend this blog. Scott Turner has done an amazing job evaluating a number of different approaches, all using freely available software. Kudos Scott!)
This year, I created my own variant of ISR. There are two main modifications. First, I am accounting for margin of victory. How? In a 2006 paper by Paul Kvam and Joel Sokol, the authors derive an expression for the probability that Team A will defeat Team B, given that Team A beat Team B by x points on Team’ A’s home court:
RH(x) = exp(0.292 x - 0.6228) / (1 + exp(0.292 x - 0.6228))
This function levels off as the margin gets higher: the values for x=21 and x=20 are almost identical and close to 1. This function is also an indirect measure of strength of victory. Given the score of a game, and taking into account the home floor, we can evaluate this function and scale the “winning bonus” – so a large margin of victory will result in a winning bonus greater than 0.25 and a smaller margin of victory will result in a smaller winning bonus.
The second variation is to weight games differently. I divide the season into three segments:
- The first 10 games,
- The next 10 games,
- The rest of the season.
The segments have weights: [0.8, 1.0, 1.2]. Why? Because I felt like it: games later in the season are probably a better predictor. A better approach is to find optimized weights based on tournament predictive power. After each modified ISR iteration I renormalize team ratings so they are in the range [0, 1]. Effectively this means I compute three scores for each team instead of one, but I don’t think this screws up the predictive power of the model too much given the number of observations per team (around 30).
I ran the algorithm on the complete set of 2012-2013 college basketball games, found here courtesy of Peter Wolfe of UCLA. This list is exhaustive and includes NAIA schools, Canadian schools, exhibition games, the Washington Generals, cats and dogs living together, etc. I’m not sure the teams are fully connected, so I do a pass through all of the games once, excluding exhibitions to identify a cluster of top-tier teams (presumably all Division I and II). The algorithm is about 140 lines of Python including the code to read the data. No fancy stuff. I will post the code later.
I have been doing little NCAA models like this for a few years now, and this is the first one I am proud of. We’ll see how it does. The main difference, of course, is that I am looking at individual games rather than aggregate team statistics over a season. A colleague of mine sometimes quotes the Papa John’s slogan “Better Ingredients, Better Pizza” when referring to the use of more granular data in models. I hope this year’s pizza tastes as good as it smells. (No endorsement implied…)
It’s NCAA Tournament selection Sunday! As in past years, I am going to write a program to make my picks following two principles: code it fast and do something reasonable. I am ahead of the game this year – I’m done!
The approach I am using is a modified version of the Iterative Strength Rating as described on the Net Prophet blog. I am making two modifications:
- Incorporate margin of victory.
- Weight late-season games more than early-season games.
I ran a preliminary version of the code on games played through March 17 (note: post updated 3/18 to include games from the last week). The model seeds the teams as follows. Let’s see how close these seeds are to the actual ones released this afternoon!
|3||St Louis U.|