At last spring’s INFORMS Analytics Conference I was invited to speak about Marketing Mix Analytics at Nielsen. I thought I would (belatedly) summarize my talk for those who were not able to attend.
Nielsen’s mission is to provide the most complete understanding of what consumers watch and buy. My team builds analytics solutions that use watch and buy information to help advertisers understand where their sales come from. Our primary analytical tool to do this is called marketing mix modeling. This first post summarizes what marketing mix is about, and how modeling teams assemble the data necessary for a mix model.
Marketing mix models measure the impact of marketing (and other drivers) on sales. Simply put, we go get sales data in partnership with our clients, and find matching time series data for everything that we believe affects sales: their advertising whether TV, radio, or online; their trade activity such as features and displays in grocery stores; their pricing and discounts; events, holidays, and industry trend. Once we obtain all of this data, we build a big regression model that predicts sales based on all of these factors. This has the effect of attributing the dips and spikes in sales to corresponding dips and spikes in activity. A big ad appears in the paper: sales spike. We assume some portion of the spike is because of the ad. When we run the regression model we obtain a decomposition of sales according to the various factors in the model, based on their coefficients. This allows us to make statements such as “7% of your sales are due to your TV advertising,” or, “you lost 3% of your sales due to your competitor’s pricing strategy.”
These kinds of statements are useful by themselves but they’re even better when you turn them into decisions that affect the future. This is done by chaining models together to provide additional insight. A marketing mix model produces coefficients and decomps – which characterize past sales for historical levels of, for example, TV advertising. We can turn those into sales response curves which predict sales for any level of activity – even levels of advertising that were not conducted historically. These curves are the basis for forecasting and optimization models for media planning. Moving from raw sales, advertising, and pricing data to a coordinated, targeted media plan is a huge leap, but not without challenges.
Textbooks and websites will tell you that marketing mix modeling is old hat, but doing it right is hard work. First of all, getting the data is difficult. The point is for the analyst team and client to dream up everything that can impact sales…and obtain matching, correct time series data for up to three years in duration. Some data, like TV or radio advertising, can be sourced from within Nielsen. Sales, revenue, and margin data comes from a combination of client and MMM vendor sources. Other data such as industry trend, macroeconomic data, and so on may come from third parties. Cleaning and verifying data is always hard, but it’s particularly hard in marketing mix because of its dimensionality. The modeled product dimension may be at the brand, sub brand, or even the PPG (price promoted group) level – a collection of UPCs. Sales data is sometimes modeled down to the store level via grocery store scanner reports. The variety and intricacy of the data used for a “straightforward” mix necessitates a data review between the analyst team and client before modeling even begins. This step alone – getting the data – sometimes takes half of the total cycle time in a mix engagement. Time is money, so defining workflows and procedures that result in quick, accurate, repeatable data acquisition are good for vendor and client alike.
I will probably regret writing this post.
So perhaps you have seen these touchless toilets in airports. They flush by themselves. Exhibit A:
Amazing. Wonderful. Sanitary. See that button on the left? That flushes the old fashioned way. You need that. Because no matter how good the sensor is, how elegant your solution for activating the sensor, no matter how impeccably designed the sensor is, you are going to need that button. And when you need that button, boy, you really need that button.
The same thing goes when you are building an “automated” analytics system that hides all of the math and complexity from your poor user. You’re going to need that button, for scarily similar reasons.
Previous posts have discussed Proust’s In Search Of Lost Time. Marco76UK asks “Out of curiosity, what is the third longest sentence?” I’ll do you two better and give the five longest English sentences in the Moncreiff translation. I am not 100% certain that I’ve got it right because sentence parsing is not as straightforward as it might seem – for example, does an internal quote ending in a sentence qualify as the end of a sentence? I will omit the details and simply give what seem to be the five longest sentences.
A disclaimer: Proust’s discussions of sexuality and Judaism (among other things) in ISLT are complicated and controversial. Other sources discuss these topics in detail, and perspectives vary. Reporting on the content of ISLT should not be taken as an endorsement of anything whatsoever.
#1: At 958 words, a passage linking both of the controversial topics above, a discussion of those similar to the fascinating M. de Charlus.
Their honour precarious, their liberty provisional, lasting only until the discovery of their crime; their position unstable, like that of the poet who one day was feasted at every table, applauded in every theatre in London, and on the next was driven from every lodging, unable to find a pillow upon which to lay his head, turning the mill like Samson and saying like him: “The two sexes shall die, each in a place apart!”; excluded even, save on the days of general disaster when the majority rally round the victim as the Jews rallied round Dreyfus, from the sympathy — at times from the society — of their fellows, in whom they inspire only disgust at seeing themselves as they are, portrayed in a mirror which, ceasing to flatter them, accentuates every blemish that they have refused to observe in themselves, and makes them understand that what they have been calling their love (a thing to which, playing upon the word, they have by association annexed all that poetry, painting, music, chivalry, asceticism have contrived to add to love) springs not from an ideal of beauty which they have chosen but from an incurable malady; like the Jews again (save some who will associate only with others of their race and have always on their lips ritual words and consecrated pleasantries), shunning one another, seeking out those who are most directly their opposite, who do not desire their company, pardoning their rebuffs, moved to ecstasy by their condescension; but also brought into the company of their own kind by the ostracism that strikes them, the opprobrium under which they have fallen, having finally been invested, by a persecution similar to that of Israel, with the physical and moral characteristics of a race, sometimes beautiful, often hideous, finding (in spite of all the mockery with which he who, more closely blended with, better assimilated to the opposing race, is relatively, in appearance, the least inverted, heaps upon him who has remained more so) a relief in frequenting the society of their kind, and even some corroboration of their own life, so much so that, while steadfastly denying that they are a race (the name of which is the vilest of insults), those who succeed in concealing the fact that they belong to it they readily unmask, with a view less to injuring them, though they have no scruple about that, than to excusing themselves; and, going in search (as a doctor seeks cases of appendicitis) of cases of inversion in history, taking pleasure in recalling that Socrates was one of themselves, as the Israelites claim that Jesus was one of them, without reflecting that there were no abnormals when homosexuality was the norm, no anti-Christians before Christ, that the disgrace alone makes the crime because it has allowed to survive only those who remained obdurate to every warning, to every example, to every punishment, by virtue of an innate disposition so peculiar that it is more repugnant to other men (even though it may be accompanied by exalted moral qualities) than certain other vices which exclude those qualities, such as theft, cruelty, breach of faith, vices better understood and so more readily excused by the generality of men; forming a freemasonry far more extensive, more powerful and less suspected than that of the Lodges, for it rests upon an identity of tastes, needs, habits, dangers, apprenticeship, knowledge, traffic, glossary, and one in which the members themselves, who intend not to know one another, recognise one another immediately by natural or conventional, involuntary or deliberate signs which indicate one of his congeners to the beggar in the street, in the great nobleman whose carriage door he is shutting, to the father in the suitor for his daughter’s hand, to him who has sought healing, absolution, defence, in the doctor, the priest, the barrister to whom he has had recourse; all of them obliged to protect their own secret but having their part in a secret shared with the others, which the rest of humanity does not suspect and which means that to them the most wildly improbable tales of adventure seem true, for in this romantic, anachronistic life the ambassador is a bosom friend of the felon, the prince, with a certain independence of action with which his aristocratic breeding has furnished him, and which the trembling little cit would lack, on leaving the duchess’s party goes off to confer in private with the hooligan; a reprobate part of the human whole, but an important part, suspected where it does not exist, flaunting itself, insolent and unpunished, where its existence is never guessed; numbering its adherents everywhere, among the people, in the army, in the church, in the prison, on the throne; living, in short, at least to a great extent, in a playful and perilous intimacy with the men of the other race, provoking them, playing with them by speaking of its vice as of something alien to it; a game that is rendered easy by the blindness or duplicity of the others, a game that may be kept up for years until the day of the scandal, on which these lion-tamers are devoured; until then, obliged to make a secret of their lives, to turn away their eyes from the things on which they would naturally fasten them, to fasten them upon those from which they would naturally turn away, to change the gender of many of the words in their vocabulary, a social constraint, slight in comparison with the inward constraint which their vice, or what is improperly so called, imposes upon them with regard not so much now to others as to themselves, and in such a way that to themselves it does not appear a vice.
#2: At 599 words, a passage from the opening section of Swann’s Way:
But I had seen first one and then another of the rooms in which I had slept during my life, and in the end I would revisit them all in the long course of my waking dream: rooms in winter, where on going to bed I would at once bury my head in a nest, built up out of the most diverse materials, the corner of my pillow, the top of my blankets, a piece of a shawl, the edge of my bed, and a copy of an evening paper, all of which things I would contrive, with the infinite patience of birds building their nests, to cement into one whole; rooms where, in a keen frost, I would feel the satisfaction of being shut in from the outer world (like the sea-swallow which builds at the end of a dark tunnel and is kept warm by the surrounding earth), and where, the fire keeping in all night, I would sleep wrapped up, as it were, in a great cloak of snug and savoury air, shot with the glow of the logs which would break out again in flame: in a sort of alcove without walls, a cave of warmth dug out of the heart of the room itself, a zone of heat whose boundaries were constantly shifting and altering in temperature as gusts of air ran across them to strike freshly upon my face, from the corners of the room, or from parts near the window or far from the fireplace which had therefore remained cold — or rooms in summer, where I would delight to feel myself a part of the warm evening, where the moonlight striking upon the half-opened shutters would throw down to the foot of my bed its enchanted ladder; where I would fall asleep, as it might be in the open air, like a titmouse which the breeze keeps poised in the focus of a sunbeam — or sometimes the Louis XVI room, so cheerful that I could never feel really unhappy, even on my first night in it: that room where the slender columns which lightly supported its ceiling would part, ever so gracefully, to indicate where the bed was and to keep it separate; sometimes again that little room with the high ceiling, hollowed in the form of a pyramid out of two separate storeys, and partly walled with mahogany, in which from the first moment my mind was drugged by the unfamiliar scent of flowering grasses, convinced of the hostility of the violet curtains and of the insolent indifference of a clock that chattered on at the top of its voice as though I were not there; while a strange and pitiless mirror with square feet, which stood across one corner of the room, cleared for itself a site I had not looked to find tenanted in the quiet surroundings of my normal field of vision: that room in which my mind, forcing itself for hours on end to leave its moorings, to elongate itself upwards so as to take on the exact shape of the room, and to reach to the summit of that monstrous funnel, had passed so many anxious nights while my body lay stretched out in bed, my eyes staring upwards, my ears straining, my nostrils sniffing uneasily, and my heart beating; until custom had changed the colour of the curtains, made the clock keep quiet, brought an expression of pity to the cruel, slanting face of the glass, disguised or even completely dispelled the scent of flowering grasses, and distinctly reduced the apparent loftiness of the ceiling.
#3: At 447 words – something in either the Fugitive or Time Regained (too lazy to look it up):
A sofa that had risen up from dreamland between a pair of new and thoroughly substantial armchairs, smaller chairs upholstered in pink silk, the cloth surface of a card-table raised to the dignity of a person since, like a person, it had a past, a memory, retaining in the chill and gloom of Quai Conti the tan of its roasting by the sun through the windows of Rue Montalivet (where it could tell the time of day as accurately as Mme. Verdurin herself) and through the glass doors at la Raspelière, where they had taken it and where it used to gaze out all day long over the flower-beds of the garden at the valley far below, until it was time for Cottard and the musician to sit down to their game; a posy of violets and pansies in pastel, the gift of a painter friend, now dead, the sole fragment that survived of a life that had vanished without leaving any trace, summarising a great talent and a long friendship, recalling his keen, gentle eyes, his shapely hand, plump and melancholy, while he was at work on it; the incoherent, charming disorder of the offerings of the faithful, which have followed the lady of the house on all her travels and have come in time to assume the fixity of a trait of character, of a line of destiny; a profusion of cut flowers, of chocolate-boxes which here as in the country systematised their growth in an identical mode of blossoming; the curious interpolation of those singular and superfluous objects which still appear to have been just taken from the box in which they were offered and remain for ever what they were at first, New Year’s Day presents; all those things, in short, which one could not have isolated from the rest, but which for Brichot, an old frequenter of the Verdurin parties, had that patina, that velvety bloom of things to which, giving them a sort of profundity, an astral body has been added; all these things scattered before him, sounded in his ear like so many resonant keys which awakened cherished likenesses in his heart, confused reminiscences which, here in this drawing-room of the present day that was littered with them, cut out, defined, as on a fine day a shaft of sunlight cuts a section in the atmosphere, the furniture and carpets, and pursuing it from a cushion to a flower-stand, from a footstool to a lingering scent, from the lighting arrangements to the colour scheme, carved, evoked, spiritualised, called to life a form which might be called the ideal aspect, immanent in each of their successive homes, of the Verdurin drawing-room.
#4: At 426 words, a reflection on a church from the narrator’s youth:
All these things and, still more than these, the treasures which had come to the church from personages who to me were almost legendary figures (such as the golden cross wrought, it was said, by Saint Eloi and presented by Dagobert, and the tomb of the sons of Louis the Germanic in porphyry and enamelled copper), because of which I used to go forward into the church when we were making our way to our chairs as into a fairy-haunted valley, where the rustic sees with amazement on a rock, a tree, a marsh, the tangible proofs of the little people’s supernatural passage — all these things made of the church for me something entirely different from the rest of the town; a building which occupied, so to speak, four dimensions of space — the name of the fourth being Time — which had sailed the centuries with that old nave, where bay after bay, chapel after chapel, seemed to stretch across and hold down and conquer not merely a few yards of soil, but each successive epoch from which the whole building had emerged triumphant, hiding the rugged barbarities of the eleventh century in the thickness of its walls, through which nothing could be seen of the heavy arches, long stopped and blinded with coarse blocks of ashlar, except where, near the porch, a deep groove was furrowed into one wall by the tower-stair; and even there the barbarity was veiled by the graceful gothic arcade which pressed coquettishly upon it, like a row of grown-up sisters who, to hide him from the eyes of strangers, arrange themselves smilingly in front of a countrified, unmannerly and ill-dressed younger brother; rearing into the sky above the Square a tower which had looked down upon Saint Louis, and seemed to behold him still; and thrusting down with its crypt into the blackness of a Merovingian night, through which, guiding us with groping finger-tips beneath the shadowy vault, ribbed strongly as an immense bat’s wing of stone, Théodore or his sister would light up for us with a candle the tomb of Sigebert’s little daughter, in which a deep hole, like the bed of a fossil, had been bored, or so it was said, “by a crystal lamp which, on the night when the Frankish princess was murdered, had left, of its own accord, the golden chains by which it was suspended where the apse is to-day and with neither the crystal broken nor the light extinguished had buried itself in the stone, through which it had gently forced its way.”
#5: At 398 words, a neat little passage about Gilberte early on in Swann’s Way, setting the stage for the final scenes a couple of thousand pages later:
The name Gilberte passed close by me, evoking all the more forcibly her whom it labelled in that it did not merely refer to her, as one speaks of a man in his absence, but was directly addressed to her; it passed thus close by me, in action, so to speak, with a force that increased with the curve of its trajectory and as it drew near to its target; — carrying in its wake, I could feel, the knowledge, the impression of her to whom it was addressed that belonged not to me but to the friend who called to her, everything that, while she uttered the words, she more or less vividly reviewed, possessed in her memory, of their daily intimacy, of the visits that they paid to each other, of that unknown existence which was all the more inaccessible, all the more painful to me from being, conversely, so familiar, so tractable to this happy girl who let her message brush past me without my being able to penetrate its surface, who flung it on the air with a light-hearted cry: letting float in the atmosphere the delicious attar which that message had distilled, by touching them with precision, from certain invisible points in Mlle. Swann’s life, from the evening to come, as it would be, after dinner, at her home, — forming, on its celestial passage through the midst of the children and their nursemaids, a little cloud, exquisitely coloured, like the cloud that, curling over one of Poussin’s gardens, reflects minutely, like a cloud in the opera, teeming with chariots and horses, some apparition of the life of the gods; casting, finally, on that ragged grass, at the spot on which she stood (at once a scrap of withered lawn and a moment in the afternoon of the fair player, who continued to beat up and catch her shuttlecock until a governess, with a blue feather in her hat, had called her away) a marvellous little band of light, of the colour of heliotrope, spread over the lawn like a carpet on which I could not tire of treading to and fro with lingering feet, nostalgic and profane, while Françoise shouted: “Come on, button up your coat, look, and let’s get away!” and I remarked for the first time how common her speech was, and that she had, alas, no blue feather in her hat.
Data scientists often sabotage their own work by doing a crappy job of presenting it. Don’t fall into this trap: think carefully about what you want to say, and build PowerPoint decks (*) that say it well. Data scientists often present their projects the same way that they present in a seminar or at a half-empty session at an INFORMS conference: LaTex decks with a white background with twice as many slides as minutes to speak. Your words will fall on deaf ears if you present this way. Polished presentations need not be superficial! Tuck in your shirt and create a nice looking deck. Here’s how.
Focus on three key messages that you want your audience to understand. Orient your presentation around those. Who is your audience? Are they peers? Decision makers? New team members? Clients? Three points well suited for one audience may not be relevant for another. This often means re-working a perfectly good deck, or leaving fascinating conclusions unsaid. Do your audience a favor and tell them what they need to hear, not what they want to hear, or what you want to say. If you haven’t touched on your three main points within 5 minutes (regardless of the length of the presentation), you’ve done something wrong. I screw that up a lot.
Analytics practitioners are often in the happy situation of having too much to say. We tend not to talk about fluff, but we take it too far in the other direction. 50 slides for a 20 minute talk is too many. Depending on the topic, five, three, or even one may be enough. It’s a judgment call, but err on the side of leaving things out. You can always stash extra slides in an appendix in case the conversation leads that way, or you miraculously end up with time to spare. Nobody will measure the value of your deck by the number of slides, but by the amount of useful, relevant material.
Your presentation should tell a story. It should have a beginning, a middle, and an end. This story may be told through your verbal presentation rather than in the deck content itself. Storytelling is particularly important because your audience will not pay full attention to what you are saying, nor will they read your deck carefully. It’s not a TED talk. They will stop paying attention 5 minutes in, and rally at the end. While your presentation has been the focus of your day (or perhaps week, month, …), that is usually not true of your audience. They’re thinking about what they want on their pizza later.
Be clear, and tell the truth. Don’t sugar coat or hide from your findings. Data scientists are careful people because we don’t want to overreach, but if you add ten qualifiers to every sentence then nobody will know what the hell you are talking about, and won’t care. Call out things that are important. You can say, “This is important.” You can say, “We’re not sure about this.” You can say, “This is not what we expected.” You can say, “This is obvious.” If you can’t admit what you don’t know, then people will suspect that you don’t know what you actually do.
You will not always be around to explain your deck. Decks are passed around and read by people you have never met, if you are lucky. Their half-life can be months long. Slide titles should guide readers along without supervision. Callouts can refer readers to other resources. Stick extra slides in the back for definitions and additional supporting material if needed.
The written tone of your deck should fit the content and your audience. In particular, these considerations should govern the amount of technical detail in your deck. If the math is what’s at issue, then your deck should have equations in it. If not, then leave them out. They are often the mechanics rather than the meaning. Try to make the deck self-contained without being burdensome: include variable definitions and define acronyms the first time you use them. Avoid Greek letters unless absolutely necessary. In general, LaTeX should not be used in a PowerPoint deck. Use PowerPoint’s “math mode” (control-equal in recent versions) so that the content is editable and can be changed the appropriate size and resolution. If you need LaTeX’s more elaborate equation formatting, consider whether you are loading in too much detail for your audience, or whether a deck is the right delivery vehicle.
Don’t simply read from your slides. There is no point in you being there if that’s all you’re doing. Your deck should be written (and you should know the material well enough) so that the slide content serves as cues for you to say what you need to say in a natural, succinct way. Practice, with a friend or by yourself. Record yourself and watch the recording if you have never done so. Be yourself, unless you are by nature inappropriate. In that case, pretend to be somebody respectable. Most people are not funny, at least to the majority of a large group. Consider leaving jokes out of the PowerPoint deck entirely, and if you must use humor, simply weave them into your verbal presentation. It’s best if the humor is natural and not pre-canned. Be culturally aware. Avoid pop culture references and American English idioms if you have a global audience (you usually do). It’s hard to avoid this completely, but if a few slip out then that ends up being your personality shining through rather than confusion and awkwardness.
Be obsessive about the visual elements in your presentation. Pretend you are Steve Jobs. If your organization has a standard PowerPoint template, use it. You are speaking on behalf of your organization and part of your job is to serve as an ambassador. Save the iconoclasm for twitter. If your organization has style or voice guidelines, use them as well, so long as you are able to do so naturally. No organization wants you to come off as a phony.
In most cases, the following principles apply:
- Don’t overdo fonts. Three sizes and two typefaces should suffice most of the time. If one of them is Comic Sans, quit while you’re behind.
- The use of color is a good thing as long as it isn’t overdone. In this sense it is similar to the use of italics and boldface. If you use it too much, or in long stretches, it begins to become annoying. Consider the use of common user interface metaphors when using color: blue is cold, red is hot, and so on.
- Using icons instead of words can be effective.
- When referring to your company’s logos, brand names, slogans, etc, make sure you are using the most up-to-date, correctly spelled versions possible, derived from original sources. You’d be surprised how often this is botched. Be even more careful with client, partner, or competitor logos and names. Get permission first.
- Less is more. Delete unnecessary words such as adjectives and technical jargon.
- Avoid bullet point slides. Consider tables, charts, or visual evidence to make the same point.
- Be obsessive about alignment and positioning. If you have tables of similar size on consecutive slides, they should be in the exact same spot. Make sure centered elements actually are centered. Titles are in the same spot. The same font size is used for similar elements through out. Use the “Size and Position” settings in PowerPoint.
- Tables should have appropriate cell margins, and align data correctly. For the love of Tufte, don’t left align numeric data. Use a consistent and appropriate number of decimal places.
- Follow an authority like Tufte for charts (but avoid his twitter feed). Label the axes, and always provide units. Use appropriate scales. Be information rich.
- Be careful with stock images and clip art. If often looks dumb, and it dates very quickly. Make sure stock images and clip art illustrate an actual point you want to make rather than fill space.
- Forget animation. It’s usually not worth it, and it is often a distraction.
- Last thing: try to find as many opportunities as possible to present a deck, and modify the deck after every presentation.
All of this advice is subjective and you may come across occasions where doing the direct opposite of what I suggest is the right thing to do. That’s okay. The key is to be as intentional about the way we present our work as we are about doing it.
(*) I say “decks” and not “whitepapers” or “documentation” because PowerPoint decks are the primary information currency for many organizations. I say “PowerPoint” rather than <your favorite presentation software> because the reality in 2013 remains that PowerPoint is the standard for most organizations. You can argue about whether these things should be, or whether they will change, but they are realities just the same. Don’t let your opinions about such matters get in the way of delivering your message.
In an effort to broaden my blog readership, today’s topic is generating random unimodular matrices with a column of ones. Here’s the tweet from a fellow Hawkeye that got me thinking about this:
Possible to generate “random” totally unimodular matrices with at least one column of all ones? My simplex-pivoting students will thank you!
— Sam Burer (@sburer) September 11, 2013
A unimodular matrix is a matrix whose determinant is +/-1, denoted det(M) for a matrix M. Two important facts are that det(A)det(B)=det(AB), and the determinant of a triangular matrix is the product of the diagonal entries.
So randomly generating a unimodular matrix is easy:
- Generate a lower triangular matrix L and upper triangular matrix U with 1s on the diagonal and random entries elsewhere.
- The problem is that the product M=LU does not have a column of ones. But we can fix this. Let’s assume that we want the last column of M should contain all 1s. Then the inner product of row I of L and the last column of U should be 1. We can fiddle with the last element of U in this inner product so that this works out: changing this element will not change the determinant. There is a special case for the lower right element because we do not want to change the diagonals, so in that case we should adjust L instead.
Here is the code in python, assuming the NumPy package has been loaded. I do not use “dot” to compute val so I don’t have to special case i=0. Too long for a tweet, and matrices that are produced are a little funny looking, and if we get unlucky we may divide but zero, but good enough for a blog post.
def rand_unimod(n): l = tril(random.randint(-10, 10, size=(n,n))).astype('float') u = triu(random.randint(-10, 10, size=(n,n))).astype('float') for i in range(0, n): l[i, i] = u[i, i] = 1.0 if i < n - 1: val = sum([l[i, j] * u[j, n-1] for j in range(0, i)]) u[i, n-1] = (1 - val) / l[i, i] else: val = sum([l[i, j] * u[j, n-1] for j in range(1, i+1)]) l[n-1, 0] = (1 - val) / u[0, n-1] return dot(l, u), l, u
My definition(*) of analytics is:
Analytics is the practice of building models on computers to learn about the world.
Models, computers, learning are all necessary to the definition. No models? Not analytics, otherwise browsing the web would qualify. No computers? Not analytics. You’re doing math or stats or something. Not learning about the world? Not analytics. You’re playing Sim City or Grand Theft Auto or whatever. I find that many data scientists (**) don’t spend much time thinking about what analytics actually is, perhaps because analytics is such a relentlessly practical discipline, or maybe because it is so nebulous. I bet more people would be able to define machine learning, or Big Data, or optimization than analytics. Everyone’s doing analytics but nobody can say what it is. That’s worth considering in its own right, but it’s not the subject of today’s post.
I want to talk about models because that’s the piece of the puzzle in my definition that isn’t obvious. A model is simply a representation of a thing that actually exists in the world. It’s a statement: “this thing is like that one”. This is every bit as true for a model railroad as it is for an analytics model. The way you build a model is to think about the key properties of the thing you want to model (“it rides on a track, has wheels, and a funnel on top”), and then use the tools you have at your disposal to reproduce those attributes in the model.
Computer models often differ from their targets in a key respect: computer models don’t actually physically exist where as the things they represent (shoppers in a grocery store, oil deposits underneath the ocean floor, cells reproducing and mutating), do. It is this difference that makes computer models so useful. Since they don’t actually exist, computer models are comparatively incredibly cheap to make, change, and rebuild. Experiments can be run on computer models without fear of explosions, crashes, reactions, or lawsuits. These days we’re witnessing exponential growth in the number of experiments that are conducted each day, and it’s entirely due to the use of computer models. I’ll bet there have been more experiments conducted in the first thirteen years of the 21st century then in all of prior human existence combined.
Models purposefully represent only certain aspects of the things they represent. After all, if it represented all of them you’d have a clone and not a model. Of the nearly limitless properties we can ascribe to any object, a model of that object ignores most, modifies others, and mimics only a select few. A toy airplane has no engine, no seats, no electrical system, no pilot, is built from different materials, at much smaller scale. It’s entirely recognizable as a toy airplane because it has wings and is fun to move around. It’s the same deal with computer models because most of the learning can be achieved through modeling a few key attributes. It’s so easy to forget this that when I am advised about model building I often suggest to go back to the start and build the simplest possible model that models the phenomenon of interest. Add in the complexity later – oftentimes you don’t need it (***).
Finding the right dimensions of similarity are important. It’s not always obvious which ones are the important ones to keep. Models, like analogies, are ways of seeking truth. As with analogies, models can be “faux amis”; they can enlighten or obscure. It’s also true that models and analogies are insufficient. They can be tortured and stretched too thin. We’ve got to be careful that the conclusions that we draw based on analytics apply to the things we are modeling, and not just the models themselves. There’s the old joke about spherical cows that applies. I hope it’s not too obvious if I say that there will always be room for experimental science: watching what actually happens in the actual universe and thinking about what’s been observed.
A recent Hofstadter book (which I have not read…) discusses the mind’s necessity for analogy, and I think this is what drives us to build models so relentlessly. Model building soothes great clusters of neurons inside our skulls. Even if we didn’t enjoy building models so much, they’d be necessary because they help us to understand the incomprehensible so much more quickly than we would if we relied solely on empirical science.
(*) INFORMS defines analytics as follows: Analytics is the scientific process of transforming data into insight for making better decisions. That’s a great definition.
(**) Most data scientists don’t call themselves data scientists.
(***) I am not implying that “big data” is useless. I am speaking here of the complexity of input data rather than quantity. That said, in many, many cases I don’t think you need “big data”. So there.
You can find commonly used stats for NFL players in CSV format for the 2010, 2011, 2012 NFL seasons at this location.
For each season there are seven files:
- QB: quarterback data.
- RB: running backs.
- WR: wide receivers.
- TE: tight ends.
- K: kickers. Attempts and made field goals are broken out by distance in separate columns.
- Def: defensive stats by team.
- ST: special teams stats by team.
Updated 8/21/2013: Added the 2010 season.