Archive
Enviable Problems: MIP and Warren Buffett
There’s been some recent talk about just how awesome Warren Buffett actually is. Summary: “What have you done for me lately?”
Warren Buffett has an enviable problem: how to get superior return on a shit-ton of money. For someone with Buffett’s skills, there are many high-return opportunities out there. The “problem” is that most of those opportunities are not of a sufficiently large scale to make an impact on Berkshire’s bottom line. A 100% return on a $1M investment would mean a lot to me, but not so much for Warren. In other words, the opportunities in the upper right corner are the ones he’s looking for:
Erwin recently posted to a link to a paper by Bob Bixby from Gurobi about the history of mixed-integer programming (MIP) solvers. The MIP community has more in common with Warren than they realize! Bixby has a nice table that summarizes improvements in MIP solver performance from 1998 – 2004:
| Speedup | |
| Algorithmic improvement (machine independent) | 3300× |
| Machine improvement | 1600× |
| Total improvement | 5,280,000× |
That’s a pretty good ROI. The challenge is how to continue to find these kinds of speedups. The Gurobi team thinks about this a lot. Ed drew me this picture once:
There are lots of MIP improvements that provide dramatic speedups. The problem is that they only apply in certain specialized circumstances (e.g. a certain kind of network model). In fact, since many optimization problems can be written as MIPs, many, many OR papers can be viewed as clever ways of speeding up certain weird subclasses of MIP. (My thesis is an example, even though I never really thought of it that way.) On the other hand, there are also improvements that apply to a large percentage of MIP models but yield a relatively modest benefit (for example certain presolve rules). The game is to find improvements that provide large speedups for large classes of MIP models. Again, the upper right corner is where you want to be. Bixby’s paper supplies some examples: dual simplex, presolve, and so on.
It’s hard to climb at the same rate as the air gets thinner. Kudos to those who have the stamina to keep going.
Defining “Better”
I hereby award the “Tweet of INFORMS 2012” prize to Marc-Andre Carle:
#orms is definitely the science of better: each software presented at#INFORMS2012 is better than its competitors![]()
(Thanks to Michael Trick for the tip!)
Okay, so what is “better” anyway? I get the sense that for many operations research insiders, “better” is another word for “faster”, but that is wrong, wrong, wrong. “Better” means different things to different people. For example:
- More accurate.
- Less prone to failure.
- Easier to use by a broader set of people.
- Faster to develop a solution.
- Easier to integrate with other systems.
- Better supported.
- Cheaper.
- Easier to customize and modify.
- Easier to share.
- Uses less memory.
- Unencumbered by intellectual property concerns.
as well as…
- Faster.
“Better” is a multiobjective problem: most of us actually want many of the things on the list. How we weight the various factors depends in part on what the software is being used for:
- For academic research,
- For rapid prototyping,
- To create a model for a consulting engagement,
- For a production system.
Some of these factors can be measured (and are, thanks to the tireless efforts of Hans Mittelmann and others) while others are more subjective. Even if we are focusing exclusively on “faster”, the picture remains complicated. In a production system what matters is how quickly users get the results they want. So we care about not only the time that the solver takes, but also:
- How long it takes to retrieve the data and assemble the model to be solved,
- The predictability of response time over different user requests,
- How the solver performs in the face of many simultaneous requests.
Solver runtime differences of 5-10% don’t matter that much, generally speaking. I like to categorize how long an operation takes in real-world terms (I stole this idea from somebody else, but I don’t remember who):
- instantaneous (subsecond)
- the time it takes to check espn.com and/or twitter (5-10 seconds)
- get coffee (a few minutes)
- have lunch (30 minutes or so)
- overnight
- a weekend
- ETERNITY
It’s usually not worth the effort making an engineering decision based solely on performance if you can’t move to a different bucket. Otherwise you probably have better things to do.
Operations Research Tom Swifties
I apologize in advance, and you’re welcome. Add your Tom Swifties in the comments!
- “Queuing theory says that the registration line will move faster if we all stand here,” Tom said optimistically.
- “Ah, but what did you expect? It was a grad student talking about Simplex at a sponsored session!” Tom pivoted.
- “Oh yeah, Gomory would positively rave over your choices,” Tom said cuttingly.
- “I don’t know the best way to the airport, but I can tell you what I did last time,” Tom said warmly.
- “I assure you darling, my derivatives are continuous in an extremely large neighborhood near the solution,” Tom said smoothly.
- “My model says that if I post this on a Tuesday, I’ll get more 20% blog hits,” Tom said analytically.
- “There may be some error in the data, but don’t worry, we’ve accounted for that,” Tom said robustly.
- “Just choose the largest slice and be done with it,” Tom said greedily.
- “Break the constraint if you want, but it’ll cost you,” Tom said softly.
- “I use Avogadro’s number in all of my big-M constraints,” Tom said unstably.
- “Things have never been better,” Tom said monotonically nondecreasingly.
Detecting the Sources of Model Infeasibility using Gurobi
Today I come to sing the praises of Irreducible Infeasible Sets (IIS). An IIS is a minimal subset of the constraints of an optimization problem that are self-contradictory. In other words, if you remove any single constraint in an IIS, then the remaining constraints are feasible. Most modern linear and mixed-integer solvers have programming APIs to return IIS information for an infeasible model.
The reason IIS is so cool is that it helps diagnose why a model is not able to be solved. This is especially useful when the optimization problem to be solved is formed with the help of input from a user who may not know anything about operations research, or when the model is complicated. IIS is like the “spell check” of linear programming. Building optimization models without IIS is like debugging using “printf”.
Here is a simple example from the lp_solve documentation:
min x + y x >= 6 y >= 6 x + y <= 11
6 plus 6 is more than 11, people. In this case, all three constraints are part of the IIS.
Here is some C# code using the Gurobi API that obtains the IIS for a very simple model. First, here is the code that forms and solves the model. This is easy to figure this out by reading the Gurobi docs. The only twist is that I have expressed the first two constraints as lower bounds on the variables x and y.
GRBEnv env = new GRBEnv(); GRBModel model = new GRBModel(env); GRBVar x = model.AddVar(6, Double.MaxValue, 1, GRB.CONTINUOUS, "x"); GRBVar y = model.AddVar(6, Double.MaxValue, 1, GRB.CONTINUOUS, "y"); model.Update(); GRBConstr c3 = model.AddConstr(x + y, GRB.LESS_EQUAL, 11, "c3"); model.Optimize(); model.ComputeIIS();
To find the IIS you need to call another method called ComputeIIS. Then you can query the constraints and variables to see if they are part of the IIS. This is accomplished using the Get method. I simply print out the constraint and variable names that are in the IIS.
// Print the names of all of the constraints in the IIS set. foreach (var c in model.GetConstrs()) { if (c.Get(GRB.IntAttr.IISConstr) > 0) { Console.WriteLine(c.Get(GRB.StringAttr.ConstrName)); } } // Print the names of all of the variables in the IIS set. foreach (var v in model.GetVars()) { if (v.Get(GRB.IntAttr.IISLB) > 0 || v.Get(GRB.IntAttr.IISUB) > 0) { Console.WriteLine(v.Get(GRB.StringAttr.VarName)); } }
If you want to write the IIS information to a text file for review, simply use the write method with the “ilp” extension on the file name:
model.Write("model.ilp");
One last thing: using LINQ (a feature of the C# language), you can simplify how we query for the IIS constraints and variables. For example, I can get all of the constraint names in one statement as follows:
var allConstraintNames =
from c in model.GetConstrs()
where c.Get(GRB.IntAttr.IISConstr) > 0
select c.Get(GRB.StringAttr.ConstrName);
LINQ is very handy for creating models and querying results. I use it often!
The diet problem revisited: McDonald’s nutritional information in 2012
The so-called “diet problem” is familiar to many students of optimization. The idea is to select the cheapest set of items from a menu that meets certain dietary restrictions, such as the amount of calories, carbohydrates, protein, and so on. The AMPL book features a version of this problem based on the McDonald’s menu from some time back. The original data (which I am pretty sure dates back to a 1993 paper from Robert Bosch) is right here.
That got me thinking – has this information changed during the last 20 years? Yes, it has – and it appears for the worse. I retrieved updated information from the McDonald’s website and compared the items from the “diet1.dat” example.
In the table below I have listed the information from diet1.dat side-by-side with the 2012 information, for the menu items that appear in both. Places where the 2012 nutrition is worse are highlighted in red; green means better. I left carbs and protein alone since people have different views on what is “better”. As you can see, nearly all the items have more calories. To be fair, it’s not clear whether portion sizes are bigger – but my bet is that they are not (call me a cynic).
| Calories | Carbohydrates | Protein | Vitamin A | Vitamin C | Calcium | Iron | ||||||||
| Name | 1993 | 2012 | 1993 | 2012 | 1993 | 2012 | 1993 | 2012 | 1993 | 2012 | 1993 | 2012 | 1993 | 2012 |
| Quarter Pounder® with Cheese | 510 | 520 | 34 | 42 | 28 | 30 | 15 | 10 | 6 | 2 | 30 | 30 | 20 | 25 |
| Big Mac® | 500 | 550 | 42 | 46 | 25 | 25 | 6 | 6 | 2 | 2 | 25 | 25 | 20 | 25 |
| Filet-O-Fish® | 370 | 380 | 38 | 39 | 14 | 16 | 2 | 2 | 0 | 0 | 15 | 15 | 10 | 15 |
| McGrilled Chicken | 400 | 350 | 42 | 42 | 31 | 28 | 8 | 4 | 15 | 8 | 15 | 15 | 8 | 20 |
| Small French Fries | 220 | 230 | 26 | 29 | 3 | 3 | 0 | 0 | 15 | 8 | 0 | 2 | 2 | 4 |
| Sausage McMuffin® | 345 | 370 | 27 | 29 | 15 | 14 | 4 | 6 | 0 | 2 | 20 | 25 | 15 | 15 |
| 1% Lowfat Milk | 110 | 100 | 12 | 12 | 9 | 8 | 10 | 10 | 4 | 4 | 30 | 30 | 0 | 0 |
| Orange Juice | 80 | 95 | 20 | 19.5 | 1 | 1.5 | 2 | 0 | 120 | 90 | 2 | 2 | 2 | 0 |
I’ll share the complete data in a future post.
(Two notes: for “McGrilled Chicken” I used the 2012 information for “Premium Grilled Chicken Classic Sandwich”, and for Orange Juice the 1993 numbers were clearly for an 8 ounce serving so I divided 2012 numbers by two.)
Yes, Virginia, there is a difference between optimization and prescriptive analytics
[This post is a nod to this famous editorial. I don't want you to think I've gone crazy. The subject of this post was inspired by the Lustig/Dietrich/Johnson/Dziekan article "The Analytics Journey" and the last bit (regarding skills) by a conversation with my boss.]
Eight-year old Virginia O’Dual writes:
Dear Blogger: I am eight years old. Some of my little friends tell me that prescriptive analytics is just a fancy way to say optimization. My advisor tells me that if it is on your blog it must be so. Tell me, is it so?
Virginia, your little friends are wrong. Analytics is not the same as optimization. I realize it may sound strange to be so blunt – after all, this is merely semantics, right? The internet is filled with people ceaselessly arguing about terminology. Millions (billions?) are afflicted with the disease of thinking that knowing the name of a thing is the same as knowing it. Virginia, you may be asking why this is any different. Am I not falling into the same trap by insisting there’s a difference? On top of that, why pick a fight about “analytics”? The word “analytics” is on everyone’s lips these days: on twitter, in magazines and newspapers, during coffee breaks at conferences. More often than not, rushes of hot air follow. So why on earth am I telling you there’s a difference? Virginia, the reason is that there is a connotation to “prescriptive analytics” that is missing from “optimization”:
Optimization speaks to process and prescriptive analytics to purpose.
[Numerical] optimization is finding the best option from a set of alternatives for a mathematical model. In more staid company it is called “operations research”. In less refined circles some simply speak of “search problems”, but this detracts from the majesty and beauty of the techniques that are so well-known and well-loved by practitioners. The theories and techniques of optimization date back centuries, though they were not labeled as such until much more recently. Optimization usually implies a formal mathematical model and a clearly articulated method that provides a guarantee about the quality of the alternatives are discovered.
Virginia, you read this blog, so you already know all of this. You’re an optimizer. You’re proud. Proud of your heritage, proud of the sound reasoning behind your methods, and proud of the profligate applications of your craft in matters of war, peace, and profit. More or less. You may consider those who use highfalutin terms such as “prescriptive analytics” to be late to the party, particularly when they speak in the language of business rather than in the more precise language of mathematics. Skepticism abounds; in the past I have even called analytics “old wine in a new bottle“. The fact that the terminology comes with a sales pitch doesn’t help: a cluster of computers, consulting services, a web service. If you listen to the hucksters, everything under the sun seems to be an essential part of an analytics “solution”: a filesystem, a database, a charting library. Even worse, the aspirations the “analytics community” speak of seem so depressingly modest. A pivot table is not analytics, we optimizers snort. We can factorize matrices already, thank you very much. Yes, my data fits on my machine. Go away so I can continue to bear the world on my shoulders like Atlas (or Dantzig). We know what you are talking about, we say to them, and each other. You are talking about optimization. (I notice a similar relationship between statisticians and the term “predictive analytics”.)
But prescriptive analytics really is something new. I have come to understand that though optimization is the primary tool of prescriptive analytics, it would be very limiting to conflate the two. Prescriptive analytics refers to turning data into decisions – but not just any decision. Decisions that are to be made by people, answering questions posed by people. This is not always the case for optimization. Sometimes decisions produced by optimization are in service of another software process. Other times the decisions aren’t really decisions in the everyday sense. Root finding and regression are properly speaking optimization, even if we don’t always label them such. Other times, the answers produced by optimization are far removed from the questions that were originally posed. “Consider a spherical cow“, a professor once said. In many cases, such as scheduling, supply chain optimization, or routing there are so many transformations, simplifications, and layers of rewriting from prose to math to AMPL (or GAMS, or AIMMS) to C and back have been undertaken that it is hard to connect solution to problem. The plot is lost. I wrote my Masters thesis on semidefinite programming formulations for truss topology design – bridge building. During my defense I was asked by a committee member if I would drive over a bridge I had built using the results of my formulation. The correct answer was “hell, no” – because I was trying to solve an optimization problem and not a prescriptive analytics problem. (I wasn’t brave enough to say so at the time, of course.)
Prescriptive analytics implies a specific purpose – building a system that automatically produces decisions to choices that humans face. As a wise man once said: “begin with the decision in mind and see where it leads you“. In the practice of prescriptive analytics this may lead you down unfamiliar roads, but brainpower and software are always provisions for the journey. Software that lies at the center of data, models, algorithms, and computational platforms. Sometimes the job simply boils down to assembling data. Real time data, historical data, disjointed and incomplete data that must be joined, refined, summarized, synthesized and completed. Once this is done we are left with nothing but a measly linear program, or even more pathetically, a sort. There is no shame in this – there are no bonus points awarded for complicated formulations in the world of analytics.
Prescriptive analytics concerns an entire process, and this process typically involves assembling data, building models, evaluating them, and presenting the results. Optimization comes into the play for the middle two steps, but the first and last are every bit as important. In fact, many times the first and last steps – assembling the data and presenting results in a form that people understand – are the most difficult and time consuming ones. And you do need tools to help you to pull them off. The problem with the analytics crowd, and the reason why optimization people sometimes roll their eyes, is that analytics people often focus entirely on the edges and ignore the middle: formulating and solving a real model. In prescriptive analytics we give all steps their proper attention, because without them the goal of delivering actionable plans for success cannot be attained.
So, Virginia, this offers fantastic opportunities to those who choose to take up the challenge of mastering prescriptive analytics – young and old alike. A practitioner of prescriptive analytics needs to understand not only the formal methods of optimization, but much more besides. They need to understand how data is obtained, manipulated, and joined together. They need to understand the art of model building, which is experimental in nature and requires knowledge of the original problem domain. They need to be able to communicate the insights produced by their models in words and in pictures. Finally, they need to be able to understand the business objectives and motivations behind their projects, so that they can not only deliver what is being asked for, but anticipate (and answer) the questions that come next.
PhD thesis: Solving Large-Scale Quadratic Assignment Problems
I notice that people are expecting to find my thesis here, so here goes. It was written in 2000 and while it’s always anguishing to look at things you did a long time ago, I am proud of the result. The “nug30″ QAP was one of the most difficult optimization problems ever solved at the time, and it probably still ranks pretty high on the list 12 years later.
Abstract:
The quadratic assignment problem (QAP) is the problem of finding an assignment of an equal number of facilities to locations that minimizes the transportation cost between facilities. The large number of possible assignments along with the interdependence of the cost on the flows and distances causes QAP to be an extremely difficult problem to solve in practice.
Branch-and-bound algorithms are typically used to find exact solutions to QAP. A branch-and-bound algorithm solves a QAP by assigning some facilities to locations, and computing lower bounds on these smaller subproblems. The use of lower bounds allows the algorithm to eliminate from consideration partial assignments that cannot lead to a solution, and via this process of elimination efficiently determine an optimal assignment.
This dissertation proposes a new branch-and-bound implementation capable of solving previously unsolved quadratic assignment problems. The first ingredient is a new continuous relaxation of QAP based on convex quadratic programming, resulting in an efficiently computed, effective lower bounding procedure. The new bound is a key component in an improved branch-and-bound algorithm that uses information provided by the continuous relaxation to intelligently extend the search for optimal solutions.
Even though the performance of our branch-and-bound algorithm surpasses previous implementations on a range of standard test problems, large computational resources are required. The solution of QAPs as small as thirty facilities and locations requires years of sequential computation. Grid computing resources allow hundreds of workstations to work on the solution of a QAP at once. The MW grid computing tool was used to produce an efficient, fault-tolerant parallel branch-and-bound implementation. This implementation was used to solve the Nugent 30 QAP, unsolved since 1968, over a period of seven days using 1000 machines.
.Net coding guidelines for operations research
Writing code, whether in C++, AMPL, GAMS, SAS, .Net, or whatever*, is a part of operations research for most of us. Here are a few thoughts on writing good .Net code. This advice is aimed especially at optimizers with a research background, who don’t have lots of experience writing “production” code.
- Give things sensible names. In particular, start with Guidelines for Names on MSDN. This guideline is actually pretty important, because good naming and organization leads to fewer bugs and more maintainable code. My experience working with code by “optimization people” is that a lot of it is poorly organized. You’d never write a poorly organized paper (would you?) so why write disorganized code. The first step in good organization is to use good names. (Tip: In Visual Studio Use the F2 key (or right click and “Rename”) instead of find/replace.)
- Exception: math code can and should use good math notation. Code that implements an optimization model should have a backing whitepaper. The whitepaper should include the full mathematical formulation of the problem. It should be possible to go back and forth between the documentation and code without too much trouble. So if in your doc you have a variable x with subscripts p, g, and a, then it is totally okay for the code to have the same names. Don’t artificially name x something like “amountToProduce” just to satisfy some MSDN guideline. For this reason I tend to avoid Greek letters, hats, and bars in my whitepapers. Careful choice of notation is important.
- Get in the habit of writing small programs to test your assumptions. Are multidimensional arrays or jagged arrays faster in .Net? How much more expensive is a hashtable lookup than an array lookup? Is it better to use SOS2 constraints or dummy integer variables for piecewise linear constraints? How much memory does a solver use for a model? These things can be empirically tested.
- Write tests for your code. In my experience, any code that does not have tests is probably wrong. Start with very simple tests that test basic, but important things.
- Document your code. Help others understand what you are doing (and more importantly) why.
- Public APIs should be designed for the person using it, not the person writing it. For example, an input data structure for a production planning module might talk about capacities, facilities, and availability. Constraints, knots, and duals wouldn’t make much sense to a user, even though they make perfect sense to us.
- Consider refactoring any function over a page long. It is probably too confusing. You can highlight a section of code, right click, and do “Extract Function” in Visual Studio.
- If you don’t understand the performance characteristics of a data structure, avoid using it until you do. The .Net framework provides TONS of built in methods and data structures. This makes it very, very easy to write correct but horribly inefficient code. For example, finding an element in List<> takes linear time, but accessing an element by key from a Dictionary<> is nearly constant time. The Count() method on an IEnumerable<> is linear time, but .Length on an array is constant time. Most of the time, arrays are great for what we do. It’s okay to use more, just understand why.
- Package complicated data in data structures. A small caveat to the last point: optimization people tend to go crazy with multidimensional arrays when an array of simple data structure is usually more appropriate. This results in self-documenting code.
- Be careful with memory. Just because .Net has garbage collection doesn’t mean you can forget about how you manage memory. Don’t unnecessarily allocate arrays inside a loop if you don’t need to. If you are careless with memory, it will kill your performance.
- Don’t reinvent the wheel. You don’t need to write any sorting code – use Array.Sort. You don’t need to scan through strings for commas – use String.Split. You don’t need to write 3.14159 – use Math.PI.
The Design Guidelines for Developing Class Libraries page on MSDN is filled with lots of good suggestions.
* I think F# and other functional programming languages could actually be great for optimization. I guess that’s why Python is catching on.
Microsoft Solver Foundation 3.1 Released
On the Solver Foundation team blog we have just announced the availability of Solver Foundation 3.1. I am very proud of the team, and proud of the code that we shipped together.
For this release we tried to rally around the feedback that we have received from email, blogs, forums, and internal use. My hope is that this has resulted in a release that includes a bunch of small, subtle improvements that will make Solver Foundation more useful and more intuitive than ever.
Enjoy.
O.R. and social networking: A Solver Foundation MIP model to suggest Twitter followers
For this month’s INFORMS blog challenge I thought it would be fun to write a simple Solver Foundation – Twitter mashup. I’m not as happy with it as I could be, but maybe someone will find it interesting.
My goal is to write a web application that will suggest interesting twitter users to follow. Here’s what the finished product looks like:
As you can see, the application asks for:
- a keyword,
- a maximum number of twitter users to suggest,
- minimum and maximum number of expected tweets per day.
and produces a list of twitter users to follow. I will describe some of the key steps in building this application, but I won’t provide all of the source simply because it would be a lot of work for me to clean up the code! I will try to provide enough so that the motivated reader can finish it off. If you want to follow along, get Solver Foundation, open Visual Studio 2010, and create a new web site (say MsfTwitter). Drag the controls shown (several Labels, one CheckBox, three TextBoxes and a Button) on the screenshot onto your main form, then double click on the button (which I named "Suggest"). Then you’ll be sitting in the codebehind file for the click button handler:
protected void Suggest_Click(object sender, EventArgs e) { }
The end goal here is to fill this with some code that will find suggestions. In order to do this, we’re going to have to know something about twitter users. For now, let’s assume that we’re able to obtain the number of tweets per day and an "interest score" that indicates how interesting a twitter user is for a particular topic. Then we can define a User data structure which is the basis for our model. Add a new C# file into the App_Code directory and add this:
using System; using System.Collections.Generic; using System.Diagnostics; using System.Linq; using System.Web; [DebuggerDisplay("{Name}, score = {InterestScore}, tweets = {Tweets}")] public class User { public string Name { get; set; } public double InterestScore { get; set; } public double TweetsPerDay { get; set; } }
Our application logic is really divided into two parts:
- Finding interesting twitter users for a given topic. This gives us a list of Twitter users, including their interest score and tweets per day.
2. Selecting a "best" subset of these users given the preferences specified on the web page.
These are actually both great analytics problems. I am going to rely on third party services to solve the first problem, and I will use Solver Foundation for the second (since that is an optimization model). So for now, let’s assume that we’ve got a list of twitter users and we want to solve problem 2. (If you’re following along in Visual Studio, now is a good time to add a reference to Solver Foundation using “Add Reference”.)
Following my standard pattern of identifying inputs, outputs, goals, and constraints, our model is as follows:
- The input is a bunch of users. Each one has a score and expected number of tweets per day.
- The output is a boolean value for each user: to follow or not to follow?
- The goal is to maximize the total interest score over all selected users.
- One constraint is that the expected number of tweets is between user specified values min_tweets and max_tweets.
- Another constraint is that the number of users that we select is at most K.
The next step is to write a class that we can use from the web page. I will call the class SuggestModel and give it one method. The signature is pretty straightforward based on my description above.
using System; using System.Collections.Generic; using System.Diagnostics; using System.Linq; using Microsoft.SolverFoundation.Common; using Microsoft.SolverFoundation.Services; public class SuggestModel { public static IEnumerable<User> Solve(IEnumerable<User> data,
int minSuggestions, int maxSuggestions, int maxTweets) { } }
Creating the C# code for Solve is also straightforward given the description of the model. The Parameters for the model are the interest scores and tweets per day. Both are indexed over the set of users. We use the SetBinding method to associate the appropriate fields of the User objects with the parameters.
public static IEnumerable<User> Solve(IEnumerable<User> data, int minSuggestions, int maxSuggestions, int maxTweets) { SolverContext context = SolverContext.GetContext(); Model model = context.CreateModel(); Set users = new Set(Domain.Any, "users"); Parameter score = new Parameter(Domain.Real, "score", users); score.SetBinding(data, "InterestScore", "Name"); // bind to the InterestScore field. Parameter tweets = new Parameter(Domain.Real, "tweets", users); tweets.SetBinding(data, "TweetsPerDay", "Name"); // bind to the TweetsPerDay field. model.AddParameters(score, tweets); Decision choose = new Decision(Domain.IntegerRange(0, 1), "choose", users); model.AddDecision(choose); model.AddGoal("goal", GoalKind.Maximize, Model.Sum( Model.ForEach(users, user => choose[user] * score[user]))); model.AddConstraint("c_choose", minSuggestions <= Model.Sum(Model.ForEach(users, user => choose[user])) <= maxSuggestions); model.AddConstraint("c_tweet", Model.Sum(Model.ForEach(users, user => choose[user] * tweets [user])) <= maxTweets); context.Solve(); return data.Where(user => choose.GetDouble(user.Name) > 0.01); }
Notice the last line: using a LINQ expression we return a list of users where the value of the choose decision is nonzero. That’s what we wanted. The decision values are integer valued and all the constraints and goals are linear, so this is a mixed integer programming model. It’s essentially a knapsack problem – no big deal.
Now we can return to "problem 1": finding interesting twitter users for the topic entered into the text box. To solve this problem I will rely on the technique (first suggested by Polya?) of being lazy – I will use other people’s code.
Twitter provides a simple REST API that is the equivalent of the "Who to Follow" button on twitter.com. You can learn more about it (and try it out) here.
To actually call this “who to follow” twitter API from our program, we need three more things:
- A twitter REST API key (so we can make calls to twitter). So I registered my app here: https://dev.twitter.com/apps/new. You can too – it’s free.
- Code to make authenticated REST calls. The authentication system is called OAuth. Shannon Whitley wrote a nice OAuth package for .Net and posted it on his blog here. Go grab that and put it in your App_Code directory. I changed the namespace to “OAuth”.
- Code to convert the API results (returned in a format called JSON). There is a Json.Net ackage written by James Newton-King that makes this easy. Follow this link for the details. In the code below where I work with JObject/JToken/JArray, add a ”using Newtonsoft.Json.Linq” statement to the top of the source file.
Since the twitter API is authenticated, there is a dance that we have to do to obtain an access token given our API key. This involves redirecting to twitter, allowing our registered application to have access to twitter data, then redirecting back to our page. I am omitting that code because it’s not really the point of this post.
Once you go through all of that mess, the code is short and sweet. Put this in the same file as your click event handler (probably Default.aspx.cs):
private oAuthTwitter _oAuth; private JArray GetTwitterUsers() { string url = String.Format(http://api.twitter.com/1/users/search.json?q={0},
HttpUtility.UrlEncode(Keyword.Text)); string json = _oAuth.oAuthWebRequest(oAuthTwitter.Method.GET, url, String.Empty); return JArray.Parse(json); }
This gives me a bunch of interesting users but I don’t have a score or expected number of tweets. We can estimate the number of tweets per day by looking at two fields in the GetRelevantUsers result: statuses_count (the number of tweets for the user account) and created_at (when the account was created). Easy.
The last ingredient is the interest scores. A colleague referred me to Klout, which fits the bill nicely. The documentation for the API is here, but it is very simple: pass a comma-delimited list of user names in the query string and you get back the "klout scores". I will use those as our interest scores. Using the same raw materials as for the twitter REST API, the code looks like this:
private JToken GetKloutScores(IEnumerable<string> users) { string uri = "http://api.klout.com/1/klout.json?key={0}&users={1}"; string usernames = HttpUtility.UrlEncode(String.Join(",", users)); string json = _oAuth.WebRequest(oAuthTwitter.Method.GET,
String.Format(uri, _kloutKey, usernames), String.Empty); var result = JObject.Parse(json); return result["users"]; }
Now we have two data sets: one from Twitter and one from Klout. We need to join this information together to get a list of users to pass into SuggestModel. We can do that in a single LINQ Join statement. Again, skipping some OAuth details, here is the code:
protected void Suggest_Click(object sender, EventArgs e) { if (_authorized) { if (SetAccessToken()) { // todo: read values for these from the TextBox controls.
int minSuggest = 1, maxSuggest = 10, tweets = 100; var twitter = GetTwitterUsers(); var klout = GetKloutScores(twitter.Select(token => (string)token["screen_name"])); IEnumerable<User> data = klout.Join( twitter, k => (string)k["twitter_screen_name"], t => (string)t["screen_name"], (k, t) => new User { Name = (string)t["screen_name"], InterestScore = (double)k["kscore"], TweetsPerDay = (int)t["statuses_count"] /
DaysBetween(t.DateTime("created_at"), DateTime.UtcNow) }); var suggestedUsers = SuggestModel.Solve(data, minSuggest, maxSuggest, tweets);
// todo: do something with the output.
}
}
} private double DaysBetween(DateTime start, DateTime end) { return (end - start).TotalDays; }
A braindead way of displaying the output is to drag a Literal control on the form and write the output as HTML. We can replace that last “todo” with this:
StringBuilder build = new StringBuilder(); build.AppendFormat("<h2>{0}</h2>", "Suggestions"); foreach (var user in suggestedUsers) { build.AppendFormat("<div>{0}, tweets/day = {1}</div>",
HttpUtility.HtmlEncode(user.Name), user.TweetsPerDay); } apiResponse.Text = build.ToString();
I have made major, major shortcuts all the way along: the error checking is poor, I could consider caching information rather than hitting twitter & Klout each time, the interest scores are not necessarily perfect, the UI and logic is all mixed together, existing follows are not considered, the app should itself be exposed as a REST service, and so on. Even with all the simplifications I have made, a lot of glue code is necessary to tie all of this together. This obscures the relative simplicity of the model, which is a shame.