Presenting Analytic Solver Platform 2014-R2

In 2014 Frontline Systems released the newest version of its flagship product, Analytic Solver Platform. You can download a free trial of Analytic Solver Platform here.

Analytic Solver Platform makes it easy to learn from your data and make good decisions quickly. You don’t have to learn a new programming language, suffer through a complex deployment process, or abandon what you already know: you can grab data from your desktop, the web, or the cloud and build powerful predictive models in minutes from Excel.


In this release of Analytic Solver Platform you’ll find world class time series, prediction, classification, data cleaning, and clustering methods in XLMiner. XLMiner’s 30+ data mining methods have been rewritten from the ground up, combining the latest advances in machine learning with a straightforward Excel interface. Data sets that crash more expensive competitive products run flawlessly in XLMiner. Better yet, XLMiner produces reports with all the information you need to make the business case for your findings, including built-in charts and visualizations.



Analytic Solver Platform works with Microsoft Power BI to turn data into insight. My recent post showed how cloud hosted data can be ingested, cleaned, and mined for insight in minutes. Analytic Solver Platform supplements Power Query’s data cleaning with additional methods to help you categorize, clean, and handle missing data, and provides built in connectors to allow you to sample and score with popular data sources including Power Pivot.

Finally, Analytic Solver Platform helps you bridge the gap between experimentation and production deployment. Using Analytic Solver Platform with SharePoint allows your organization to audit and version your models. Use Frontline’s Solver SDK to integrate simulation and optimization in your application whether you use C++, C#, or web technologies. The latest version of Solver SDK will provide support for the popular F# language, allowing your team to build predictive models in a fraction of the development cost and lines of code.


Give it a try!

Yes, it’s an error, but it’s *my* error!

This morning I learned of a problem with a .Net optimization component developed by my team. We checked the server logs and we found that an optimization failed with the following (fictionalized) call stack:

System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.
   at System.ThrowHelper.ThrowKeyNotFoundException()
   at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
   at MyLibrary.Optimization.CreatePlanningHorizon(Int32 planId)
   at System.Linq.Enumerable.WhereSelectListIterator`2.MoveNext()
   at System.Linq.Enumerable.Sum(IEnumerable`1 source)
   at MyLibrary.Optimization.FindPlanningHorizons(BusinessObject data)
   at MyLibrary.Optimization.Initialize(BusinessObject data)
   at MyLibrary.Optimization.CreateModel(ScenarioData info)
   at Optimization.OptimizationThread.Optimize() in c:\Program Files\MyLibrary\OptimizationThread.cs:line 42

LINQ! Dictionaries! Call stacks! Help!

It turns out the situation wasn’t that complicated – it turns out the data that was being passed to the optimization library was invalid. A “plan” entity must have a range of “time period” entities associated with it. The CreatePlanningHorizon method examines the time periods associated with a plan to create a planning horizon. In this case, we were passed a plan with no time periods, which is invalid. (Again, I have fictionalized the scenario: I have renamed the entities from our actual application. The point is that we were passed invalid data according to the business rules of our application.)

It is totally cool to throw an exception in this case.  The .Net Design Guidelines for exceptions explain why – and this advice applies for other environments as well, for example Java. Don’t try to set an error code, or swallow the error and soldier on. This page states it well:

If a member cannot successfully do what it is designed to do, that should be considered an execution failure and an exception should be thrown.

So our problem is not that we are throwing an exception, it’s that the exception is confusing. The message does not tell me what the problem is, or what to do about it. The right thing to do here is to explicitly check the condition that should hold, and throw your own exception. You don’t need to define your own custom exception type for this. You can throw an existing exception type with a decent message. For example, at the time the planning horizon data is read from the database, throw a System.IO.InvalidDataException with the message: “The planning horizon with ID=42 is invalid because it has no associated time periods.” (Or something like that.) The point is that the message should indicate that there is a problem with the input data, so that the appropriate member of the development team can address the issue. The user of the application should never see this message – it’s for internal use.

Failing in the manner of your own choosing is preferable to failing in an unpredictable fashion!

Analytics Decathlon: 10 tasks every pro should know

I tried to think of 10 fundamental tasks that every analytics programmer should know how to do. I’m trying to keep it task-oriented (“do this”) rather than concept-oriented (“understand this”).  In thinking about this list I tried to make sure that I accounted for data preparation, carrying out a computation, sharing results, and code maintenance. Here goes:

  1. Read data from a CSV file.
  2. Sort a large multi-keyed dataset.
  3. Roll up numerical values based on a hierarchy. For example, given sales figures for all US grocery stores, produce state- and national-level sales.
  4. Create a bar chart with labels and error bars. Make sure the chart presents information clearly and beautifully. Read Tufte.
  5. Create a histogram with sensible bins. I include a second visualization item not only because histograms are so frequently used, but also because thinking about how to bin data causes one to think more deeply about how results should be summarized to tell a story.
  6. Perform data classification. Classification algorithms place items into different groups based on similarity. Several different popular machine learning approaches focus on this problem, for example k-means and decision trees. 
  7. Linear regression.
  8. Solve a linear programming problem. I am an optimization guy, so this should not surprise you. Optimization is underutilized, which is strange considering it sits atop  Analytics Mountain.
  9. Invoke an external process. Call an arbitrary executable program, preparing the necessary input, processing the output, and handling any errors.
  10. Consume and publish projects from a source control repository. I use “source control repository” loosely – simply: you need to know how to share code with others. For example: github, sourceforge, or CRAN.

It’s even better if you know how each of these tasks are actually implemented (at a high level)!

I intentionally skipped a few items:

  • Read XML. The key is to be able to process and produce structured data. Naturally tabular data, which can always be written out to CSV, seems to be more important in my experience.
  • Regular expressions. Really handy, but not vital. If you are focusing exclusively on text analytics then the situation is different.
  • Programming language X. This is worthy of a separate post – but I think it is unwise from a professional development and productivity standpoint to be religious about any particular programming language or environment: C++, .Net, Java, Python, SAS, R, Matlab, AMPL, GAMS, etc. Not all languages and environments are created equal, but no single environment provides everything an analytics pro needs in all situations (at this point). It is frequently the case that those who claim that a particular environment or language is uniquely qualified to perform a scientific programming task are unfamiliar with the alternatives.
  • Writing unit tests. I am a huge proponent of writing unit tests and test driven development, but this is not as important for consultants or academics. My hope that the thought of sharing code (number 10) scares most people enough into making sure that their code is correct and presentable.
    This list is meant to provoke discussion. What do you think? What’s missing? What’s wrong?

Simple SAS Performance Profiling Macros

As the old saying goes, the first rule of performance is to measure. My team is trying to tune the performance of a rather large SAS application, but it hasn’t been as easy as I thought it would be because profiling is too hard. Here’s what I understand about what SAS offers in the way of performance profiling (my focus is on CPU time):

First: SAS prints CPU time information to the SAS Log. For example:

17         data hello_world;
18           set temp.sampledata;
19         run;

NOTE: There were 42390 observations read from the data set TEMP.SAMPLEDATA.
NOTE: The data set WORK.HELLO_WORLD has 42390 observations and 37 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.03 seconds

This is simply too low-level to be useful because I am usually interested in understanding the performance of larger components rather than a single step. (Although once a problem has been narrowed down, it is very helpful to have this information produced automatically.)

Second: SAS also provides a LOGPARSE utility that parses logs to retrieve this information and put it in a more structured format. This is a bit better, but it still does not address the problem of understanding performance at a macro (component) level. I suppose this is due in part to the nature of SAS programming, which does not reduce to a call tree in the same way as many procedural languages do. It also does not address the very real problem of wanting to profile SAS Macro execution time, as well as including the time spent my external processes (such as callouts to system utilities or non-SAS components).

Third: for SCL (SAS Component Language) programs there is the SCL Performance Analyzer, which is helpful, but only if you happen to be writing SCL – which is not usually the case. Also the output of the analyzer is a little less than what I am used to.

I am not dissing SAS – I just think I am spoiled. I am not a SAS developer by background. The past decade or so I have been doing my primary development on the .Net Framework using various versions of Visual Studio. Several editions of Visual Studio 2012 include a performance analysis feature that makes it dead simple to profile .Net code. You can access it from the Debug menu:


A couple of clicks through a wizard, run your program, and you get all sorts of amazing information about the performance characteristics of the program:


It even highlights which lines took the most time:


It could hardly be much easier. Perhaps you can understand why I was so disappointed with SAS’s built-in tools.

I decided to be the change I want to see in the world and write a couple of simple, self-contained macros that I could use to profile SAS code. I am not a SAS expert (yet) so it’s likely that there are things that could be improved. Please help me out by posting your comments!

I wrote three macros: one to initialize the timing infrastructure, one to start timing a labeled code block, and one to stop. I have a few guidelines:

  • The macros are self-contained and small.
  • If profiling is disabled, the overhead is as close to zero as possible.
  • If profiling is enabled and the code blocks that are being timed are nontrivial, the overhead is small.
  • Arbitrary blocks of SAS/SAS Macro can be timed.
  • Timers can be nested.
  • Timing blocks can be given sensible names.
  • The results are reported in a SAS data set.

Here is an example of how to use the library. It times a small block of SAS named “hello”:

** … **;

data hello_world;
  set temp.sampledata;

The results are stored in a dataset called “_profile”:

Here is the source for the macros. “timer_init” takes a yes/no argument. If profiling is enabled, it creates a dataset called _tick. _tick holds the start times for all of the active timers.

** timer_init(profilingEnabled): initialize timers.       **;
**   profilingEnabled: Y means profiling is on.           **;
%macro timer_init(profilingEnabled);
  %global timerEnabled;
  %let timerEnabled=&profilingEnabled;
  %if &timerEnabled=Y %then %do;
  data _tick;
    root = datetime();

“timer_start” starts a new timer with the specified name. We rely on the datetime() method, which determines the timer resolution. Starting a new timer just means sticking a new variable in _tick.

** timer_start(desc): start timing a code block.          **;
**   desc: a description of the block.                    **;

%macro timer_start(label);
  %if &timerEnabled=Y %then %do;
  data _tick;
    set _tick;
    &label = datetime();

“timer_start” stops a timer. It’s slightly more complicated than you might think because I want to allow for the possibility of the same block being timed more than once, for example if it appears inside a SAS Macro %for loop. I suspect a better SAS programmer could simplify this code.

** timer_stop(desc): stop timing a code block.            **;
**   desc: a description of the block.                    **;

%macro timer_stop(label);
  %if &timerEnabled=Y %then %do;
  data _tick;
    set _tick;
    _now = datetime();
    _duration = _now – &label;
    &label = _now;
    call symputx(‘_duration’, _duration);
    drop _now _duration;

  %if %sysfunc(exist(_profile)) %then %do;
    data _profile;
      set _profile;
      &label = sum(&label,0) + input(&_duration, best20.);
  %else %do;
    data _profile;
      &label = input(&_duration, best20.);


Newsflash: debugging parallel programs ain’t easy

I ran into a situation recently where I was asked to debug a legacy C# program that was crashing due to multiple threads trying to write to the same file at the same time. I was asked because I was the last guy to modify it, so I guess I had no room to complain. I focused on the changes that I had made, trying to figure out how the heck I could have introduced the bug – my changes weren’t anywhere close to the source of the crash!

Then it hit me – my changes were a bunch of refactoring to make the code faster. The bug was always there, we were just more likely to hit it after my changes since each thread executes in less time. I should have probably guessed right away – there was a shared resource that was not being handled properly – but I was blinded by my assumption that my changes had to have introduced the bug. I guess that’s one moral of the story. (And now that I think about it, in the past *I* have been the guy that introduced a parallelism related bug that someone had to fix later.)

Another lesson is perhaps that it’s unwise to screw around with multicore parallelism unless you know what you are doing. Say you’ve got 4 cores, and let’s say that you get a 3x speedup out of them (which is often pretty generous). Many times I would rather be 3x slower but compleletely reliable and avoid random crashes. Microsoft’s task parallel library is kind of cool, but kind of dangerous. I’m not sure how often it’s really helpful.

.Net coding guidelines for operations research

Writing code, whether in C++, AMPL, GAMS, SAS, .Net, or whatever*, is a part of operations research for most of us. Here are a few thoughts on writing good .Net code. This advice is aimed especially at optimizers with a research background, who don’t have lots of experience writing “production” code.

  • Give things sensible names. In particular, start with Guidelines for Names on MSDN. This guideline is actually pretty important, because good naming and organization leads to fewer bugs and more maintainable code. My experience working with code by “optimization people” is that a lot of it is poorly organized. You’d never write a poorly organized paper (would you?) so why write disorganized code. The first step in good organization is to use good names. (Tip: In Visual Studio Use the F2 key (or right click and “Rename”) instead of find/replace.)
  • Exception: math code can and should use good math notation. Code that implements an optimization model should have a backing whitepaper. The whitepaper should include the full mathematical formulation of the problem. It should be possible to go back and forth between the documentation and code without too much trouble. So if in your doc you have a variable x with subscripts p, g, and a, then it is totally okay for the code to have the same names. Don’t artificially name x something like “amountToProduce” just to satisfy some MSDN guideline. For this reason I tend to avoid Greek letters, hats, and bars in my whitepapers. Careful choice of notation is important.
  • Get in the habit of writing small programs to test your assumptions. Are multidimensional arrays or jagged arrays faster in .Net? How much more expensive is a hashtable lookup than an array lookup? Is it better to use SOS2 constraints or dummy integer variables for piecewise linear constraints? How much memory does a solver use for a model? These things can be empirically tested.
  • Write tests for your code. In my experience, any code that does not have tests is probably wrong. Start with very simple tests that test basic, but important things.
  • Document your code. Help others understand what you are doing (and more importantly) why.
  • Public APIs should be designed for the person using it, not the person writing it. For example, an input data structure for a production planning module might talk about capacities, facilities, and availability. Constraints, knots, and duals wouldn’t make much sense to a user, even though they make perfect sense to us.
  • Consider refactoring any function over a page long. It is probably too confusing. You can highlight a section of code, right click, and do “Extract Function” in Visual Studio.
  • If you don’t understand the performance characteristics of a data structure, avoid using it until you do. The .Net framework provides TONS of built in methods and data structures. This makes it very, very easy to write correct but horribly inefficient code. For example, finding an element in List<> takes linear time, but accessing an element by key from a Dictionary<> is nearly constant time. The Count() method on an IEnumerable<> is linear time, but .Length on an array is constant time. Most of the time, arrays are great for what we do. It’s okay to use more, just understand why.
  • Package complicated data in data structures. A small caveat to the last point: optimization people tend to go crazy with multidimensional arrays when an array of simple data structure is usually more appropriate. This results in self-documenting code.
  • Be careful with memory. Just because .Net has garbage collection doesn’t mean you can forget about how you manage memory. Don’t unnecessarily allocate arrays inside a loop if you don’t need to. If you are careless with memory, it will kill your performance.
  • Don’t reinvent the wheel. You don’t need to write any sorting code – use Array.Sort. You don’t need to scan through strings for commas – use String.Split. You don’t need to write 3.14159 – use Math.PI.

The Design Guidelines for Developing Class Libraries page on MSDN is filled with lots of good suggestions.

* I think F# and other functional programming languages could actually be great for optimization. I guess that’s why Python is catching on.

O.R. and social networking: A Solver Foundation MIP model to suggest Twitter followers

Please note that Solver Foundation is no longer supported.

For this month’s INFORMS blog challenge I thought it would be fun to write a simple Solver Foundation – Twitter mashup. I’m not as happy with it as I could be, but maybe someone will find it interesting.

My goal is to write a web application that will suggest interesting twitter users to follow. Here’s what the finished product looks like:


As you can see, the application asks for:

  • a keyword,
  • a maximum number of twitter users to suggest,
  • minimum and maximum number of expected tweets per day.

and produces a list of twitter users to follow. I will describe some of the key steps in building this application, but I won’t provide all of the source simply because it would be a lot of work for me to clean up the code! I will try to provide enough so that the motivated reader can finish it off. If you want to follow along, get Solver Foundation, open Visual Studio 2010, and create a new web site (say MsfTwitter). Drag the controls shown (several Labels, one CheckBox, three TextBoxes and a Button) on the screenshot onto your main form, then double click on the button (which I named "Suggest"). Then you’ll be sitting in the codebehind file for the click button handler:

  protected void Suggest_Click(object sender, EventArgs e) {

The end goal here is to fill this with some code that will find suggestions. In order to do this, we’re going to have to know something about twitter users. For now, let’s assume that we’re able to obtain the number of tweets per day and an "interest score" that indicates how interesting a twitter user is for a particular topic. Then we can define a User data structure which is the basis for our model. Add a new C# file into the App_Code directory and add this:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Web;

[DebuggerDisplay("{Name}, score = {InterestScore}, tweets = {Tweets}")]
public class User {
  public string Name { get; set; }
  public double InterestScore { get; set; }
  public double TweetsPerDay { get; set; }

Our application logic is really divided into two parts:

  1. Finding interesting twitter users for a given topic. This gives us a list of Twitter users, including their interest score and tweets per day.

    2. Selecting a "best" subset of these users given the preferences specified on the web page.

These are actually both great analytics problems. I am going to rely on third party services to solve the first problem, and I will use Solver Foundation for the second (since that is an optimization model). So for now, let’s assume that we’ve got a list of twitter users and we want to solve problem 2. (If you’re following along in Visual Studio, now is a good time to add a reference to Solver Foundation using “Add Reference”.)

Following my standard pattern of identifying inputs, outputs, goals, and constraints, our model is as follows:

  • The input is a bunch of users. Each one has a score and expected number of tweets per day.
  • The output is a boolean value for each user: to follow or not to follow?
  • The goal is to maximize the total interest score over all selected users.
  • One constraint is that the expected number of tweets is between user specified values min_tweets and max_tweets.
  • Another constraint is that the number of users that we select is at most K.

    The next step is to write a class that we can use from the web page. I will call the class SuggestModel and give it one method. The signature is pretty straightforward based on my description above.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using Microsoft.SolverFoundation.Common;
using Microsoft.SolverFoundation.Services;

public class SuggestModel
  public static IEnumerable<User> Solve(IEnumerable<User> data, 
int minSuggestions, int maxSuggestions, int maxTweets) { } }

Creating the C# code for Solve is also straightforward given the description of the model. The Parameters for the model are the interest scores and tweets per day. Both are indexed over the set of users. We use the SetBinding method to associate the appropriate fields of the User objects with the parameters.

public static IEnumerable<User> Solve(IEnumerable<User> data, int minSuggestions, 
  int maxSuggestions, int maxTweets) {
  SolverContext context = SolverContext.GetContext();
  Model model = context.CreateModel();
  Set users = new Set(Domain.Any, "users");
  Parameter score = new Parameter(Domain.Real, "score", users);
  score.SetBinding(data, "InterestScore", "Name"); // bind to the InterestScore field.
  Parameter tweets = new Parameter(Domain.Real, "tweets", users);
  tweets.SetBinding(data, "TweetsPerDay", "Name"); // bind to the TweetsPerDay field.
  model.AddParameters(score, tweets);
  Decision choose = new Decision(Domain.IntegerRange(0, 1), "choose", users);
  model.AddGoal("goal", GoalKind.Maximize, Model.Sum(
    Model.ForEach(users, user => choose[user] * score[user])));
    minSuggestions <= Model.Sum(Model.ForEach(users, user => choose[user])) <= maxSuggestions);
    Model.Sum(Model.ForEach(users, user => choose[user] * tweets [user])) <= maxTweets);
  return data.Where(user => choose.GetDouble(user.Name) > 0.01);

Notice the last line: using a LINQ expression we return a list of users where the value of the choose decision is nonzero. That’s what we wanted. The decision values are integer valued and all the constraints and goals are linear, so this is a mixed integer programming model. It’s essentially a knapsack problem – no big deal.

Now we can return to "problem 1": finding interesting twitter users for the topic entered into the text box. To solve this problem I will rely on the technique (first suggested by Polya?) of being lazy – I will use other people’s code.

Twitter provides a simple REST API that is the equivalent of the "Who to Follow" button on You can learn more about it (and try it out) here.

To actually call this “who to follow” twitter API from our program, we need three more things:

  1. A twitter REST API key (so we can make calls to twitter). So I registered my app here: You can too – it’s free.
  2. Code to make authenticated REST calls. The authentication system is called OAuth. Shannon Whitley wrote a nice OAuth package for .Net and posted it on his blog here. Go grab that and put it in your App_Code directory. I changed the namespace to “OAuth”.
  3. Code to convert the API results (returned in a format called JSON). There is a Json.Net ackage written by James Newton-King that makes this easy. Follow this link for the details. In the code below where I work with JObject/JToken/JArray, add a ”using Newtonsoft.Json.Linq” statement to the top of the source file.

Since the twitter API is authenticated, there is a dance that we have to do to obtain an access token given our API key. This involves redirecting to twitter, allowing our registered application to have access to twitter data, then redirecting back to our page. I am omitting that code because it’s not really the point of this post.

Once you go through all of that mess, the code is short and sweet. Put this in the same file as your click event handler (probably Default.aspx.cs):

  private oAuthTwitter _oAuth;
  private JArray GetTwitterUsers() {
    string url = String.Format({0},
HttpUtility.UrlEncode(Keyword.Text)); string json = _oAuth.oAuthWebRequest(oAuthTwitter.Method.GET, url, String.Empty); return JArray.Parse(json); }

This gives me a bunch of interesting users but I don’t have a score or expected number of tweets. We can estimate the number of tweets per day by looking at two fields in the GetRelevantUsers result: statuses_count (the number of tweets for the user account) and created_at (when the account was created). Easy.

The last ingredient is the interest scores. A colleague referred me to Klout, which fits the bill nicely. The documentation for the API is here, but it is very simple: pass a comma-delimited list of user names in the query string and you get back the "klout scores". I will use those as our interest scores. Using the same raw materials as for the twitter REST API, the code looks like this:

  private JToken GetKloutScores(IEnumerable<string> users) {
    string uri = "{0}&users={1}";
    string usernames = HttpUtility.UrlEncode(String.Join(",", users)); 
    string json = _oAuth.WebRequest(oAuthTwitter.Method.GET, 
String.Format(uri, _kloutKey, usernames), String.Empty); var result = JObject.Parse(json); return result["users"]; }

Now we have two data sets: one from Twitter and one from Klout. We need to join this information together to get a list of users to pass into SuggestModel. We can do that in a single LINQ Join statement. Again, skipping some OAuth details, here is the code:

  protected void Suggest_Click(object sender, EventArgs e) {
    if (_authorized) {
      if (SetAccessToken()) {
        // todo: read values for these from the TextBox controls.
int minSuggest = 1, maxSuggest = 10, tweets = 100; var twitter = GetTwitterUsers(); var klout = GetKloutScores(twitter.Select(token => (string)token["screen_name"])); IEnumerable<User> data = klout.Join( twitter, k => (string)k["twitter_screen_name"], t => (string)t["screen_name"], (k, t) => new User { Name = (string)t["screen_name"], InterestScore = (double)k["kscore"], TweetsPerDay = (int)t["statuses_count"] /
DaysBetween(t.DateTime("created_at"), DateTime.UtcNow) }); var suggestedUsers = SuggestModel.Solve(data, minSuggest, maxSuggest, tweets);
// todo: do something with the output.
} private double DaysBetween(DateTime start, DateTime end) { return (end - start).TotalDays; }

A braindead way of displaying the output is to drag a Literal control on the form and  write the output as HTML. We can replace that last “todo” with this:

StringBuilder build = new StringBuilder();
build.AppendFormat("<h2>{0}</h2>", "Suggestions");
foreach (var user in suggestedUsers) {
  build.AppendFormat("<div>{0}, tweets/day = {1}</div>", 
HttpUtility.HtmlEncode(user.Name), user.TweetsPerDay); } apiResponse.Text = build.ToString();

I have made major, major shortcuts all the way along: the error checking is poor, I could consider caching information rather than hitting twitter & Klout each time, the interest scores are not necessarily perfect, the UI and logic is all mixed together, existing follows are not considered, the app should itself be exposed as a REST service, and so on. Even with all the simplifications I have made, a lot of glue code is necessary to tie all of this together. This obscures the relative simplicity of the model, which is a shame.

Converting an Access DB to XML using C#

I recently needed to import an Access database into a C# program for a sample that I will be blogging about soon. My objective was to convert the data into a more convenient form for use with my “real” application. Nothing here will be very cutting edge! A quick and dirty way to get the job done seemed to be to read the data into a DataSet and export it to XML. Access MDB files can be read using the Jet OLDB provider with OleDbConnection. Once the connection is established, the GetOleDbSchemaTable method can be used to get the table names. Then each table can be read using a select. Writing the data out to XML is easy using the built-in DataSet.WriteToXml() method. I also write out the schema file so that the columns will have the correct types when I read the data back in.

One last hitch: in .Net 4/VS2010 the OLEDB component works only with a 32-bit build. So change the Platform target in the Project properties as follows:


Here’s the code. I’ve made even less of an effort than usual to make the code “production quality”, but note that a couple of the classes I use are IDisposable so I’m taking care to wrap them in a “using” block.

using System;
using System.Collections.Generic;
using System.Data;
using System.Data.OleDb;
using System.Linq;
using System.Text;

// None of this is foolproof...caveat emptor.
namespace AccessToDataSet {
  class Program {
    static void Main(string[] args) {
      if (args.Length == 0 || !args[0].EndsWith(".mdb", StringComparison.InvariantCultureIgnoreCase)) {
        Console.WriteLine("Please specify the path to an MDB file.");

      DataSet dataSet = new DataSet();
      using (var conn = new OleDbConnection(@"Provider=Microsoft.JET.OLEDB.4.0;" + @"data source=" + args[0])) {
        // Retrieve the schema
        DataTable schemaTable = conn.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, new object[] { null, null, null, "TABLE" });
        // Fill the DataTables.
        foreach (DataRow dataTableRow in schemaTable.Rows) {
          string tableName = dataTableRow["Table_Name"].ToString();
          // I seem to get an extra table starting with ~. I can't seem to screen it out based on information in schemaTable,
          // hence this hacky check.
          if (!tableName.StartsWith("~", StringComparison.InvariantCultureIgnoreCase)) {
            FillTable(dataSet, conn, tableName);

      string name = args[0].ToLowerInvariant();
      dataSet.WriteXmlSchema(name.Replace(".mdb", ".schema.xml"));
      dataSet.WriteXml(name.Replace(".mdb", ".xml"));

    private static void FillTable(DataSet dataSet, OleDbConnection conn, string tableName) {
      DataTable dataTable = dataSet.Tables.Add(tableName);
      using (OleDbCommand readRows = new OleDbCommand("SELECT * from " + tableName, conn)) {
        OleDbDataAdapter adapter = new OleDbDataAdapter(readRows);

Which is faster: Regex.IsMatch or String.Contains?

On an internal message board here at work somebody asked:

Is there any difference in speed/memory usage for these two equivalent expressions:

Regex.IsMatch(Message, "1000");

My guess is that Message.Contains() is faster because it likely involves less machinery. Let’s try it and see.

using System;
using System.Diagnostics;
using System.Text;
using System.Text.RegularExpressions;

namespace TryItAndSee {
  class Program {
    static void Main(string[] args) {
      string message = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. "
      + "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in"
      + " reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt"
      + " in culpa qui officia deserunt mollit anim id est laborum.";
      Stopwatch s = new Stopwatch();
      int trials = 1000000;
      for (int i = 0; i < trials; i++) {
        bool isMatch = Regex.IsMatch(message, "nulla");
      Console.WriteLine("regex = " + s.Elapsed);
      for (int i = 0; i < trials; i++) {
        bool isMatch = message.Contains("nulla");
      Console.WriteLine("contains = " + s.Elapsed);

The output appears to confirm my guess, at least on this input:

regex    = 00:00:01.2446435
contains = 00:00:00.5458883


Niels Kuhnel reports the following:

Sure. But if you’re using RegexOptions.Compiled then IsMatch is actually faster.

Try putting:

Regex nulla = new Regex("nulla", RegexOptions.Compiled);

// Normally we have a static Regex so it isn't fair to time the initialization 
// (although it doesn't make a difference in this case)
for (int i = 0; i < trials; i++) {
 bool isMatch = nulla.IsMatch(message);

I got:

regex = 00:00:00.6902234

contains = 00:00:00.8815885

(during 10 trials it was consistently faster)

Lesson must be that if you’re searching for the same thing a lot, the dynamically compiled state machine provided by RegexOptions.Compiled is actually faster. Even if just searching for a simple string.”

Solving a Knapsack problem with Solver Foundation and LINQ

On an internal message board, somebody asked how to solve the following problem:

Let’s say I have this list of days and prices:

List<ReservationPrice> prices = new List<ReservationPrice>(); 
prices.Add(new ReservationPrice { NumberOfDays = 1, Price = 1000 }); 
prices.Add(new ReservationPrice { NumberOfDays = 2, Price = 1200 }); 
prices.Add(new ReservationPrice { NumberOfDays = 3, Price = 2500 }); 
prices.Add(new ReservationPrice { NumberOfDays = 4, Price = 3100 }); 
prices.Add(new ReservationPrice { NumberOfDays = 7, Price = 4000 }); 

What I would like to able to do now is: give me the best price from the list based on a number of days.

So if ask for 3 days the best price from the list is from child one (1000) and two (1200), but there are of course different combinations. How would an algorithm that found the best price from this list look like ?

Here’s how you could do it in Solver Foundation. You’ll note that the way we bind the solution (which items to choose) back to the ReservationPrice data structure kind of stinks. Other than that the code is pretty straightforward.

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.SolverFoundation.Services;

namespace SolverFoundationSample {

  public class ReservationPrice {
    private double _selectedValue;

    public int Id { get; set; }
    public int NumberOfDays { get; set; }
    public int Price { get; set; }
    public bool Selected {
      get { return _selectedValue > 0.0; }
      set { _selectedValue = value ? 1.0 : 0.0; }
    // It's too bad that we need this field. It would be nicer to bind
    // the Decision to the Selected property - but we need a double-valued
    // property.
    public double SelectedValue {
      get { return _selectedValue; }
      set { _selectedValue = value; }

    public override string ToString() {
      return "Number of days = " + NumberOfDays + ", price = " + Price;

  public class Knapsack {

    public static void Main(string[] args) {

    private static IEnumerable<ReservationPrice> GetData() {
      List<ReservationPrice> prices = new List<ReservationPrice>();
      prices.Add(new ReservationPrice { NumberOfDays = 1, Price = 1000, Id = 0 });
      prices.Add(new ReservationPrice { NumberOfDays = 2, Price = 1200, Id = 1 });
      prices.Add(new ReservationPrice { NumberOfDays = 3, Price = 2500, Id = 2 });
      prices.Add(new ReservationPrice { NumberOfDays = 4, Price = 3100, Id = 3 });
      prices.Add(new ReservationPrice { NumberOfDays = 7, Price = 4000, Id = 4 });
      return prices;

    private static void SolveKnapsack() {
      SolverContext context = SolverContext.GetContext();
      Model model = context.CreateModel();
      var items = new Set(Domain.Any, "items");

      var theData = GetData();

      // Duration of each reservation
      var length = new Parameter(Domain.IntegerNonnegative, "length", items);
      length.SetBinding(theData, "NumberOfDays", "Id");

      // Price for each reservation.
      var price = new Parameter(Domain.IntegerNonnegative, "price", items);
      price.SetBinding(theData, "Price", "Id");

      // The duration requirement.
      int duration = 3;

      var choose = new Decision(Domain.IntegerRange(0, 1), "choose", items);
      choose.SetBinding(theData, "SelectedValue", "Id");
      // Reserve the right number of days.
      model.AddConstraint("c", (Model.Sum(Model.ForEach(items, i => (choose[i] * length[i]))) == duration));

      // Spend as little as possible.
      model.AddGoal("g", GoalKind.Minimize, Model.Sum(Model.ForEach(items, i => (price[i] * choose[i]))));

      var solution = context.Solve();

      var selected = from d in theData
                where d.Selected
                select d.ToString();

      foreach (var s in selected) {