Presenting Analytic Solver Platform 2014-R2

In 2014 Frontline Systems released the newest version of its flagship product, Analytic Solver Platform. You can download a free trial of Analytic Solver Platform here.

Analytic Solver Platform makes it easy to learn from your data and make good decisions quickly. You don’t have to learn a new programming language, suffer through a complex deployment process, or abandon what you already know: you can grab data from your desktop, the web, or the cloud and build powerful predictive models in minutes from Excel.

ASP1

In this release of Analytic Solver Platform you’ll find world class time series, prediction, classification, data cleaning, and clustering methods in XLMiner. XLMiner’s 30+ data mining methods have been rewritten from the ground up, combining the latest advances in machine learning with a straightforward Excel interface. Data sets that crash more expensive competitive products run flawlessly in XLMiner. Better yet, XLMiner produces reports with all the information you need to make the business case for your findings, including built-in charts and visualizations.

ASPTree

ASPChart

Analytic Solver Platform works with Microsoft Power BI to turn data into insight. My recent post showed how cloud hosted data can be ingested, cleaned, and mined for insight in minutes. Analytic Solver Platform supplements Power Query’s data cleaning with additional methods to help you categorize, clean, and handle missing data, and provides built in connectors to allow you to sample and score with popular data sources including Power Pivot.

Finally, Analytic Solver Platform helps you bridge the gap between experimentation and production deployment. Using Analytic Solver Platform with SharePoint allows your organization to audit and version your models. Use Frontline’s Solver SDK to integrate simulation and optimization in your application whether you use C++, C#, or web technologies. The latest version of Solver SDK will provide support for the popular F# language, allowing your team to build predictive models in a fraction of the development cost and lines of code.

SDK1

Give it a try!

Yes, it’s an error, but it’s *my* error!

This morning I learned of a problem with a .Net optimization component developed by my team. We checked the server logs and we found that an optimization failed with the following (fictionalized) call stack:

System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.
   at System.ThrowHelper.ThrowKeyNotFoundException()
   at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
   at MyLibrary.Optimization.CreatePlanningHorizon(Int32 planId)
   at System.Linq.Enumerable.WhereSelectListIterator`2.MoveNext()
   at System.Linq.Enumerable.Sum(IEnumerable`1 source)
   at MyLibrary.Optimization.FindPlanningHorizons(BusinessObject data)
   at MyLibrary.Optimization.Initialize(BusinessObject data)
   at MyLibrary.Optimization.CreateModel(ScenarioData info)
   at Optimization.OptimizationThread.Optimize() in c:\Program Files\MyLibrary\OptimizationThread.cs:line 42

LINQ! Dictionaries! Call stacks! Help!

It turns out the situation wasn’t that complicated – it turns out the data that was being passed to the optimization library was invalid. A “plan” entity must have a range of “time period” entities associated with it. The CreatePlanningHorizon method examines the time periods associated with a plan to create a planning horizon. In this case, we were passed a plan with no time periods, which is invalid. (Again, I have fictionalized the scenario: I have renamed the entities from our actual application. The point is that we were passed invalid data according to the business rules of our application.)

It is totally cool to throw an exception in this case.  The .Net Design Guidelines for exceptions explain why – and this advice applies for other environments as well, for example Java. Don’t try to set an error code, or swallow the error and soldier on. This page states it well:

If a member cannot successfully do what it is designed to do, that should be considered an execution failure and an exception should be thrown.

So our problem is not that we are throwing an exception, it’s that the exception is confusing. The message does not tell me what the problem is, or what to do about it. The right thing to do here is to explicitly check the condition that should hold, and throw your own exception. You don’t need to define your own custom exception type for this. You can throw an existing exception type with a decent message. For example, at the time the planning horizon data is read from the database, throw a System.IO.InvalidDataException with the message: “The planning horizon with ID=42 is invalid because it has no associated time periods.” (Or something like that.) The point is that the message should indicate that there is a problem with the input data, so that the appropriate member of the development team can address the issue. The user of the application should never see this message – it’s for internal use.

Failing in the manner of your own choosing is preferable to failing in an unpredictable fashion!

Analytics Decathlon: 10 tasks every pro should know

I tried to think of 10 fundamental tasks that every analytics programmer should know how to do. I’m trying to keep it task-oriented (“do this”) rather than concept-oriented (“understand this”).  In thinking about this list I tried to make sure that I accounted for data preparation, carrying out a computation, sharing results, and code maintenance. Here goes:

  1. Read data from a CSV file.
  2. Sort a large multi-keyed dataset.
  3. Roll up numerical values based on a hierarchy. For example, given sales figures for all US grocery stores, produce state- and national-level sales.
  4. Create a bar chart with labels and error bars. Make sure the chart presents information clearly and beautifully. Read Tufte.
  5. Create a histogram with sensible bins. I include a second visualization item not only because histograms are so frequently used, but also because thinking about how to bin data causes one to think more deeply about how results should be summarized to tell a story.
  6. Perform data classification. Classification algorithms place items into different groups based on similarity. Several different popular machine learning approaches focus on this problem, for example k-means and decision trees. 
  7. Linear regression.
  8. Solve a linear programming problem. I am an optimization guy, so this should not surprise you. Optimization is underutilized, which is strange considering it sits atop  Analytics Mountain.
  9. Invoke an external process. Call an arbitrary executable program, preparing the necessary input, processing the output, and handling any errors.
  10. Consume and publish projects from a source control repository. I use “source control repository” loosely – simply: you need to know how to share code with others. For example: github, sourceforge, or CRAN.

It’s even better if you know how each of these tasks are actually implemented (at a high level)!

I intentionally skipped a few items:

  • Read XML. The key is to be able to process and produce structured data. Naturally tabular data, which can always be written out to CSV, seems to be more important in my experience.
  • Regular expressions. Really handy, but not vital. If you are focusing exclusively on text analytics then the situation is different.
  • Programming language X. This is worthy of a separate post – but I think it is unwise from a professional development and productivity standpoint to be religious about any particular programming language or environment: C++, .Net, Java, Python, SAS, R, Matlab, AMPL, GAMS, etc. Not all languages and environments are created equal, but no single environment provides everything an analytics pro needs in all situations (at this point). It is frequently the case that those who claim that a particular environment or language is uniquely qualified to perform a scientific programming task are unfamiliar with the alternatives.
  • Writing unit tests. I am a huge proponent of writing unit tests and test driven development, but this is not as important for consultants or academics. My hope that the thought of sharing code (number 10) scares most people enough into making sure that their code is correct and presentable.
    This list is meant to provoke discussion. What do you think? What’s missing? What’s wrong?

.Net coding guidelines for operations research

Writing code, whether in C++, AMPL, GAMS, SAS, .Net, or whatever*, is a part of operations research for most of us. Here are a few thoughts on writing good .Net code. This advice is aimed especially at optimizers with a research background, who don’t have lots of experience writing “production” code.

  • Give things sensible names. In particular, start with Guidelines for Names on MSDN. This guideline is actually pretty important, because good naming and organization leads to fewer bugs and more maintainable code. My experience working with code by “optimization people” is that a lot of it is poorly organized. You’d never write a poorly organized paper (would you?) so why write disorganized code. The first step in good organization is to use good names. (Tip: In Visual Studio Use the F2 key (or right click and “Rename”) instead of find/replace.)
  • Exception: math code can and should use good math notation. Code that implements an optimization model should have a backing whitepaper. The whitepaper should include the full mathematical formulation of the problem. It should be possible to go back and forth between the documentation and code without too much trouble. So if in your doc you have a variable x with subscripts p, g, and a, then it is totally okay for the code to have the same names. Don’t artificially name x something like “amountToProduce” just to satisfy some MSDN guideline. For this reason I tend to avoid Greek letters, hats, and bars in my whitepapers. Careful choice of notation is important.
  • Get in the habit of writing small programs to test your assumptions. Are multidimensional arrays or jagged arrays faster in .Net? How much more expensive is a hashtable lookup than an array lookup? Is it better to use SOS2 constraints or dummy integer variables for piecewise linear constraints? How much memory does a solver use for a model? These things can be empirically tested.
  • Write tests for your code. In my experience, any code that does not have tests is probably wrong. Start with very simple tests that test basic, but important things.
  • Document your code. Help others understand what you are doing (and more importantly) why.
  • Public APIs should be designed for the person using it, not the person writing it. For example, an input data structure for a production planning module might talk about capacities, facilities, and availability. Constraints, knots, and duals wouldn’t make much sense to a user, even though they make perfect sense to us.
  • Consider refactoring any function over a page long. It is probably too confusing. You can highlight a section of code, right click, and do “Extract Function” in Visual Studio.
  • If you don’t understand the performance characteristics of a data structure, avoid using it until you do. The .Net framework provides TONS of built in methods and data structures. This makes it very, very easy to write correct but horribly inefficient code. For example, finding an element in List<> takes linear time, but accessing an element by key from a Dictionary<> is nearly constant time. The Count() method on an IEnumerable<> is linear time, but .Length on an array is constant time. Most of the time, arrays are great for what we do. It’s okay to use more, just understand why.
  • Package complicated data in data structures. A small caveat to the last point: optimization people tend to go crazy with multidimensional arrays when an array of simple data structure is usually more appropriate. This results in self-documenting code.
  • Be careful with memory. Just because .Net has garbage collection doesn’t mean you can forget about how you manage memory. Don’t unnecessarily allocate arrays inside a loop if you don’t need to. If you are careless with memory, it will kill your performance.
  • Don’t reinvent the wheel. You don’t need to write any sorting code – use Array.Sort. You don’t need to scan through strings for commas – use String.Split. You don’t need to write 3.14159 – use Math.PI.

The Design Guidelines for Developing Class Libraries page on MSDN is filled with lots of good suggestions.

* I think F# and other functional programming languages could actually be great for optimization. I guess that’s why Python is catching on.

Which is faster: Regex.IsMatch or String.Contains?

On an internal message board here at work somebody asked:

Is there any difference in speed/memory usage for these two equivalent expressions:

Regex.IsMatch(Message, "1000");
Message.Contains("1000");

My guess is that Message.Contains() is faster because it likely involves less machinery. Let’s try it and see.

using System;
using System.Diagnostics;
using System.Text;
using System.Text.RegularExpressions;

namespace TryItAndSee {
  class Program {
    static void Main(string[] args) {
      string message = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. "
      + "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in"
      + " reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt"
      + " in culpa qui officia deserunt mollit anim id est laborum.";
      Stopwatch s = new Stopwatch();
      int trials = 1000000;
      
      s.Start();
      for (int i = 0; i < trials; i++) {
        bool isMatch = Regex.IsMatch(message, "nulla");
      }
      s.Stop();
      Console.WriteLine("regex = " + s.Elapsed);
      
      s.Reset();
      s.Start();
      for (int i = 0; i < trials; i++) {
        bool isMatch = message.Contains("nulla");
      }
      s.Stop();
      Console.WriteLine("contains = " + s.Elapsed);
    }
  }
}

The output appears to confirm my guess, at least on this input:

regex    = 00:00:01.2446435
contains = 00:00:00.5458883

UPDATE:

Niels Kuhnel reports the following:

Sure. But if you’re using RegexOptions.Compiled then IsMatch is actually faster.

Try putting:

Regex nulla = new Regex("nulla", RegexOptions.Compiled);

// Normally we have a static Regex so it isn't fair to time the initialization 
// (although it doesn't make a difference in this case)
s.Start(); 
for (int i = 0; i < trials; i++) {
 bool isMatch = nulla.IsMatch(message);
}

I got:

regex = 00:00:00.6902234

contains = 00:00:00.8815885

(during 10 trials it was consistently faster)

Lesson must be that if you’re searching for the same thing a lot, the dynamically compiled state machine provided by RegexOptions.Compiled is actually faster. Even if just searching for a simple string.”