Archive
Don’t be a hero when trying to solve set covering problems
A colleague posted an interesting question to an internal discussion board the other day:
Given a set of code blocks changed between two versions of the same binary, and approximated code coverage for the new binary, what are the optimal tests that cover all changed part of the new binary and are least costly to run?
Example: (A “1” in a cell (Bi, Tj) means that block i is tested by test j)
B1 B2 B3 B4 B5 B6 Cost T1 1 1 1 1 1 1 50 T2 0 1 1 1 1 1 20 T3 1 1 1 0 0 0 15 T4 0 0 0 1 1 1 15 The optimal solution is T3, T4 with minimized cost 15 + 15 = 30. The best greedy solution that we have is T2, T3 with cost 20 + 15 = 35.
The colleague who posted this question also posted a mathematical formulation for the problem:
Binary integer problem statement:
One constraint for each binary block that needs to be covered to make sure we archive maximum coverage. One variable for each test in the list that we try to optimize.
Block 1: X1 + X3 > 0
Block 2: X1 + X2 + X3 > 0
Block 3: X1 + X2 + X3 > 0
Block 4: X1 + X2 + X4 > 0
Block 5: X1 + X2 + X4 > 0
Block 6: X1 + X2 + X4 > 0X1, X2, X3, X4 = {0, 1}
Cost function to optimize: 50*X1 + 20*X2 + 15*X3 + 15*X4
It’s simple to model this problem using Solver Foundation. Let’s break it down:
- We have sets of tests and code blocks.
- One input parameter is a matrix of 0-1 values that indicate which tests exercise which blocks.
- Another input is the cost of running each test.
- We want to decide whether or not to run each test.
- Our constraint is to make sure that each block is covered.
- We want to minimize the total cost of running the active tests.
Here is the OML. If you run Excel, change the editing mode to manual, paste the OML, then paste the table above and set up the data binding then you can run it for yourself. If you play around you will see that you can easily solve much larger instances.
Model[
Parameters[
Sets, Tests, Blocks
],
Parameters[
Integers[0, 1],
Cover[Tests, Blocks] // the cells in the matrix below
],
Parameters[
Integers[-Infinity, Infinity],
Cost[Tests] // the cost column
],
Decisions[
Integers[0, 1],
RunTest[Tests] // whether to run test i
],
Constraints[
Constraint1 -> Foreach[{b, Blocks}, Sum[{t, Tests}, Cover[t, b] * RunTest[t]] >= 1] // ensure coverage
],
Goals[
Minimize[
Goal1 -> Sum[{t, Tests}, Cost[t] * RunTest[t]] // minimize cost
]
]
]
I posted this solution to the alias, and the subsequent feedback was very interesting! The original poster was pleased to have a solution and followed up privately with some questions about the installation requirements for Solver Foundation,etc. He was happy. Another contributor recognized this as a set-covering problem, assumed it couldn’t be solved efficiently, and started proposing LP relaxations and heuristics. Another went into details about complexity theory and the difficulty of finding polynomial time approximation algorithms for set covering problems. Another simply doubted that the proposed solution would work (but didn’t try it). While these are interesting topics for discussion, it loses sight of the fact that the situation is that an engineer wants to solve a problem. Mixed integer programming solvers (like the ones inside Solver Foundation) are incredibly robust and mature; the products of decades of research. A one-off implementation of a specialized algorithm is likely to be less efficient and less robust than a highly tuned, professional grade solver. If an off the shelf solution works, then grab it with both hands and move on with life! Don’t try and be a hero (unless you plan to get a research paper out of it).
Which is faster: Regex.IsMatch or String.Contains?
On an internal message board here at work somebody asked:
Is there any difference in speed/memory usage for these two equivalent expressions:
Regex.IsMatch(Message, "1000"); Message.Contains("1000");
My guess is that Message.Contains() is faster because it likely involves less machinery. Let’s try it and see.
using System; using System.Diagnostics; using System.Text; using System.Text.RegularExpressions; namespace TryItAndSee { class Program { static void Main(string[] args) { string message = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. " + "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in" + " reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt" + " in culpa qui officia deserunt mollit anim id est laborum."; Stopwatch s = new Stopwatch(); int trials = 1000000; s.Start(); for (int i = 0; i < trials; i++) { bool isMatch = Regex.IsMatch(message, "nulla"); } s.Stop(); Console.WriteLine("regex = " + s.Elapsed); s.Reset(); s.Start(); for (int i = 0; i < trials; i++) { bool isMatch = message.Contains("nulla"); } s.Stop(); Console.WriteLine("contains = " + s.Elapsed); } } }
The output appears to confirm my guess, at least on this input:
regex = 00:00:01.2446435 contains = 00:00:00.5458883
UPDATE:
Niels Kuhnel reports the following:
“
Sure. But if you’re using RegexOptions.Compiled then IsMatch is actually faster.
Try putting:
Regex nulla = new Regex("nulla", RegexOptions.Compiled); // Normally we have a static Regex so it isn't fair to time the initialization // (although it doesn't make a difference in this case) s.Start(); for (int i = 0; i < trials; i++) { bool isMatch = nulla.IsMatch(message); }
I got:
regex = 00:00:00.6902234
contains = 00:00:00.8815885
(during 10 trials it was consistently faster)
Lesson must be that if you’re searching for the same thing a lot, the dynamically compiled state machine provided by RegexOptions.Compiled is actually faster. Even if just searching for a simple string.”
Mixed integer linearizations for QAP using Solver Foundation
I read an interesting paper by Zhang, Beltran-Royo and Ma on the excellent Optimization Online site this morning. As an exercise I wrote a Solver Foundation program to reproduce some of the results. In previous posts I showed how to write a C# program to solve Quadratic Assignment Problems (QAP). I wrote everything from scratch, including the Gilmore-Lawler bound computation. In this post (and the ones that follow) I am going to write everything using Solver Foundation Services in an effort to show how much easier it is!
If you are coming at this from the QAP angle, I recommend reading my previous QAP posts above. If you are coming at this from the Solver Foundation angle, look at page 4 of the paper (equations 2.12 – 2.15 in particular) and look at how easy it is to write down this rather complicated model using Solver Foundation Services. If you want a short Solver Foundation, check this out, or check out our new and improved documentation.
The paper talks about ways to use mixed integer linear programs (such as those built into Solver Foundation) to solve linearizations of QAP. One popular linearization is the one proposed by Kaufman and Broeckx. This formulation introduces n^2 continuous variables in addition to the standard integer variables x_ij (which represent the optimal assignment of facilities to locations). If you look at equations 2.12 – 2.15 on page 4, you will see that the goal function involves auxiliary real variables z. 2.13 is the most important constraint. 2.14 just enforces the non-negativity of the z variables, and we also need to make sure that the standard constraints on x hold: the row and column sums are equal to 1.
The code to compute the Kaufman-Broeckx linearization is below. I hardcoded a sample problem in A and B. You can replace it to read a standard problem from QAPLIB. Notice that the goals and constraints are basically transcriptions of the equations! The equations are complicated, but the way they map to the SFS APIs is straightforward. The other issue is how to use data binding to establish the values of the “q” and “a” parameters. This is done with the help of an auxiliary Cell class, and helper methods that return the “q” and “a” items. The SetBinding calls are pretty straightforward once you’ve got those.
If I were better at LINQ I could have simplified some of the code in GetCells and GetSums. But I’m not, so I didn’t. Viewing the output shows that the goal function is 50 – which appears to be correct (that’s the optimal value for the nug05 problem). Looking at the x values you can determine the optimal permutation according to the linearization.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.SolverFoundation.Services;
namespace QapLinearization {
class Program {
// Here is the data for nug05.dat from QAPLIB.
private double[][] A = new double[][] {
new double[] { 0, 5, 2, 4, 1 },
new double[] { 5, 0, 3, 0, 2 },
new double[] { 2, 3, 0, 0, 0 },
new double[] { 4, 0, 0, 0, 5 },
new double[] { 1, 2, 0, 5, 0 }
};
private double[][] B = new double[][] {
new double[] { 0, 1, 1, 2, 3 },
new double[] { 1, 0, 2, 1, 2 },
new double[] { 1, 2, 0, 1, 2 },
new double[] { 2, 1, 1, 0, 1 },
new double[] { 3, 2, 2, 1, 0 }
};
static void Main(string[] args) {
Program p = new Program();
p.KaufmanBroeckx();
}
public void KaufmanBroeckx() {
SolverContext context = SolverContext.GetContext();
Model model = context.CreateModel();
Set N = new Set(Domain.Integer, "N");
Parameter q = new Parameter(Domain.Real, "q", N, N, N, N);
q.SetBinding(GetCells(), "Value", "I", "J", "K", "L");
Parameter a = new Parameter(Domain.Real, "a", N, N);
a.SetBinding(GetSums(), "Value", "I", "J");
model.AddParameters(q, a);
Decision x = new Decision(Domain.IntegerRange(0, 1), "x", N, N);
Decision z = new Decision(Domain.RealNonnegative, "z", N, N);
model.AddDecisions(x, z);
model.AddGoal("g", GoalKind.Minimize,
Model.Sum(Model.ForEach(N, i =>
Model.Sum(Model.ForEach(N, j => z[i, j]))
))
);
model.AddConstraint("c",
Model.ForEach(N, i =>
Model.ForEach(N, j =>
z[i, j] >= Model.Sum(Model.ForEach(N, k =>
Model.Sum(Model.ForEach(N, l => q[i, j, k, l] * x[k, l] - a[i, j] * (1 - x[i, j])))
))
))
);
model.AddConstraint("rowsum",
Model.ForEach(N, i =>
Model.Sum(Model.ForEach(N, j => x[i, j])) == 1
));
model.AddConstraint("colsum",
Model.ForEach(N, j =>
Model.Sum(Model.ForEach(N, i => x[i, j])) == 1
));
var solution = context.Solve();
Console.WriteLine(solution.GetReport());
}
// This is a helper class for data binding.
private class Cell {
public int I { get; set; }
public int J { get; set; }
public int K { get; set; }
public int L { get; set; }
public double Value { get; set; }
}
// This returns the q_ijkl as described on page 2 of the paper.
private IEnumerable<Cell> GetCells() {
for (int i = 0; i < A.Length; i++) {
for (int j = 0; j < A[i].Length; j++) {
for (int k = 0; k < B.Length; k++) {
for (int l = 0; l < B[k].Length; l++) {
yield return new Cell { I = i, J = j, K = k, L = l, Value = A[i][k] * B[j][l] };
}
}
}
}
}
// This returns the a_ij as described on page 4 of the paper.
private IEnumerable<Cell> GetSums() {
for (int i = 0; i < A.Length; i++) {
for (int j = 0; j < A[i].Length; j++) {
double sum = 0;
for (int k = 0; k < B.Length; k++) {
for (int l = 0; l < B[k].Length; l++) {
if (B[k][j] != 0) {
sum += A[i][k] * B[j][l];
}
}
}
yield return new Cell { I = i, J = j, Value = sum };
}
}
}
}
}