This post takes a look at performance variability issues when scaling up the number of processors assigned to the CPLEX MIP solver. I summarize results from a few computational experiments we’ve made. I show that while increasing the number of processor cores results in quicker runs on average, but the effect on individual instances is quite difficult to predict. The general conclusions are:

- In general, more processor cores available leads to reduced solution times, up to 8 cores. Using 16 cores yield different conclusions depending on the performance metric you’re using.
- On average, using 4 cores instead of a single core approximately cuts solution times in half;
- On a single instance, using more cores may actually
**lead to considerably longer (or shorter) run times,** - The run with the most cores (16, in our test) was the fastest about 50% of the time, while 11% of our models ran faster on a single core than on 16 cores!

**Some background**

When mixed-integer linear programming (MIP) models take too much time to solve, a natural option is to run the model on a *faster* machine. Nowadays, with multi-core processors being commonplace, faster often means having more processors (or cores). A simple idea is to look at how much time is needed to prove optimality on a bunch of models while adding more cores.

On a pure branch-and-bound algorithm (the one you will find in textbooks), adding processors simply leads to more nodes being evaluated per second. However, state-of-the-art MIP solvers are several orders of magnitude faster than a pure BnB code: from version 1.2 to 11.0, CPLEX’s speedup is estimated at 29,530 times! [1] That being said, all these cut generation techniques and heuristics often leads to a phenomenon labeled as performance variability [2,3]. Erwin Kalvelagen also experienced this phenomenon and wrote a blog post about it. I had already some data on this, so it seemed like a good time to share it with the community.

**Is parallel CPLEX faster overall?**

I ran CPLEX 12.4 on 90 models (85 from the “benchmark” set of MIPLIB 2010 [2] and 5 of my own), on the same machine, using 1, 2, 4, 8, and 16 cores. All runs are performed on the same machine. Rather than do a thorough analysis of run times (this is a blog, not a paper), I do look at the following question: how many seconds does CPLEX need to solve these models to optimality given a certain number of cores? As expected, speed-ups are experienced until you reach 8 cores. Using 16 cores actually result in a more clock time needed to solve the 90 models, although the geometric mean of run times consistently decreases as the number of cores used increase (According to J-F Puget, this metric is used by CPLEX and Gurobi teams to monitor performance).

**How about individual instances?**

This overall estimate on 90 instances isn’ that helpful when you’re trying to solve a particular model. Over our small testbed, how many instances are solved faster when using Y cores instead of X, where Y > X? To my complete surprise, a considerable number of models were actually slower when using more cores. This is too small a dataset to reach general conclusions (The guys at GUROBI use 1800+ models for testing purposes), but the thing to remember is that you can get very different solution times by changing the number of processors available to the solver.

In our testbed, the worst pathological case is instance *neos-1337307* from MIPLIB2010 (which solved in 9.66 seconds on 4 cores vs 2202.52 seconds on 8 cores. The most important speedup is instance *net12*, also from MIPLIB2010, which solved in 4187.89 seconds on one core versus 154.1 seconds on two cores. This definitely warrants for some more investigation on what actually happens in these cases.

#### Methodological considerations and references

The experiments were made on a IBM x3650 server with Intel Xeon X5650 processors at 2.67 GHz and with 72 GB of RAM operating under Windows Server 2008 R2. A single run was made on each instance.

[1] Bixby, R.E. (2010). Lastest Advances in Mixed-Integer Programming Solvers, talk delivered at CIRRELT Spring School on Combinatorial Optimization in Logistics, Montreal, May 17-20, 2010.

[2] Koch, T., T. Achterberg, *et al*. (2011). “MIPLIB 2010.” Mathematical Programming Computation 3(2): 103-163.

[3] Lodi, A. (2010). Mixed-integer Programming Computation. 50 years of integer programming 1958-2008 [ressource électronique] : the early years and state-of-the-art surveys. M. Jünger, T. M. Liebling, D. Naddefet al. Berlin, Springer.

Nice post, we have similar experience in NTUA with some knapsack and TSP instances

Thanks for your post. As far as I can see, anIntel Xeon X5650 has 6 cores and 12 threads. Assuming that you have 2 CPU’s in the computer, then you have 12 cores, and 24 threads. So, when you are running your tests with 16 cores, then several of the threads will use the same core. This could explain the bad scaling above 8 cores. Also, the Xeon X5650 decreases the CPU frequency if you are using many cores. So, it is more difficult to conclude whether more cores is better. What your test shows is that with your hardware, one should not use more than 8 cores. However, if you have more “real” cores available, it may still be a good idea to use more cores.

Nice post!

We, at IBM, usually use geometric mean instead of sum of running times to compare performance comparison (same is true for Gurobi). Would you still have the same overall picture with that measure?

Regarding how to cope wih MIP running time variability we do offer tools in CPLEX 12.5, as explained here by Tobias Achterberg:

http://yetanothermathprogrammingconsultant.blogspot.fr/2012/11/1-vs-4-threads.html?showComment=1353922220283#c5159965318923688785

Thank you for your comments. I updated my post (and the chart) by adding geometric mean, and it does indeed decrease while extending to 16 cores.

CPLEX 12.5’s new feature is really nice. I do wonder how it relates to the ParallelMode parameter that was already present in CPLEX 12.1-12.4. I believed that the “deterministic” setting would lead to indentical runs. After reading Tobias, I’m not sure about what this parametes does in deterministic mode.

Thank you for the update with geo mean!

To your point, deterministic means you can run again the same model and you’ll get the same answer. Even if CPLEX uses multi threading .

It does not mean that you get the same answer if you change the random seed.

Put differently, if you use the same seed, then you can reproduce results. This is useful, for instance, when you want to debug something. If something goes wrong in an application that runs CPLEX as a subroutine, then you can ensure that CPLEX will behave the same when you run the application again.

What Tobias was hinting at is that you can guess if a given model is likely to lead to unpredictable running times. Indeed, you simply need to run CPLEX several times on that model with a different seed each time. If your running times stay alike then there is little variability. If running times change a lot then you have a very “unpredictable” model.

It does not removes the variability issue. However, knowing when a model leads to high variability, and when it does not, is a first step towards developing models with low vriability.

The net benefits of parallel processing with CPLEX also depends on the structure of the problems at hand and the methods you want to use to solve them. Solving a MIP doesn’t always require independent processes, in which case having multiple processors and a multi-thread CPLEX won’t help. On the other hand, it might make sense to use strong-branching or another “smart” variable selection strategy where computationally heavy independent processes are involved. One process could dive into the tree and investigate the utility of branching on one variable as opposed to another while an other process is still busy at completing an LP sub-problem.

I agree that the number of cores exhibits diminishing marginal returns in terms real run times.

hello,

the cores doesn’t neccesarily reduced times in the solution of problems, this is not notice, the experimentation is a confirmation , but in problems in that the times of solution are of days or week, the reduction for example of three days in only one day is important, if the work of one month is reduced in a week is very good notice, the results of this experimentation are a confirmation of that the times are reduced in more of the mid of the time

in problems in that the times of solutions are minutes , the reduction is not important

Node heuristics can contribute to this behavior if they play an important role in the total solve time. By changing the number of threads, you change the nodes at which the heuristics are applied. That can dramatically change the time required to get good solutions and eventually solve the model to optimality. Check the node log of your solver to see if the change in performance involves node throughput or number of nodes required to reach a solution within a specified gap. If the latter, then heuristics are probably involved, and you might want to see if additional threads help on some runs with heuristics disabled. If the former (i.e., extra threads don’t lead to faster node throughput), then unbalanced loads among the different threads (i.e., different node LP solve times in the subtrees assigned to different threads) can contribute.

you are not comparing apples and oranges. as a previous commenter said, you only really have 12 real cores with your setup, looking at hyperthreaded cores is not useful. for all intents and purposes if one core runs 2 threads it will act like two cores running on half the clock frequency minus some overhead (IBM will tell you a different story).