Simulated annealing (option METHOD/PMETHOD=ANNEALING) is an algorithm originally proposed to solve (or at least approximately solve) certain combinatorial problems such as the traveling salesman problem (minimizing the total distance required to hit each of a set of stops exactly once). As the number of stops grows, it becomes intractable to simply look at all possibly routes and pick the shortest one. But it’s a difficult problem to optimize by trying change up the route to (always) shorten it because it’s very easy to get trapped in a locally shortest route—shortening the path may require taking a “right” rather than a “left” 15 or 20 moves farther back, and backtracking through all that is basically no different from the (impossible) testing of all routes.

 

The key to simulated annealing is that it can make changes that can either shorten or lengthen the route. The name is based upon the annealing process in metallurgy, where the energy state generally goes down, but can occasionally go up, as metal cools. At a high “temperature” the moves in the “wrong” direction are more likely, and become less likely as the “temperature” declines. For the traveling salesman problem, there may be a key switch that requires quite a bit of later reshuffling, but which overall reduces the length. The annealing algorithm makes it possible to try that switch at a high temperature (when, at least initially, it seems to make things worse) and then steadily fix the rest of the path as the process “cools”.

 

The general structure of the algorithm requires an “energy function” (the function to be optimized or some transformation of it), and an “annealing schedule”, which tells what the initial “temperature” is and when and by how much it “cools”. With RATS, the probability of moving from the current energy level e to a proposed energy level e’ when at temperature T is

 

 

which is known as the Metropolis condition. It’s possible to make a “bad” move (to a higher energy level) and it’s also possible to not make a “good” move. However, the probability of either of those goes down as T decreases. (Some descriptions of SA have this set to always take a good move).

 

That’s common to all uses of simulated annealing. What’s specific to an application is the generation of possible moves (proposals). For the traveling salesman problem, what’s been found to work is a random choice of path reversals and path breaks. However, we’re interested in applying this to functions defined in continuous parameter spaces, not integers. Corana et. al (1987) describe an adaptive method for generating proposals in such situations. Each parameter maintains its own range (which they call v(i)). Batches of single-move proposals are generated, where each proposal randomly perturbs only one of the parameters using that parameter’s current range (uniform draw from + or – the range from the current value). Moves or non-moves are determined by the Metropolis test. At the end of a batch, you look at the probabilities of moves for each parameter. If it’s too high (the target is 50% moves), the range is widened and if it’s too low, the range is narrowed. Multiple batches are run at each temperature level to form an “iteration”. At the end of an iteration, the best set of parameters seen so far are re-loaded, and the temperature is reduced. At a high temperature, quite a few “bad” moves (even some really bad ones) will be accepted, and the parameter space will be widely searched. As the temperature is reduced, you will generally only make good moves and the test range for the parameters will squeeze down so you are only making (small) local changes.

 

METHOD=ANNEALING (and PMETHOD=ANNEALING, which is what is more typically used) are available on BOXJENK, CVMODEL, DLM, FIND, GARCH, MAXIMIZE and NLLS. Each of those instructions has a T0 option to control the initial temperature. For a maximum likelihood function (all of the above but BOXJENK (without the MAXL option) and NLLS), the default is the maximum of 5 and the square root of the number of parameters, which seems to work fine in practice. For non-linear least squares functions, it’s 20% of the function value at the initial guesses. The ITER (for METHOD) and PITER (for PMETHOD) options control the number of “iterations”.

 

The other controls are in the NLPAR instruction, and generally don’t need adjustment. ASCHEDULE controls the rate at which the temperature is lowered per iteration. The default is .85. AINNER (default of 20) is the size of the inner batches described above. After each inner batch, the ranges are adjusted to try to push the acceptance ranges towards .5. The AVADJUST option (default of 2.0) controls the range adjustment (Corana, et. al. call this Cu). Higher values lead to more aggressive changes. AOUTER controls the number of times the inner batches are run per iteration. This is a multiplier on the number of parameters. (The inner batch size is fixed). The default is 3.0—however, a minimum of 25 are done, regardless of the number of parameters. For example, if you have 15 parameters, there will be 3.0 x 15 outer loops over 20 (AINNER default) inner loops, each of which will do one function evaluation per parameter, thus, a single “iteration” would involve 3 x 15 x 20 x 15 = 13500 function evaluations. By contrast, a BFGS iteration typically would require 15 plus a few extras for the directional search, so you can get an idea of how much longer it will take. In practice, use it (at most) only for preliminary iterations and keep the number of PITERS small (5 or less).