4  Optimization: stochastic methods

4.1 Problems with steepest descent

  • (see previous slide:) steepest descent may be very slow
  • Main problem: a minimum may be local, other initial values may result in other, better minima
  • Cure: apply Gauss-Newton from many different starting points (cumbersome, costly, cpu intensive)
  • Global search:
    • apply a grid search – curse of dimensionality. E.g. for three parameters, 50 grid nodes along each direction: \(50^3= 125000\)
    • apply random sampling (same problem)
    • use search methods that not only go downhill:
      • Metropolis-Hastings (sampling)
      • Simulated Annealing (optimizing)

4.2 Metropolis-Hastings

Why would one want probabilistic search?

  • global–unlikely areas are searched too (with small probability)
  • a probability distribution is richer than a point estimate:

Gauss-Newton provides an estimate of \(\hat\theta\) of \(\theta\), given data \(y\). What about the estimation error \(\hat\theta - \theta\)? Second-order derivatives give approximations to standard errors, but not the full distribution.

We explain the simplified version, the Metropolis algorithm

Metropolis algorithm

Given a point in parameter space \(\theta\), say \(x_t = (\theta_{1,t},...,\theta_{p,t})\) we evaluate whether another point, \(x'\) is a reasonable alternative. If accepted, we set \(x_{t+1}\leftarrow x'\); if not we keep \(x_t\) and set \(x_{t+1}\leftarrow x_t\).

  • if \(P(x') > P(x_t)\), we accept \(x'\) and set \(x_{t+1}=x'\)
  • if \(P(x') < P(x_t)\), then
    • we draw \(U\), a random uniform value from \([0,1]\), and
    • accept \(x'\) if \(U < \frac{P(x')}{P(x_t)}\)

Often, \(x'\) is drawn from some normal distribution centered around \(x_t\): \(N(x_t,\sigma^2 I)\). Suppose we accept it always, then \[x_{t+1}=x_t + e_t\] with \(e_t \sim N(0,\sigma^2 I)\). Looks familiar?

Burn-in, tuning \(\sigma^2\)

  • When run for a long time, the Metropolis (and its generalization Metropolis-Hastings) algorithm provide a correlated sample of the parameter distribution
  • M and MH algorithms provide Markov Chain Monte Carlo samples; another even more popular algorithm is the Gibb’s sampler (WinBUGS).
  • As the starting value may be quite unlikely, the first part of the chain (burn-in) is usually discarded.
  • if \(\sigma^2\) is too small, the chain mixes too slowly (consecutive samples are too similar, and do not describe the full PDF)
  • if \(\sigma^2\) is too large, most proposal values are not accepted
  • often, during burn-in, \(\sigma^2\) is tuned such that acceptance rate is close to 60%.
  • many chains can be run, using different starting values, in parallel

Little mixing (too low acceptance rate):

Better mixing:

Still better mixing:

Little mixing (too high acceptance rate: too much autocorrelation)

Likelihood ratio – side track

For evaluating acceptance, the ratio \(\frac{P(x')}{P(x_t)}\) is needed, not the individual values.

This means that \(P(x')\) and \(P(x_t)\) are only needed up to a normalizing constant: if we have values \(aP(x')\) and \(aP(x_t)\), than that is sufficient as \(a\) cancels out.

This result is key to the reason that MCMC and M-H are the work horse in Bayesian statistics, where \(P(x')\) is extremely hard to find because it calls for the evaluation of a very high-dimensional integral (the normalizing constant that makes sure that \(P(\cdot)\) is a probability) but \(aP(x')\), the likelihood of \(x\) given data, is much easier to find!

Likelihood function for normal distribution

Normal probability density function: \[Pr(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp(-\frac{(x-\mu)^2}{2\sigma^2})\] Likelihood, Multivariate; independent observations: \[Pr(x_1,x_2,...,x_p;\mu,\sigma) = \prod_{i=1}^p Pr(x_i)\] which is proportional to \[\exp(-\frac{\sum_{i=1}^n(x_i-\mu)^2}{2\sigma^2})\]

4.3 Simulated annealing

Simulated Annealing is a related global search algorithm, does not sample the full parameter distribution but searches for the (global) optimimum.

The analogy with annealing, the forming of crystals in a slowly cooling substance, is the following:

The current solution is replaced by a worse “nearby” solution with a certain probability that depends on the the degree to which the “nearby” solution is worse, and on the temperature of the cooling process; this temperature slowly decreases, allowing less and smaller changes.

At the start, temperature is large and search is close to random; when temperature decreases search is more and more local and downhill. Random, uphill jumps prevent SA to fall into a local minimum.

A related algorithm (stochastic optimization) is the Genetic Algorithm.