Levenberg-Marquardt Method

The two methods we described above have problems. (1) The steepest descent method has no good way to determine the length of the step. (2) Newton's method is based on solving a linear system in Eq. (41). The matrix to be inverted can be singular. (3) Moreover, unless it is started close to the minimum, Newton's method sometimes leads to divergent oscillations that move away from the answer. That is, it overshoots, and then overcompensates, etc. We discuss the cure to the second problem and then show how the first and third can be used to cure each other.

The standard cure to the singular matrix problem is to drop the second partials in the expression for $M$, to give

M_{j,k} = \sum_{i=1}^{N} \frac{1}{\sigma_i^2}
...rtial a_{k}}
\frac{\partial\bar y_{i,model}}{\partial a_{j}}.
\end{displaymath} (46)

This matrix is positive definite (so nonsingular), as long as there are no more fitting parameters than data points and the fitting function has independent variation in the parameters (e.g. we wouldn't be able to find the parameters $a_{1}$ and $a_{2}$ separately if the fitting function depends on them only as $a_{1} + a_{2}$). A brief justification for this modification is discussed by Press et al. in “Numerical Recipes”. The main argument is that the second partial derivative term contains a factor representing the departure of the measured value from the theoretical value. If the parameters correspond to a reasonable fit, these terms should be randomly positive and negative, tending to cancel, so the sum should be small, if the data is close to the theoretical curve. Outliers can make the omitted term large, but their effect on the search for the minimum should probably be discarded.

The problem with overshooting can be solved by a method of Levenberg and Marquardt that combines steepest descent with Newton. The idea is to modify the matrix $M$ once more to have the form

M_{j,k} \rightarrow \tilde M_{i,j} =
M_{i,j} + \lambda M_{i,i} \delta_{i,j}
\end{displaymath} (47)

In other words we increase the diagonal matrix elements in $M$ by a factor $(1+\lambda)$. Now suppose $\lambda$ is very large. Then the diagonal term dominates and the solution to Eq. (41) is just
\delta a_{j} = -\frac{1}{2(1+\lambda)M_{j,j}}
\frac{\partial \chi^{2}}{\partial a_{j}},
\end{displaymath} (48)

which is the steepest descent method but with an explicit choice for the step length in each direction. On the other hand, if $\lambda$ is very small, then the method becomes more Newton-like.

So how do we pick $\lambda$? We could start the Levenberg-Marquardt scheme with a small value of $\lambda$. Then we are doing Newton. If we overshoot the minimum, we would get an increase in $\chi^{2}$. Then we back up and increase $\lambda$ by, say, a factor of 10 and try again. Eventually $\lambda$ may be very large. Then the method is more like steepest descent with a step size that gets smaller each time we increase $\lambda$. So by continuing to increase $\lambda$ we are guaranteed to decrease $\chi^{2}$ eventually. If $\chi^{2}$ decreases, on the other hand, we decrease $\lambda$ by a factor of 10. In this way the method tunes $\lambda$ to the circumstances. As we approach the minimum, we expect Newton to do better, so we should find, finally, that a successful search concludes with a small value of $\lambda$.

How much should we fuss over a determination of the minimum of $\chi^{2}$? As far as the determination of the parameters is concerned, we should really be content with getting $\chi^{2}$ to within less than the true minimum plus one. The reason is that an increase in $\chi^{2}$ by one corresponds roughly to a change in one standard deviation in the fitting parameters, so is “in the noise”. However, the estimate of the error in the parameters through the formula (45) is often quite sensitive to how close we are to the true minimum, so it is a good idea to adjust the stopping criterion for the minimization in accordance with the resulting effect on the error estimate.