# Maximum Likelihood and Chi Square

Although the least squares method gives us the best estimate of the parameters and , it is also very important to know how well determined these best values are. In other words, if we repeated the experiment many times with the same conditions, what range of values of these parameters would we get? To answer this question, we use a maximum likelihood method.

We start by assuming a probability distribution for the entire set of measurements . We assume that the measurement of the data points independent of each other, and each one follows a Gaussian normal distibution with mean value and standard deviation . The probability that a single experiment results in the set of values is then just a product of the individual Gaussians:

 (20)

A more complicated expression would be needed if correlations are present between the measurement of the and data points. For now, we take this expression as the simplest choice.

The idea of maximum likelihood is to replace the ideal mean values with the theoretically “expected” values predicted by a linear-function model. The probability distribution then becomes a conditional probability
. In other words, assuming that the slope and intercept are and , it gives the probability for getting the result in a single measurement. But then we use the power of Bayes's theorem to turn it around and reinterpret it as the probability that, given the experimental result, the linear relationship is given by the parameters and . Dropping the list of data points, we write this probability as

 (21)

where
 (22)

(The “proportional to” symbol is there because our reinterpretation of the probability in terms of and requires us to redo the normalization, so the total probability is one.)

The probability is called the likelihood function for the parameter values and . We want to find the values and that are most probable, i.e. maximize the likelihood function. Clearly this condition is equivalent to requiring that we minimize , and leads to the result discussed in the first section. But now we also have a way to estimate the reliability of our determination of the best values and .

From the expressions (19) we see that is similar to a normal distribution in the variables and , except that instead of one variable, we have two. Instead of a simple quadratic in the exponent we have a quadratic form in the exponent. Once we have realized this, we can use standard results to estimate the error in the best fit values.

The variance in the parameter is determined from the formula

 (23)

The standard deviation is , the square root of the variance. This expression is a generalization of the one we used when we were dealing with a probability distribution in a single variable. It takes some not-so-difficult calculus to do the integral, but we skip it here, and just quote the result in terms of the matrix defined in Eq. (10).
 (24)

Likewise the error in is just
 (25)

The inverse is called the “error matrix” for this reason. It is also called the “covariance matrix” for the best-fit parameters. Thus the diagonal matrix elements of give the variance of the best fit parameters. The off-diagonal elements contain information about correlations between the best fit parameter values. That correlation is determined from the formula
 (26)

which results in
 (27)

The correlations between slope and intercept are also characterized by the parameter , which is given by
 (28)

Correlations are discussed further in Sec. 3 below.

Written explicitly, we have

 (29) (30) (31)

This is an important result, since it allows us to assign confidence ranges for the best fit parameters and to determine how they are correlated.