next up previous
Next: The Error in the Up: Normal Distribution Previous: Normal Distribution

Populations and Their Means and Standard Deviations

In order to develop confidence in the result of a measurement of a single quantity, such as the length of a table top, we often repeat the measurement process a number of times. The results of the measurement vary because of difficulties in reading the meter stick scale to the last tenth of a millimeter, and for other reasons. Suppose we repeated the measurement $N$ times, getting a list of values $x_i$. Our best guess for the true value is usually the average of these values:

\begin{displaymath}
\bar x^* = \sum_{i=1}^N x_i/N,
\end{displaymath} (6)

which is also called the ``mean'' value of this sample set of observations. In our notation, $\bar x^*$ indicates our best, imperfect estimate of the true value $\bar x$. If we repeat the measurement an infinite number of times, ideally, the mean value should approach the ``true'' value of the measurement. The statistical way to describe what is happening is that our set of $N$ measurements is a sample of $N$ values taken from an infinite ``population''. The true population mean is given by
\begin{displaymath}
\bar x = \mathrel{\mathop{\lim}\limits_{N \rightarrow \infty}} \sum_{i=1}^N x_i/N.
\end{displaymath} (7)

We might ask of this infinite population, what is the probability of getting a value of $x$ in the range $(x,x+dx)$ when we make a measurement? This probability is expressed in terms of a probability function $P(x)$ as $P(x)dx$. The factor $dx$ is necessary because as the interval width $dx$ gets smaller, the probability of getting a value in that tiny range must get smaller in proportion to $dx$. If we make enough measurements, we can begin to construct this probability function, but usually we don't make enough measurements to know it very well. So we often assume for want of any better reason that the probability is given by the Gaussian distribution function (normal distribution)

\begin{displaymath}
P(x) = \exp[-(x - \bar x)^2/2\sigma^2]/(\sqrt{2 \pi} \sigma)
\end{displaymath} (8)

In this expression the true mean of the population is $\bar x$ and $\sigma$ is the true ``standard deviation''. This probability is normalized so that
\begin{displaymath}
\int_{-\infty}^{\infty} P(x)dx = 1.
\end{displaymath} (9)

i.e. the probability of measuring any value of $x$ is 1. The Gaussian distribution is peaked at $x = \bar x$ and falls off on either side of $\bar x$ over a distance in $x$ that is controlled by the value of $\sigma$. If $\sigma$ is large, the fall off is slow and the most probable values of $x$ are in a broad range around $\bar x$; if $\sigma$ is small, the fall off is rapid, and the most probable values of $x$ are narrowly clustered around $\bar x$. A property of the Gaussian distribution is that the probability of making a measurement and getting a value in the range $\bar x - \sigma$ and $\bar x + \sigma$ is about 68%. (This value is found by calculating the integral under the probability distribution from $\bar x - \sigma$ to $\bar x + \sigma$.) Thus in common usage, we say that for a single measured value of $x$, the result is $\bar x \pm \sigma$. The standard deviation of a quantity is sometimes called the ``error'' in that quantity, so we say the error in a single measurement is $\sigma$. The statement that $x$ lies in the range $\bar x \pm \sigma$ is a statement we can make with 68% confidence. That means the result of a measurement is likely to be outside this range 32% of the times we repeat the experiment.

A measure of the width of this peak is given by

\begin{displaymath}
{\rm Var}(x) = \sigma^2 = \int_{-\infty}^{\infty}(x - \bar x)^2 P(x)dx
\end{displaymath} (10)

This is just the average of $(x - \bar x)^2$ over the population.

If we made an infinite number of measurements, we would be able to determine the two parameters $\bar x$ and $\sigma$ or the distribution exactly. With a finite set of measurements, however, we can estimate them. To estimate the mean value, we simply compute the average of the measurements $x_{i}$:

\begin{displaymath}
\bar x^{*} = \langle x\rangle = \sum_{i=1}^{N} x_{i}/N.
\end{displaymath} (11)

Notice that we have put a star on $\bar x^{*}$ to distinguish the estimate from the true value $\bar x$. The sample also permits an estimate of this population standard deviation $\sigma$. It is just
\begin{displaymath}
{\rm Var}(x_{1}, x_{2}, \ldots{}) = \sigma^{*2}
= \sum_{i=1}^N (x_i - \langle x\rangle)^2/(N-1).
\end{displaymath} (12)

The quantity $\sigma^*$ is the estimated standard deviation, and its square is called the estimated variance of $x$ from the mean value $\bar x$, or just the estimated variance of $x$. 1

Another useful formula is obtained by expanding the square on the right side to give

\begin{displaymath}
(N-1)\sigma^{*2} = \sum x_i^2 - 2 \langle x\rangle \sum x_i ...
...gle x\rangle^2
= N(\langle x^2\rangle - \langle x\rangle^2).
\end{displaymath} (13)

The $\langle x^2\rangle$ means the average of $x_i^2$. In other words the estimated variance is just the difference between the average of the squares and the square of the average times $N/(N-1)$.

As an exercise in this course, you will be asked to write a program that reads a list of values $x_i$ and calculates $\bar x^*$ and $\sigma^*$.


next up previous
Next: The Error in the Up: Normal Distribution Previous: Normal Distribution
Carleton DeTar 2009-11-18