PHYCS 6730 Lab Exercise: Correlations, Bias, and Jackknife Method

These exercises experiment with some advanced problems in fits to nonlinear data. The files you need are in ~p6730/exercises/jackknife. The handout Jackknife Error Estimates should help.

Exercise 1. The spectral data

The file data contains 161 measurements of a four-component vector y. The vector components are correlated. To see this do a scatter plot of y1 vs y2 using your favorite plotting utility. The correlations of adjacent components are stronger than those of more distant components. Try plotting some other pairs to see this.

Nothing to hand in.


Exercise 2. Correlation matrix.

The Maple code in mdata constructs the correlation matrix for the data. Read this file into a Maple session and answer the following questions.

  1. What is the meaning of the subscripts on y[k,i]?
  2. What is the meaning of the variables yav[i] and sd[i]? For the latter, be sure you distinguish between the population standard deviation and the standard deviation of the mean.
  3. What is the meaning of the variables z[k,i]?
  4. The matrix h is the correlation matrix. Which pair of components has the largest correlation?
  5. Do an SVD decomposition of the matrix h. Is this the same as diagonalization by means of an orthogonal similarity transformation?
  6. The column vectors (singular vectors) of U (same as V) are sometimes called "principal factors". They are like the normal modes of a mechanical system and define the statistically independent orthogonal directions of fluctuation of the scaled data vectors z. Each eigenvalue gives the variance due to a fluctuation in the direction of the corresponding eigenvector. What is unique about the relative signs of the eigenvector components for the largest eigenvalue? This property indicates that the largest amplitude fluctuations in this data come from all components of z moving up and down together.

Exercise 3. Modeling the data

Our model for the data is an exponential decay:

    yt = A exp(-m t)
where the vector components for each measurement are labeled t = 1, 2, 3, 4. We are especially interested in the value of m. Since there are four points and only two parameters, we would ordinarily do a least chisquare fit, preferably with measured correlations taken into account. But to keep it simple for this exercise, we use
   m = (1/2) log(y2/y4)
Calculate this value. (For this purpose use the mean values.) You should get approximately 0.42949.

We need to calculate the error in m as well. The error in m can be determined naively from the errors in y2 and y4. The Maple script has the means and population standard deviations for each component yt. Remember, the standard deviation in the mean is found from the population standard devation by the formula

    st dev in mean = popn st dev / sqrt(N)
Since our N is 161, don't worry about the difference between N and N-1 here. Find dy2 and dy4.

If we ignore correlations between y2 and y4, the error in m is given by the naive expression

  dm = (1/2) sqrt[ (dy2/y2)2 + (dy4/y4)2 ]
Calculate this error. (Remember the values of sd in your Maple file should be converted to errors in the mean by dividing by sqrt(N).) You should get about 0.0002.

Exercise 4. Jackknife analysis

Here we estimate the error in the mean and check for bias using a single-elimination jackknife method. The Bourne shell script ~p6730/exercises/jackknife/jacksamples.sh generates a set of jackknife sample estimates of the mean m. These are obtained by casting out the ith data set, computing the average of the remaining N-1 y2 and y4 values and using the formula above to get m for that sample. This process is repeated for each i, resulting in N jackknife estimates.

Run the script as follows

   cd ~p6730/exercises/jackknife ; jacksamples.sh data | meansd
The code meansd calculates the mean and population standard deviation of the list of numbers that you feed it.

Use the result and the formulas in the jackknife handout to determine the jackknife estimated error in the mean m with the naive estimated error. Be sure to multiply the population standard deviation by the appropriate factors to convert to the jackknife error in the mean. You should get approximately 0.00013.

Compare the jackknife mean and the mean from the full sample to check for sampling bias. Do you detect any?

Considering the correlations you have seen in this data, explain why the jackknife error estimate is smaller than the naive error estimate. Which one should you trust?

Note that the naive error propagation formula above can be modified to take account of the correlations in the data and should give a modified result consistent with the jackknife error. Such a method is certainly less costly than the jackknife approach. However, if the value of m came from a least chi square fit, rather than the just log of a ratio, error propagation would be nontrivial and a jackknife analysis of error would be worth the effort.