Fall Semester 2000
Write a C code called mcint.cc that operates as follows:
darkstar> mcint 10000 987654321 5-D unit sphere volume: XXXX uncertainty sigma_I: YYYYThe output XXXX should be your answer obtained from simple Monte Carlo integration with 10000 random sample points (specified by the first command-line argument) using the drand() random number generator (RNG) seeded (by sdrand()) with the value given in the second command-line argument. The sigma_I value YYYY should be an estimate of the error in your answer as discussed in lecture and in Numerical Recipes in C, section 7.6.
As in Assignment 7, Exercise 1, use a makefile, ~p5720/makefiles/a08.mk, to access function prototypes in in "p5720.h" and definitions in the p5720 class library. A makefile ~p5720/makefiles/a08.mk can help with compilation and linking to the class library.
Hints:
Turn in your code mcint.cc
This problem deals with data compresibility as a measure of information in a stream of bits. In the directory ~p5720/data you will find the following files, all of exactly the same length:
The relevant differences between these files is the amount of information they carry. The first file is just a stream of random bits. This could, in principle, correspond to a large amount of unique information -- it has no "extra fluff" in terms of repeated patterns -- but we do not have a "key" to unlock this information. The second file, "randc.txt" is also random, but not in all bits -- only those bits used to represent the keyboard characters are used. The third file is extremely high-quality English text, and the last makes for much less interesting reading. (It is based on Stanley Kubrick's film "the Shining", in which an author, played by Jack Nicholson, goes beserk [that's the executive summary]. In one scene, the author works furiously at the typewriter, night and day. The result of all this effort is a ream of paper with the same sentence written over and over and over...as in "jack.txt".)
Use the utility "gzip" and the shell command "ls -l" to determine the degree of compression (or information content) in these files.
In a text file info.txt, type in the resulting length of each file using the format
randb.bin.gz: ###### bytes
randc.txt.gz: ###### bytes
(etc.)
Following this list, write a short paragraph discussing
gzip's ability to the compress these files.
Submit your info.txt file. (All the "original" data files will remain available in the class directories, so please erase your copies of all the lengthy ones -- thanks!)
This problem, of a sociological nature, involves linking C++ code with C and Fortran routines.
In U.S. presidential elections, the President is not chosen by popular vote. Instead, a candidate must win a majority of only 538 "electoral votes"; each State gets to provide electoral votes equal to the number its of U.S. Senators and Representatives (e.g., Utah gets 5 electoral votes). In all but two States (Maine and Nebraska), the winner of a State's popular vote gets all of its electoral votes. For example, if two candidates in California got 50.01% and 49.99% of the popular vote respectively, the winner gets all of California's 54 electoral votes. Owing to "discretization noise" in the process of converting individual States' popular votes into a small number of electoral votes, it is actually possible for the winner of the national popular vote to actually lose the election. This has happened twice in the history of the U.S. (Please see the National Archives and Records website for more information.) This situation is even more likely to occur when there are more than two dominant political parties.
In this exercise, you are to write a single C++ code, vote.cc, to calculate the answer to these two questions:
p(x0 = exp(-(x-mean)^2/(2*sigma^2))/sqrt(2*pi*sigma^2)
from x=50 to infinity, with mean=45 and sigma=3.
prob(k,n,p) = binom(n,k) * p^k * (1-p)^(n-k)
where p is the probabiliy that any individual voter
will vote for this candidate (here, p=0.5), and binom(n,k)
is the binomial coefficient, n!/[k!*(n-k)!].
To answer these questions, use codes in ~p5720/src called gerf.f (Fortran 77 code) and binom.c (C code). The first of these calculates the integral
Gerf(x) = Integrate[t=0 to t=x] dt exp(-(t^2)/2)/sqrt(2*pi)
The second calulates the natural log of the binomial coefficient
binom(n,k). With the approriate change of variables, the
gerf.f code can answer the first question, while the binom.c
code can help answer the second. Note that with the Binomial
distribution, the numbers involved are typically so large
(e.g., 0.5^500000= ~10^-150000) that your machine cannot represent them.
Thus work with natural logarithms and convert to back to
standard double precision numbers only at the end of the calculation.
(For more information about the statistics involved here, please check out this Statistics Page to get some basic information on the Binomial and Gaussian (or "Normal") distribution. This site from Mathematica is unfortunately shut down for legal reasons but when it comes back up it will be a good resource.... If you find a good statistics website please let the instructor know -- thanks!)
At the heart of this problem is not statistical analysis, but linking Fortran, C, and C++ codes together. This Lab Exercise on linking will help you figure out how to do this.
Turn in only your code vote.cc. Be sure the answers to the two problems are clearly printed out when your code is executed.
The C code binom.c in the ~p5720/src directory was hacked together by the instructor in 10 minutes. It is not very well designed; in particular log_binom_coeff(n,k) fails to use nice things like the Stirling approximation (e.g., log(n!) -> n*log(n)) in the large n limit, or the fact that in this limit the Binomial distribution is well approximated by a Guassian.
Starting with NIST's GAMS website, find a code (C or Fortran) which provides binomial coefficients in double precision (most [all?] codes provided by GAMS give answers accurate to machine precision).
Turn in a text file called binom.txt with the URL (http://...) containing a link to downloadable source code that calculates binomial coefficients in double precision.
This exercise is a plea for help from the instructor. A student in this course was running the code for Assignment 7, Exercise 1, and discovered that the drand() random number generator actually gave a double precision number that was less than 10^-16. The likelihood that this can happen is very small. (If we ran drand in an infinite loop, generating a million numbers a second [a modern workstation can do this easily] then we should get such a small number on average every thousand years or so.)
Is the drand() code broken?
To help answer this question, do two things:
nice +10 ~p5720/bin/rndexp_test NUMBER SEED >&! rndexp_test.log &
where NUMBER is a pretty big integer, say 1000000 to 1000000000
(a million to a billion), and SEED is a seed for the drand() random
number generator. Comments:
* The "nice +10" in front of the executable tells the shell to run this program at low priority, so the CPU doesn't get bogged down crunching a billion drand's when other people are trying to get work done too.
* The symbol >!& is really three separate shell directives. The ">" redirects output from the executable to the file rndexp_test.log, the "&" causes error messages to also get recorded in the .log file, and the "!" (bang) causes the redirect to overwrite the file it it already exists. (The & symbol may be new. It causes error messages that normally go to your shell [or rather a stream like standard output called "standard error" or "stderr"], to be redirected along with the rest of the standard output -- it is important to get all information about whether the code breaks!).
* The final & says to run your program in background so that you can continue to work on something else in your xterm window without having to wait for the program to terminate.
The code can generate about a million numbers per second, so use this rate to pick the argument NUMBER in the command line, depending on how much time you want to wait around for the code to run. Also, it would be good to use a SEED that was different from everyone else's in the class. How about picking a seed from your Utah Driver's License number (but don't use your SSN!)? Or some scrambled form of your UofU ID NUmber?
Ultimately, you will submit the rndexp_test.log file, but meanwhile, run the code as shown above and move on to the next part of the assignment.
Store the commands used to create your histogram in a script, rndexp.sm or .gpl) for example.
Hint: When building the histogram and comparing with an exponential, you must either normalize the histogram or scale your exponential by a factor of N=20000*delta_x. Either way is Ok.
Also, for a new and better (?) way of getting a .jpg file out of sm, try the instructions posted at this link.
Note: There will be some scatter, or error, in the histogram relative to the exponential, and this is Ok even for a good RNG. The thing to check is that the scatter is not much bigger than the "discreteness noise" from using finite histogram bin sizes and a limited number of sample points. An estimate the standard deviation (characteristics size) of the error is sqrt(N_i), where N_i is the number of random numbers in the i^th bin. In this exercise, you do not have to provide an estimate of the errors.
Submit the output of your rndexp_test run rndexp_test.log, along with your plotting script (e.g. rdnexp.sm) and histogram plot in jpeg form, exphist.jpg, as in Assignment 4, Exercise 3.