awk

The awk utility is used to make selections and substitutions in text files. It is actually a highly sophisticated C-like interpretive programming language. Our use of awk will be very limited, and will come with complete instructions with each case. We give a couple of common, useful examples here. For more information about awk, see A. Aho, B.W. Kernighan, and P.J. Weinberger, The AWK Programming Language (Addison-Wesley, Reading, MA, 1988).

Here is an example of where you might use awk. Suppose you have a program that produces a list of $x,y$ values like this:

  2.5 0.513
  2.7 0.416
  2.9 0.213
in a file called oldfile. Suppose you decided that you wanted to multiply the $y$ value by 2 on every line and put the result in a file called newfile. The awk command to do this is
  awk '{print $1, 2*$2;}' oldfile > newfile
Inside the single quotes '' is the instruction to awk. The instruction is applied to each line, one at a time. Awk sees each line as a bunch of fields separated by ``white'' space (blanks and tabs are white space). Each line in this case has two fields, therefore. The single instruction says for each line, print the first field and twice the second field.

Awk can print the third field on all lines containing the string ans with the command

  awk '/ans/{print $3;}' oldfile
This procedure is very handy for extracting numbers from complicated output files for subsequent processing, as long as you had the foresight to include a unique keyword ans on the line containing the number you wanted to extract. Since we chose not to redirect output in this example, it comes to the screen. (You could redirect it to a file, of course.)

Awk can average all the numbers in a field. Suppose in the above list of numbers, you wanted to compute the average value of the second field. The awk command is

  awk 'BEGIN{s=0;}{s=s+$2;}END{print s/NR;}' oldfile
There are actually three instructions given to awk, separated by a semicolon ;. The first BEGIN{s=0} is executed only once before the file is read. It initializes the variable s that is to hold the sum. This step is actually not necessary, since all variables are automatically initialized to zero. The second command $s = s + \$2$ is executed for each line and accumulates the sum. The third command END{print s/NR;} is executed only after the last line has been processed. It says to print the sum, divided by NR, a special awk variable that gives the number of lines (``records'') read so far.

In these examples the awk scripts are short enough to put on the command line, enclosed between single quotes. Longer scripts are generally put in files. Scripts are easier to read if the commands are separated by line breaks, just as you would a C program. If the file name extension is .awk as in avg.awk, The emacs editor recognizes the name and enters a useful awk editing mode similar to its C and C++ modes. If we had done that with the averaging script above, we would run it using

  awk -f avg.awk oldfile

One can use awk to do quick calculator-style arithmetic. To evaluate $cos(3.5)$, for example, do

  awk 'BEGIN{print cos(3.5);}'
The syntax for arithmetic expressions is the same as in C. In this case awk is not working on any file, so our commands must be executed under BEGIN.

Sometimes we like to be able to change an awk variable each time we run a script. Instead of editing the script each time, we specify the variable on the command line. Here is an awk script for selecting all lines for which the first field matches a specified key string

  # keyselect.awk
  { if($1==KEYSTRING)print; }
We run it with
  awk -f keyselect.awk -v KEYSTRING=ans
In this case the value of the variable KEYSTRING is set to ``ans'' on the command line.

One can do much more with awk. Indeed one could almost use awk as a programming language in place of C++, Fortran or C, but its strengths lie in record manipulation, and not in number crunching, and some of the most elegant applications involve only one-liners as in the foregoing examples.