AWK Tool in Unix:AWK Grammar.
AWK Grammar
At this stage it may be worth our while to recapitulate some of the grammar rules. In particular we shall summarize the patterns which commonly describe the AWK grammar.
The reader should note how right across all the tools and utilities, Unix maintains the very same regular expression conventions for programming.
1. BEGIN{statements}: These statements are executed once before any input is processed.
2.END{statements}: These statements are executed once all the lines in the data input file have been read.
3.expr.{statements}: These statements are executed at each input line where the expr is true.
4. /regular expr/{statements}: These statements are executed at each input line that contains a string matched by regular expression.
5. compound pattern{statements}: A compound pattern combines patterns with && (AND), || (OR) and ! (NOT) and parentheses; the statements are executed at each input line where the compound pattern is true.
6. pattern1, pattern2 {statements}: A range pattern matches each input line from a line matched by “pattern1" to the next line matched by “pattern2", inclusive; the statements are executed at each matching line.
7.“BEGIN" and “END" do not combine with any other pattern. “BEGIN" and “END" also always require an action. Note “BEGIN" and “END" technically do not match any input line. With multiple “BEGIN" and “END" the action happen in the order of their appearance.
8. A range pattern cannot be part of any other pattern.
9. “FS" is a built-in variable for field separator.
Note that expressions like $3/$2 > 0.5= match when they evaluate to true. Also “The" < “Then" and “Bonn" > “Berlin". Now, let us look at some string matching considerations. In general terms, the following rules apply.
1./regexpr/ matches an input line if the line contains the specified substring. As an example : /India/ matches “ India " (with space on both the sides), just as it detects presence of India in “Indian".
2.expr ~ /regexpr/ matches, if the value of the expr contains a substring matched by regexpr. As an example, $4 ~ /India/ matches all input lines where the fourth field contains “India" as a substring.
3.expr !~/regexpr/ same as above except that the condition of match is opposite. As an example, $4 !~/India/ matches when the fourth field does not have a substring “India".
The following is the summary of the Regular Expression matching rules.
^C : matches a C at the beginning of a string
C$ : matches a C at the end of a string
^C$ : matches a string consisting of the single character C
^.$ : matches single character strings
^...$ : matches exactly three character strings
... : matches any three consecutive characters
“.$ : matches a period at the end of a string
* : zero or more occurrences
? : zero or one occurrence
+ : one or more occurrence
The regular expression meta characters are:
1. \ ^ $ . [ ] | ( ) * + ?
2. A basic RE is one of the following:
A non meta character such as A that matches itself
An escape sequence that matches a special symbol: “t matches a tab.
A quoted meta-character such as “* that matches meta-ch literally
^, which matches beginning of a string
$, which matches end of a string
., which matches any single character
A character class such as [ABC] matches any of the A or B or C
Character abbreviations [A-Za-z] matches any single character
A complemented character class such as [^0-9] matches non digit characters
3. These operators combine REs into larger ones:
alteration : A|B matches A or B
concatenation : AB matches A immediately followed by B
closure : A* matches zero or more A's
positive closure : A+ matches one or more A's
zero or one : A? matches null string or one A
parentheses : (r) matches same string as r does.
The operator precedence being |, concatenation, ( *, +, ? ) in increasing
value. i.e. *, +, ? bind stronger than |
The escape sequences are:
“b : backspace; “f : form feed; “n : new-line; “c literally c ( ““ for “ )
“r : carriage return; “t : tab; “ddd octal value ddd
Some useful patterns are:
Now we will see the use of FILENAME (a built-in variable) and the use of range operators in the RE.
More Examples
For the next set of examples we shall consider a file which has a set of records describing the performance of some of the well-known cricketers of the recent past. This data is easy to obtain from any of the cricket sites like khel.com or cricinfo.com or the International Cricket Councils' website. The data we shall be dealing with is described in Table 12.1. In Table 12.1, we have information such as the name of the player, his country affliation, matches played, runs scored, wickets taken, etc. We should be able to scan the cricket. data file and match lines to yield desired results. For instance, if we were to look for players with more than 10,000 runs scored, we shall expect to see SGavaskar and ABorder. Similarly, for anyone with more than 400 wickets we should expect to see Kapildev and RHadlee3.
So let us begin with our example programs.
1.Example – 1
# In this program we identify Indian cricketers and mark them with ***.
# We try to find cricketers with most runs, most wickets and most catches
BEGIN {FS = "“t" # make the tab as field separator
printf("%12s %5s %7s %4s %6s %7s %4s %8s %8s %3s %7s %7s”n”n",
"Name","Country","Matches","Runs","Batavg","Highest","100s","Wkts",
Table 12.1: The cricket data file.
"Bowlavg","Rpo","Best","Catches")}
$2 ~/IND/ { printf("%12s %5s %7s %6s %6s %7s %4s %8s %8s %4s %7s %7s %3s”n",
$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,"***")}
$4 > runs {runs = $4;name1 = $1}
$8 > wickets {wickets = $8;name2 = $1}
$12 > catches {catches = $12;name3 = $1}
END
{printf("“n %15s is the highest scorer with %6s runs",name1,runs)
printf("“n %15s is the highest wicket taker with %8s wickets",name2,wickets)
printf("“n %15s is the highest catch taker with %7s catches”n",name3,catches)
}
bhatt@falerno [AWK] =>!a
awk -f cprg.1 cricket.data
AR.Border is the highest scorer with 11174 runs
Kapildev is the highest wicket taker with 434 wickets
MTaylor is the highest catch taker with 157 catches
2. Example 2 In this example we use the built-in variable FILENAME and also match a few patterns.
# In this example we use FILENAME built in variable and print data from
# first three lines of cricket.data file. In addition we print data from
# ImranKhan to ABorder
BEGIN {FS = "“t" # make the tab as the field separator
printf("%25s “n","First three players")
NR == 1, NR == 5 {print FILENAME ": " $0}
{printf("%12s %5s “n", $1,$2)}
/ImranKhan/, /ABorder/ {num = num + 1; line[num] = $0}
END
{printf("Player list from Imran Khan to Allen Border “n",
printf("%12s %5s %7s %4s %6s %7s %4s %8s %8s %3s %7s %7s”n”n",
"Name","Country","Matches","Runs","Batavg","Highest","100s","Wkts",
"Bowlavg","Rpo","Best","Catches")}
for(i=1; i <= num; i = i+1)
print(line[i] ) }
bhatt@falerno [AWK] =>!a
awk -f cprg.2 cricket.data
First three players
cricket.data SGavaskar IND
cricket.data MAmarnath IND
cricket.data BSBedi IND
Player list from Imran Khan to Allen Border
At this time it may be worth our while to look at some of the list of the built-in variables that are available in AWK. (See Table 12.2).
More on AWK Grammar
Continuing our discussion with some aspects of the AWK grammar, we shall describe the nature of statements and expressions permissible within AWK. The statements
Table 12.2: Built-in variables in AWK.
are essentially invoke actions. In this description, “expression" may be with constants, variables, assignments, or function calls. Essentially, these statements are program actions as described below:
1. print expression-list
2. printf(format, expression-list)
3. if (expression) statement
4. while (expression) statement
5. for (expression; expression; expression) statement
6. for (variable in array) expression note : “in" is a key word
7. do statement while (expression)
8. break: immediately leave the innermost enclosing while loop
9. continue: start next iteration of the innermost enclosing while loop
10 next: start the next iteration of the main input loop
11 exit: go immediately to the END action
12 exit expression: same with return expression as status of program 13 statements
; : is an empty statement
We should indicate the following here:
1. Primary Expressions are: numeric and string constants, variables, fields, function calls, and array elements
2. The following operators help to compose expressions:
(a) assignment operators =;+ =;¡ =; ¤ =; = =; % =;=
(b) conditional operator ?:
(c) logical operators || && !
(d) matching operators ~ and !~
(e) relational operators <;<=;==; ! =;>=
(f) concatenation operator (see the string operator below)
(g) arithmetic +, -, *, /, % ^ unary + and -
(h) incrementing and decrementing operators ++ and -- (pre- as well as post-fix )
(i) parentheses and grouping as usual.
For constructing the expressions the following built-in functions are available:
3. Example 3 We next look at string operations. Let us first construct a small recognizer. The string concatenation operation is rather implicit. String expressions are created by writing constants, vars, fields, array elements, function values and others placed next to each other. The program {print NR ": " $0} concatenates as expected ": " to each line of output. In the example below, we shall use some of these facilities which we have discussed.
# this program is a small illustration of building a recognizer
BEGIN {
sign = "[+-]?"
decimal = "[0-9]+[.]?[0-9]*"
fraction = "[.][0-9]+"
exponent ="([eE]" sign "[0-9]+)?"
number ="^" sign "(" decimal "|" fraction ")" exponent "$"
}
$0 ~ number {print}
Note that in this example if /pattern/ were to be used then meta-characters would be recognized with an escape sequence, i.e. using “. for . and so on. Run this program using gawk with the data given below:
Table 12.3: Various string function in AWK
1.2e5
129.0
abc
129.0
You should see the following as output:
bhatt@falerno [AWK] =>gawk -f flt_num_rec.awk flt_num.data
1.2e5
129.0
129.0
4. Example 4 Now we shall use some string functions that are available in AWK. We shall match partial strings and also substitute strings in output (like the substitute command in vi editor). AWK supports many string oriented operations. These are listed in Table 12.3.
Let us now suppose we have the following AWK program line:
x = sprintf("%10s %6d",$1, $2)
This program line will return x in the specified format. Similarly, observe the behaviour of the program segment given below:
{x = index("ImranKhan", "Khan"); print "x is :", x}
bhatt@falerno [AWK] =>!a
awk -f dummy.rec
x is : 6
In the response above, note that the index on the string begins with 1. Also, if we use
gsub command it will act like vi substitution command as shown below:
{ gsub(/KDev/, "kapildev"); print}.
Let us examine the program segment below.
BEGIN {OFS = "“t"}
{$1 = substr($1, 1, 4); x = length($0); print $0, x}
{s = s substr($1, 1, 4) " "}
END {print s}
Clearly, the output lines would be like the one shown below. We have shown only one line of output.
We next indicate how we may count the number of centuries scored by Indian players and players from Pakistan.
/IND/ {century["India"] += $7 }
/PAK/ {century["Pakistan"] += $7 }
/AUS/ {catches["Australia"] += $12; k = k+1; Aus[k] = $0}
END {print "The Indians have scored ", century["India"], "centuries"
print "The Pakistanis have scored ", century["Pakistan"], "centuries"
print "The Australians have taken ", catches["Australia"], "catches"}
The response is:
The Indians have scored 53 centuries
The Pakistanis have scored 21 centuries
The Australians have taken 368 catches
5. Example 5 Now we shall demonstrate the use of Unix pipe within the AWK program.
This program obtains output and then pipes it to give us a sorted output.
# This program demonstrates the use of pipe.
BEGIN{FS = "“t"}
{wickets[$2] += $8}
END {for (c in wickets)
printf("%10s”t%5d”t%10s”n", c, wickets[c], "wickets") | "sort -t'“t' +1rn" }
The obvious response is reproduced below :4
Normally, a file or a pipe is created and opened only during the run of a program. If the file, or pipe, is explicitly closed and then reused, it will be reopened. The statement close(expression) closes a file, or pipe, denoted by expression. The string value of expression must be the same as the string used to create the file, or pipe, in the first place. Close is essential if we write and read on a file, or pipe, alter in the same program. There is always a system defined limit on the number of pipes, or files that a program may open.
One good use of pipes is in organizing input. There are several ways of providing the input data with the most common arrangement being:
awk 'program' data
AWK reads standard input if no file names are given; thus a second common arrangement is to have another program pipe its output into AWK. For example, egrep selects input lines containing a specified regular expression, but does this much faster than AWK does. So, we can type in a command egrep 'IND' countries.data | awk 'program' to get the desired input
bhatt@falerno [AWK] =>egrep 'BRAZIL' cricketer.data | awk -f recog3.awk egrep: NO such file or directory
6. Example 6 Now we shall show the use of command line arguments. An AWK command line may have any of the several forms below:
awk 'program' f1 f2 ...
awk -f programfile f1 f2 ...
awk -Fsep 'program' f1 f2 ...
awk -Fsep programfile f1 f2 ...
If a file name has the form var=text (note no spaces), however, it is treated as an assignment of text to var, performed at the time when that argument would be otherwise a file. This type of assignment allows vars to be changed before and after a file is read.
The command line arguments are available to AWK program in a built-in array called ARGV. The value of ARGC is one more than the number of arguments. With the command line awk -f progfile a v=1 bi, ARGC is 4 and the array ARGV has the following values : ARGV[0] is awk, ARGV[1] is a ARGV[2] is v=1 and finally, ARGV is b. ARGC is one more than the number of arguments as awk is counted as the zeroth argument. Here is another sample program with its response shown:
#echo - print command line arguments
BEGIN{ for( i = 1; i < ARGC; i++ )
print("%s , “n", ARGV[i] )}
outputs
bhatt@falerno [AWK] =>!g
gawk -f cmd_line1.awk cricket.data
cricket.data ,
7. Example - 7 Our final example shows the use of shell scripts. 5 Suppose we wish to have a shell program in file sh1.awk. We shall have to proceed as follows.
- step 1: make the file sh1.awk as gawk '{print $1}' $* .
- step 2: chmod sh1.swk to make it executable. bhatt@falerno [AWK] => chmod +x sh1.awk
- step 3: Now execute it under the shell command bhatt@falerno [AWK] => sh sh1.awk cricket.data file1.data file2.data
- step 4: See the result.
bhatt@falerno [AWK] =>sh sh1.awk cricket.data
SGavaskar
.......
.......
MartinCrowe
RHadlee
Here is an interesting program that swaps the fields:
#field swap bring field 4 to 2; 2 to 3 and 3 to 4
#usage : sh2.awk 1 4 2 3 cricket.data to get the effect
gawk '
BEGIN {for (i = 1; ARGV[i] ~ /^[0-9]+$/; i++) {# collect numbers
fld[++nf] = ARGV[i]
#print " the arg is :", fld[nf]
ARGV[i] = ""}
#print "exited the loop with the value of i : ", i
if (i >= ARGC) #no file names so force stdin
ARGV[ARGC++] = "-"
}
# {print "testing if here"}
{for (i = 1; i <= nf; i++)
#print
printf("%8s", $fld[i])
}
{ print "" }' $*
bhatt@falerno [AWK] =>!s
sh sh2.awk 1 2 12 3 4 5 6 7 8 9 10 11 cricket.data
SGavaskar IND 108 125 10122 51.12 236 34 1 206.00 3.25 1-34
......
......
ABorder AUS 156 156 11174 50.56 265 27 39 39.10 2.28 11-96
In the examples above we have described a very powerful tool. It is hoped that with these examples the reader should feel comfortable with the Unix tools suite.
Comments
Post a Comment