Next: Wc Program, Previous: Tee Program, Up: Clones [Contents][Index]
The uniq
utility reads sorted lines of data on its standard
input, and by default removes duplicate lines. In other words, it only
prints unique lines—hence the name. uniq
has a number of
options. The usage is as follows:
uniq
[-udc [-n
]] [+n
] [inputfile [outputfile]]
The options for uniq
are:
-d
Print only repeated (duplicated) lines.
-u
Print only nonrepeated (unique) lines.
-c
Count lines. This option overrides -d and -u. Both repeated and nonrepeated lines are counted.
-n
Skip n fields before comparing lines. The definition of fields
is similar to awk
’s default: nonwhitespace characters separated
by runs of spaces and/or TABs.
+n
Skip n characters before comparing lines. Any fields specified with ‘-n’ are skipped first.
inputfile
Data is read from the input file named on the command line, instead of from the standard input.
outputfile
The generated output is sent to the named output file, instead of to the standard output.
Normally uniq
behaves as if both the -d and
-u options are provided.
uniq
uses the
getopt()
library function
(see Getopt Function)
and the join()
library function
(see Join Function).
The program begins with a usage()
function and then a brief outline of
the options and their meanings in comments.
The BEGIN
rule deals with the command-line arguments and options. It
uses a trick to get getopt()
to handle options of the form ‘-25’,
treating such an option as the option letter ‘2’ with an argument of
‘5’. If indeed two or more digits are supplied (Optarg
looks
like a number), Optarg
is
concatenated with the option digit and then the result is added to zero to make
it into a number. If there is only one digit in the option, then
Optarg
is not needed. In this case, Optind
must be decremented so that
getopt()
processes it next time. This code is admittedly a bit
tricky.
If no options are supplied, then the default is taken, to print both
repeated and nonrepeated lines. The output file, if provided, is assigned
to outputfile
. Early on, outputfile
is initialized to the
standard output, /dev/stdout:
# uniq.awk --- do uniq in awk # # Requires getopt() and join() library functions
function usage() { print("Usage: uniq [-udc [-n]] [+n] [ in [ out ]]") > "/dev/stderr" exit 1 } # -c count lines. overrides -d and -u # -d only repeated lines # -u only nonrepeated lines # -n skip n fields # +n skip n characters, skip fields first BEGIN { count = 1 outputfile = "/dev/stdout" opts = "udc0:1:2:3:4:5:6:7:8:9:" while ((c = getopt(ARGC, ARGV, opts)) != -1) { if (c == "u") non_repeated_only++ else if (c == "d") repeated_only++ else if (c == "c") do_count++ else if (index("0123456789", c) != 0) { # getopt() requires args to options # this messes us up for things like -5 if (Optarg ~ /^[[:digit:]]+$/) fcount = (c Optarg) + 0 else { fcount = c + 0 Optind-- } } else usage() } if (ARGV[Optind] ~ /^\+[[:digit:]]+$/) { charcount = substr(ARGV[Optind], 2) + 0 Optind++ } for (i = 1; i < Optind; i++) ARGV[i] = "" if (repeated_only == 0 && non_repeated_only == 0) repeated_only = non_repeated_only = 1 if (ARGC - Optind == 2) { outputfile = ARGV[ARGC - 1] ARGV[ARGC - 1] = "" } }
The following function, are_equal()
, compares the current line,
$0
, to the previous line, last
. It handles skipping fields
and characters. If no field count and no character count are specified,
are_equal()
returns one or zero depending upon the result of a
simple string comparison of last
and $0
.
Otherwise, things get more complicated. If fields have to be skipped,
each line is broken into an array using split()
(see String Functions); the desired fields are then joined back into a line
using join()
. The joined lines are stored in clast
and
cline
. If no fields are skipped, clast
and cline
are set to last
and $0
, respectively. Finally, if
characters are skipped, substr()
is used to strip off the leading
charcount
characters in clast
and cline
. The two
strings are then compared and are_equal()
returns the result:
function are_equal( n, m, clast, cline, alast, aline) { if (fcount == 0 && charcount == 0) return (last == $0) if (fcount > 0) { n = split(last, alast) m = split($0, aline) clast = join(alast, fcount+1, n) cline = join(aline, fcount+1, m) } else { clast = last cline = $0 } if (charcount) { clast = substr(clast, charcount + 1) cline = substr(cline, charcount + 1) } return (clast == cline) }
The following two rules are the body of the program. The first one is
executed only for the very first line of data. It sets last
equal to
$0
, so that subsequent lines of text have something to be compared to.
The second rule does the work. The variable equal
is one or zero,
depending upon the results of are_equal()
’s comparison. If uniq
is counting repeated lines, and the lines are equal, then it increments the count
variable.
Otherwise, it prints the line and resets count
,
because the two lines are not equal.
If uniq
is not counting, and if the lines are equal, count
is incremented.
Nothing is printed, as the point is to remove duplicates.
Otherwise, if uniq
is counting repeated lines and more than
one line is seen, or if uniq
is counting nonrepeated lines
and only one line is seen, then the line is printed, and count
is reset.
Finally, similar logic is used in the END
rule to print the final
line of input data:
NR == 1 { last = $0 next } { equal = are_equal() if (do_count) { # overrides -d and -u if (equal) count++ else { printf("%4d %s\n", count, last) > outputfile last = $0 count = 1 # reset } next } if (equal) count++ else { if ((repeated_only && count > 1) || (non_repeated_only && count == 1)) print last > outputfile last = $0 count = 1 } } END { if (do_count) printf("%4d %s\n", count, last) > outputfile else if ((repeated_only && count > 1) || (non_repeated_only && count == 1)) print last > outputfile close(outputfile) }
Next: Wc Program, Previous: Tee Program, Up: Clones [Contents][Index]