AWK

Created: 04 Sep 2023
Updated: 17 Jan 2025
308px-Great_Auk_Thomas_Bewick_1804.jpg
Figure 1: mascot adopted by AWK's bibliography
book https://www.awk.dev/
wiki https://en.wikipedia.org/wiki/AWK
wikibooks https://en.wikibooks.org/wiki/AWK
web repl https://awk.js.org/
doc https://www.gnu.org/software/gawk/manual/html_node/index.html
fansite http://awklang.org/

cli

$ awk '{ print $1 }' file.csv
$ awk -f script.awk file.csv
    arg description
-F –field-separator S sets FS
-f –file <FILEPATH> runs script
-E –exec <FILEPATH> runs script (for gawk cgi)
-v –assign var=val sets var to val
-c –traditional - compatibility mode
-P –posix - compatibility mode extra
-S –sandbox - disables system() and IO redirections
-d- –dump-variables=- - dump variables status to stdout

language

#!/usr/bin/awk -f
#!/usr/bin/env -S gawk -f
END { print "Hello, World!"; }

style

types

weakly typed: their type can change as the program runs

  • gnu: typeof(a)
  • untyped, unassigned
  • strnum
  • regexp (?

string

  • index start at 1

number

  • || and && return booleans AKA 0(false) and 1(true). They do NOT return the truthy value.
  • falsy: "", 0, undefined variables
  • "comparing floating-point values to see if they are exactly equal is generally a bad idea"

    ratio = dimension[1] / dimension[2] # ratio = 1.77778 of type number
    if (ratio ==  1.77778 ) # does NOT work
    if (ratio == "1.77778") # works
    
  • support for scientific notation, eg: 1e8
  • support for float
  • gawk: support for bignum through GMP
  • gawk: support for arbitrary precision through MPFR (to be removed soon?)

array

  • are associative arrays (aka hashtables)
  • no need to be declared
  • for strings or numbers
  • 1D
  • 2D support can be mimicked by using [x,y] as index and (x,y) in arr for checking
  • 2D support in gawk
  • index
    • are strings
    • start at 1
      • at least the ones returned by stdlib functions
      • you can make it start by 0(zero) if you use a custom variable to initialize it
  • https://www.gnu.org/software/gawk/manual/html_node/Controlling-Array-Traversal.html

    comp_func(i1, v1, i2, v2)  < 0 # Index i1 comes before index i2
    comp_func(i1, v1, i2, v2) == 0 # Indices i1 and i2 come together
    comp_func(i1, v1, i2, v2)  > 0 # Index i1 comes after in2
    
  • Set the order an already created array would be presented on a forIn

    PROCINFO["sorted_in"] = "afunctionname" # see comp_func
    PROCINFO["sorted_in"] = "@val_num_asc"
    PROCINFO["sorted_in"] = "@val_num_desc"
    PROCINFO["sorted_in"] = "@val_str_asc"
    PROCINFO["sorted_in"] = "@val_str_desc"
    PROCINFO["sorted_in"] = "@ind_num_asc"
    PROCINFO["sorted_in"] = "@ind_num_desc"
    PROCINFO["sorted_in"] = "@ind_str_asc"
    PROCINFO["sorted_in"] = "@ind_str_desc"
    

built-in variables

  DESCRIPTION DEFAULT
FPAT regex of what each field contains "[^[:space:]]+"
FIELDWIDTHS whitespace separated list field widths ""
NF numer of fields in line 0
NR number of records (aka lines) read so far 0
FNR number of records read so far, in curr file 0
FS controls the input field separator " "
RS controls the input record separator "\n"
OFS output field separator " "
ORS output record separator "\n"
OFMT output format for numbers "%.6g"
ENVIRON array of environment variables  
ARGV array of cli arguments  
ARGC number of cli arguments 0
ARGIND index of ARGV being processed 0
FILENAME name of current input file ""
RLENGTH length of string matched by match function 0
RSTART start of string matched by match function 0
SUBSEP subscript separator "\034"
IGNORECASE all but array subscripting will ignore case 0

built-in functions

TIME

https://www.gnu.org/software/gawk/manual/html_node/Time-Functions.html

mktime DATESTR, UTC? given DATESTR, timestamp in seconds since epoch
strftime FMT, TIMESTAMP, UTC?  
systime - now, TIMESTAMP in seconds since epoch
  • where DATESTR is a space separated "YYYY MM DD HH MM SS DST? 0|1"
  • where FMT can be "%Y-%m-%d %H:%M:%S"

BITWISE

https://www.gnu.org/software/gawk/manual/html_node/Bitwise-Functions.html

fn args returns
and v1,v2,…  
xor v1,v2,…  
or v1,v2,…  
compl val complement
lshift val,count val left shifted by count bits
rshift val,count val right shifter by count bits

ARRAY

<r> returns does
asort(SRC,DST) number of elements in SRC sort by value, DST has idx=numeric val=old_value
asorti(SRC,DST) number of elements in SRC sort by index, DST has idx=numeric val=old_index
isarray(arr) boolean  
delete arr[1] ? deletes element "1" from array
"" in arr ? coerce arr into array type (in a function?)
for (i in arr) ? iterates over array indexes (i)

MATH

https://www.gnu.org/software/gawk/manual/html_node/Numeric-Functions.html

fn arg returns
atan2 y,x arctangent of y/x in -x to x range
cos x cosine of x, with x in radians
sin x sine of x, with x in radians
exp x  
log x ntural base e logarithm of x
sqrt x  
int x integer part of x, truncated
rand - random nuber r, 0 <= r < 1
srand x x is new seed for rand()

STRING

https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

r=regex  s=string  t=targetstring  fs=field separator
fn args returns does
sub r,s number of subst made substitute one r for s in $0
  r,s,t " substitute one r for s in t
gsub r,s " substitute all r for s in $0
  r,s,t " substitute all r for s in t
gensub r,s,h copy of s modified substitute h'th instance of r by s in $0
  r,s,h,t " substitute h'th instance of r by s in t
substr s,start substring of s  
  s,start,len "  
split s,a number of fields stores the pieces in array a
  s,a,fs " stores the pieces in array a
length - number of chars in $0  
  s number of chars in s  
index s,t 0 or n position of t in s  
match s,r index or 0 test if s contains r, sets RSTART and RLENGTH
  s,r,a   … sets a to portions of s that match r
      [0] = whole matched part of s
      [N, "start"] = starting index of match
      [N, "length"] = length of match
sprintf fmt,… formated string  
strtonum s    
tolower s lowercased s  
toupper s uppercased s  

operators

= += -= *= /= %= ^= Assigments
?: Ternary operator
in Array membership
~ !~ Matching

control flow

  • exit
    • on a normal rule, still runs END, but not ENDFILE
    • on BEGIN , still runs END
    • on END , stops
exit goes immediately to the END action
exit expression  
next skips to the next line of input

output statement

close filename break connection between print and filename
close command break connection between print and command
system command execute command

getline

https://www.gnu.org/software/gawk/manual/html_node/Getline.html

getline reads next input record NF, NR, FNR, RT, $0
getline var reads n.i.r. into var NR, FNR, RT
getline < file reads n.i.r. from file NF, RT, $0
getline var < file reads n.i.r. from file into var -
"cmd" ¦ getline reads a single line of cmd into awk NF, RT, $0
"cmd" ¦ getline var reads a single line of cmd into var RT
"cmd" ¦& getline reads from a two-way pipe NF, RT, $0
"cmd" ¦& getline var reads from a two-way pipe into var RT

NOTE: call close("cmd") on the non two-way pipes, maybe call getline on a while>0

format strings

  description
%f, %F float
%a, %A float hexa
%g, %G float or scientific notation
%d, %i decimal integer
%e, %E scientific notation
%o unsigned octal
%u unsigned decimal integer
%x, %X unsigned hexadecimal integer
%c numbers as character
%s string
%% literal "%"

extensions

  • at /usr/share/doc/gawk/examples/lib/*.awk
    • maybe set on OS environment variable AWKPATH (at least for lsp emacs)
  • @include "join"

    function join(array, start, end, sep,    result, i)
       if (sep == "")     sep = " "
       if (sep == SUBSEP) sep = "" # magic value
    
  • @include "assert" assert(BOOLEAN, "Reason of failure HERE")
  • @include "ord" OR @load "ordchr" https://www.gnu.org/software/gawk/manual/html_node/Extension-Sample-Ord.html
    • ord(STRING) -> NUMBER
    • chr(NUMBER) -> STRING

control flow

  • do while, while, for(;;), for(in)
  • can assign a value on a if

    if (disjoint = r[2] <= m1 || m2 <= r[1])
        continue
    

network

rossetta - web server

https://rosettacode.org/wiki/Hello_world/Web_server

#!/usr/bin/gawk -f
BEGIN {
    RS = ORS = "\r\n"
    HttpService = "/inet/tcp/8080/0/0"
    Hello = "<HTML><HEAD>" \
        "<TITLE>A Famous Greeting</TITLE></HEAD>" \
        "<BODY><H1>Hello, world</H1></BODY></HTML>"
    Len = length(Hello) + length(ORS)
    print "HTTP/1.0 200 OK"          |& HttpService
    print "Content-Length: " Len ORS |& HttpService
    print Hello                      |& HttpService
    while ((HttpService |& getline) > 0)
        continue;
    close(HttpService)
}

redirections

{ print "foo bar" >  "file.txt" } # file output
{ print "foo bar" >> "file.txt" } # file output
{ print "foo bar" |  "grep foo" }
{ print "foo bar" |& "cmd"      } # piped IO coproc/socket

gotchas

  • https://www.gnu.org/software/gawk/manual/html_node/Conversion gawk always uses the period (.) as the decimal point unless told explicitly to use the local LC_NUMERIC –posix –use-lc-numeric (-N)
  • sometimes not enforcing variables to be local can cause weird issues. early return, should happen as soon as possible otherwise this function will keep looping… If I move the if/return0 to the top it works just fine OR if I make "middle" a local variable

    function binarySearch(target,    left, right) {
        middle = int((left+right)/2)
        print "l:", left, "r:", right, "m:", middle, "n[m]="numbers[middle]
        if (left >= right) {
            return 0
        }
        if (numbers[middle] > target) binarySearch(target, left, middle-1)
        if (numbers[middle] < target) binarySearch(target, middle+1, right)
        return numbers[middle] == target
    }
    
  • Can redefine NF=0 at END and then add new $(++NF)=??? to later just print

    { print "expression" > "filename" }
    { print "expression" | "command" }
    function add_tree (number) { # local variables can be declared here too, like &aux
        return number + 3
    }
    { print add_tree(36) }
    
  • if you use an array as a map or just an array, be careful when
    • checking for equality/inequality as just indexing the value to read it will create the slot
  • if you use an array as a set, to count unique values, if using more than one number, separate by a string

    map[x y]   = 1 # BAD
    map[x","y] = 1 # GOOD!
    

codebases

snippets

  • print unique lines, without sorting

    $ awk '!x[$0]++' file.txt
    
  • wEiRd - removes leading space

    $ awk '{ $1=$1 }1' file.txt
    $ awk '{ $1=$1 }; { print }' file.txt
    $ awk '/.*/ { $1=$1 }; /.*/ { print $0 }' file.txt
    
  • array

    function format_matrix(    arr, row, col, res) {
        for (row in arr) {
            for (col in arr[row]) res = res sprintf(arr[row][col])
            res = res sprintf("\n")
        }
        return res
    }
    # map[i+((NR-1)*NF)] = $i
    function print_mat(    rid, cid) {
        print ""
        for (rid = 1; rid <= NR; rid++) {
            for (cid = 1; cid <= NF; cid++) {
                printf map[cid + ((rid-1)*NR)]
            }
            printf "\n"
        }
    }
    function print_matrix_dimensions(    arr) {
        printf "%dx%d\n", length(arr), length(arr[1])
    }
    
  • math

    function max(    x,y) { return (x>y)?x:y  }
    function min(    x,y) { return (x<y)?x:y  }
    function abs(    x)   { return (x<0)?-x:x }
    
  • untestes stack?

    function isEmpty()    { return idx == 0 }
    function peek()       { return stack[idx] }
    function push(el)     { print el; stack[++idx] = el }
    function pop(    tmp) { tmp = stack[idx]; delete stack[idx--]; return tmp }
    
  • tested stack?

    function push(a, x) {
        "" in a # coerce into array
        a[length(a) + 1] = x
    }
    
    function pop(a, __x, __i) {
        __x = a[1]
        for (__i = 1; __i < length(a); __i++) a[__i] = a[__i + 1]
        delete a[__i]
        return __x
    }
    
  • PGM - grayscale 1-D array of a 2-D matrix

    function array2PGM(arr,    out) {
        out = out "P2"    # format id
        out = out NF" "NR # dimensions
        out = out 9       # max value
        for (idx in cache)
            out = out arr[idx] " "
        return out "\n"
    }
    
  • check for empty records and fields

    length($0) == 0 { print "this is an empty record==" }
    END { if (NR == 0) print "means that we didn't process any record" }
    

implementations

gawk https://www.gnu.org/software/gawk/
mawk https://web.archive.org/web/20240202023335/https://invisible-island.net/mawk/
goawk https://github.com/benhoyt/goawk
bioawk https://github.com/lh3/bioawk
frawk https://github.com/ezrosent/frawk
wak https://github.com/raygard/wak
nawk https://github.com/onetrueawk/awk
  https://justine.lol/awk/
$ readelf -d /usr/bin/gawk | grep Shared # 689K
 0x0000000000000001 (NEEDED)             Shared library: [libsigsegv.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libreadline.so.8]
 0x0000000000000001 (NEEDED)             Shared library: [libmpfr.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgmp.so.10]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]

$ readelf -d /usr/bin/mawk | grep Shared # 155K
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
  • buffering
    • gawk unbuffered by default
    • mawk buffers by default, needs -W interactive to disable