AWK

book https://www.awk.dev/
wiki https://en.wikipedia.org/wiki/AWK
wikibooks https://en.wikibooks.org/wiki/AWK
web repl https://awk.js.org/
doc https://www.gnu.org/software/gawk/manual/html_node/index.html

language

#!/usr/bin/awk -f
END { print "Hello, World!"; }
  • Authors
    • Aho Alfred Vaino
    • Weinberger Peter Jay
    • Kernighan Brian Wilson
  • 2023
    • –csv, supports fields surrounded by double quotes
  • https://www.gnu.org/software/gawk/manual/html_node/Conversion gawk always uses the period (.) as the decimal point unless told explicitly to use the local LC_NUMERIC –posix –use-lc-numeric (-N)
  • sometimes not enforcing variables to be local can cause weird issues. early return, should happen as soon as possible otherwise this function will keep looping… If I move the if/return0 to the top it works just fine OR if I make "middle" a local variable

    function binarySearch(target,    left, right) {
        middle = int((left+right)/2)
        print "l:", left, "r:", right, "m:", middle, "n[m]="numbers[middle]
        if (left >= right) {
            return 0
        }
        if (numbers[middle] > target) binarySearch(target, left, middle-1)
        if (numbers[middle] < target) binarySearch(target, middle+1, right)
        return numbers[middle] == target
    }
    
  • length($0) = 0 { print "this is an empty record=" } END { if (NR == 0) print "means that we didn't process any record" }
  • || and && return booleans. AKA 0 or 1. They do NOT return the truthy value.
  • Can redefine NF=0 at END and then add new $(++NF)=??? to later just print
  • falsey: "", 0, undefined variables
{ print "expression" > "filename" }
{ print "expression" | "command" }
function add_tree (number) { # local variables can be declared here too, like &aux
    return number + 3
}
{ print add_tree(36) }
  • Unique at the time due
    • being a scripting language
    • having associative arrays
  • can assign a value on a if

    if (disjoint = r[2] <= m1 || m2 <= r[1])
        continue
    

network

rossetta - web server

https://rosettacode.org/wiki/Hello_world/Web_server

#!/usr/bin/gawk -f
BEGIN {
    RS = ORS = "\r\n"
    HttpService = "/inet/tcp/8080/0/0"
    Hello = "<HTML><HEAD>" \
        "<TITLE>A Famous Greeting</TITLE></HEAD>" \
        "<BODY><H1>Hello, world</H1></BODY></HTML>"
    Len = length(Hello) + length(ORS)
    print "HTTP/1.0 200 OK"          |& HttpService
    print "Content-Length: " Len ORS |& HttpService
    print Hello                      |& HttpService
    while ((HttpService |& getline) > 0)
        continue;
    close(HttpService)
}

style

Types (will be automaticaly coerced when needed)

https://www.gnu.org/software/gawk/manual/html_node/Variable-Typing.html

  1. Strings
    • index start at 1
  2. Numbers
  3. Arrays
    • index start at 0
    • 1D
    • for strings or numbers
    • no need to be declared
    • ALWAYS asssociative (aka hashtables)
    • for (variable in array)
    • delete array[subscript]
    • use this to coerce into an array in a body's function "" in arr
    • https://www.gnu.org/software/gawk/manual/html_node/Controlling-Array-Traversal.html
      • comp_func(i1, v1, i2, v2) < 0 Index i1 comes before index i2
      • comp_func(i1, v1, i2, v2) == 0 Indices i1 and i2 come together
      • comp_func(i1, v1, i2, v2) > 0 Index i1 comes after in2
    • Set the order an already created array would be presented on a forIn
      • PROCINFO["sorted_in"] = "afunctionname" like comp_func(index1, value1, index2, value2)
      • PROCINFO["sorted_in"] = "@val_num_asc"
      • PROCINFO["sorted_in"] = "@val_num_desc"
      • PROCINFO["sorted_in"] = "@val_str_asc"
      • PROCINFO["sorted_in"] = "@val_str_desc"
      • PROCINFO["sorted_in"] = "@ind_num_asc"
      • PROCINFO["sorted_in"] = "@ind_num_desc"
      • PROCINFO["sorted_in"] = "@ind_str_asc"
      • PROCINFO["sorted_in"] = "@ind_str_desc"

built-in variables

  meaning default
FPAT regex of what each field contains -
NF numer of fields in line -
NR number of records (aka lines) read so far -
FNR number of records read so far, in curr file -
FS controls the input field separator " "
RS controls the input record separator "\n"
OFS output field separator " "
ORS output record separator "\n"
OFMT output format for numbers "%.6g"
ARGC number of cli arguments -
ARGV array of cli arguents -
ENVIRON array of environment variables  
RLENGTH length of string matched by match function -
RSTART start of string matched by match function -
FILENAME name of current input file -
SUBSEP subscript separator "\034"

built-in functions

TIME

https://www.gnu.org/software/gawk/manual/html_node/Time-Functions.html

mktime DATESTR, UTC? given DATESTR, timestamp in seconds since epoch
strftime FMT, TIMESTAMP, UTC?  
systime - now, timestamp in seconds since epoch

BITWISE

https://www.gnu.org/software/gawk/manual/html_node/Bitwise-Functions.html

<r> returns
and(v1,v2,…)  
xor(v1,v2,…)  
or(v1,v2,…)  
compl(val) complement
lshift(val, count) val left shifted by count bits
rshift(val, count) val right shifter by count bits

ARRAY

<r> returns does
asort(SRC,DST) number of elements in SRC sort by value, DST has idx=numeric val=old_value
asorti(SRC,DST) number of elements in SRC sort by index, DST has idx=numeric val=old_index
isarray(arr) boolean  

MATH

https://www.gnu.org/software/gawk/manual/html_node/Numeric-Functions.html

  returns
atan2(y,x) arctangent of y/x in -x to x range
cos(x) cosine of x, with x in radians
sin(x) sine of x, with x in radians
exp(x)  
log(x) ntural base e logarithm of x
sqrt(x)  
int(x) integer part of x, truncated
rand() random nuber r, 0 <= r < 1
srand(x) x is new seed for rand()

STRING

https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

  returns does
sub(r,s) number of subst made substitute one r for s in $0
sub(r,s,t) number of subst made substitute one r for s in t
gsub(r,s) number of subst made substitute all r for s in $0
gsub(r,s,t) number of subst made substitute all r for s in t
gensub(r,s,h) copy of s modified substitute h'th instance of r by s in $0
gensub(r,s,h,t) copy of s modified substitute h'th instance of r by s in t
substr(s,start) substring of s  
substr(s,start,len) substring of s  
split(s,a) number of fields stores the pieces in array a
split(s,a,fs) number of fields stores the pieces in array a
length() number of chars in $0  
length(s) number of chars in s  
index(s,t) 0 or n position of t in s  
match(s,r) index or 0 test if s contains r, sets RSTART and RLENGTH
match(s,r,a)   … sets a to portions of s that match r
    [0] = whole matched part of s
    [N, "start"] = starting index of match
    [N, "length"] = length of match
sprintf(fmt, …) formated string  
strtonum(s)    
tolower(s) lowercased s  
toupper(s) uppercased s  

operators

= += -= *= /= %= ^= Assigments
?: Ternary operator
in Array membership
~ !~ Matching

control flow

  • exit
    • on a normal rule, still runs END, but not ENDFILE
    • on BEGIN , still runs END
    • on END , stops
exit goes immediately to the END action
exit expression  
next skips to the next line of input

output statement

close(filename) break connection between print and filename
close(command) break connection between print and command
system(command) execute command

getline

https://www.gnu.org/software/gawk/manual/html_node/Getline.html

getline reads next input record NF, NR, FNR, RT, $0
getline var reads n.i.r. into var NR, FNR, RT
getline < file reads n.i.r. from file NF, RT, $0
getline var < file reads n.i.r. from file into var -
"cmd" ¦ getline reads a single line of cmd into awk NF, RT, $0
"cmd" ¦ getline var reads a single line of cmd into var RT
"cmd" ¦& getline reads from a two-way pipe NF, RT, $0
"cmd" ¦& getline var reads from a two-way pipe into var RT

NOTE: call close("cmd") on the non two-way pipes

format strings

  description
%f, %F float
%a, %A float hexa
%g, %G float or scientific notation
%d, %i decimal integer
%e, %E scientific notation
%o unsigned octal
%u unsigned decimal integer
%x, %X unsigned hexadecimal integer
%c numbers as character
%s string
%% literal "%"

extensions

codebases

snippets

  • wEiRd - removes leading space

    $ awk '{ $1=$1 }1' file.txt
    $ awk '{ $1=$1 }; { print }' file.txt
    $ awk '/.*/ { $1=$1 }; /.*/ { print $0 }' file.txt
    
  • array

    function format_matrix(    arr, row, col, res) {
        for (row in arr) {
            for (col in arr[row]) res = res sprintf(arr[row][col])
            res = res sprintf("\n")
        }
        return res
    }
    function print_matrix_dimensions(    arr) {
        printf "%dx%d\n", length(arr), length(arr[1])
    }
    
  • math

    function max(    x,y) { return (x>y)?x:y  }
    function min(    x,y) { return (x<y)?x:y  }
    function abs(    x)   { return (x<0)?-x:x }
    
  • untestes stack?

    function isEmpty()    { return idx == 0 }
    function peek()       { return stack[idx] }
    function push(el)     { print el; stack[++idx] = el }
    function pop(    tmp) { tmp = stack[idx]; delete stack[idx--]; return tmp }
    
  • tested stack?

    function push(a, x) {
        "" in a # coerce into array
        a[length(a) + 1] = x
    }
    
    function pop(a, __x, __i) {
        __x = a[1]
        for (__i = 1; __i < length(a); __i++) a[__i] = a[__i + 1]
        delete a[__i]
        return __x
    }
    
  • PGM - grayscale 1-D array of a 2-D matrix

    function array2PGM(arr,    out) {
        out = out "P2"    # format id
        out = out NF" "NR # dimensions
        out = out 9       # max value
        for (idx in cache)
            out = out arr[idx] " "
        return out "\n"
    }
    

implementations

gawk https://www.gnu.org/software/gawk/
mawk https://web.archive.org/web/20240202023335/https://invisible-island.net/mawk/
nawk https://github.com/onetrueawk/awk
  https://justine.lol/awk/
goawk https://github.com/benhoyt/goawk
bioawk https://github.com/lh3/bioawk
$ readelf -d /usr/bin/gawk | grep Shared # 689K
 0x0000000000000001 (NEEDED)             Shared library: [libsigsegv.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libreadline.so.8]
 0x0000000000000001 (NEEDED)             Shared library: [libmpfr.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgmp.so.10]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]

$ readelf -d /usr/bin/mawk | grep Shared # 155K
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
  • buffering
    • gawk unbuffered by default
    • mawk buffers by default, needs -W interactive to disable

tools


Created: 2023-09-04

Updated: 2024-11-16

Back