- 1977
- Authors
- Aho Alfred Vaino
- Weinberger Peter Jay
- Kernighan Brian Wilson
cli
$ awk '{ print $1 }' file.csv
$ awk -f script.awk file.csv
arg | description | ||
---|---|---|---|
-F | –field-separator | S | sets FS |
-f | –file | <FILEPATH> | runs script |
-E | –exec | <FILEPATH> | runs script (for gawk cgi) |
-v | –assign | var=val | sets var to val |
-c | –traditional | - | compatibility mode |
-P | –posix | - | compatibility mode extra |
-S | –sandbox | - | disables system() and IO redirections |
-d- | –dump-variables=- | - | dump variables status to stdout |
language
#!/usr/bin/awk -f #!/usr/bin/env -S gawk -f END { print "Hello, World!"; }
- gawk version changelog (by date)
- 2023 (5.3)
--csv
flag, supports fields surrounded by double quotes\u
escape sequence
- 2023 (5.3)
style
- guide https://github.com/soimort/translate-shell/wiki/AWK-Style-Guide
- CamelCase, variables
- lowercase, functions
- 4 spaces or 4 #### for local variables in a function
- case have a default branch
- use "IF" and not "ELSE IF" if the previous branch uses a "RETURN"
- use () before pipes |
types
weakly typed: their type can change as the program runs
- gnu: typeof(a)
- untyped, unassigned
- strnum
- regexp (?
string
- index start at 1
number
- || and && return
booleans
AKA 0(false) and 1(true). They do NOT return the truthy value. - falsy: "", 0, undefined variables
"comparing floating-point values to see if they are exactly equal is generally a bad idea"
ratio = dimension[1] / dimension[2] # ratio = 1.77778 of type number if (ratio == 1.77778 ) # does NOT work if (ratio == "1.77778") # works
- support for scientific notation, eg: 1e8
- support for
float
- gawk: support for
bignum
through GMP - gawk: support for arbitrary precision through MPFR (to be removed soon?)
array
- are associative arrays (aka hashtables)
- no need to be declared
- for strings or numbers
- 1D
- 2D support can be mimicked by using
[x,y]
as index and(x,y) in arr
for checking - 2D support in gawk
- index
- are strings
- start at 1
- at least the ones returned by stdlib functions
- you can make it start by 0(zero) if you use a custom variable to initialize it
https://www.gnu.org/software/gawk/manual/html_node/Controlling-Array-Traversal.html
comp_func(i1, v1, i2, v2) < 0 # Index i1 comes before index i2 comp_func(i1, v1, i2, v2) == 0 # Indices i1 and i2 come together comp_func(i1, v1, i2, v2) > 0 # Index i1 comes after in2
Set the order an already created array would be presented on a forIn
PROCINFO["sorted_in"] = "afunctionname" # see comp_func PROCINFO["sorted_in"] = "@val_num_asc" PROCINFO["sorted_in"] = "@val_num_desc" PROCINFO["sorted_in"] = "@val_str_asc" PROCINFO["sorted_in"] = "@val_str_desc" PROCINFO["sorted_in"] = "@ind_num_asc" PROCINFO["sorted_in"] = "@ind_num_desc" PROCINFO["sorted_in"] = "@ind_str_asc" PROCINFO["sorted_in"] = "@ind_str_desc"
built-in variables
- RS="^$" reads the whole file as a single record
- FPAT https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html
- For csv, FPAT = "([^,]+)|(\"[^\"]+\")"
- Instead of using FS to specify what the fields are not
- We use this to specify what are the fields, in the form of a regular expression.
DESCRIPTION | DEFAULT | |
---|---|---|
FPAT |
regex of what each field contains | "[^[:space:]]+" |
FIELDWIDTHS |
whitespace separated list field widths | "" |
NF | numer of fields in line | 0 |
NR | number of records (aka lines) read so far | 0 |
FNR | number of records read so far, in curr file | 0 |
FS | controls the input field separator | " " |
RS | controls the input record separator | "\n" |
OFS | output field separator | " " |
ORS | output record separator | "\n" |
OFMT | output format for numbers | "%.6g" |
ENVIRON | array of environment variables | |
ARGV | array of cli arguments | |
ARGC | number of cli arguments | 0 |
ARGIND |
index of ARGV being processed | 0 |
FILENAME | name of current input file | "" |
RLENGTH | length of string matched by match function | 0 |
RSTART | start of string matched by match function | 0 |
SUBSEP | subscript separator | "\034" |
IGNORECASE |
all but array subscripting will ignore case | 0 |
built-in functions
TIME
https://www.gnu.org/software/gawk/manual/html_node/Time-Functions.html
mktime | DATESTR, UTC? | given DATESTR, timestamp in seconds since epoch |
strftime | FMT, TIMESTAMP, UTC? | |
systime | - | now, TIMESTAMP in seconds since epoch |
- where DATESTR is a space separated "YYYY MM DD HH MM SS DST? 0|1"
- where FMT can be "%Y-%m-%d %H:%M:%S"
BITWISE
https://www.gnu.org/software/gawk/manual/html_node/Bitwise-Functions.html
fn | args | returns |
---|---|---|
and | v1,v2,… | |
xor | v1,v2,… | |
or | v1,v2,… | |
compl | val | complement |
lshift | val,count | val left shifted by count bits |
rshift | val,count | val right shifter by count bits |
ARRAY
<r> | returns | does |
---|---|---|
asort(SRC,DST) | number of elements in SRC | sort by value, DST has idx=numeric val=old_value |
asorti(SRC,DST) | number of elements in SRC | sort by index, DST has idx=numeric val=old_index |
isarray(arr) | boolean | |
delete arr[1] | ? | deletes element "1" from array |
"" in arr | ? | coerce arr into array type (in a function?) |
for (i in arr) | ? | iterates over array indexes (i) |
MATH
https://www.gnu.org/software/gawk/manual/html_node/Numeric-Functions.html
fn | arg | returns |
---|---|---|
atan2 | y,x | arctangent of y/x in -x to x range |
cos | x | cosine of x, with x in radians |
sin | x | sine of x, with x in radians |
exp | x | |
log | x | ntural base e logarithm of x |
sqrt | x | |
int | x | integer part of x, truncated |
rand | - | random nuber r, 0 <= r < 1 |
srand | x | x is new seed for rand() |
STRING
https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html
r=regex s=string t=targetstring fs=field separator
fn | args | returns | does |
---|---|---|---|
sub | r,s | number of subst made | substitute one r for s in $0 |
r,s,t | " | substitute one r for s in t | |
gsub | r,s | " | substitute all r for s in $0 |
r,s,t | " | substitute all r for s in t | |
gensub | r,s,h | copy of s modified | substitute h'th instance of r by s in $0 |
r,s,h,t | " | substitute h'th instance of r by s in t | |
substr | s,start | substring of s | |
s,start,len | " | ||
split | s,a | number of fields | stores the pieces in array a |
s,a,fs | " | stores the pieces in array a | |
length | - | number of chars in $0 | |
s | number of chars in s | ||
index | s,t | 0 or n position of t in s | |
match | s,r | index or 0 | test if s contains r, sets RSTART and RLENGTH |
s,r,a | … sets a to portions of s that match r | ||
[0] = whole matched part of s | |||
[N, "start"] = starting index of match | |||
[N, "length"] = length of match | |||
sprintf | fmt,… | formated string | |
strtonum | s | ||
tolower | s | lowercased s | |
toupper | s | uppercased s |
operators
= += -= *= /= %= ^= | Assigments |
?: | Ternary operator |
in | Array membership |
~ !~ | Matching |
control flow
- exit
- on a normal rule, still runs END, but not ENDFILE
- on BEGIN , still runs END
- on END , stops
exit | goes immediately to the END action |
exit expression | |
next | skips to the next line of input |
output statement
close | filename | break connection between print and filename |
close | command | break connection between print and command |
system | command | execute command |
getline
https://www.gnu.org/software/gawk/manual/html_node/Getline.html
getline | reads next input record | NF, NR, FNR, RT, $0 |
getline var | reads n.i.r. into var | NR, FNR, RT |
getline < file | reads n.i.r. from file | NF, RT, $0 |
getline var < file | reads n.i.r. from file into var | - |
"cmd" ¦ getline | reads a single line of cmd into awk | NF, RT, $0 |
"cmd" ¦ getline var | reads a single line of cmd into var | RT |
"cmd" ¦& getline | reads from a two-way pipe | NF, RT, $0 |
"cmd" ¦& getline var | reads from a two-way pipe into var | RT |
NOTE: call close("cmd")
on the non two-way pipes, maybe call getline on a while>0
format strings
- https://www.gnu.org/software/gawk/manual/html_node/Control-Letters.html
- https://www.gnu.org/software/gawk/manual/html_node/Format-Modifiers.html
- %+-width.prec(?)
description | |
---|---|
%f, %F | float |
%a, %A | float hexa |
%g, %G | float or scientific notation |
%d, %i | decimal integer |
%e, %E | scientific notation |
%o | unsigned octal |
%u | unsigned decimal integer |
%x, %X | unsigned hexadecimal integer |
%c | numbers as character |
%s | string |
%% | literal "%" |
extensions
- at /usr/share/doc/gawk/examples/lib/*.awk
- maybe set on OS environment variable
AWKPATH
(at least for lsp emacs)
- maybe set on OS environment variable
@include "join"
function join(array, start, end, sep, result, i) if (sep == "") sep = " " if (sep == SUBSEP) sep = "" # magic value
- @include "assert" assert(BOOLEAN, "Reason of failure HERE")
- @include "ord" OR @load "ordchr" https://www.gnu.org/software/gawk/manual/html_node/Extension-Sample-Ord.html
- ord(STRING) -> NUMBER
- chr(NUMBER) -> STRING
control flow
- do while, while, for(;;), for(in)
can assign a value on a if
if (disjoint = r[2] <= m1 || m2 <= r[1]) continue
network
- https://www.gnu.org/software/gawk/manual/html_node/TCP_002fIP-Networking.html
- https://www.gnu.org/software/gawk/manual/gawkinet/html_node/index.html
- https://www.gnu.org/software/gawk/manual/gawkinet/gawkinet.html#Primitive-Service
/inet[,4,6]/(udp|tcp)/lport/rhost/rport
rossetta - web server
https://rosettacode.org/wiki/Hello_world/Web_server
#!/usr/bin/gawk -f BEGIN { RS = ORS = "\r\n" HttpService = "/inet/tcp/8080/0/0" Hello = "<HTML><HEAD>" \ "<TITLE>A Famous Greeting</TITLE></HEAD>" \ "<BODY><H1>Hello, world</H1></BODY></HTML>" Len = length(Hello) + length(ORS) print "HTTP/1.0 200 OK" |& HttpService print "Content-Length: " Len ORS |& HttpService print Hello |& HttpService while ((HttpService |& getline) > 0) continue; close(HttpService) }
redirections
- https://www.gnu.org/software/gawk/manual/html_node/Redirection.html
- see getline
- in pipes, it's a good idea to call
close(cmd)
on them
{ print "foo bar" > "file.txt" } # file output { print "foo bar" >> "file.txt" } # file output { print "foo bar" | "grep foo" } { print "foo bar" |& "cmd" } # piped IO coproc/socket
gotchas
- https://www.gnu.org/software/gawk/manual/html_node/Conversion gawk always uses the period (.) as the decimal point unless told explicitly to use the local LC_NUMERIC –posix –use-lc-numeric (-N)
sometimes not enforcing variables to be local can cause weird issues. early return, should happen as soon as possible otherwise this function will keep looping… If I move the if/return0 to the top it works just fine OR if I make "middle" a local variable
function binarySearch(target, left, right) { middle = int((left+right)/2) print "l:", left, "r:", right, "m:", middle, "n[m]="numbers[middle] if (left >= right) { return 0 } if (numbers[middle] > target) binarySearch(target, left, middle-1) if (numbers[middle] < target) binarySearch(target, middle+1, right) return numbers[middle] == target }
Can redefine NF=0 at END and then add new $(++NF)=??? to later just print
{ print "expression" > "filename" } { print "expression" | "command" } function add_tree (number) { # local variables can be declared here too, like &aux return number + 3 } { print add_tree(36) }
- if you use an array as a map or just an array, be careful when
- checking for equality/inequality as just indexing the value to read it will create the slot
if you use an array as a set, to count unique values, if using more than one number, separate by a string
map[x y] = 1 # BAD map[x","y] = 1 # GOOD!
codebases
snippets
print unique lines, without sorting
$ awk '!x[$0]++' file.txt
wEiRd - removes leading space
$ awk '{ $1=$1 }1' file.txt $ awk '{ $1=$1 }; { print }' file.txt $ awk '/.*/ { $1=$1 }; /.*/ { print $0 }' file.txt
array
function format_matrix( arr, row, col, res) { for (row in arr) { for (col in arr[row]) res = res sprintf(arr[row][col]) res = res sprintf("\n") } return res } # map[i+((NR-1)*NF)] = $i function print_mat( rid, cid) { print "" for (rid = 1; rid <= NR; rid++) { for (cid = 1; cid <= NF; cid++) { printf map[cid + ((rid-1)*NR)] } printf "\n" } } function print_matrix_dimensions( arr) { printf "%dx%d\n", length(arr), length(arr[1]) }
math
function max( x,y) { return (x>y)?x:y } function min( x,y) { return (x<y)?x:y } function abs( x) { return (x<0)?-x:x }
untestes stack?
function isEmpty() { return idx == 0 } function peek() { return stack[idx] } function push(el) { print el; stack[++idx] = el } function pop( tmp) { tmp = stack[idx]; delete stack[idx--]; return tmp }
tested stack?
function push(a, x) { "" in a # coerce into array a[length(a) + 1] = x } function pop(a, __x, __i) { __x = a[1] for (__i = 1; __i < length(a); __i++) a[__i] = a[__i + 1] delete a[__i] return __x }
PGM - grayscale 1-D array of a 2-D matrix
function array2PGM(arr, out) { out = out "P2" # format id out = out NF" "NR # dimensions out = out 9 # max value for (idx in cache) out = out arr[idx] " " return out "\n" }
check for empty records and fields
length($0) == 0 { print "this is an empty record==" } END { if (NR == 0) print "means that we didn't process any record" }
implementations
$ readelf -d /usr/bin/gawk | grep Shared # 689K 0x0000000000000001 (NEEDED) Shared library: [libsigsegv.so.2] 0x0000000000000001 (NEEDED) Shared library: [libreadline.so.8] 0x0000000000000001 (NEEDED) Shared library: [libmpfr.so.6] 0x0000000000000001 (NEEDED) Shared library: [libgmp.so.10] 0x0000000000000001 (NEEDED) Shared library: [libm.so.6] 0x0000000000000001 (NEEDED) Shared library: [libc.so.6] $ readelf -d /usr/bin/mawk | grep Shared # 155K 0x0000000000000001 (NEEDED) Shared library: [libm.so.6] 0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
- buffering
gawk
unbuffered by defaultmawk
buffers by default, needs-W interactive
to disable