Back in ~2005 when I was still very new to Linux I had this old Dell that I put Gentoo on. It was a 900MHz Intel Coppermine with a minuscule amount of memory, so things compiled very slow. Being the wise Linux guru that I was back then, I decided that I would emerge (compile) all of X, fluxbox, OpenOffice, and probably 1 or 2 other things.
I didn't have distcc set up. It compiled for days.
One thing I really did enjoy was watching the cryptic messages crawl by, and one that always mystified me was during the configure stages:
checking for gawk... gawk
It seemed like a total nonsense word. To this day I still have fond memories every time someone mentions (g)awk.
Yes, easy reading, as a non-programmer the only thing that lost me was the throw away line in the code comments on "function arguments are call-by-value". I've some recollection that call-by-value and call-by-reference are options here but not entirely clear what the meaning of "call-by-value" is.
Am I right in saying that if I call a function on $2, say, that unless that variable is explicitly assigned in the function that $2 remains unchanged after the function operates. But that with a call-by-reference the value of the variable itself will be altered? Something along those lines.
Am I right in saying that if I call a function on $2, say, that unless that variable is explicitly assigned in the function that $2 remains unchanged after the function operates. But that with a call-by-reference the value of the variable itself will be altered?
You've got it. At least that's what passing a variable by reference vs. by value means in other languages, such as PHP.
General question: when would one choose to use an awk script over something more general purpose such as a python or ruby script? To me it would make sense to use the latter in most cases.
Good question. I've been working with awk for several years now, and here's how I feel about it.
AWK is old. 1977 old. Later versions that appeared (nawk and gawk are the most common) helped make it a smoother language, but it's still a pain. There are definite features you will be missing in an awk script:
- Any useful data structure slightly more complex than associative arrays. Try multi dimensional arrays, it's actually fun to do. Once.
- Any useful programming construct to help manage with complexity of scripts longer than a few hundred lines. No classes, no variable scoping, no namespaces in general. Not to mention an extremely permissive compiler.
- Any easy way to deal with the environment (other than text). Try sending http requests in awk, it can be a pain.
In 2015, if you need to write a script you should almost always prefer python/ruby over awk.
Now if you're asking whether you should learn awk? It comes in handy. There are a lot of awk scripts in the wild, and you may need to read them or edit them one day. Also, awk has a fun way of parsing input, which makes for very enjoyable one liners. Learning awk (and some complementary utilities like sed or find) turned me into a "oh, it can be done in a quick one-liner" kind of guy. Definitely recommended.
So you get an ugly program that requires a third party tool, rather than just a POSIX system. I would say that this solution then requires additional justification compared to the Awk no-brainer.
I use awk for monitoring scripts which must run on large scale (>200 node) clusters, because:
1. It is lightweight -- there is no need to load shared libraries, modules, plugins, etc., and it uses less memory than the more general purpose scripting languages do.
2. It comes with BusyBox, so if you have a BusyBox PXE boot environment you can make the AWK scripts work under it, and also work independently of any host OS that gets booted, as long as it's Unix-like. Perl, Python and Ruby are not in Busybox, and their versions can vary a lot across different OS distributions and cluster configurations. AWK avoids this dependency because it's stable and self-contained in one executable file.
3. AWK automatically splits records into fields based on a settable FS, so it makes it very easy to parse things like /proc/stat and /proc/loadavg into $1 $2 $3 ... fields. The code looks nicer and is more compact because of the automatic field variables.
4. The default RS is newline, and AWK is designed to perform actions on records. Most of the scripts I write perform an action upon receiving a newline on their standard input as a trigger. So most of the scripts tend to be of the "BEGIN { } { }" variety -- initialization in the BEGIN section, and then an unconditional action to fire at every newline, such as collecting /proc statistics and reporting them up the cluster hierarchy. AWK is naturally suited for this REPL behavior without needing any boilerplate code.
I personally use awk exclusively for one-liners either in an interactive shell or in a shell script. It's incredibly handy for taking a chunk of text and pulling out the bit you're interested in concisely.
Awk is optimized for working with the output of other Unix tools. It's great for writing filters for larger output, like "show me the third column of all lines in <million-line-long-logfile> that contain the word 'moose'". Of course you can do this in Python or Ruby, but it's very convenient in awk as it's really optimized for that exact task.
As others have noted, awk is great for one-liners. Just minutes ago, I wanted to find out what users and groups own all of the files within a certain deep directory tree (on linux). I did this:
I also do this, but the right way to parse the output of ls would really be to remember the syntax of stat(1). That way you avoid any trouble with spaces and other such problems, and the command is not much longer:
Nice! I though for a minute about using find, but I didn't know about those options to stat, and I just punted and used ls. I need to study the stat man page, apparently.
You kinda alluded to it, but I'm going to leave this here for reference. Here's a good explanation of why you should avoid parsing the output of `ls`: http://mywiki.wooledge.org/ParsingLs
> when would one choose to use an awk script over something more general purpose such as a python or ruby script?
You have almost answered your own question.
In situations where you are dealing with munging data which has a structure that is implicitly handled by awk (one or more files, consisting of regularly-delimited records, which break into regularly-delimited fields), it is very difficult to beat awk for succinctness.
The second thing is this: Awk has been around for decades and is part of the POSIX standard. A shell script that uses awk commands can be full POSIX compliant and work on any POSIX-like system with minimal changes, without the installation of third-party software.
So, with Awk you can do a lot of things that would otherwise require something like Python. In many cases you can do them with clearer code that has less clutter, and your solution is POSIX, to boot.
The down side of Awk is that it sacrifices reliability to pander to succinctness. Awk does not detect undefined variables; any variable you mention becomes defined. It has loose arithmetic: this is so you can increment a nonexistent variable or array element by one, and it behaves as if it had been defined with zero. Awk only has local variables in functions, and they are modeled as extra parameters (for which you could pass values, but you don't, unless perpetrating a hack).
Basically, whereas you can do actual software engineering in Python and Ruby, you'd be crazy to do it in Awk, even if the lack of libraries and datatypews such isn't an impediment against doing it in Awk.
I only use it for one-liners or a-few-liners now, for unixy data-munging. (I'd still recommend the book by Aho, Kernighan, and Weinberger, because it's full of great examples.)
20-25 years ago I wrote many more things in Awk, up to a Lisp interpreter and a parser generator; but Python/Ruby/etc. have replaced it.
> when would one choose to use an awk script over something more general purpose such as a python or ruby script?
When the others are not around. Sometimes you need to work on weird old or stripped-down boxes that don't have python or even perl - but awk is (almost) always there. Like vi.
If the input is naturally arranged in columns, regardless of separation character, awk is optimal for parsing it, and not bad for actually processing it, either. This goes double if you only need certain lines of input, and you can concisely pick out those lines using a regular expression.
Piping things in and out of awk is really powerful and fast! Check out how Gary Bernhardt utilizes different Unix-commands (including awk) to filter out dead links on his blog: http://vimeo.com/11202537
This is a cool video. I have been using Linux casually for a year and a half now, but haven't progressed too much past the desktop paradigm.
I am not uncomfortable in the terminal, but I feel more akin to a foreigner who's mastered a phrasebook, rather than someone able to hold even a rudimentary conversation.
Do you (or does anyone else) have more examples similar to this, or resources that would be useful in moving toward this fluency in utilizing command line tools?
I would strongly urge you to purchase Gary Bernhardt's screencasts at http://destroyallsoftware.com. It has been extremely influential for me. I've realized that Unix should be looked at as a tool when programming. He also has strong opinions on how to test properly, check this out: https://www.youtube.com/watch?v=tdNnN5yTIeM
No, the python interpreter itself takes about 0.03 seconds to start up on an i7. Awk and sed and friends are usually about 1/10th of that. So the python interpreter startup time dominates the cost for most simple tasks. It's not enough to notice when you're typing things by hand at the shell, but the difference can be painful if you're writing bash scripts. I use a similar python tool for command line work, but I always go for sed when I'm writing bash scripts that I plan to distribute.
This seems like a good thread to ask: are there more featureful languages that derive from Awk (edit: i.e., can work in a data driven mode) but don't
diverge as much as much as Perl did in terms of syntax?
awk applies actions to lines matching patterns. The design seems concise and limited on purpose.
Not that you shouldn't want more features, but in the context that awk is typically used, what sort of features would you want to add? I confess I'm not able to understand what "can work in a data driven mode" might mean.
To answer your question a little more directly (and yet still be almost a non-answer), additional features can be found on the other side of the pipe into which you direct awk's output.
>awk applies actions to lines matching patterns. The design seems concise and limited on purpose.
That is what I meant by a "data driven mode".
>what sort of features would you want to add?
An extended awk could at least add new types of patterns, e.g., binary, stateful and nested patterns or formal grammars, regex patterns with match groups, etc., meaning you could correctly process "real" CSV files or log files with complex structure. I believe these features could be made to fit in with the design of the language making them all feel like they belong to the same kind. It could also add new functions such as ones for Unicode text normalization.
Perl5 does much of this and Perl6 even introduces grammars but because of some of its design decisions for me Perl is not a joy to use like Awk. Both of its versions are just plain too big.
You can use multiple file options, but I am not sure if you can include files to load a library file from within awk itself, nor am I aware of any options for conditional linking. Those would be nice, it would make it easy to write awk libraries.
I made a programming language called TXR which has a "data driven mode", consciously based on the concept, but not derived in any way, and with very different syntax.
The data driven mode isn't based on applying patterns to records, but rather patterns to entire text streams.
Here is a TXR script I use that transforms the output of the Linux kernel "checkpatch.pl" script into a conventional compiler-like error report that Vim understands:
This is "pure TXR": there are no traces of the embedded programming language TXR Lisp. Stuff that isn't @ident or @(syntax) is literal text which is matched. So the block
@type: @message
#@code: FILE: @path:@lineno:
matches a two-line pattern in the checkpatch output, extracting a type, followed by a space and colon, the message, then on the next line a code value prceded by a hash mark, followed by ": FILE: " and then a path delimited by a colon, and a line number terminated by a colon.
The following more complicated script scans diff output (usually the output of `git show -p` or `git diff`) and produces an "errors.err" file for Vim such that "vim -q" will navigate over the diffs as a quickfix list, and the messages have some information, like how many lines were added or removed at that point by the diff.
Here, there is an overall pattern matching logic for scanning the sections of a diff output: parsing out multiple diffs, and the "hunks" within each diff, with the line number info from the hunk header and such.
Some stateful Lisp logic calculates what is needed out of the extracted pieces and produces the output as it goes.
In "vim -q" you are taken to where the changes are, and in the ":cope" window you can see the additional lines that give the original text that was modified, if applicable. For instance, the first item navigates to the line "# $2 - source file(s)" and you know that one line was edited at that point, and the original text was " $2 - source file".
The best introduction to awk is its man page. It is such a concise language that you can find pretty much everything you want to learn about it in just a few pages.
The manual page isn't an introduction, it's a reference. It says what it is (the GNU Project's implementation of the AWK programming language), and goes on to talk about how the program is operated and how the language works, but it doesn't explain why or where you'd use this tool That's fine for a reference manual but it's a big omission for an introduction.
I use perl one-liners, thanks to the power of -n, -i and -p switches.
perl -n -e <expression> <file>
runs the expression on each and every line of <file> (can be STDIN of course).
perl -p -e <expression> <file>
prints out every line for which the expression returns true.
The -i option allows treating files in place. Add -i<extension> to create a <file>.<extension> backup, just in case.
Lastly, perl allows you to use q(string) instead of 'string' and qq(string) instead of "string", that avoids lots of escaping when typing in oneliners, so instead of:
This is probably the most concise introduction to awk I've read.
For more extensions[0] and advanced features like arrays of arrays and array sorting, there's also gawk. And for larger files there's performance driven mawk which can drastically increase processing speed[1].
On CentOS, at least, gawk == awk which falsifies "patterns cannot capture specific groups to make them available in the ACTIONS part of the code". Eg: echo "abcdef" | gawk 'match($0, /b(.*)e/, a) { print a[1]; }'
You missed the match and not match operators ~ and !~ in your list.
Finally, I find people better understand the flexibility of awk when they realise that awk '/bob/ { print }' is shorthand for awk '$0~/bob/ { print $0; }'. This makes it clear that the pattern element is not limited to regexs or matching the whole line.
Awk is also used a lot in Bioinformatics where you need one-liners to extract/format the data. Ofcourse, Perl/Python/R/Ruby can also be used but in some cases Awk is just simple and graceful.
I'm actually working on (about to finish) free Splunk app that monitors HTTP traffic via Apache logs on WHM/Cpanel based hosting servers and visualizes traffic and activity trends and patterns between IP addresses and sites.
Awk would be an excellent tool to quickly play and "debug" logs content alongside with visual tool.
Additionally I think I'd want to utilize it for malware detection.
a lot of people pipe to sed but you don't need to. you can do regex subs right in awk. see sub and gsub.
sub(r, t, s)
substitutes t for the first occurrence of the regular expression r in the string s. If s is
not given, $0 is used.
gsub same as sub except that all occurrences of the regular expression are replaced; sub and gsub
return the number of replacements.
BEGIN and END are somewhat like what is above the first %% and below the second %% in a yylex "source" file. Or maybe not. I need to review the documentation.
This brief AWK intro entices me to try making a similar one for (f)lex using the author's concise format as a model.
In any event, Pattern --> Action is common to both programs.
The %% in lex and yacc statically organize the file into different areas. BEGIN and END have run-time semantics: do these things before applying the pattern/actions to the inputs, and do these things afterward.
1. I never mentioned yacc. What relevance does it have to my comment? I typically use (f)lex without yacc/bison to do a similar job as I would use AWK for: text processing.
2. "statically organize the file into different areas"
One is a code generator and the other is a scripting language with an interpreter, is that what you mean? In effect, this difference means little to me (except for speed of execution): I store my (f)lex programs as source files that I feed to the (f)lex code generator. Then I compile the generated C code. I store my AWK scripts as source files that I feed to the AWK interpreter. I use both flex and AWK to perform a similar task: text processing.
For whatever it is worth, I get better performance from my compiled flex scanners than from my interpreted AWK scripts. But I sometimes use them for the very same text processing jobs.
AWK:
BEGIN { define variables }
pattern-action rules
END { stuff to do after EOF }
(f)lex:
{ definitions } user variables
%%
{ rules } pattern-action rules
%%
{ user routines } stuff to do after EOF
From the blog:
"_BEGIN_, which matches only before any line has been input to the file. This is basically where you can _initiate variables_ and all other kinds of state in your script."
From the Lesk and Schmidt:
So far only the rules have been described. The user needs additional options, though, to _define variables_ for use in his program and for use by Lex. These can go ... in the _definitions_ section...
From the blog:
There is also END, which as you may have guessed, will match after the whole input has been handled. This lets you clean up or do some final output before exiting.
From Lesk and Schmidt:
Another Lex library routine that the user will sometimes want to redefine is yywrap() which is called whenever Lex reaches an end-of-file.
I regularly use yywrap in the "user routines" section.
It functions much the same way as commands I use in the END section of an AWK script.
I guess one can either focus on differences or similarities. I choose the later.
I care little about the "intended purpose" of a program. I care more about what a program can actually do.
I know, but both lex and yacc use the %% division in similar ways; that is why I mentioned it.
Simply put, your "definitions" are not stuff that is done before pattern-action rules, and "user routines" are not stuff that is done after EOF. It's all just stuff that is declared. Both sections can contain code, and that code can be called out from the pattern rules. Either section could contain a main function that calls yylex. If the lexer is reentrant, it could be re-entered from any of those places. And so on. Fact is, the %% division has nothing to do with processing order, unlike BEGIN and END in Awk.
%% division can be used to do exactly what BEGIN and END do, and that is how I use it. Moreover, as I recalled correctly, the Lesk and Schmidt paper specifially mentions such usage.
My comment is not referring to the internal behavior of the two programs (as yours is). And the Lesk and Schmidt paper is not setting down hard and fast rules; it is only making suggestions. My commment was about how the two programs can be used to do similar work, i.e., text processing.
If you do a lot of text processing work, at some point AWK is not fast enough. I have other programs I use and flex is one of them. Specifically, scanners (filters) produced with flex.
I don't disagree that you can put stuff that is done first above the first %%, and then stuff that is done after scanning after the second %%. I just don't think that this makes %% analogous to BEGIN and END. For one thing, stuff can be moved around from one of those sections to the other, without changing the basic organization of the program. For instance, prior to the first %% you can put prototype declarations, and move everything to the bottom.
The second comma in the second sentence has got to go. "on files, usually structured" <-- that one. There are some other ones too. Other than that this is really great!
If you view the Unix shell as a language unto itself, awk and sed are both higher-level functions in that language, married via the use of pipelines (concatenative programming without a stack).
Awk & Sed are labeled as the following in my mind:
- Awk: That thing I use when I need to grep for something over more than one line and/or do some basic transformations.
- Sed: That thing with the painful syntax for doing ridiculously complicated regex substitutions.
With that said, I find sed much more difficult to use than awk and generally try to avoid it if I can. I'm even prone to just opening the file in vim and executing the replace command through that rather than using sed.
I would use sed for transformations (simple or complex) where I only care about one transformation (even though you can do multiple): Does the current line match this pattern? Change it to this other thing and move on the next line.
I would use awk when there are a handful or more of potential transformations: Does the current line match any of these multiple patterns? Do the action that's defined for each of the patterns, then move on to the next line.
If I need to do multiple transformations, and still want to use sed, I find it easiest to create a chain of single sed transformations, piped together. Somewhere in that area a shift to awk (or python) becomes justified.
I found sed great for messing around with SVN repository dumps. Delete a few lines here to remove the commit that added a directory... change a few paths to pretend the files in that directory had always been somewhere else... add a few lines somewhere else.
Sed scripts are a quick way to automate simple edits to large files.
Does noting that you don't have to use / as the separator help with the syntactical pain? I personally always use |, since I'm unlikely to be using that for a pattern. Example:
grep foo somefile | sed 's|/path/to/some/file|/new/path|g'
at least in the awks built into osx and debian have sub and gsub (and gawk adds gensub).
sub(r, t, s)
substitutes t for the first occurrence of the regular expression r in the string s. If s is
not given, $0 is used.
gsub same as sub except that all occurrences of the regular expression are replaced; sub and gsub
return the number of replacements.
I have been very happy in using awk, have used it to generate real results. It is fast, a little clunky at first, but since it is a smaller language it is quick to get up to speed on.
Back in ~2005 when I was still very new to Linux I had this old Dell that I put Gentoo on. It was a 900MHz Intel Coppermine with a minuscule amount of memory, so things compiled very slow. Being the wise Linux guru that I was back then, I decided that I would emerge (compile) all of X, fluxbox, OpenOffice, and probably 1 or 2 other things.
I didn't have distcc set up. It compiled for days.
One thing I really did enjoy was watching the cryptic messages crawl by, and one that always mystified me was during the configure stages:
It seemed like a total nonsense word. To this day I still have fond memories every time someone mentions (g)awk.