Awk in 20 Minutes

_ofdw · on Jan 16, 2015

Story time.

Back in ~2005 when I was still very new to Linux I had this old Dell that I put Gentoo on. It was a 900MHz Intel Coppermine with a minuscule amount of memory, so things compiled very slow. Being the wise Linux guru that I was back then, I decided that I would emerge (compile) all of X, fluxbox, OpenOffice, and probably 1 or 2 other things.

I didn't have distcc set up. It compiled for days.

One thing I really did enjoy was watching the cryptic messages crawl by, and one that always mystified me was during the configure stages:

   checking for gawk... gawk

It seemed like a total nonsense word. To this day I still have fond memories every time someone mentions (g)awk.

chris_wot · on Jan 15, 2015

You are awesome. That's seriously the easiest bit of technical reading I've ever done. I tip my hat to you!

rdtsc · on Jan 15, 2015

Speaking of awesome, check out other stuff Fred Hebert wrote.

Learn You Some Erlang For Great Good: http://learnyousomeerlang.com/

His blog posts are great too: http://ferd.ca/

And his new (free) book, Erlang In Anger: http://www.erlang-in-anger.com/

pbhjpbhj · on Jan 16, 2015

Yes, easy reading, as a non-programmer the only thing that lost me was the throw away line in the code comments on "function arguments are call-by-value". I've some recollection that call-by-value and call-by-reference are options here but not entirely clear what the meaning of "call-by-value" is.

It probably doesn't matter though - the Wikipedia reference is ever-so slightly to terse for me to understand the implications thoroughly too, http://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_val....

Am I right in saying that if I call a function on $2, say, that unless that variable is explicitly assigned in the function that $2 remains unchanged after the function operates. But that with a call-by-reference the value of the variable itself will be altered? Something along those lines.

antsar · on Jan 16, 2015

Am I right in saying that if I call a function on $2, say, that unless that variable is explicitly assigned in the function that $2 remains unchanged after the function operates. But that with a call-by-reference the value of the variable itself will be altered?

You've got it. At least that's what passing a variable by reference vs. by value means in other languages, such as PHP.

busterarm · on Jan 15, 2015

Now do one on sed! :D

sea6ear · on Jan 15, 2015

Also Awk has one of the best bite sized tutorial/references out there. Up there with Programming in Lua in terms of awesomeness per page count.

http://www.amazon.com/AWK-Programming-Language-Alfred-Aho/dp...

freditup · on Jan 15, 2015

Very well done article.

General question: when would one choose to use an awk script over something more general purpose such as a python or ruby script? To me it would make sense to use the latter in most cases.

babarock · on Jan 15, 2015

Good question. I've been working with awk for several years now, and here's how I feel about it.

AWK is old. 1977 old. Later versions that appeared (nawk and gawk are the most common) helped make it a smoother language, but it's still a pain. There are definite features you will be missing in an awk script:

- Any useful data structure slightly more complex than associative arrays. Try multi dimensional arrays, it's actually fun to do. Once.

- Any useful programming construct to help manage with complexity of scripts longer than a few hundred lines. No classes, no variable scoping, no namespaces in general. Not to mention an extremely permissive compiler.

- Any easy way to deal with the environment (other than text). Try sending http requests in awk, it can be a pain.

In 2015, if you need to write a script you should almost always prefer python/ruby over awk.

Now if you're asking whether you should learn awk? It comes in handy. There are a lot of awk scripts in the wild, and you may need to read them or edit them one day. Also, awk has a fun way of parsing input, which makes for very enjoyable one liners. Learning awk (and some complementary utilities like sed or find) turned me into a "oh, it can be done in a quick one-liner" kind of guy. Definitely recommended.

a3n · on Jan 15, 2015

> Also, awk has a fun way of parsing input, which makes for very enjoyable one liners.

Yes, being able to casually toss off an awk (or sed) one-liner is a very convenient skill to have.

busterarm · on Jan 15, 2015

I agree, but I get most of the same out of Ruby even if might be slightly less readable (or ugly Ruby code).

It's nice that the mentality comes out of using certain languages though.

also http://tomayko.com/writings/awkward-ruby

kazinator · on Jan 16, 2015

So you get an ugly program that requires a third party tool, rather than just a POSIX system. I would say that this solution then requires additional justification compared to the Awk no-brainer.

davidp · on Jan 16, 2015

Yep. Need line 123425938039 of a gigantic file? [0]

    sed -e 'NR == 123425938039 { print; exit; }' < file

Fast as hell and takes no memory to speak of. Google 'awesome awk' for goodies.

[0]: Yes, this happens.

eadler · on Jan 18, 2015

Wouldn't it be slightly easier to do something like

    sed -n 123425938039p file?

To be fair, this one won't exit early.

pcvarmint · on Jan 15, 2015

I use awk for monitoring scripts which must run on large scale (>200 node) clusters, because:

1. It is lightweight -- there is no need to load shared libraries, modules, plugins, etc., and it uses less memory than the more general purpose scripting languages do.

2. It comes with BusyBox, so if you have a BusyBox PXE boot environment you can make the AWK scripts work under it, and also work independently of any host OS that gets booted, as long as it's Unix-like. Perl, Python and Ruby are not in Busybox, and their versions can vary a lot across different OS distributions and cluster configurations. AWK avoids this dependency because it's stable and self-contained in one executable file.

3. AWK automatically splits records into fields based on a settable FS, so it makes it very easy to parse things like /proc/stat and /proc/loadavg into $1 $2 $3 ... fields. The code looks nicer and is more compact because of the automatic field variables.

4. The default RS is newline, and AWK is designed to perform actions on records. Most of the scripts I write perform an action upon receiving a newline on their standard input as a trigger. So most of the scripts tend to be of the "BEGIN { } { }" variety -- initialization in the BEGIN section, and then an unconditional action to fire at every newline, such as collecting /proc statistics and reporting them up the cluster hierarchy. AWK is naturally suited for this REPL behavior without needing any boilerplate code.

shawkinaw · on Jan 15, 2015

I personally use awk exclusively for one-liners either in an interactive shell or in a shell script. It's incredibly handy for taking a chunk of text and pulling out the bit you're interested in concisely.

gh02t · on Jan 15, 2015

Awk is optimized for working with the output of other Unix tools. It's great for writing filters for larger output, like "show me the third column of all lines in <million-line-long-logfile> that contain the word 'moose'". Of course you can do this in Python or Ruby, but it's very convenient in awk as it's really optimized for that exact task.

rootbear · on Jan 15, 2015

As others have noted, awk is great for one-liners. Just minutes ago, I wanted to find out what users and groups own all of the files within a certain deep directory tree (on linux). I did this:

$ ls -lR | awk '{ print $3, $4 }' | sort -u > user_group_list

and got the list I wanted. (Yes, it's sloppy but I don't care, as it's just a one-off.)

a3_nm · on Jan 15, 2015

I also do this, but the right way to parse the output of ls would really be to remember the syntax of stat(1). That way you avoid any trouble with spaces and other such problems, and the command is not much longer:

$ find -print0 | xargs -0 stat -c "%U %G" | sort -u > user_group_list

rootbear · on Jan 16, 2015

Nice! I though for a minute about using find, but I didn't know about those options to stat, and I just punted and used ls. I need to study the stat man page, apparently.

babarock · on Jan 16, 2015

You kinda alluded to it, but I'm going to leave this here for reference. Here's a good explanation of why you should avoid parsing the output of `ls`: http://mywiki.wooledge.org/ParsingLs

Erwin · on Jan 16, 2015

find (or maybe just GNU find) has the equivalent of stat built in so you can even do:

     find -printf "%u %g\n"|sort|uniq -c

kazinator · on Jan 16, 2015

> when would one choose to use an awk script over something more general purpose such as a python or ruby script?

You have almost answered your own question.

In situations where you are dealing with munging data which has a structure that is implicitly handled by awk (one or more files, consisting of regularly-delimited records, which break into regularly-delimited fields), it is very difficult to beat awk for succinctness.

The second thing is this: Awk has been around for decades and is part of the POSIX standard. A shell script that uses awk commands can be full POSIX compliant and work on any POSIX-like system with minimal changes, without the installation of third-party software.

So, with Awk you can do a lot of things that would otherwise require something like Python. In many cases you can do them with clearer code that has less clutter, and your solution is POSIX, to boot.

The down side of Awk is that it sacrifices reliability to pander to succinctness. Awk does not detect undefined variables; any variable you mention becomes defined. It has loose arithmetic: this is so you can increment a nonexistent variable or array element by one, and it behaves as if it had been defined with zero. Awk only has local variables in functions, and they are modeled as extra parameters (for which you could pass values, but you don't, unless perpetrating a hack).

Basically, whereas you can do actual software engineering in Python and Ruby, you'd be crazy to do it in Awk, even if the lack of libraries and datatypews such isn't an impediment against doing it in Awk.

abecedarius · on Jan 15, 2015

I only use it for one-liners or a-few-liners now, for unixy data-munging. (I'd still recommend the book by Aho, Kernighan, and Weinberger, because it's full of great examples.)

20-25 years ago I wrote many more things in Awk, up to a Lisp interpreter and a parser generator; but Python/Ruby/etc. have replaced it.

ptaipale · on Jan 16, 2015

> when would one choose to use an awk script over something more general purpose such as a python or ruby script?

When the others are not around. Sometimes you need to work on weird old or stripped-down boxes that don't have python or even perl - but awk is (almost) always there. Like vi.

tsotha · on Jan 17, 2015

If you're writing a script that has to run on any Unix or Unix-like system, AWK is a good choice. Because it's going to be there.

cbd1984 · on Jan 15, 2015

If the input is naturally arranged in columns, regardless of separation character, awk is optimal for parsing it, and not bad for actually processing it, either. This goes double if you only need certain lines of input, and you can concisely pick out those lines using a regular expression.

michaelmcmillan · on Jan 15, 2015

Piping things in and out of awk is really powerful and fast! Check out how Gary Bernhardt utilizes different Unix-commands (including awk) to filter out dead links on his blog: http://vimeo.com/11202537

greggyb · on Jan 16, 2015

This is a cool video. I have been using Linux casually for a year and a half now, but haven't progressed too much past the desktop paradigm.

I am not uncomfortable in the terminal, but I feel more akin to a foreigner who's mastered a phrasebook, rather than someone able to hold even a rudimentary conversation.

Do you (or does anyone else) have more examples similar to this, or resources that would be useful in moving toward this fluency in utilizing command line tools?

michaelmcmillan · on Jan 19, 2015

I would strongly urge you to purchase Gary Bernhardt's screencasts at http://destroyallsoftware.com. It has been extremely influential for me. I've realized that Unix should be looked at as a tool when programming. He also has strong opinions on how to test properly, check this out: https://www.youtube.com/watch?v=tdNnN5yTIeM

greggyb · on Jan 19, 2015

These look very interesting. I think I'll probably end up purchasing these.

kurtbuilds · on Jan 15, 2015

Going to put a shameless plug here for the python awk replacement I made.

github.com/kespindler/puffin

Instead of learning a new language and new syntax, just use python!

andre3k1 · on Jan 16, 2015

If only it was as fast as awk. Is it?

Russell91 · on Jan 16, 2015

No, the python interpreter itself takes about 0.03 seconds to start up on an i7. Awk and sed and friends are usually about 1/10th of that. So the python interpreter startup time dominates the cost for most simple tasks. It's not enough to notice when you're typing things by hand at the shell, but the difference can be painful if you're writing bash scripts. I use a similar python tool for command line work, but I always go for sed when I'm writing bash scripts that I plan to distribute.

proveanegative · on Jan 15, 2015

This seems like a good thread to ask: are there more featureful languages that derive from Awk (edit: i.e., can work in a data driven mode) but don't diverge as much as much as Perl did in terms of syntax?

a3n · on Jan 15, 2015

awk applies actions to lines matching patterns. The design seems concise and limited on purpose.

Not that you shouldn't want more features, but in the context that awk is typically used, what sort of features would you want to add? I confess I'm not able to understand what "can work in a data driven mode" might mean.

To answer your question a little more directly (and yet still be almost a non-answer), additional features can be found on the other side of the pipe into which you direct awk's output.

proveanegative · on Jan 15, 2015

>awk applies actions to lines matching patterns. The design seems concise and limited on purpose.

That is what I meant by a "data driven mode".

>what sort of features would you want to add?

An extended awk could at least add new types of patterns, e.g., binary, stateful and nested patterns or formal grammars, regex patterns with match groups, etc., meaning you could correctly process "real" CSV files or log files with complex structure. I believe these features could be made to fit in with the design of the language making them all feel like they belong to the same kind. It could also add new functions such as ones for Unicode text normalization.

Perl5 does much of this and Perl6 even introduces grammars but because of some of its design decisions for me Perl is not a joy to use like Awk. Both of its versions are just plain too big.

lsiebert · on Jan 16, 2015

You can use multiple file options, but I am not sure if you can include files to load a library file from within awk itself, nor am I aware of any options for conditional linking. Those would be nice, it would make it easy to write awk libraries.

kazinator · on Jan 16, 2015

I made a programming language called TXR which has a "data driven mode", consciously based on the concept, but not derived in any way, and with very different syntax.

The data driven mode isn't based on applying patterns to records, but rather patterns to entire text streams.

Here is a TXR script I use that transforms the output of the Linux kernel "checkpatch.pl" script into a conventional compiler-like error report that Vim understands:

    #!/usr/local/bin/txr
    @(next *stdin*)
    @(repeat)
    @type: @message
    #@code: FILE: @path:@lineno:
    @(output)
    @path:@lineno:@type (#@code):@message
    @(end)
    @(end)

This is "pure TXR": there are no traces of the embedded programming language TXR Lisp. Stuff that isn't @ident or @(syntax) is literal text which is matched. So the block

    @type: @message
    #@code: FILE: @path:@lineno:

matches a two-line pattern in the checkpatch output, extracting a type, followed by a space and colon, the message, then on the next line a code value prceded by a hash mark, followed by ": FILE: " and then a path delimited by a colon, and a line number terminated by a colon.

The following more complicated script scans diff output (usually the output of `git show -p` or `git diff`) and produces an "errors.err" file for Vim such that "vim -q" will navigate over the diffs as a quickfix list, and the messages have some information, like how many lines were added or removed at that point by the diff.

Here, there is an overall pattern matching logic for scanning the sections of a diff output: parsing out multiple diffs, and the "hunks" within each diff, with the line number info from the hunk header and such.

Some stateful Lisp logic calculates what is needed out of the extracted pieces and produces the output as it goes.

    #!/usr/local/bin/txr
    @(next *stdin*)
    @(do (set *stdout* (open-file "errors.err" "w")))
    @(repeat)
    +++ @nil/@path
    @  (repeat)
    @@@@ -@line0,@len0 +@line1,@len1 @@@@@(skip)
    @    (bind (line start minuses pluses old) (0 nil 0 0 nil))
    @    (set line1 @(toint line1))
    @    (repeat)
    @      (cases)
    -@text
    @        (do (inc minuses)
                 (set start line)
                 (push text old))
    @      (or)
     @nil
    @        (do (inc line)
                 (when start
                   (let ((wording (cond
                                    ((zerop minuses) `@pluses lines added`)
                                    ((zerop pluses) `@minuses lines deleted`)
                                    (t `@minuses lines edited to @pluses lines`))))
                     (put-line `@path:@(+ line1 start):@wording`)
                     (put-lines (nreverse old))
                     (set minuses 0)
                     (set pluses 0)
                     (set start nil)
                     (set old nil))))
    @      (or)
    +@nil
    @        (do (if (eql 1 (inc pluses))
                   (set start line))
                 (inc line))
    @      (or)
    @        (accept)
    @      (end)
    @    (until)
    @/[^+\- ]/@(skip)
    @    (end)
    @  (until)
    --- @(skip)
    @  (end)
    @(end)

Example:

    $ git show -p
    [ ... ]
    diff --git a/Makefile b/Makefile
    index a68fd84..f97b77b 100644
    --- a/Makefile
    +++ b/Makefile
    @@ -163,14 +163,14 @@ enforce:
     # Installation macro.
     #
     # $1 - chmod perms
    -# $2 - source file
    +# $2 - source file(s)
     # $3 - dest directory
     #
     define INSTALL
            mkdir -p $(3)
            cp -f $(2) $(3)
            chmod $(1) $(3)/$(notdir $(2))
    -       touch -r $(2) $(3)/$(notdir $(2))
    +       for x in $(2) ; do touch -r $$x $(3)/$$(basename $$x) ; done
     endef
     
     PREINSTALL := :


    $ git show -p | diff2err.txr 
    $ cat errors.err 
    Makefile:166:1 lines edited to 1 lines
    # $2 - source file
    Makefile:173:1 lines edited to 1 lines
    	touch -r $(2) $(3)/$(notdir $(2))

In "vim -q" you are taken to where the changes are, and in the ":cope" window you can see the additional lines that give the original text that was modified, if applicable. For instance, the first item navigates to the line "# $2 - source file(s)" and you know that one line was edited at that point, and the original text was " $2 - source file".

reacweb · on Jan 15, 2015

I use awk only for trivial commands that fit in one line. Here is my last one.

awk -F : '/^OPS1:/{print $2}' < machinelist

I think the article could add an example of this kind of simple usages.

aidos · on Jan 16, 2015

Here's my last one (find half of the free memory to feed into a stress testing tool):

stress --vm-bytes $(awk '/MemFree/{printf "%d\n", $2 * 0.5;}' < /proc/meminfo)k --vm-keep -m 1

It's like playing terminal history awk roulette :)

coliveira · on Jan 15, 2015

The best introduction to awk is its man page. It is such a concise language that you can find pretty much everything you want to learn about it in just a few pages.

Gracana · on Jan 15, 2015

The manual page isn't an introduction, it's a reference. It says what it is (the GNU Project's implementation of the AWK programming language), and goes on to talk about how the program is operated and how the language works, but it doesn't explain why or where you'd use this tool That's fine for a reference manual but it's a big omission for an introduction.

coliveira · on Jan 17, 2015

I am not talking about the GNU Info page, but the traditional man page. It is much more concise and useful.

tarblog · on Jan 21, 2015

Can you provide a link to an online version? I'd like to read it

cbd1984 · on Jan 15, 2015

I actually like the GNU gawk info pages:

https://www.gnu.org/software/gawk/manual/

Also, TCP/IP with gawk (The gawker's Guide To The Internet?):

https://www.gnu.org/software/gawk/manual/gawkinet/

vram22 · on Jan 15, 2015

Here's a UNIX one-liner I wrote a while ago that uses awk, sed and grep for a real-life need:

UNIX one-liner to kill a hanging Firefox process: http://jugad2.blogspot.in/2008/09/unix-one-liner-to-kill-han...

The comments on that post are also of interest.

wazoox · on Jan 16, 2015

I use perl one-liners, thanks to the power of -n, -i and -p switches.

  perl -n -e <expression> <file>

runs the expression on each and every line of <file> (can be STDIN of course).

  perl -p -e <expression> <file>

prints out every line for which the expression returns true. The -i option allows treating files in place. Add -i<extension> to create a <file>.<extension> backup, just in case.

examples:

  perl -p -e '/admin/' file
  perl -p -e '^/admin/' file

behaves exactly like the basic awk examples. We can also replace stuff :

  perl -p -e 's/admin/bozo/g' file

print all lines matching 'admin', replacing 'admin' with 'bozo'.

Now if you just want to replace all occurrences of 'admin' with 'bozo' in the file:

  perl -pi -e 's/admin/bozo/g' file

For instance perl allows you to use a different separator than / for regexps, very useful when manipulating paths:

  perl -pi -e 's#/some/path/#/different/path/#g' file

Of course instead of a simple awk/sed substitute, you can run more elaborate code, and even use perl modules in your one liners with the -M switch:

  perl -MData::Dumper -n -e 'print Dumper $1 if m/^(admin .w+)/' file

Lastly, perl allows you to use q(string) instead of 'string' and qq(string) instead of "string", that avoids lots of escaping when typing in oneliners, so instead of:

  perl -n -e 'print \'found\' if m/admin/' file

Use

  perl -n -e 'print q(found) if m/admin/' file

decisiveness · on Jan 16, 2015

This is probably the most concise introduction to awk I've read.

For more extensions[0] and advanced features like arrays of arrays and array sorting, there's also gawk. And for larger files there's performance driven mawk which can drastically increase processing speed[1].

[0]https://www.gnu.org/software/gawk/manual/html_node/Extension...

[1]http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-a...

lotsofcows · on Jan 17, 2015

Nice write-up!

A couple of points:

On CentOS, at least, gawk == awk which falsifies "patterns cannot capture specific groups to make them available in the ACTIONS part of the code". Eg: echo "abcdef" | gawk 'match($0, /b(.*)e/, a) { print a[1]; }'

You missed the match and not match operators ~ and !~ in your list.

Finally, I find people better understand the flexibility of awk when they realise that awk '/bob/ { print }' is shorthand for awk '$0~/bob/ { print $0; }'. This makes it clear that the pattern element is not limited to regexs or matching the whole line.

kasperset · on Jan 15, 2015

Awk is also used a lot in Bioinformatics where you need one-liners to extract/format the data. Ofcourse, Perl/Python/R/Ruby can also be used but in some cases Awk is just simple and graceful.

SilasX · on Jan 16, 2015

Yes! It shows up a lot among the briefest solutions to Project Rosalind.

Friedduck · on Jan 16, 2015

Were I to have had this instead of the O'Rielly book (Sed & Awk) that I learned from. The time I could have saved.

My goal: to be able to write as clearly as this.

gesman · on Jan 16, 2015

Excellent tutorial.

I'm actually working on (about to finish) free Splunk app that monitors HTTP traffic via Apache logs on WHM/Cpanel based hosting servers and visualizes traffic and activity trends and patterns between IP addresses and sites.

Awk would be an excellent tool to quickly play and "debug" logs content alongside with visual tool.

Additionally I think I'd want to utilize it for malware detection.

+ On my "to practice" list.

tieTYT · on Jan 15, 2015

The one thing this is missing is a search/replace example. That's a very common thing to do and annoyingly clunky in Awk.

shawkinaw · on Jan 15, 2015

That's why awk is so often used with sed, as awk is great for the first part while sed is very good for the second. For example:

    echo $BUNCH_OF_TEXT | awk '/amazing/ {print $1, $3}' | sed 's/amazing/mildly interesting/'

jnazario · on Jan 15, 2015

a lot of people pipe to sed but you don't need to. you can do regex subs right in awk. see sub and gsub.

       sub(r, t, s)
              substitutes  t  for the first occurrence of the regular expression r in the string s.  If s is
              not given, $0 is used.

       gsub   same as sub except that all occurrences of the regular expression are replaced; sub  and  gsub
              return the number of replacements.

(this is awk from osx, which i think is nawk)

favadi · on Jan 16, 2015

If you have gawk, use gensub instead.

sxv · on Jan 15, 2015

| awk 'sub(/annoyingly clunky/, "pretty easy")'

101914 · on Jan 16, 2015

BEGIN and END are somewhat like what is above the first %% and below the second %% in a yylex "source" file. Or maybe not. I need to review the documentation.

This brief AWK intro entices me to try making a similar one for (f)lex using the author's concise format as a model.

In any event, Pattern --> Action is common to both programs.

kazinator · on Jan 16, 2015

The %% in lex and yacc statically organize the file into different areas. BEGIN and END have run-time semantics: do these things before applying the pattern/actions to the inputs, and do these things afterward.

101914 · on Jan 17, 2015

1. I never mentioned yacc. What relevance does it have to my comment? I typically use (f)lex without yacc/bison to do a similar job as I would use AWK for: text processing.

2. "statically organize the file into different areas" One is a code generator and the other is a scripting language with an interpreter, is that what you mean? In effect, this difference means little to me (except for speed of execution): I store my (f)lex programs as source files that I feed to the (f)lex code generator. Then I compile the generated C code. I store my AWK scripts as source files that I feed to the AWK interpreter. I use both flex and AWK to perform a similar task: text processing.

For whatever it is worth, I get better performance from my compiled flex scanners than from my interpreted AWK scripts. But I sometimes use them for the very same text processing jobs.

AWK:

  BEGIN { define variables }
  pattern-action rules
  END { stuff to do after EOF }

(f)lex:

  { definitions } user variables
  %%
  { rules } pattern-action rules
  %%
  { user routines } stuff to do after EOF

From the blog: "_BEGIN_, which matches only before any line has been input to the file. This is basically where you can _initiate variables_ and all other kinds of state in your script."

From the Lesk and Schmidt: So far only the rules have been described. The user needs additional options, though, to _define variables_ for use in his program and for use by Lex. These can go ... in the _definitions_ section...

From the blog: There is also END, which as you may have guessed, will match after the whole input has been handled. This lets you clean up or do some final output before exiting.

From Lesk and Schmidt: Another Lex library routine that the user will sometimes want to redefine is yywrap() which is called whenever Lex reaches an end-of-file.

I regularly use yywrap in the "user routines" section. It functions much the same way as commands I use in the END section of an AWK script.

I guess one can either focus on differences or similarities. I choose the later.

I care little about the "intended purpose" of a program. I care more about what a program can actually do.

kazinator · on Jan 17, 2015

I know, but both lex and yacc use the %% division in similar ways; that is why I mentioned it.

Simply put, your "definitions" are not stuff that is done before pattern-action rules, and "user routines" are not stuff that is done after EOF. It's all just stuff that is declared. Both sections can contain code, and that code can be called out from the pattern rules. Either section could contain a main function that calls yylex. If the lexer is reentrant, it could be re-entered from any of those places. And so on. Fact is, the %% division has nothing to do with processing order, unlike BEGIN and END in Awk.

101914 · on Jan 17, 2015

%% division can be used to do exactly what BEGIN and END do, and that is how I use it. Moreover, as I recalled correctly, the Lesk and Schmidt paper specifially mentions such usage.

My comment is not referring to the internal behavior of the two programs (as yours is). And the Lesk and Schmidt paper is not setting down hard and fast rules; it is only making suggestions. My commment was about how the two programs can be used to do similar work, i.e., text processing.

If you do a lot of text processing work, at some point AWK is not fast enough. I have other programs I use and flex is one of them. Specifically, scanners (filters) produced with flex.

kazinator · on Jan 19, 2015

I don't disagree that you can put stuff that is done first above the first %%, and then stuff that is done after scanning after the second %%. I just don't think that this makes %% analogous to BEGIN and END. For one thing, stuff can be moved around from one of those sections to the other, without changing the basic organization of the program. For instance, prior to the first %% you can put prototype declarations, and move everything to the bottom.

joshbaptiste · on Jan 15, 2015

#awk on irc Freenode great is a great place for general or advanced help

carapace · on Jan 16, 2015

The second comma in the second sentence has got to go. "on files, usually structured" <-- that one. There are some other ones too. Other than that this is really great!

OneOneOneOne · on Jan 15, 2015

AWK is very easy for C programmers to learn.

You can also try 'info gawk' from the Unix or Cygwin prompt for a good tutorial.

cbd1984 · on Jan 15, 2015

It's also easy for people coming from Perl, given that Perl was designed to be easy for people coming from Unix shell, which includes awk.

Python and Ruby aren't that far removed, but the idea of a one-liner as such is alien to the Python mindset, I suppose.

mseepgood · on Jan 15, 2015

Or JavaScript programmers. JavaScript's function syntax came from AWK.

agumonkey · on Jan 15, 2015

First time I read that but well the spec says so. http://hepunx.rl.ac.uk/~adye/jsspec11/intro.htm#1006028

hayksaakian · on Jan 16, 2015

someone should add this to

http://learnxinyminutes.com/

baldfat · on Jan 15, 2015

Awk & SED I have a hard time think about one without the other. I think they should be married.

cbd1984 · on Jan 15, 2015

If you view the Unix shell as a language unto itself, awk and sed are both higher-level functions in that language, married via the use of pipelines (concatenative programming without a stack).

http://en.wikipedia.org/wiki/Concatenative_programming_langu...

AdmiralAsshat · on Jan 15, 2015

Awk & Sed are labeled as the following in my mind:

- Awk: That thing I use when I need to grep for something over more than one line and/or do some basic transformations.

- Sed: That thing with the painful syntax for doing ridiculously complicated regex substitutions.

With that said, I find sed much more difficult to use than awk and generally try to avoid it if I can. I'm even prone to just opening the file in vim and executing the replace command through that rather than using sed.

a3n · on Jan 15, 2015

I would use sed for transformations (simple or complex) where I only care about one transformation (even though you can do multiple): Does the current line match this pattern? Change it to this other thing and move on the next line.

I would use awk when there are a handful or more of potential transformations: Does the current line match any of these multiple patterns? Do the action that's defined for each of the patterns, then move on to the next line.

If I need to do multiple transformations, and still want to use sed, I find it easiest to create a chain of single sed transformations, piped together. Somewhere in that area a shift to awk (or python) becomes justified.

simpleigh · on Jan 15, 2015

I found sed great for messing around with SVN repository dumps. Delete a few lines here to remove the commit that added a directory... change a few paths to pretend the files in that directory had always been somewhere else... add a few lines somewhere else.

Sed scripts are a quick way to automate simple edits to large files.

seanp2k2 · on Jan 16, 2015

Does noting that you don't have to use / as the separator help with the syntactical pain? I personally always use |, since I'm unlikely to be using that for a pattern. Example:

grep foo somefile | sed 's|/path/to/some/file|/new/path|g'

Lot nicer than all those picket fences \/\/\/

coliveira · on Jan 15, 2015

I use awk often, but rarely use sed, so I think they don't need to be always together.

dfkf · on Jan 15, 2015

They are like Dogar and Kazon.

http://www.youtube.com/watch?v=NzvySzw_-O0&t=112

jnazario · on Jan 15, 2015

at least in the awks built into osx and debian have sub and gsub (and gawk adds gensub).

       sub(r, t, s)
              substitutes  t  for the first occurrence of the regular expression r in the string s.  If s is
              not given, $0 is used.

       gsub   same as sub except that all occurrences of the regular expression are replaced; sub  and  gsub
              return the number of replacements.

jacquesm · on Jan 16, 2015

This is written by the same person that wrote 'learn you some erlang for great good'.

known · on Jan 15, 2015

Use gawk for processing large data files

qodeninja · on Jan 15, 2015

Finally, something useful.

patrickg_zill · on Jan 15, 2015

I have been very happy in using awk, have used it to generate real results. It is fast, a little clunky at first, but since it is a smaller language it is quick to get up to speed on.