Getting started with awk
If you’ve been processing files in linux, extracting specific data from them, from particular columns, based on a certain condition or pattern,etc, then you must have come accross the awk utility, as it gets the job easily done for you.
In this article, I’ll show you how to use awk in several examples, from its simple general basics, to a little more complex use cases, so let’s get started!
Let’s consider the following file that we will be using throughout this article:
Wonderland exists | just a myth 1
Greenland exists | not that green though 2
Iceland exists | not that icy though 3
Norway exists | beautiful country 4
Now let’s see the general syntax of awk:
awk 'BEGIN{}{}END{}' path/to/file
The first curly brackets after begin are generally used for initializing variables. As for the curly brackets in the middle, they are actually a loop that wraps the entire file, for instance, if we wanna read the file and print its content, we must write the code inside. As for the last brackets, they contain actions after we’re done reading a given file. Sometimes we can skip the BEGIN and END parts, and just use:
awk '{}' path/to/file
Now, let’s try for instance to read the above file using awk:
awk 'BEGIN{print "awk begins here"}{print $0}END{print "Awk finished reading file"}' filename
the output will be as following:
awk begins here
Wonderland exists | just a myth 1
Greenland exists | not that green though 2
Iceland exists | not that icy though 3
Norway exists | beautiful country 4
Awk finished reading file
You must be wondering what $0 is. In awk, it just means the entire line, when we wrote “print $0”, awk takes it as outputing the entire line.
Now, you know how to read a file with awk and print its content, and know the meaning of $0, that’s great, now let’s talk a bit more about other reserved key words and characters!
So, $0 stands for “an entire line”, but what if we don’t care about the entire line, we just want a specific column, or what if we want to know the total number of lines and columns in a given file? Let’s find out how :)
NF: stands for the number of columns of the current line(Fields)
NR: stands for the number of the current row
Now let’s try to print the number of rows and columns in our file:
awk '{print "Number of the current line is "NR",""Number of columns of the current line is "NF;}' filename
The output should be :
Number of the current line is 1,Number of columns of the current line is 7
Number of the current line is 2,Number of columns of the current line is 7
Number of the current line is 3,Number of columns of the current line is 8
Number of the current line is 4,Number of columns of the current line is 5
One question should come to your mind, how to define columns? based on what? If you’ve noticed, in this example, columns were defined by space, which is the default, unless we define our own seperators. Let’s see in the example below how to use seperators with awk:
In our file above, if you still remember, there was space as a seperator between columns, there was also another character “|” . We’ll use the latter as a seperator:
awk -F"|" '{print "Number of the current line is "NR",""Number of columns of the current line is "NF;}' filename
The output should be:
Number of the current line is 1,Number of columns of the current line is 2
Number of the current line is 2,Number of columns of the current line is 2
Number of the current line is 3,Number of columns of the current line is 2
Number of the current line is 3,Number of columns of the current line is 2
Now the number of columns has changed! Seperators are a very helpful utility that allows you to focus only on a certain type of columns.
Now that you became familiar with reading files with awk, dealing with their columns, we’ll jump now into another interesting feature in awk, which will allow us to select lines based on a pattern or regex that we specify. Take a look at the example below:
Always considering our file mentioned at the very beginning of this article, there are these words: Wonderland, Greenland, Iceland. they all share “land” in common, except the last line. So, in our example, we’ll look up the lines that contain “land” pattern:
#Printing only column 1 that contains 'land' instead of printing the entire lineawk '/land/{print $1}' filename
And here’s the output:
Wonderland
Greenland
Iceland
So as you guessed, $0 stands for the entire line, $1 stands for column 1, $2 stands for column 2, and so on and so forth.
Great! Now you’re familiar with the basics of awk!
As you know, awk is just like any other programming, it has conditions, loops, and defining functions for future use. in the example below, we’ll see how to define a function with awk, and then use it.
awkfun='{function print_sum(param){print "Sum is "param)}
awk -v sum=0 "$awkfun"'{sum+=$NF;print_sum(sum) }' filename
In the example above, we used other two parameters with awk. the first one is “-v” which is used to declare the “sum” variable, as for the second ”$awkfun”, it is used in order to kind of, import the function and can use it in awk, please note that there is no space betwen “$awkfun” and ‘{}’. Inside the loop, we add the numbers of the last column of each line, and then we give the sum as a paramter to our function “print_sum(param)” and it will out put the result. Here’s the output:
Sum is 1
Sum is 3
Sum is 6
Sum is 10
One more thing I’d love to add to this article, sometimes in awk you need to use bash commands, I’ll show you how to do that in this last example of this article:
awk '{system("pwd")}' filename
And the output will be the working directory where you are now. Just use the system function and write inside it the bash command you’d love to execute.
Congratulations! now you’ve learned how get specific information from specific columns using seperators, patterns, or/and regex. You’ve also learned how to create and call a function in awk, and how to execute bash functions from awk, now you’re ready to go! Hope you enjoyed this article, if you have any questions or feedback, I’d love to see them in comments.
Happy reading :)