This article is written for Linux administrators. It teaches you how to create pipelines on a terminal using sed
and awk
commands. Combining these commands allows you to filter and analyze data, troubleshoot log files, and streamline your day-to-day workflow.
sed
and awk
are essential for filtering and transforming text data. awk
works well with columns, and sed
excels at search-and-replace. The power of these tools lies in combining them into a pipeline. That will be the focus of this tutorial.
To complete this tutorial, you will need:
- Experience operating a Linux terminal. DigitalOcean's A Linux Command Line Primer is a great place to start.
- Knowledge about regular expressions; how to interpret and create them. Read An Introduction to Regular Expressions to learn more.
- Experience using common command line tools like
cut
,head
, and so on. Check out Sed Stream Editor to Manipulate Text in Linux and How To Use the AWK language to Manipulate Text in Linux.
Let us walk through a basic example of filtering specific data from a file with awk
and then formatting it for display with sed
. You will use a pipeline to extract and then print the product names and prices for products with a price greater than ten dollars.
First, create a products.text
file in vim using the following command:
vim products.txt
Note: You don't have to use vim; you can use whichever editor works best for you.
Fill the file with the following contents:
123:T-Shirt:19.99
456:Coffee Mug:8.50
789:Headphones:49.95
Here is the full pipeline you are going to construct:
awk -F: '$3 > 10 {print $2 "\t" $3}' products.txt | sed '1iName\tPrice'
Primer:
awk
Let's brush up on
awk
.awk
uses the syntax of condition { action }. Here is an example of anawk
script:
/^int/ { print "Found an integer." }
- condition:
/^int/
- { action }:
print "Found an integer."
This is it works: For every line beginning with "
int
",awk
prints the message, "Found an integer."
Now, let's break down each part of this pipeline. Here is the awk
portion:
awk -F: '$3 > 10 {print $2 "\t" $3}'
Here is how it works:
awk
matches lines where the condition the price ($3
) is greater than10
; the action prints the product name ($2
), followed by a tab, followed by price.- The
-F:
argument sets the delimiter to ‘,
’
Let's look at the sed
portion of our pipeline:
sed '1iName\tPrice'
Here is how it works:
- The
1i
inserts "Name"--before the first line of the output--followed by a tab, "Price", and a newline.
Below is the full pipeline. Run it:
awk -F: '$3 > 10 {print $2 "\t" $3}' products.txt | sed '1iName\tPrice'
Here is the resulting output:
Name Price
T-Shirt 19.99
Headphones 49.95
Straight forward enough, right?
In this section we will create some more complex filters and transformations using sed
, awk
, and some other commands. As we walk through each example pipeline, go slow, be patient, run every command, observe the output, and grasp what's happening.
Let's create a pipeline that analyzes process information generated by the ps
command. As a system administrator it behooves you to monitor resource usage per user, allowing you to discover users who consuming excessive memory, CPU usage, and so on.
Here is the full pipeline you will construct, which filters resource usage by user:
# Get process information
ps -eo pid,user,rss |
# Filter by specific user and format output
awk '$2 == "root" {print $1, "RSS:", $3/1024, "MB"}' |
# Sort by memory usage
sed '1iPID RSS(MB)' | sort -nrk 2
Begin this pipeline by generating process information using ps
. Here is the first part of the pipeline:
ps -eo pid,user,rss
Here is how it works:
- Displays all processes using the
-e
argument. - Using the
-o
argument, displays thepid
,user
, andrss
columns.
Running this command line produces the following output:
PID USER RSS
1 codespa+ 640
7 codespa+ 1792
42 root 3480
322 codespa+ 1408
355 root 1664
509 codespa+ 1536
518 codespa+ 131588
560 codespa+ 54792
981 codespa+ 62928
Now you have our fields of interest: PID, USER, an RSS.
Let's move on to the next portion of our pipeline, which uses awk
to filter lines containing the “root
” user and calculate memory usage in megabytes.
awk '$2 == "root" {print $1, "RSS:", $3/1024, "MB"}'
Here is how it works:
-
The condition
$2 == "root"
selects lines where the NAME is equal to "root". -
The action {
print $1, "RSS:", $3/1024, "MB"
} displays output using the following format:[value of `pid`] RSS: [file-size] MB
Note: Dividing the RSS value by 1024 demonstrates how
awk
can perform calculations.
Below is our updated pipeline. Run it.
ps -eo pid,user,rss | awk '$2 == "root" {print $1, "RSS:", $3/1024, "MB"}'
You should see the following output:
1 RSS: 0.191406 MB
7 RSS: 0.148438 MB
8 RSS: 94.7695 MB
212 RSS: 7.16406 MB
821 RSS: 5.85938 MB
1883 RSS: 1.55469 MB
1884 RSS: 2.91016 MB
Let's add some commands to our pipeline to sort the output by memory usage. Here is the sed
portion of the pipeline:
sed '1iPID RSS(MB)' | sort -nrk 2
Here is how it works:
-
'1iPID RSS(MB)'
uses the 'i
' to insert the column heading "PID RSS(MB)" followed by a newline.PID RSS(MB)
-
sort -nrk
sorts the text numerically (-n
), reverses the result (-r
), and sorts by the second column (-k 2
), which effectively sorts the output based on memory usage.
Note: Sorting by memory usage (column 2) in descending order helps identify resource-intensive processes efficiently.
Here is the pipeline thus far. Go ahead and run it:
ps -eo pid,user,rss | awk '$2 == "root" {print $1, $3/1024, "MB"}' | sed '1iPID RSS(MB)' | sort -nrk 2
This should produce the following output:
PID RSS(MB)
821 5.88672 MB
8 95.1914 MB
7 0.148438 MB
212 7.16406 MB
2009 1.09375 MB
2008 1.08594 MB
2007 2.92969 MB
2006 3.14453 MB
1 0.191406 MB
Let's create a pipeline that analyzes an authentication log, searches for, and then counts failed login attempts. Paying attention to events like this allows your to protect your system and respond to potential threats. Here is the full pipeline we'll build:
grep "Failed password" auth.log | sed 's/invalid user //' | awk '{print $9}' | sort | uniq -c | sort -nr
Here is how it works:
grep "Failed password"
: Filters the lines that contain “Failed password” from the authentication log.sed 's/invalid user //'
: Removes the “invalid user” part from the lines, if present.awk '{print $9}'
: Prints the ninth field, which is typically the username.sort
: Sorts the usernames alphabetically.uniq -c
: Counts the occurrences of each username.sort -nr
: Sorts the counts in descending order.
Let's walk through it.
Create a file named auth.log
using the following command:
vim auth.log
Fill the file with the following contents:
Feb 10 15:45:09 ubuntu-lts sshd[47341]: Failed password for tedbell from 103.106.189.143 port 60824 ssh2
Feb 10 15:45:11 ubuntu-lts sshd[47341]: Connection closed by authenticating user root 103.106.189.143 port 60824 [preauth]
Feb 10 15:45:11 ubuntu-lts sshd[47339]: Failed password for root from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:12 ubuntu-lts sshd[47343]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh user= rhost=103.106.189.143 user=root
Feb 10 15:45:14 ubuntu-lts sshd[47339]: Failed password for rhomboidgoatcabin from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:14 ubuntu-lts sshd[47343]: Failed password for root from 103.106.189.143 port 33990 ssh2
Feb 10 15:45:16 ubuntu-lts sshd[47343]: Connection closed by authenticating user root 103.106.189.143 port 33990 [preauth]
Feb 10 15:45:16 ubuntu-lts sshd[47339]: Received disconnect from 180.101.88.228 port 11349:11: [preauth]
Feb 10 15:45:16 ubuntu-lts sshd[47339]: Disconnected from authenticating user root 180.101.88.228 port 11349 [preauth]
Feb 10 15:45:16 ubuntu-lts sshd[47339]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=180.101.88.228
Below is the first part of the pipeline, which uses grep
to filter lines containing “Failed password". Run it.
grep "Failed password" auth.log
You should see the following output:
Feb 10 15:45:09 ubuntu-lts sshd[47341]: Failed password for tedbell from 103.106.189.143 port 60824 ssh2
Feb 10 15:45:11 ubuntu-lts sshd[47339]: Failed password for root from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:14 ubuntu-lts sshd[47339]: Failed password for rhomboidgoatcabin from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:14 ubuntu-lts sshd[47343]: Failed password for root from 103.106.189.143 port 33990 ssh2
You now have all the failed password entries.
Update the pipeline by adding a sed
command, which removes any “invalid user” parts:
grep "Failed password" auth.log | sed 's/invalid user //'
Running this pipeline should produce the following output:
Feb 10 15:45:09 ubuntu-lts sshd[47341]: Failed password for tedbell from 103.106.189.143 port 60824 ssh2
Feb 10 15:45:11 ubuntu-lts sshd[47339]: Failed password for root from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:14 ubuntu-lts sshd[47339]: Failed password for rhomboidgoatcabin from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:14 ubuntu-lts sshd[47343]: Failed password for root from 103.106.189.143 port 33990 ssh2
Update the pipeline by adding an awk
command to print the username
field ($9
).
grep "Failed password" auth.log | sed 's/invalid user //' | awk '{print $9}'
Running this pipeline should produce the following output:
tedbell
root
rhomboidgoatcabin
root
You are making progress! Now you have all the usernames.
Update the pipeline by adding the following sort
command to sort the usernames alphabetically:
grep "Failed password" auth.log | sed 's/invalid user //' | awk '{print $9}' | sort
Running this pipeline should produce the following output:
rhomboidgoatcabin
root
root
tedbell
Now you have an alphabetical list of usernames.
Update the pipeline by adding sort
and uniq
commands. Using uniq
with the -c
argument counts the occurrences of each username, and sort
sorts the usernames alphabetically.
grep "Failed password" auth.log | sed 's/invalid user //' | awk '{print $9}' | sort | uniq -c
Running this pipeline should produce the following output:
1 rhomboidgoatcabin
2 root
1 tedbell
Now you have a user count.
Finally, update the pipeline by adding another sort
command. Using the-nr
argument of sort
sorts the output by username count in descending order.
grep "Failed password" auth.log | sed 's/invalid user //' | awk '{print $9}' | sort | uniq -c | sort -nr
Running the full pipeline should produce the following output:
2 root
1 rhomboidgoatcabin
1 tedbell
All done!
Let's construct a pipeline that finds the top disk space-consuming directories and sorts them in descending order. It's important to be able to monitor disk usage, ensuring a smoother experience for users. Here is the complete pipeline you will construct:
cat disk_usage.log | awk '{print $2, $1}' | sed 's/G/ GB/' | sort -k2 -nr
Here's how it works:
cat disk_usage.log
: Outputs the content of disk_usage.log.awk '{print $2, $1}'
: Swaps the columns so that the directory path comes first.sed 's/G/ GB/'
: Adds a space before the unit ‘G’ to standardize it to ‘GB’.sort -k2 -nr
: Sorts the output based on the second column (disk space) in descending numerical order.
Begin by creating an input file called disk_usage.txt
, and fill it with the following content:
2.4G /usr/local/bin
5.7G /home/user
1.2G /tmp
9.8G /var/log
Begin the pipeline by using the cat
command to send the contents of the disk usage file to standard output (screen.)
cat disk_usage.log
Update the pipeline by adding a awk
command to rearrange the order of the columns, displaying the directory path first:
cat disk_usage.log | awk '{print $2, $}'
Running this pipeline should produce the following output:
/usr/local/bin 2.4G
/home/user 5.7G
/tmp 1.2G
/var/log 9.8G
Update the pipeline by adding a sed
command to add a space before the unit ‘G’ to standardize it to ‘GB’:
cat disk_usage.log | awk '{print $2, $1}' | sed 's/G/ GB/'
Running this pipeline should produce the following output:
/usr/local/bin 2.4 GB
/home/user 5.7 GB
/tmp 1.2 GB
/var/log 9.8 GB
Update the pipeline by adding a sort
command to sort the output based on the second column (disk space) in descending numerical order.
cat disk_usage.log | awk '{print $2, $1}' | sed 's/G/ GB/' | sort -k2 -nr
Here is an explanation of the sort
options used above:
-k2
: This flag specifies the column (field) to use for sorting. In this case,2
indicates the second column.-nr
:-n
: This flag tellssort
to perform a numeric sort on the specified column (second column in this case).-r
: This flag reverses the sorting order, so it sorts in descending order instead of the default ascending order.
Running this pipeline produces output sorted by disk space:
/var/log 9.8 GB
/home/user 5.7 GB
/usr/local/bin 2.4 GB
/tmp 1.2 GB
Nice work, friend!
In this tutorial you have learned how to create sophisticated pipelines using sed
, awk
, and other commands. Now you are ready to start experimenting and creating your own pipelines and solve day-to-day system administration problems.
I hope this tutorial helped. Thanks for reading!