Episode #28 - CLI Monday: cat, grep, awk, sort, and uniq

Loading the player...

About Episode - Duration: 9 minutes, Published: 2014-07-14

In this episode, I would like to show you a series of commands for reviewing log data, how to extract event counts from that data, and finally how to create sparklines from those counts. This can be extremely useful for pointing you in the right direction while troubleshooting issues.

Download: mp4 or webm

Get notified about future content via the mailing list, follow @jweissig_ on Twitter for episode updates, or use the RSS feed.

Links, Code, and Transcript


In this episode, I would like to show you a series of commands for reviewing log data, how to extract event counts from that data, and finally how to create sparklines from those counts. This can be extremely useful for pointing you in the right direction while troubleshooting issues.

Before we jump into the demo, we first need a log file to review. Back in episode #26, we looked at downloading an example web server log file from NASA. Today, we are going to use that same NASA web server log, and look at how to extract useful count data.

Next, we need a tool to generate sparklines, this is not required, but they look pretty neat. Sparklines are basically really small charts. Personally, my tool of choice, is called spark, developed by Zack Holman, and you can find it on his Github page. The usage examples down here illustrate how this tools works, you just provide a list of numbers, and the spark script outputs a nice little graph on the command line.

I should mention that you can find links, to both the NASA and spark pages, in the episode notes below. Okay, so here we have our two files sitting in a directory, the web server log, and the spark script.

Actually before we jump into looking at the commands and log data. Let me explain what I am trying to show you through a couple illustrations. Lets say for example that the log data looks like this, with IP address, date, and url fields, then you have many rows filled with data.

Say for example, that you wanted to get a count of the IP addresses mentioned in this file. We can do that manually here, since it is a small sample size. We have 3 occurrences of this IP addresses, 3 of this one, and 1 of this one. That was easy, but lets say you have millions of rows? What then? Well, typically, this would be done using a database or something similar. Say for example, with a database you would write a query that looks something like this. Select count star as total, ip from data, group by ip. This looks through all your rows and gets IP occurrence count grouped by IP address. The results would looks something like this. This 3, says we have 3 occurrences of this IP address in our data, up here.

Lets look at how we can do this on the command line. This is a quick and dirty pattern which can be applied to all types of log data, it is not all that efficient, but gets the job done.

Lets use the head command to grab the first line of the NASA web server log so we can see what the fields look like. Since this file has roughly 1.8 million rows, spanning the entire month of July 1995, we can use grep as a filter to select chunks of data by day. Lets say, we wanted to grab all of the IP addresses by occurrences for July 1st, 1995, from this file. I would do something like this.

head -n 1 NASA_access_log_Jul95

Grep, then just copy and paste the filter I want to search for, in this case July 1st, 1995. Next pipe the data to awk, this tells awk to only grab the first field, next lets sort the IP addresses, so that we can feed this into the uniq command. Uniq allows you to dedupe data fed to it, this can be extremely useful, but there is also a switch to provide you with a count of the duplicates (this is where the magic happens). Next lets sort the uniq output so that it is in a human readable format, and in reverse order, meaning the highest numbers at the top. Finally, lets only grab the first 15 lines of output using the head command.

grep '01/Jul/1995' NASA_access_log_Jul95 | 
  awk '{print $1}' | 
  sort | 
  uniq -c | 
  sort -h -r | 
  head -n 15

So, we just searched through this file with 1.8 million rows, plucked out records we are interested in by date, found the IP address field, and then grouped all IP addresses together by count. This is extremely quick and dirty but gets the job done.

Why would you ever want to do something like this? Well, lets say for example, that you have a web server and it is throwing high load alerts. You might want to dig through the logs to see if something is abusing the system, maybe a url is getting hammered, or is it just an influx of regular traffic? This would also work with firewall logs, say for example you want to see if a particular IP address or port keeps coming up. Using this pattern you can quickly find the answer to these questions.

I should mention that we went through this pretty quickly. To really understand how this is working, you need to try it yourself. I suggest downloading the sample log file, and work through this chain of commands. Start with the first command, and add them, one at a time. Seeing how the output is filtered at each step. Also, I cannot recommend it enough, use the manual pages, they are extremely useful.

So, how would we change the field we want to group by, lets say for example url. Well, lets jump back to the illustrations again. So we have our records, but we want to know if a particular url is getting hammered. So, lets crawl through all the records and group them together, here we would group, home, about, and help.

Again, a database query would look something like this. Select count star as total, url from data, group by url. This looks through all your rows and gets url occurrence count grouped by url. The results would looks something like this. This 3, says we have 3 occurrences of this url in our data, up here.

Well, we already know how to do this. But how do we change the field we want to group by? Easy. Lets just look at the first line of the log file again using the head command. Lets just count the number of fields. 1, 2, 3, 4, 5, 6, 7. Okay, so the url field in our dataset is the 7th one in. So, lets just rerun our command from a couple minutes ago and change this awk command to grab the 7th field instead of the 1st. Finally, lets rerun our command.

head -n 1 NASA_access_log_Jul95
grep '01/Jul/1995' NASA_access_log_Jul95 | 
  awk '{print $7}' | 
  sort | 
  uniq -c | 
  sort -h -r | 
  head -n 15

Cool right. Personally, I just love this. It has helped me so many times. Say for example, that you have a CGI script right at the top, maybe it is just getting hammered, and maybe by a particular IP address, say a bot for example.

Okay, so earlier we talked about creating cool little graphs using the spark command. Lets try it using the command we just ran and piping the data to the script. I am just going to use awk again, grab the first field of output, then we are going to pipe that to our spark script. So, this awk command grabs these numbers in the first field, then we pipe the data over to our spark script. Pretty cool. What is so neat about this, is that if you have plenty of data you can see how it relates to each other.

grep '01/Jul/1995' NASA_access_log_Jul95 | 
  awk '{print $7}' | 
  sort | 
  uniq -c | 
  sort -h -r | 
  head -n 15 | 
  awk '{print $1}' | 
  ./spark

For many years I have used these commands in different configurations to solve these sorts of problems. After writing this episode, I am starting to think that this could actually be packed up into a shell script for easy distribution.

You are probably thinking that this would be more efficient, packaged up into the awk command, and you are right. However, I like to keep things modular, so that if I need to swap something in or out of this command chain, it is easy.

Anyways, if you use something similar or have improvements, I would love to hear about it.