In this episode, I wanted to show you the lsof command. We are going to review several issues and how the lsof command can be used to troubleshoot what is going on.
First off, what is the lsof command? Well, the man page is packed with useful information and real world examples, but briefly, the command allows you to display information about open files on your system. Since we are using a UNIX-like operation system, just about everything is a file, things like traditional config files, libraries, executables, pipes, and sockets.
man lsof
Lets just run the lsof command without any options to view the default output. As, you can see there is plenty of output about what processes are running on our system, information like what files, pipes, and sockets those processes are using. We can pipe this default output to wc dash l, to get a line count. So, on our test system, we have 1022 lines of output, but if this were a busy machine, we could have tens of thousands of lines.
lsof
lsof | wc -l
Since this command has the ability to look at what regular files, pipes, and sockets, a given process is using, it gives you a pretty good picture of what happening behind the scenes. Lets run the ps command to list the active processes on our test system, then we can pick something that looks interesting, and use the lsof command to see what files are being used.
ps aux
Lets just pick something at random here, lets focus on the auditd process, to figure out what files it has open. So, lets type lsof dash p, this tells lsof that we want to focus on a specific process. Then, rather than typing the process id number, we are going to use the pidof command, this command finds the process IDs for us, all we need to do, is provide the process name. Finally we have our list of open files for the auditd process.
lsof -p `pidof auditd`
Just wanted to focus on the pidof command for a minute and then we can discuss the lsof output. So, we typed our lsof command, then we used these special quotes, which allows us to execute a command within the quotes, and return that as input into our lsof command. So, the equivalent command would look like this, lsof dash p, then up here, we find the auditd process line item, and its ID number, in this case, 1831. Pidof comes in handy if you have long numbers, or the process ID is constantly changing, say for example if executed via a script, or you have multiple process numbers.
lsof -p 1831
Okay, lets review the lsof output. So, as you would expect, we have the process binary open, a bunch of library files, and a log file. If this were a network enabled process, you would also see the socket info, I suggest you try this out on your own, to see what is looks like. So why is this useful, well, say for example that we found some interesting process that was taking lots of cpu or memory, this tool would help us see what files it is interacting with, which can be extremely useful for seeing how a process functions.
Now that we know in general of the lsof command works, lets move onto a couple real world examples of where I have used the lsof command to solve problems. In the first example, lets say that you have a filesystem which you would like to unmount. Lets just review the mounts for a second, by running the df command. Since this is a virtual machine configured by Vagrant, see episode #4 for details, we have the /vagrant mount. Say for example, that we wanted to do some maintenance and needed to unmount this /vagrant mount. We would run umount /vagrant. We receive an error message saying that the device is busy. How would you go about troubleshooting this? In this case, the error is actually pretty helpful, in that it tells you to use the lsof command to find any open file using /vagrant. So, lets run lsof /vagrant.
df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup-lv_root 8.4G 1.2G 6.8G 15% / tmpfs 246M 0 246M 0% /dev/shm /dev/sda1 485M 33M 427M 8% /boot /vagrant 1.8T 282G 1.6T 16% /vagrant
umount /vagrant/ umount: /vagrant: device is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1))
lsof /vagrant/ COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME less 2064 vagrant 4r REG 0,18 3978 2 /vagrant/Vagrantfile
I already know what is going on, since behind the scenes I opened a second terminal window and am using the less command to view a file from the /vargrant mount. But, if this were a real system, you would likely see user processes. Lets just review what the output is telling us here. The vagrant user, is running the less command, and looking at the /vagrant/Vagrantfile. Since, I know what is happening and this is just an example, I am going to kill the process. If this were real world, you would likely need to work with the users to stop their jobs. Finally, lets verify the file has been closed using lsof, then lets rerun the umount command, and verify it actually was unmounted. This technique is also really useful for seeing what is happening on NFS mounted filesystems.
kill 2064
lsof /vagrant
umount /vagrant/
In this final example, I wanted to talk about a weird issue that happened to me. I started receiving alerts for one of our systems saying the filesystem for a particular mount was full. So, I logged in and started to troubleshoot the issue. One of the first things I did was issue the df command, just to verify what the alert which was firing, was actually real. As you can see in this recreated example, the /mnt/vfs mount is 100% full. Lets head into that directory and have a look to see if we can spot where the space is being used up. I will typically run du dash hs star to give me a summery of where each folder sits in terms of size. This helps me narrow down where to look. What is interesting here, is that we only have one directory, called lost and found, and it is only 12K is size. So, where is the file that is filling the mount?
df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup-lv_root 8.4G 1.3G 6.7G 17% / tmpfs 246M 0 246M 0% /dev/shm /dev/sda1 485M 33M 427M 8% /boot /dev/loop0 30M 29M 0 100% /mnt/vfs
cd /mnt/vfs/
ls -l total 12 drwx------ 2 root root 12288 Jul 20 21:28 lost+found
du -hs * 12K lost+found
You can probably guess where this is going, but lets jump back for a second. The df command is telling us that our 30MB device mounted as /mnt/vfs is 100% full, but then down here the du command is telling us, that there is only 12K used? Something weird is happening. Lets use lsof to look for open files on the /mnt/vfs directory. So, we have some output here, but this particular line is interesting. It says the example log data file is opened by the vagrant user but deleted. So, what happened here? Well, in the real world, someone logged into the system and was reviewing a log that was pumping out tons of data, then forgot about their session. A background process came along and deleted the log file. However, the log was still active since someone had it open, so the space was never actually freed. So, if you ever see inconsistent info between df and du, make sure you check this out. Lets kill the process, and then verify that df looks happy.
lsof /mnt/vfs/ COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME bash 1943 root cwd DIR 7,0 1024 2 /mnt/vfs less 3243 vagrant 4r REG 7,0 28618752 12 /mnt/vfs/example-log-data (deleted) lsof 3250 root cwd DIR 7,0 1024 2 /mnt/vfs lsof 3251 root cwd DIR 7,0 1024 2 /mnt/vfs
kill 3243
df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup-lv_root 8.4G 1.4G 6.6G 17% / tmpfs 246M 0 246M 0% /dev/shm /dev/sda1 485M 33M 427M 8% /boot /dev/loop0 30M 1.4M 28M 5% /mnt/vfs