- Episode #35 - ZFS on Linux (Part 1 of 2)
- Install ZFS on Debian GNU/Linux
- ZFS on Linux
- Oracle Solaris ZFS Administration Guide
- ZFS Troubleshooting Guide
- FreeNAS Community: ECC vs non-ECC RAM and ZFS
- Google Groups: ZFS w/o ECC RAM -> Total loss of data
- ZFS Administration, Appendix C - Why You Should Use ECC RAM
- ZFS on Linux: 1.7 What /dev/ names should I use when creating my pool?
- Freenas: Creating ZFS Datasets
- ZFS Administration, Part XI- Compression and Deduplication
- ZFS Terminology
- Nex7s Blog: ECC vs non-ECC RAM: The Great Debate
- ZFS: Read Me 1st
- Freenas: Hardware Recommendations
In this episode, lets continue our look at ZFS on Linux. In part one, we covered the basic ins and outs of ZFS, so in this episode, lets focus on some of the more advanced filesystem features offered by ZFS.
If you have not already watched ZFS on Linux part #1, episode #35, I highly suggest you go do that now, as it plays well into this episode. Before we dive into compression, snapshots, quotas, and data deduplication, using ZFS. I thought it would be a good idea to cover a couple best practices just in case these episodes inspire you to try ZFS for yourself.
The first suggestion, is that if you are going to be using ZFS in production, it is highly recommended that you use ECC memory. ZFS is great at repairing failures and data corruption via checksums, but if a bad memory module is silently introducing corruption via RAM, and the bad data gets saved to disk, you could be in a very bad situation. If you want to read more about this, you can check out the FreeNAS forums for a nice discussion, there is also this this good Google Groups thread, and finally the ZFS Administration Guide on why you should use ECC memory. This is a large topic, and it is not limited to ZFS, so if you are interested in reading more about it, these links can be found in the episode notes below.
Let me just refresh your memory about our test setup. You might remember that, back in episode #35, I talked about how I created a virtual environment with 10 disks, those being /dev/sdb through sdk, each roughly 100 MB in size. We then used these virtual devices to play around with ZFS pools. You should also remember that, we can take a look at existing ZFS pools, by running zpool status.
# ls -l /dev/sd[bcdefghijk] brw-rw---- 1 root disk 8, 16 Sep 9 03:50 /dev/sdb brw-rw---- 1 root disk 8, 32 Sep 9 03:50 /dev/sdc brw-rw---- 1 root disk 8, 48 Sep 9 03:50 /dev/sdd brw-rw---- 1 root disk 8, 64 Sep 9 03:50 /dev/sde brw-rw---- 1 root disk 8, 80 Sep 9 03:50 /dev/sdf brw-rw---- 1 root disk 8, 96 Sep 9 03:50 /dev/sdg brw-rw---- 1 root disk 8, 112 Sep 9 04:27 /dev/sdh brw-rw---- 1 root disk 8, 128 Sep 9 04:27 /dev/sdi brw-rw---- 1 root disk 8, 144 Sep 9 04:28 /dev/sdj brw-rw---- 1 root disk 8, 160 Sep 9 04:28 /dev/sdk
# zpool status pool: e37pool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM e37pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 sdf ONLINE 0 0 0 sdg ONLINE 0 0 0 errors: No known data errors
# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup-lv_root 8.4G 1.2G 6.9G 15% / tmpfs 230M 0 230M 0% /dev/shm /dev/sda1 485M 54M 406M 12% /boot e37pool 225M 0 225M 0% /e37pool
I just wanted to quickly show you how to grow ZFS pools because I think it plays well into a second recommended best practice. So, here we have our e37pool, and it is made up of several disk mirrors. The e37pool is also mounted and it is about 225 MB is size. Now, lets say that you have reached the capacity of this pool, and you wanted to add some additional space, how would you go about that? Well, since ZFS is a combined volume manager and filesystem, it is actually extremely easy to add additional capacity to a ZFS filesystem. It just to happens that we have four spare virtual disks on our system, so lets work through adding those to this pool. Lets type, zpool add e37pool mirror sdh sdi mirror sdj sdk. This will create two mirrors, one of sdh and sdi, and a second mirror of sdj and sdjk, then it will add these to the e37pool.
# zpool add e37pool mirror sdh sdi mirror sdj sdk
We can verify this worked by running, zpool status again, and as you can see we have our two new mirrors down here, but we can also verify the filesystem was resized by running df again. So, before we had 225 MB of space, and now we have 396 MB. Personally, I think this is pretty cool and a major advancement over existing filesystems, in that if you wanted to do something like this with LVM, you would have to run all these commands to add devices to the raid, grow the volume group, and finally grow the filesystem. You might even have to take the volume off-line for some of these steps. With ZFS, it is one command, and it was all done on-line. I should also mention, that you can only expand a zpool, you cannot shrink it.
# zpool status pool: e37pool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM e37pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 sdf ONLINE 0 0 0 sdg ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 sdh ONLINE 0 0 0 sdi ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 sdj ONLINE 0 0 0 sdk ONLINE 0 0 0 errors: No known data errors
# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup-lv_root 8.4G 1.2G 6.9G 15% / tmpfs 230M 0 230M 0% /dev/shm /dev/sda1 485M 54M 406M 12% /boot e37pool 396M 0 396M 0% /e37pool
Okay, so now that we refreshed our memories about ZFS pools, and what they look like. Lets chat about the second recommended best practice. This has to do with how I created the pool in episode #35 using Linux device names, devices like /dev/sdb, sdc, sdd, etc. Lets, just start fresh by running zpool status again. I am talking about these devices down here. If this were a real production system you would likely not want to use device names, because it creates a management nightmare if you have lots of devices. There is a great little blurb about this on the ZFS on Linux FAQ page. You actually have many choices how you references devices in a pool. For example, in episode #35, we used the Linux devices names, but you can also use drive identifier numbers, things like serial number, etc. Or you can choose to use physical layout information, things like PCI slot and port numbers. Finally, you can also create your own label types, these might describe the physical locations.
So, why would you want to do something like this? Well, it basically boils down to ease of management when things go wrong. Say for example that you think ZFS is great, you purchase some hardware that supports lots of disks, lets say a 48 bay chassis. You have it all up and running, using ECC memory, and things are going great, until you have a failed disk. If you used device names provided by Linux, like I did in these examples, it is likely going to be a nightmare trying to find the failed disk, as it is not clear how these device names map to physical locations. Once you have figured the mapping out, then you will likely want to swap out the failed disk and rebuild the ZFS storage pool on-line. The issue is that, there is an extremely high chance that when you reboot the machine down the road, and your drive letters will have shifted, since when that disk failed, and you replaced it with a new one, you were likely given a new device name! So, if you created the pool using device names, it is now out of whack, since before and after the reboot device names likely moved around, this was done to remove the dead disk from the system. This is fixable, just by asking ZFS to import the array, by rereading the headers off the disks, but it requires manual intervention. So, I would suggest creating the pool using something like, drive identifier numbers, then no matter what the device drive letter, your pool should always come back on-line without manual intervention. A handy trick, would be to create the pool using disk by id, then you can add a physical label to each disk carrier, which indicates unique information about that disk, like the serial number. So, when a failure happens, it is easy to match up the zpool status failed disk information with the physical label on each disk.
If you happen to create a ZFS pool using explict /dev/sdb drive letters, do not worry, you can convert to using disk by id after the fact. Lets export our ZFS pool using /dev/sd device names, then import it using the new disk by id names. After this is done, even if we swap the underlining devices names around, our pool will always come back.
# zpool status pool: e37pool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM e37pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 sdf ONLINE 0 0 0 sdg ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 sdh ONLINE 0 0 0 sdi ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 sdj ONLINE 0 0 0 sdk ONLINE 0 0 0 errors: No known data errors
You can export the array, which unmounts it and effectively removes it from the system, by running zpool export, then the pool name, so in our case e37pool. Lets verify that it is actually gone, by running zpool status, and as you can see there are no pools available, and lets just verify it is unmounted by running df. So one cool thing about ZFS, and this is possible with other raided devices, is that you could actually move these disks to a different system, and it would work, because all the data about the ZFS array is on the disks.
# zpool export e37pool
# zpool status
no pools available
# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup-lv_root 8.4G 1.2G 6.9G 15% / tmpfs 230M 0 230M 0% /dev/shm /dev/sda1 485M 54M 406M 12% /boot
Now lets import our pool again, but this time using the new disk by id names, so that we can survive devices, like /dev/sdb, from being renamed out from under us. We are going to import the ZFS pool by reading the metadata off each disk. So, lets type, zpool import -d, this specifies the devices we want to use, /dev/disk/by-id, then the pool name, in our case e37pool, and finally the -f option, to force it.
# zpool import -d /dev/disk/by-id e37pool -f
Then, lets verify it worked, by running, zpool status. As you can see, we are using unique drive identification numbers to reference the device, rather than the block device name, /dev/sdb for example. We can also verify it was mounted correctly by running df. I should mention, that when we ran zpool import, it scans the devices looking for ZFS metadata and constructs our pool, this could take a while if you have lots of disks.
# zpool status pool: e37pool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM e37pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ata-VBOX_HARDDISK_VBc142d444-59aeef39 ONLINE 0 0 0 ata-VBOX_HARDDISK_VBabc3cdd9-ba6145ab ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 ata-VBOX_HARDDISK_VBdd3e501a-2b4e57c7 ONLINE 0 0 0 ata-VBOX_HARDDISK_VBdb63c965-9604aaeb ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 ata-VBOX_HARDDISK_VB7f5992e7-bfa19485 ONLINE 0 0 0 ata-VBOX_HARDDISK_VB52eabb4c-f0840b75 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 ata-VBOX_HARDDISK_VB55b43204-09b13de0 ONLINE 0 0 0 ata-VBOX_HARDDISK_VB9983ccb7-24ace9eb ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 ata-VBOX_HARDDISK_VB16ce4ace-5fa97d8b ONLINE 0 0 0 ata-VBOX_HARDDISK_VB7a9e31e7-963cae3f ONLINE 0 0 0 errors: No known data errors
# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup-lv_root 8.4G 1.2G 6.9G 15% / tmpfs 230M 0 230M 0% /dev/shm /dev/sda1 485M 54M 406M 12% /boot e37pool 396M 0 396M 0% /e37pool
Okay, so that covers several recommended best practices and how you grow your ZFS pool. Lets move onto advanced filesystem features offered by ZFS on Linux, things like compression, deduplication, snapshots, and quotas.
Up until this point, we have mainly talked about the zpool command for working with ZFS on Linux, but there is actually a second commands called, just zfs. The zfs command, allows you to turn on and off various features, along with getting and setting properties. The following is probably best described through diagrams. Lets say for example, you are working at a research lab, and your ZFS pool has 100TB of storage. This is a lot of storage to play around with, and you will likely have many projects and people working in this storage pool, so you will likely want to shape the way they do things.
You see with ZFS, you create a storage volume, and it is presented as just one large chunk of storage, as shown by this box diagram. You will sometimes hear people using pool of storage, or tank of storage, these are the most common, but they all refer to the same thing.
Lets say we have four projects, A, B, C, and D. All of these projects can use the same pool or tank of storage, but these project all have different requirements. Lets say project A, is a group of scientists working with gene sequencing data. Typically, these are large text files coming off sequencers, so we want to add a policy to their area for data deduplication. Also, since these are mainly text files, we should add compression to their area, as we can likely save lots of space. Finally, since we know they have large storage requirements, we want to enforce some type of quote, just to make sure they do not use space assigned to projects B, C, or D.
It just to happens that ZFS allows you to create areas called datasets, these looks just like directories on the end system, but you can assign all types of advanced features of each dataset. We will use these datasets to carve up the large pool or tank of storage.
Next, lets say project B is mainly user data and home directories, and you have had many requests to restore files from backup since things occasionally get deleted. You think it would be a good idea to add hourly snapshots during office hours, so that files are quickly restored from the snapshot folder, rather than tape backup.
Actually, I should remove this hard line, and add a dotted line, same goes for projects C and D. Reason being, is that project A is the only one with a quota, meaning project A cannot consume more than a set limit of storage. Where, projects B, C, and D, are basically sharing the rest of the storage pool. I created datasets for projects C and D even though we do not have any special settings yet, this just adds the ability down the road, say for example that you have to have snapshots or something. Anyways, now that we have a high level overview of what we want, lets look at setting this up. I think you will be blown away with how easy it is.
So, lets just get our bearings, by running df -h, and zpool list. First off, lets create our four project directories, by running zfs create e37pool/project-a, b, c, and finally d. You will notice a convention happening here, the e37pool is our large chunk of storage, then these project a, b, c, and d, are our datasets used to divide the storage. Lets run df -h again, and you will notice that we have four new mounts, one for each of our project datasets.
# zfs create e37pool/project-a # zfs create e37pool/project-b # zfs create e37pool/project-c # zfs create e37pool/project-d
# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup-lv_root 8.4G 1.3G 6.7G 17% / tmpfs 230M 0 230M 0% /dev/shm /dev/sda1 485M 54M 406M 12% /boot e37pool 395M 0 395M 0% /e37pool e37pool/project-a 395M 0 395M 0% /e37pool/project-a e37pool/project-b 395M 0 395M 0% /e37pool/project-b e37pool/project-c 395M 0 395M 0% /e37pool/project-c e37pool/project-d 395M 0 395M 0% /e37pool/project-d
First, let configure compression. I thought it would be cool to show you examples of all these features, so I downloaded a large log file and we will use this as a test. Right now, I am sitting in roots home directory. Let me just list the files here, as you can see there is a NASA web server log, I have used these in previous episodes, #28 would be an example, and it is about 161 MB in size. Let me just show you the ZFS mounts again, as you can see, there is zero used space. So, lets turn on compression for project a, by running zfs set compression=lz4, lz4 is the compression algorithm, then the pool and dataset we want to use, in our case, e37pool/project-a. That is it, files going into and coming out of the project-a dataset will not be compressed and decompressed on the fly, this is totally transparent to the end user. Lets test this out by copying our 161MB web server log into project As area. Lets run df again, and you will notice that only 32MB are used. You can also verify this by running zfs list. So, it looks like compression is working, but you can get the exact values by running, zfs get compressration e37pool/project-a. As you can see, we are getting a compression ration of 5.12. Pretty cool! If you want to learn more about this, check out the ZFS Administration guide on Compression and Deduplication.
# pwd
/root
# ls -l total 163908 -rw-------. 1 root root 1386 Jun 6 2013 anaconda-ks.cfg -rw-r--r--. 1 root root 8526 Jun 6 2013 install.log -rw-r--r--. 1 root root 3314 Jun 6 2013 install.log.syslog -rw-r--r-- 1 root root 167813770 Sep 10 03:53 NASA_access_log_Aug95
# head -n 1 NASA_access_log_Aug95 in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839