- ZFS on Linux
- Wikipedia: ZFS
- LLNL: High Performance Computing
- Sequoia’s 55PB Lustre+ZFS Filesystem
- ZFS Build: ZFS RAID levels
- ZFS: Read Me 1st
- RAID Capacity Calculator
- ZFS Raidz Performance, Capacity and Integrity
- Install ZFS on Debian GNU/Linux
- Oracle Solaris ZFS Administration Guide
- Episode #37 - ZFS on Linux (Part 2 of 2)
In part one, of this two part episode, we are going to look at ZFS on Linux. I will cover the basics of what ZFS is, how to create and work with ZFS pools, along with how to recover from common failures modes.
While researching this topic, it became clear that to really do the it justice, I needed to split the episode into two parts. So, the first part will act as an into to ZFS, how to create and work with ZFS pools, along with recovering from failures. Then, in the second part, will look at some of the advanced features offered by ZFS, things like: compression, deduplication, quotas, and snapshots.
ZFS was designed from the ground up, to be a highly scalable, combined volume management and filesystem, with many advanced features. This work took place almost 10 years ago by Sun Microsystems, for thier Solaris platform. In early 2010, Oracle acquired Sun Microsystems. Since the creation of ZFS, it has grown beyond Solaris. For example, BSD has had a working ZFS port for 7 years now. ZFS is just recently becoming popular on Linux, as the port is becoming much more stable. Since ZFS has been around for so long, there is no shortage of great documentation. If you are interested in reading about why ZFS is so great, along with a little history, the ZFS Wikipedia page is pretty good.
The ZFS on Linux port is produced by the Lawrence Livermore National Laboratory. I think this is worth talking about for a minute, as it add some weight to the topic. Livermore is home to two of the fastest supercomputers in the world, ranked at Nos. 3 and 9, out of the top 500. One of these top supercomputers named Sequoia, just happens to run ZFS on Linux, which Livermore helped create. There is a great video about the Sequoias ZFS and Luster implementation over on Youtube. I should also mention, that there is a nice little community gathering around the ZFS on Linux project, so lots of people are using it now.
Lets jump back to the ZFS homepage and get started with installing ZFS on our system. The ZFS on Linux is basically a kernel module and user land tools that you download, compile, and install. What is great about this, is that you do not have to patch or recompile your kernel, ZFS is just a loadable module. It is actually very easy to get going, since packages for Debian, Fedora, Ubuntu, and RHEL, and for each OS it is just a matter to running a couple commands. I am going to use CentOS on a virtual machines today, but the ZFS commands used throughout the demo, should work across all Linux distributions.
So, lets get started. I am running CentOS 6.5 here. I have also run yum update to install all the latest updates, specifically I wanted to make sure my kernel was up to date, since the ZFS package is going to build kernel modules against it. Okay, lets run those three commands from the ZFS on Linux website. The first one is installing the EPEL repository for supporting packages. The second command installs a yum ZFS repository for installation packages. Finally, the third command installs ZFS and supporting packages. As you can see, the install process is building kernel modules, this works in a similar manner to how you would install third party network or video drivers. I should also mention, that this install process took about 10 minutes to complete, so you are watching a heavily sped up version. This last step is not required, but I thought it might be a good idea to reboot the machine, just to make sure everything is consistent on boot.
# cat /etc/redhat-release CentOS release 6.5 (Final)
# yum update No Packages marked for Update
# yum localinstall --nogpgcheck http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm # yum localinstall --nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release.el6.noarch.rpm # yum install kernel-devel zfs Building initial module for 2.6.32-431.23.3.el6.x86_64 zavl: - Installing to /lib/modules/2.6.32-431.23.3.el6.x86_64/extra/ znvpair.ko: - Installing to /lib/modules/2.6.32-431.23.3.el6.x86_64/extra/ zunicode.ko: - Installing to /lib/modules/2.6.32-431.23.3.el6.x86_64/extra/ zcommon.ko: - Installing to /lib/modules/2.6.32-431.23.3.el6.x86_64/extra/ zfs.ko: - Installing to /lib/modules/2.6.32-431.23.3.el6.x86_64/extra/ zpios.ko: - Installing to /lib/modules/2.6.32-431.23.3.el6.x86_64/extra/
# reboot
Okay, so the machine has rebooted. Lets just verify the version again. Next, lets verify the ZFS kernel modules has been installed, by running lsmod and grepping for zfs, looks good, since we have a bunch of modules loaded. Now that we have installed ZFS and verified that it is working, lets start to play around with it. For the demo today, I have added 10 disks to a VirtualBox VM, lets take a look at them, by running ls /dev/sd*. The disks are /dev/sdb through sdk, and they are 100 MB each. Lets just run fdisk -l /dev/sdk to verify.
# cat /etc/redhat-release CentOS release 6.5 (Final)
# lsmod |grep zfs zfs 1195533 0 zcommon 46278 1 zfs znvpair 80974 2 zfs,zcommon zavl 6925 1 zfs zunicode 323159 1 zfs spl 266655 5 zfs,zcommon,znvpair,zavl,zunicode
# ls -l /dev/sd* brw-rw---- 1 root disk 8, 0 Aug 23 18:33 /dev/sda brw-rw---- 1 root disk 8, 1 Aug 23 18:33 /dev/sda1 brw-rw---- 1 root disk 8, 2 Aug 23 18:33 /dev/sda2 brw-rw---- 1 root disk 8, 16 Aug 23 18:33 /dev/sdb brw-rw---- 1 root disk 8, 32 Aug 23 18:33 /dev/sdc brw-rw---- 1 root disk 8, 48 Aug 23 18:33 /dev/sdd brw-rw---- 1 root disk 8, 64 Aug 23 18:33 /dev/sde brw-rw---- 1 root disk 8, 80 Aug 23 18:33 /dev/sdf brw-rw---- 1 root disk 8, 96 Aug 23 18:33 /dev/sdg brw-rw---- 1 root disk 8, 112 Aug 23 18:33 /dev/sdh brw-rw---- 1 root disk 8, 128 Aug 23 18:33 /dev/sdi brw-rw---- 1 root disk 8, 144 Aug 23 18:33 /dev/sdj brw-rw---- 1 root disk 8, 160 Aug 23 18:33 /dev/sdk
# fdisk -l /dev/sdk Disk /dev/sdk: 104 MB, 104857600 bytes 255 heads, 63 sectors/track, 12 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000
I am just going to go ahead and create a ZFS pool and then we can review what this means. Lets type zpool create -f e35pool raidz2 sdb through sdi. So, what does this command mean? Well, zpool is one of two ZFS commands that you are likely going to use all the time, the other one being zfs, but I will chat more about that in part two of this episode. So, here we are using the create option, which means we want to create a new pool. Then, the -f option to ignore disk partition labels, since these are new disks, we will just force ZFS to create them for us. Next, we need a unique ZFS pool name for each array we want to create, we are using e35pool, this name will make more sense in a minute. Next we specify the raid level, ZFS has this own terminology for raid levels, but raidz2 is equivalent to raid6, meaning we can have two failed disks in the array and still be okay. Finally, we specify the eight disks to use in this array.
You can look at active ZFS pools by running, zpool status. Here is the e35pool we just created, it is online, then down here you can see the raid level, and disks assigned to the software raid. Pretty easy so far, right?
# zpool create -f e35pool raidz2 sdb sdc sdd sde sdf sdg sdh sdi
# zpool status pool: e35pool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM e35pool ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0 sdg ONLINE 0 0 0 sdh ONLINE 0 0 0 sdi ONLINE 0 0 0 errors: No known data errors
While doing research for this episode, I came across several really cool references, and I just wanted to take a minute and share them with you. The first one explains ZFS raid levels really well, and talks about some of the terminology. It talks about how raidz2 compares to raid6, but you can also create all sorts of raid variations with ZFS. This second link, is also extremely useful, in that is offers advice by someone who has clearly used ZFS extensively. It offers up many design suggestions and things to watch out for. If you are going to be playing around with ZFS, and looking at going into production, make sure you read this page! There is also a really useful raid capacity calculator that I ran across, and it support ZFS. You can input your raid type, disk size, and number of disks, and it will help you figure out the resulting raid capacity. Finally, if you are interested in ZFS raid benchmarks, here is this fantastic link, that goes into great detail about the topic. Like I was talking about earlier in this episode, there is an astonishing amount of information about ZFS out there. So, if you find yourself in a jam, there is likely help out there.
Okay, so lets get back to the demo. So, we have our pool is using ZFS software raidz2, which is equivalent to raid6, meaning we can have two disk failures, and still keep our array on-line. Before we go any further, lets run df -h and look at the output. You will notice that we have the e35pool ZFS filesystem mounted at /e35pool. So, the zpool create command, we used earlier in the episode, not only created a ZFS software raid, formatted it, but it also mounted it. Pretty neat.
# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup-lv_root 8.4G 1.2G 6.9G 15% / tmpfs 230M 0 230M 0% /dev/shm /dev/sda1 485M 54M 406M 12% /boot e35pool 480M 128K 480M 1% /e35pool
Just to show you how easy ZFS is, lets destroy this pool using zpool destroy, and then the pool name, in this instance e35pool. Then lets run zpool status, to verify our pool is gone.
# zpool destroy e35pool
# zpool status
no pools available
Now, lets create a raid10, by running zpool create -f e35pool mirror sdb sdc mirror sdd sde, and hitting enter. Lets check it out by running zpool status. So, we have our e35pool again, but this time, we have mirror, each with two disks. You can also run the zpool list command to get details about the pool. Lets run df -h again, just to verify it is mounted. Okay, so that is creating and destroying pools in a nutshell.
# zpool create -f e35pool mirror sdb sdc mirror sdd sde
# zpool status pool: e35pool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM e35pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 errors: No known data errors
# zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT e35pool 171M 238K 171M 0% 1.00x ONLINE -
# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup-lv_root 8.4G 1.2G 6.9G 15% / tmpfs 230M 0 230M 0% /dev/shm /dev/sda1 485M 54M 406M 12% /boot e35pool 139M 128K 139M 1% /e35pool
There are actually many options for the zpool command. Lets just run the zpool command without any arguments, so that we get the help output. I am just going to scroll up to the top and work our way down. You will already recognize some of these, since we have already reviewed, create, destroy, list, and status. But as you can see, there are many, so lets look at a couple more. Say for example that you wanted to get some statistics about the pool. We can use the iostat option like this. Zpool iostat e35pool. This gives up detailed information about the capacity, operations, and bandwidth. But you can also get detailed data per device, by running the same command, with the -v option. Now the information is broken down by pool, mirror, and device.
# zpool missing command usage: zpool command args ... where 'command' is one of the following: create [-fnd] [-o property=value] ... [-O file-system-property=value] ... [-m mountpoint] [-R root] <pool> <vdev> ... destroy [-f] <pool> add [-fn] [-o property=value] <pool> <vdev> ... remove <pool> <device> ... labelclear [-f] <vdev> list [-Hv] [-o property[,...]] [-T d|u] [pool] ... [interval [count]] iostat [-v] [-T d|u] [pool] ... [interval [count]] status [-vxD] [-T d|u] [pool] ... [interval [count]] online <pool> <device> ... offline [-t] <pool> <device> ... clear [-nF] <pool> [device] reopen <pool> attach [-f] [-o property=value] <pool> <device> <new-device> detach <pool> <device> replace [-f] <pool> <device> [new-device] split [-n] [-R altroot] [-o mntopts] [-o property=value] <pool> <newpool> [<device> ...] scrub [-s] <pool> ... import [-d dir] [-D] import [-d dir | -c cachefile] [-F [-n]] <pool | id> import [-o mntopts] [-o property=value] ... [-d dir | -c cachefile] [-D] [-f] [-m] [-N] [-R root] [-F [-n]] -a import [-o mntopts] [-o property=value] ... [-d dir | -c cachefile] [-D] [-f] [-m] [-N] [-R root] [-F [-n]] <pool | id> [newpool] export [-f] <pool> ... upgrade upgrade -v upgrade [-V version] <-a | pool ...> reguid <pool> history [-il] [<pool>] ... events [-vHfc] get [-p] <"all" | property[,...]> <pool> ... set <property=value> <pool>
# zpool iostat e35pool capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- e35pool 238K 171M 0 0 13 898
# zpool iostat e35pool -v capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- e35pool 238K 171M 0 0 12 886 mirror 84.5K 85.4M 0 0 5 300 sdb - - 0 0 471 1.71K sdc - - 0 0 467 1.71K mirror 154K 85.3M 0 0 6 586 sdd - - 0 0 470 1.99K sde - - 0 0 470 1.99K ---------- ----- ----- ----- ----- ----- -----
Lets briefly look at detecting data corruption and dealing with failures. Often times, when things go bad, there will likely be a dead disk and it will be obvious what needs to be done, you replace it. A less common issue, would be data corruption, so lets try and replicate that. Lets type, zpool status again, to see an overview of our pool. As you can see in our configuration as discussed earlier, we have redundancy built in, via our mirrors. So, lets corrupt the data on sdb, by using the dd command to write zeros across the entire disk. Lets type, zpool status again, and you will notice that everything is okay, as it might take a little bit for ZFS to notice that the data on sdb is bad. However, we can force a data scrub, this is ZFS terminology for a verify operation. Lets type, zpool scrub e35pool, then run zpool status again. Now you can see that ZFS picked up the issue with sdb, as it crawled the pool looking for data integrity issues. There is even this action line, which provides us with the command to fix our mirror. Down here, we also see our sdb devices marked as corrupted.
# zpool status pool: e35pool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM e35pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 errors: No known data errors
# dd if=/dev/zero of=/dev/sdb dd: writing to `/dev/sdb': No space left on device 204801+0 records in 204800+0 records out 104857600 bytes (105 MB) copied, 4.82468 s, 21.7 MB/s
# zpool scrub e35pool
# zpool status pool: e35pool state: ONLINE status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: scrub repaired 0 in 0h0m with 0 errors on Mon Aug 25 01:01:50 2014 config: NAME STATE READ WRITE CKSUM e35pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdb UNAVAIL 0 0 0 corrupted data sdc ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 errors: No known data errors
Lets replace this bad disk with a good one, via the zpool replace command. Lets type zpool replace, the pool name, so e35pool, the bad device, sdb, and then the good device, sdf. I am just going to append the zpool status command here, so that we can see the device rebuilding. Since the virtual disks I am using in this demo are so small, the rebuilt happens pretty quickly, and I want to make sure we see it.
# zpool replace e35pool sdb sdf; zpool status pool: e35pool state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Aug 25 01:03:53 2014 114K scanned out of 190K at 114K/s, 0h0m to go 43K resilvered, 59.84% done config: NAME STATE READ WRITE CKSUM e35pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 replacing-0 UNAVAIL 0 0 0 sdb UNAVAIL 0 0 0 corrupted data sdf ONLINE 0 0 0 (resilvering) sdc ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 errors: No known data errors
# zpool status pool: e35pool state: ONLINE scan: resilvered 43K in 0h0m with 0 errors on Mon Aug 25 01:03:53 2014 config: NAME STATE READ WRITE CKSUM e35pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdf ONLINE 0 0 0 sdc ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 errors: No known data errors
Okay, so the status message has been updated. We can now see the device is being resilvered. Again, this is ZFS specific terminology. Resilvering is equivalent to rebuilding. Down here, you can see the rebuild happening, going from our bad device sdb, to the good device sdf. Lets run, zpool status again, and you see everything is happy. Dealing with failures in ZFS is actually pretty painless.
I just wanted to take a moment a talk about a couple more fantastic links for learning about ZFS. This 21 part guide, is the most complete set of documentation for ZFS on Linux that I have found. There is even a section on scrubbing and resilvering too, and it goes into great detailed how ZFS uses its knowledge about the filesystem to smartly repair data corruption.
You should also check out the official Oracle Solaris ZFS Administration Guide. Even though this is targeted at Solaris, almost all of it is compatible with the ZFS on Linux port. If you are using ZFS on Linux you will likely refer to these pages often.
This marks the end of ZFS on Linux - Part #1. Please continue onto ZFS on Linux - Part #2, episode #37, which will be linked to in the episode notes below, when it is complete.