ZFS on Linux (Part 1 of 2)

In part one, of this two part episode, we are going to look at ZFS on Linux. I will cover the basics of what ZFS is, how to create and work with ZFS pools, along with how to recover from common failures modes.

While researching this topic, it became clear that to really do the it justice, I needed to split the episode into two parts. So, the first part will act as an into to ZFS, how to create and work with ZFS pools, along with recovering from failures. Then, in the second part, will look at some of the advanced features offered by ZFS, things like: compression, deduplication, quotas, and snapshots.

ZFS was designed from the ground up, to be a highly scalable, combined volume management and filesystem, with many advanced features. This work took place almost 10 years ago by Sun Microsystems, for thier Solaris platform. In early 2010, Oracle acquired Sun Microsystems. Since the creation of ZFS, it has grown beyond Solaris. For example, BSD has had a working ZFS port for 7 years now. ZFS is just recently becoming popular on Linux, as the port is becoming much more stable. Since ZFS has been around for so long, there is no shortage of great documentation. If you are interested in reading about why ZFS is so great, along with a little history, the ZFS Wikipedia page is pretty good.

The ZFS on Linux port is produced by the Lawrence Livermore National Laboratory. I think this is worth talking about for a minute, as it add some weight to the topic. Livermore is home to two of the fastest supercomputers in the world, ranked at Nos. 3 and 9, out of the top 500. One of these top supercomputers named Sequoia, just happens to run ZFS on Linux, which Livermore helped create. There is a great video about the Sequoias ZFS and Luster implementation over on Youtube. I should also mention, that there is a nice little community gathering around the ZFS on Linux project, so lots of people are using it now.

Lets jump back to the ZFS homepage and get started with installing ZFS on our system. The ZFS on Linux is basically a kernel module and user land tools that you download, compile, and install. What is great about this, is that you do not have to patch or recompile your kernel, ZFS is just a loadable module. It is actually very easy to get going, since packages for Debian, Fedora, Ubuntu, and RHEL, and for each OS it is just a matter to running a couple commands. I am going to use CentOS on a virtual machines today, but the ZFS commands used throughout the demo, should work across all Linux distributions.

So, lets get started. I am running CentOS 6.5 here. I have also run yum update to install all the latest updates, specifically I wanted to make sure my kernel was up to date, since the ZFS package is going to build kernel modules against it. Okay, lets run those three commands from the ZFS on Linux website. The first one is installing the EPEL repository for supporting packages. The second command installs a yum ZFS repository for installation packages. Finally, the third command installs ZFS and supporting packages. As you can see, the install process is building kernel modules, this works in a similar manner to how you would install third party network or video drivers. I should also mention, that this install process took about 10 minutes to complete, so you are watching a heavily sped up version. This last step is not required, but I thought it might be a good idea to reboot the machine, just to make sure everything is consistent on boot.

# cat /etc/redhat-release 
CentOS release 6.5 (Final)

# yum update
No Packages marked for Update

# yum localinstall --nogpgcheck http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
# yum localinstall --nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release.el6.noarch.rpm
# yum install kernel-devel zfs
Building initial module for 2.6.32-431.23.3.el6.x86_64

zavl:
   - Installing to /lib/modules/2.6.32-431.23.3.el6.x86_64/extra/

znvpair.ko:
   - Installing to /lib/modules/2.6.32-431.23.3.el6.x86_64/extra/

zunicode.ko:
   - Installing to /lib/modules/2.6.32-431.23.3.el6.x86_64/extra/

zcommon.ko:
   - Installing to /lib/modules/2.6.32-431.23.3.el6.x86_64/extra/

zfs.ko:
   - Installing to /lib/modules/2.6.32-431.23.3.el6.x86_64/extra/

zpios.ko:
   - Installing to /lib/modules/2.6.32-431.23.3.el6.x86_64/extra/

# reboot

Okay, so the machine has rebooted. Lets just verify the version again. Next, lets verify the ZFS kernel modules has been installed, by running lsmod and grepping for zfs, looks good, since we have a bunch of modules loaded. Now that we have installed ZFS and verified that it is working, lets start to play around with it. For the demo today, I have added 10 disks to a VirtualBox VM, lets take a look at them, by running ls /dev/sd*. The disks are /dev/sdb through sdk, and they are 100 MB each. Lets just run fdisk -l /dev/sdk to verify.

# cat /etc/redhat-release 
CentOS release 6.5 (Final)

# lsmod |grep zfs
zfs                  1195533  0 
zcommon                46278  1 zfs
znvpair                80974  2 zfs,zcommon
zavl                    6925  1 zfs
zunicode              323159  1 zfs
spl                   266655  5 zfs,zcommon,znvpair,zavl,zunicode

# ls -l /dev/sd*
brw-rw---- 1 root disk 8,   0 Aug 23 18:33 /dev/sda
brw-rw---- 1 root disk 8,   1 Aug 23 18:33 /dev/sda1
brw-rw---- 1 root disk 8,   2 Aug 23 18:33 /dev/sda2
brw-rw---- 1 root disk 8,  16 Aug 23 18:33 /dev/sdb
brw-rw---- 1 root disk 8,  32 Aug 23 18:33 /dev/sdc
brw-rw---- 1 root disk 8,  48 Aug 23 18:33 /dev/sdd
brw-rw---- 1 root disk 8,  64 Aug 23 18:33 /dev/sde
brw-rw---- 1 root disk 8,  80 Aug 23 18:33 /dev/sdf
brw-rw---- 1 root disk 8,  96 Aug 23 18:33 /dev/sdg
brw-rw---- 1 root disk 8, 112 Aug 23 18:33 /dev/sdh
brw-rw---- 1 root disk 8, 128 Aug 23 18:33 /dev/sdi
brw-rw---- 1 root disk 8, 144 Aug 23 18:33 /dev/sdj
brw-rw---- 1 root disk 8, 160 Aug 23 18:33 /dev/sdk

# fdisk -l /dev/sdk
Disk /dev/sdk: 104 MB, 104857600 bytes
255 heads, 63 sectors/track, 12 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

I am just going to go ahead and create a ZFS pool and then we can review what this means. Lets type zpool create -f e35pool raidz2 sdb through sdi. So, what does this command mean? Well, zpool is one of two ZFS commands that you are likely going to use all the time, the other one being zfs, but I will chat more about that in part two of this episode. So, here we are using the create option, which means we want to create a new pool. Then, the -f option to ignore disk partition labels, since these are new disks, we will just force ZFS to create them for us. Next, we need a unique ZFS pool name for each array we want to create, we are using e35pool, this name will make more sense in a minute. Next we specify the raid level, ZFS has this own terminology for raid levels, but raidz2 is equivalent to raid6, meaning we can have two failed disks in the array and still be okay. Finally, we specify the eight disks to use in this array.

You can look at active ZFS pools by running, zpool status. Here is the e35pool we just created, it is online, then down here you can see the raid level, and disks assigned to the software raid. Pretty easy so far, right?

# zpool create -f e35pool raidz2 sdb sdc sdd sde sdf sdg sdh sdi

# zpool status
  pool: e35pool
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    e35pool     ONLINE       0     0     0
      raidz2-0  ONLINE       0     0     0
        sdb     ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0
        sdf     ONLINE       0     0     0
        sdg     ONLINE       0     0     0
        sdh     ONLINE       0     0     0
        sdi     ONLINE       0     0     0

errors: No known data errors

While doing research for this episode, I came across several really cool references, and I just wanted to take a minute and share them with you. The first one explains ZFS raid levels really well, and talks about some of the terminology. It talks about how raidz2 compares to raid6, but you can also create all sorts of raid variations with ZFS. This second link, is also extremely useful, in that is offers advice by someone who has clearly used ZFS extensively. It offers up many design suggestions and things to watch out for. If you are going to be playing around with ZFS, and looking at going into production, make sure you read this page! There is also a really useful raid capacity calculator that I ran across, and it support ZFS. You can input your raid type, disk size, and number of disks, and it will help you figure out the resulting raid capacity. Finally, if you are interested in ZFS raid benchmarks, here is this fantastic link, that goes into great detail about the topic. Like I was talking about earlier in this episode, there is an astonishing amount of information about ZFS out there. So, if you find yourself in a jam, there is likely help out there.

Okay, so lets get back to the demo. So, we have our pool is using ZFS software raidz2, which is equivalent to raid6, meaning we can have two disk failures, and still keep our array on-line. Before we go any further, lets run df -h and look at the output. You will notice that we have the e35pool ZFS filesystem mounted at /e35pool. So, the zpool create command, we used earlier in the episode, not only created a ZFS software raid, formatted it, but it also mounted it. Pretty neat.

# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                      8.4G  1.2G  6.9G  15% /
tmpfs                 230M     0  230M   0% /dev/shm
/dev/sda1             485M   54M  406M  12% /boot
e35pool               480M  128K  480M   1% /e35pool

Just to show you how easy ZFS is, lets destroy this pool using zpool destroy, and then the pool name, in this instance e35pool. Then lets run zpool status, to verify our pool is gone.

# zpool destroy e35pool

# zpool status
no pools available

Now, lets create a raid10, by running zpool create -f e35pool mirror sdb sdc mirror sdd sde, and hitting enter. Lets check it out by running zpool status. So, we have our e35pool again, but this time, we have mirror, each with two disks. You can also run the zpool list command to get details about the pool. Lets run df -h again, just to verify it is mounted. Okay, so that is creating and destroying pools in a nutshell.

# zpool create -f e35pool mirror sdb sdc mirror sdd sde

# zpool status
  pool: e35pool
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    e35pool     ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdb     ONLINE       0     0     0
        sdc     ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0

errors: No known data errors

# zpool list
NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
e35pool   171M   238K   171M     0%  1.00x  ONLINE  -

# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                      8.4G  1.2G  6.9G  15% /
tmpfs                 230M     0  230M   0% /dev/shm
/dev/sda1             485M   54M  406M  12% /boot
e35pool               139M  128K  139M   1% /e35pool

There are actually many options for the zpool command. Lets just run the zpool command without any arguments, so that we get the help output. I am just going to scroll up to the top and work our way down. You will already recognize some of these, since we have already reviewed, create, destroy, list, and status. But as you can see, there are many, so lets look at a couple more. Say for example that you wanted to get some statistics about the pool. We can use the iostat option like this. Zpool iostat e35pool. This gives up detailed information about the capacity, operations, and bandwidth. But you can also get detailed data per device, by running the same command, with the -v option. Now the information is broken down by pool, mirror, and device.

# zpool
missing command
usage: zpool command args ...
where 'command' is one of the following:

    create [-fnd] [-o property=value] ... 
        [-O file-system-property=value] ... 
        [-m mountpoint] [-R root] <pool> <vdev> ...
    destroy [-f] <pool>

    add [-fn] [-o property=value] <pool> <vdev> ...
    remove <pool> <device> ...

    labelclear [-f] <vdev>

    list [-Hv] [-o property[,...]] [-T d|u] [pool] ... [interval [count]]
    iostat [-v] [-T d|u] [pool] ... [interval [count]]
    status [-vxD] [-T d|u] [pool] ... [interval [count]]

    online <pool> <device> ...
    offline [-t] <pool> <device> ...
    clear [-nF] <pool> [device]
    reopen <pool>

    attach [-f] [-o property=value] <pool> <device> <new-device>
    detach <pool> <device>
    replace [-f] <pool> <device> [new-device]
    split [-n] [-R altroot] [-o mntopts]
        [-o property=value] <pool> <newpool> [<device> ...]

    scrub [-s] <pool> ...

    import [-d dir] [-D]
    import [-d dir | -c cachefile] [-F [-n]] <pool | id>
    import [-o mntopts] [-o property=value] ... 
        [-d dir | -c cachefile] [-D] [-f] [-m] [-N] [-R root] [-F [-n]] -a
    import [-o mntopts] [-o property=value] ... 
        [-d dir | -c cachefile] [-D] [-f] [-m] [-N] [-R root] [-F [-n]]
        <pool | id> [newpool]
    export [-f] <pool> ...
    upgrade
    upgrade -v
    upgrade [-V version] <-a | pool ...>
    reguid <pool>

    history [-il] [<pool>] ...
    events [-vHfc]

    get [-p] <"all" | property[,...]> <pool> ...
    set <property=value> <pool>

# zpool iostat e35pool
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
e35pool      238K   171M      0      0     13    898

# zpool iostat e35pool -v
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
e35pool      238K   171M      0      0     12    886
  mirror    84.5K  85.4M      0      0      5    300
    sdb         -      -      0      0    471  1.71K
    sdc         -      -      0      0    467  1.71K
  mirror     154K  85.3M      0      0      6    586
    sdd         -      -      0      0    470  1.99K
    sde         -      -      0      0    470  1.99K
----------  -----  -----  -----  -----  -----  -----

Lets briefly look at detecting data corruption and dealing with failures. Often times, when things go bad, there will likely be a dead disk and it will be obvious what needs to be done, you replace it. A less common issue, would be data corruption, so lets try and replicate that. Lets type, zpool status again, to see an overview of our pool. As you can see in our configuration as discussed earlier, we have redundancy built in, via our mirrors. So, lets corrupt the data on sdb, by using the dd command to write zeros across the entire disk. Lets type, zpool status again, and you will notice that everything is okay, as it might take a little bit for ZFS to notice that the data on sdb is bad. However, we can force a data scrub, this is ZFS terminology for a verify operation. Lets type, zpool scrub e35pool, then run zpool status again. Now you can see that ZFS picked up the issue with sdb, as it crawled the pool looking for data integrity issues. There is even this action line, which provides us with the command to fix our mirror. Down here, we also see our sdb devices marked as corrupted.

# zpool status
  pool: e35pool
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    e35pool     ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdb     ONLINE       0     0     0
        sdc     ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0

errors: No known data errors

# dd if=/dev/zero of=/dev/sdb
dd: writing to `/dev/sdb': No space left on device
204801+0 records in
204800+0 records out
104857600 bytes (105 MB) copied, 4.82468 s, 21.7 MB/s

# zpool scrub e35pool

# zpool status
  pool: e35pool
 state: ONLINE
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h0m with 0 errors on Mon Aug 25 01:01:50 2014
config:

    NAME        STATE     READ WRITE CKSUM
    e35pool     ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdb     UNAVAIL      0     0     0  corrupted data
        sdc     ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0

errors: No known data errors

Lets replace this bad disk with a good one, via the zpool replace command. Lets type zpool replace, the pool name, so e35pool, the bad device, sdb, and then the good device, sdf. I am just going to append the zpool status command here, so that we can see the device rebuilding. Since the virtual disks I am using in this demo are so small, the rebuilt happens pretty quickly, and I want to make sure we see it.

# zpool replace e35pool sdb sdf; zpool status
  pool: e35pool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Aug 25 01:03:53 2014
    114K scanned out of 190K at 114K/s, 0h0m to go
    43K resilvered, 59.84% done
config:

    NAME             STATE     READ WRITE CKSUM
    e35pool          ONLINE       0     0     0
      mirror-0       ONLINE       0     0     0
        replacing-0  UNAVAIL      0     0     0
          sdb        UNAVAIL      0     0     0  corrupted data
          sdf        ONLINE       0     0     0  (resilvering)
        sdc          ONLINE       0     0     0
      mirror-1       ONLINE       0     0     0
        sdd          ONLINE       0     0     0
        sde          ONLINE       0     0     0

errors: No known data errors

# zpool status
  pool: e35pool
 state: ONLINE
  scan: resilvered 43K in 0h0m with 0 errors on Mon Aug 25 01:03:53 2014
config:

    NAME        STATE     READ WRITE CKSUM
    e35pool     ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdf     ONLINE       0     0     0
        sdc     ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0

errors: No known data errors

Okay, so the status message has been updated. We can now see the device is being resilvered. Again, this is ZFS specific terminology. Resilvering is equivalent to rebuilding. Down here, you can see the rebuild happening, going from our bad device sdb, to the good device sdf. Lets run, zpool status again, and you see everything is happy. Dealing with failures in ZFS is actually pretty painless.

I just wanted to take a moment a talk about a couple more fantastic links for learning about ZFS. This 21 part guide, is the most complete set of documentation for ZFS on Linux that I have found. There is even a section on scrubbing and resilvering too, and it goes into great detailed how ZFS uses its knowledge about the filesystem to smartly repair data corruption.

You should also check out the official Oracle Solaris ZFS Administration Guide. Even though this is targeted at Solaris, almost all of it is compatible with the ZFS on Linux port. If you are using ZFS on Linux you will likely refer to these pages often.

This marks the end of ZFS on Linux - Part #1. Please continue onto ZFS on Linux - Part #2, episode #37, which will be linked to in the episode notes below, when it is complete.

#35 - ZFS on Linux (Part 1 of 2)

Links, Code, and Transcript

You may also like...