Running OpenZFS raidz on Linux instead of hardware RAID5
Table Of Contents
Introduction
For a long time I have had the idea of installing and using the file system ZFS on my Linux home server. ZFS has been available for many years, first on Sun Microsystems’ OpenSolaris in 2005 and later on FreeBSD in 2008, but for copyright reasons it was never included in the official Linux kernel and because of this the adoption of ZFS on Linux was delayed for years.
Shortly after Oracle purchased Sun Microsystems in 2010, ZFS became closed source, and many of the ZFS developers were not happy about this, some of them left the company and started the OpenZFS(1) project (More history at the “ZFS Wikipedia page”(2))
As the OpenZFS project says on its web: “OpenZFS is an outstanding storage platform that encompasses the functionality of traditional filesystems, volume managers, and more, with consistent reliability, functionality and performance across all distributions”. In other words, it is easy to administrate and it improves security, reliability and performance on your system.
Some of the main features of OpenZFS are:
- Pooled storage
- Copy-on-write
- Snapshots
- Data integrity verification and automatic repair
- RAID-Z
- Maximum 16 Exabyte file size
- Maximum 256 Quadrillion Zettabytes storage
The best of all is that some Linux distributions have started supporting OpenZFS in their core distribution and they patch and maintain the kernel and tools needed to use OpenZFS, so we don’t have to do that jobb.
Ubuntu introduced OpenZFS support in 2019 so I am in luck because my home server is running Ubuntu-server 22.04 LTS, now it is time to get my hands dirty with this. Is OpenZFS as good as its reputation says? In this article I install and test OpenZFS on Linux and see how it replaces my hardware RAID5 installation.
Preparing the disks
I have a RAID5 device on my server using 5 SSD disks and 1 dedicated hot spare disk. All of them are connected to a Megaraid SAS 9341-8i (Check my article “Megaraid SAS 9341-8i on Linux - Cooling and initialization issues”(3) for more information). This is the RAID5 I am going to replace with OpenZFS although I will continue using the same controller.
This is my RAID5 configuration:
root@server:~# /opt/MegaRAID/storcli/storcli64 /c0 /dall show
......
TOPOLOGY :
========
---------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type State BT Size PDC PI SED DS3 FSpace TR
---------------------------------------------------------------------------
0 - - - - RAID5 Optl N 6.983 TB dflt N N dflt N N
0 0 - - - RAID5 Optl N 6.983 TB dflt N N dflt N N
0 0 0 62:1 2 DRIVE Onln N 1.745 TB dflt N N dflt - N
0 0 1 62:2 1 DRIVE Onln N 1.745 TB dflt N N dflt - N
0 0 2 62:3 0 DRIVE Onln N 1.745 TB dflt N N dflt - N
0 0 3 62:4 5 DRIVE Onln N 1.745 TB dflt N N dflt - N
0 0 4 62:5 6 DRIVE Onln N 1.745 TB dflt N N dflt - N
0 - - 62:6 4 DRIVE DHS - 1.745 TB - - - - - N
---------------------------------------------------------------------------
This configuration has been working stable and without problems for several months, so let’s hope it continues like that after all the changes I am going to do to the system.
These are the disks I have available on my home server:
root@server:~# /opt/MegaRAID/storcli/storcli64 /c0 /eall /sall show
......
Drive Information :
=================
--------------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
--------------------------------------------------------------------------------
62:0 7 JBOD - 5.458 TB SATA HDD N N 512B HGST HUS726T6TALE6L4 U -
62:1 2 Onln 0 1.745 TB SAS SSD N N 512B VO1920JEUQQ U -
62:2 1 Onln 0 1.745 TB SAS SSD N N 512B VO1920JEUQQ U -
62:3 0 Onln 0 1.745 TB SAS SSD N N 512B VO1920JEUQQ U -
62:4 5 Onln 0 1.745 TB SAS SSD N N 512B VO1920JEUQQ U -
62:5 6 Onln 0 1.745 TB SAS SSD N N 512B VO1920JEUQQ U -
62:6 4 DHS 0 1.745 TB SAS SSD N N 512B VO1920JEUQQ U -
62:7 3 JBOD - 1.746 TB SAS SSD N N 512B VO1920JEUQQ U -
--------------------------------------------------------------------------------
62:0
is a backup disk and 62:7
my OS disk. The rest 62:1-6
are in use by the RAID5 device and are the ones I am going to use with OpenZFS.
The first thing I have to do is to destroy the RAID5 and define the disks as JBOD disks so OpenZFS can have full access to them without any interference from the controller. I hope this will work because the documentation says that you should not use raid controllers with OpenZFS at all. My controller does not have memory for caching, the OS can see my physical disks without problems and I can get all the S.M.A.R.T information from the disks when defined as JBOD disks by the controller. I think it will work without problems, we’ll see. I have tried to find extended information about JBOD vs HBA modes without any luck. It seems to me that it depends a lot on the controller.
After deleting the RAID5 and defining the disks as JBOD disks the system looks like this:
root@server:~# /opt/MegaRAID/storcli/storcli64 /c0 /v0 show
......
Virtual Drives :
==============
---------------------------------------------------------------
DG/VD TYPE State Access Consist Cache Cac sCC Size Name
---------------------------------------------------------------
0/0 RAID5 Optl RW Yes NRWTD - ON 6.983 TB RAID-5
---------------------------------------------------------------
root@server:~# /opt/MegaRAID/storcli/storcli64 /c0 /v0 del
Status = Success
Description = Delete VD succeeded
root@server:~# /opt/MegaRAID/storcli/storcli64 /c0 /e62 /s1-6 set jbod
root@server:~# /opt/MegaRAID/storcli/storcli64 /c0 /e62 /sall show
......
Drive Information :
=================
--------------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
--------------------------------------------------------------------------------
62:0 7 JBOD - 5.458 TB SATA HDD N N 512B HGST HUS726T6TALE6L4 U -
62:1 2 JBOD - 1.746 TB SAS SSD N N 512B VO1920JEUQQ U -
62:2 1 JBOD - 1.746 TB SAS SSD N N 512B VO1920JEUQQ U -
62:3 0 JBOD - 1.746 TB SAS SSD N N 512B VO1920JEUQQ U -
62:4 5 JBOD - 1.746 TB SAS SSD N N 512B VO1920JEUQQ U -
62:5 6 JBOD - 1.746 TB SAS SSD N N 512B VO1920JEUQQ U -
62:6 4 JBOD - 1.746 TB SAS SSD N N 512B VO1920JEUQQ U -
62:7 3 JBOD - 1.746 TB SAS SSD N N 512B VO1920JEUQQ U -
--------------------------------------------------------------------------------
The next thing I have to do is to identify the disks I am going to use with OpenZFS. For that I need the deviceID the OS sees and uses to identify these disks (62:1-6
):
--------------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
--------------------------------------------------------------------------------
62:1 2 JBOD - 1.746 TB SAS SSD N N 512B VO1920JEUQQ U -
62:2 1 JBOD - 1.746 TB SAS SSD N N 512B VO1920JEUQQ U -
62:3 0 JBOD - 1.746 TB SAS SSD N N 512B VO1920JEUQQ U -
62:4 5 JBOD - 1.746 TB SAS SSD N N 512B VO1920JEUQQ U -
62:5 6 JBOD - 1.746 TB SAS SSD N N 512B VO1920JEUQQ U -
62:6 4 JBOD - 1.746 TB SAS SSD N N 512B VO1920JEUQQ U -
--------------------------------------------------------------------------------
The DID column (Device ID) gives us the IDs the OS sees. So IDs 0,1,2,4,5 and 6 are the IDs I have to identify and use with OpenZFS. I can use the information in the directories /dev/disk/by-path/
and /dev/disk/by-id/
to identify the disks and use the correct disk identifier from /dev/disk/by-id/
when creating my first OpenZFS Pool:
root@server:~# ls -l /dev/disk/by-path/
total 0
lrwxrwxrwx 1 root root 13 Dec 10 20:28 pci-0000:01:00.0-nvme-1 -> ../../nvme0n1
lrwxrwxrwx 1 root root 9 Dec 11 16:31 pci-0000:06:00.0-scsi-0:0:0:0 -> ../../sda
lrwxrwxrwx 1 root root 9 Dec 11 16:31 pci-0000:06:00.0-scsi-0:0:1:0 -> ../../sdb
lrwxrwxrwx 1 root root 9 Dec 11 16:31 pci-0000:06:00.0-scsi-0:0:2:0 -> ../../sdc
lrwxrwxrwx 1 root root 9 Dec 11 16:31 pci-0000:06:00.0-scsi-0:0:3:0 -> ../../sdd
lrwxrwxrwx 1 root root 9 Dec 11 16:31 pci-0000:06:00.0-scsi-0:0:4:0 -> ../../sde
lrwxrwxrwx 1 root root 9 Dec 11 16:31 pci-0000:06:00.0-scsi-0:0:5:0 -> ../../sdf
lrwxrwxrwx 1 root root 9 Dec 11 16:31 pci-0000:06:00.0-scsi-0:0:6:0 -> ../../sdg
lrwxrwxrwx 1 root root 9 Dec 11 16:31 pci-0000:06:00.0-scsi-0:0:7:0 -> ../../sdh
root@server:~# ls -l /dev/disk/by-id/
lrwxrwxrwx 1 root root 13 Dec 10 20:28 nvme-Samsung_SSD_980_PRO_1TB_S5GXXXXXXXXXXXX -> ../../nvme0n1
lrwxrwxrwx 1 root root 15 Dec 10 20:29 nvme-Samsung_SSD_980_PRO_1TB_S5GXXXXXXXXXXXX-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root 15 Dec 10 20:29 nvme-Samsung_SSD_980_PRO_1TB_S5GXXXXXXXXXXXX-part2 -> ../../nvme0n1p2
lrwxrwxrwx 1 root root 9 Dec 11 16:31 scsi-SATA_HGST_HUS726T6TAL_V9HXXXX7 -> ../../sdh
lrwxrwxrwx 1 root root 9 Dec 11 16:31 scsi-SHP_VO1920JEUQQ_0SYXXXX5 -> ../../sdf
lrwxrwxrwx 1 root root 9 Dec 11 16:31 scsi-SHP_VO1920JEUQQ_0SYXXXX2 -> ../../sdc
lrwxrwxrwx 1 root root 9 Dec 11 16:31 scsi-SHP_VO1920JEUQQ_0SYXXXX0 -> ../../sda
lrwxrwxrwx 1 root root 9 Dec 11 16:31 scsi-SHP_VO1920JEUQQ_0SYXXXX4 -> ../../sde
lrwxrwxrwx 1 root root 9 Dec 11 16:31 scsi-SHP_VO1920JEUQQ_0SYXXXX3 -> ../../sdd
lrwxrwxrwx 1 root root 9 Dec 11 16:31 scsi-SHP_VO1920JEUQQ_0SYXXXX6 -> ../../sdg
lrwxrwxrwx 1 root root 9 Dec 11 16:31 scsi-SHP_VO1920JEUQQ_0SYXXXX1 -> ../../sdb
These are the disks I will use with OpenZFS: scsi-SHP_VO1920JEUQQ_0SYXXXX0, scsi-SHP_VO1920JEUQQ_0SYXXXX1, scsi-SHP_VO1920JEUQQ_0SYXXXX2, scsi-SHP_VO1920JEUQQ_0SYXXXX4, scsi-SHP_VO1920JEUQQ_0SYXXXX5,scsi-SHP_VO1920JEUQQ_0SYXXXX6
Installing and configuring ZFS
As I say at the beginning, Ubuntu has full OpenZFS support in the core system and the only thing I have to do to install and activate OpenZFS is to run this command as root:
root@server:~# apt install zfsutils-linux
This installs all necessary packages and activates the OpenZFS modules in the kernel. I can check the version and if the modules have been loaded with:
root@server:~# zfs version
zfs-2.1.4-0ubuntu0.1
zfs-kmod-2.1.4-0ubuntu0.1
root@server:~# lsmod |grep zfs
zfs 3825664 8
zunicode 348160 1 zfs
zzstd 491520 1 zfs
zlua 163840 1 zfs
zavl 20480 1 zfs
icp 323584 1 zfs
zcommon 106496 2 zfs,icp
znvpair 98304 2 zfs,zcommon
spl 118784 6 zfs,icp,zzstd,znvpair,zcommon,zavl
After this, everything should be ready to start using OpenZFS.
The first thing I need to do is to define an OpenZFS Pool. This can be done with the command zpool create
. I am going to call this pool data-zfs-pool01
and use raidz
with 5 data disks and 1 spare disk. I use the disk identifiers from /dev/disk/by-id/
root@server:~# zpool create data-zfs-pool01 raidz scsi-SHP_VO1920JEUQQ_0SYXXXX0 scsi-SHP_VO1920JEUQQ_0SYXXXX1 scsi-SHP_VO1920JEUQQ_0SYXXXX2 scsi-SHP_VO1920JEUQQ_0SYXXXX4 scsi-SHP_VO1920JEUQQ_0SYXXXX5 spare scsi-SHP_VO1920JEUQQ_0SYXXXX6
root@server:~# zpool status
pool: data-zfs-pool01
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
data-zfs-pool01 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX0 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX1 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX2 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX4 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX5 ONLINE 0 0 0
spares
scsi-SHP_VO1920JEUQQ_0SYXXXX6 AVAIL
errors: No known data errors
As simple as this, I have my first OpenZFS Pool created, available and ready to use.
With OpenZFS I also have the option of having dedicated disks for cache and logs. I am not sure how this works yet, I will have to investigate the impact of this on how resources are used when using this functionality. But that is content for another article, for now I just activate this functionality using a NVME device I have on this server.
root@server:~# zpool add data-zfs-pool01 cache nvme-Samsung_SSD_980_PRO_1TB_S5GXXXXXXXXXXXX-part1
root@server:~# zpool add data-zfs-pool01 log nvme-Samsung_SSD_980_PRO_1TB_S5GXXXXXXXXXXXX-part2
root@server:~# zpool status
pool: data-zfs-pool01
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
data-zfs-pool01 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX0 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX1 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX2 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX4 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX5 ONLINE 0 0 0
logs
nvme-Samsung_SSD_980_PRO_1TB_S5GXXXXXXXXXXXX-part2 ONLINE 0 0 0
cache
nvme-Samsung_SSD_980_PRO_1TB_S5GXXXXXXXXXXXX-part1 ONLINE 0 0 0
spares
scsi-SHP_VO1920JEUQQ_0SYXXXX6 AVAIL
errors: No known data errors
With an OpenZFS Pool in place the next step is to create a ZFS file system with the zfs create
command, I call it libvirt
:
root@server:~# zfs create data-zfs-pool01/libvirt
root@server:~# zfs list
NAME USED AVAIL REFER MOUNTPOINT
data-zfs-pool01 132G 6.71T 447K /data-zfs-pool01
data-zfs-pool01/libvirt 132G 6.71T 132G /data-zfs-pool01/libvirt
The file system and the pool are mounted automatically after they are created:
root@server:~# df -h |grep zfs
data-zfs-pool01 6.8T 512K 6.8T 1% /data-zfs-pool01
data-zfs-pool01/libvirt 6.9T 133G 6.8T 2% /data-zfs-pool01/libvirt
And finally I activate ZFS compression on the new file system with the zfs set
command:
root@server:~# zfs set compression=on data-zfs-pool01/libvirt
root@server:~# zfs get compression
NAME PROPERTY VALUE SOURCE
data-zfs-pool01 compression off default
data-zfs-pool01/libvirt compression on local
This is actually all I need to do to start using OpenZFS on Linux.
Testing a disk failure
Before I start using my new OpenZFS infrastructure I want to test what would happen if a disk in my OpenZFS Pool stops working.
The first thing I need to do is to activate the autoreplace
functionality in OpenZFS so the resilvering process with the spare disk starts automatically.
root@server:~# zpool set autoreplace=on data-zfs-pool01
root@server:~# zpool get autoreplace
NAME PROPERTY VALUE SOURCE
data-zfs-pool01 autoreplace on local
What better test than removing one disk from its enclosure. Right after doing this the ZFS pool command shows how the pool goes into a DEGRADED state, the disk I have pulled out goes into an UNAVAIL state and the hot spare disk replaces the failed disk. It also starts what they call resilvering, which is the process of rebuilding the data saved on the disk that failed, from the parity information in the raidz1.
The resilvering process takes around 1 minutte and rebuilds around 33G of data, that is a rebuild speed of around 563MB/s. Last time I had to rebuild a RAID5 on this home server it took several hours.
root@server:~# zpool status
pool: data-zfs-pool01
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Dec 10 20:35:17 2022
165G scanned at 4.86G/s, 84.7G issued at 2.49G/s, 165G total
17.0G resilvered, 51.24% done, 00:00:32 to go
config:
NAME STATE READ WRITE CKSUM
data-zfs-pool01 DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX0 ONLINE 0 0 0
spare-1 DEGRADED 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX1 UNAVAIL 3 36 0
scsi-SHP_VO1920JEUQQ_0SYXXXX6 ONLINE 0 0 0 (resilvering)
scsi-SHP_VO1920JEUQQ_0SYXXXX2 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX4 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX5 ONLINE 0 0 0
logs
nvme-Samsung_SSD_980_PRO_1TB_S5GXNF0NC32025M-part2 ONLINE 0 0 0
cache
nvme-Samsung_SSD_980_PRO_1TB_S5GXNF0NC32025M-part1 ONLINE 0 0 0
spares
scsi-SHP_VO1920JEUQQ_0SYXXXX6 INUSE currently in use
errors: No known data errors
As soon as the resilvering process is finished, I detache the failed disk from the pool with this command:
root@server:/data-zfs-pool01# zpool detach data-zfs-pool01 scsi-SHP_VO1920JEUQQ_0SYXXXX1
And the OpenZFS Pool gets the ONLINE state back.
root@server:/data-zfs-pool01# zpool status
pool: data-zfs-pool01
state: ONLINE
scan: resilvered 33.1G in 00:01:07 with 0 errors on Sat Dec 10 20:36:24 2022
config:
NAME STATE READ WRITE CKSUM
data-zfs-pool01 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX0 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX6 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX2 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX4 ONLINE 0 0 0
scsi-SHP_VO1920JEUQQ_0SYXXXX5 ONLINE 0 0 0
logs
nvme-Samsung_SSD_980_PRO_1TB_S5GXNF0NC32025M-part2 ONLINE 0 0 0
cache
nvme-Samsung_SSD_980_PRO_1TB_S5GXNF0NC32025M-part1 ONLINE 0 0 0
errors: No known data errors
After this test, I reverse all the changes and my OpenZFS Pool has again all the disks available and an AVAIL hot spare disk.
Some performance numbers
Before I destroyed the hardware RAID5 to use the disks with OpenZFS I run a few test, nothing fancy, just a simple dd
command to write/read a 100G file.
root@server:/data-raid/test# /sbin/sysctl -w vm.drop_caches=3
vm.drop_caches = 3
root@server:/data-raid/test# dd if=/dev/zero of=./test.img bs=256k count=409600
409600+0 records in
409600+0 records out
107374182400 bytes (107 GB, 100 GiB) copied, 57.5399 s, 1.9 GB/s
root@server:/data-raid/test# /sbin/sysctl -w vm.drop_caches=3
vm.drop_caches = 3
root@server:/data-raid/test# rm test.img
root@server:/data-raid/test# dd if=/dev/zero of=./test.img bs=256k count=409600 oflag=direct
409600+0 records in
409600+0 records out
107374182400 bytes (107 GB, 100 GiB) copied, 480.16 s, 224 MB/s
root@server:/data-raid/test# /sbin/sysctl -w vm.drop_caches=3
vm.drop_caches = 3
root@server:/data-raid/test# dd if=./test.img of=/dev/null bs=256k count=409600
409600+0 records in
409600+0 records out
107374182400 bytes (107 GB, 100 GiB) copied, 53.2423 s, 2.0 GB/s
The same tests with OpenZFS are faster than with the old hardware RAID5:
root@server:/data-zfs-pool01# /sbin/sysctl -w vm.drop_caches=3
vm.drop_caches = 3
root@server:/data-zfs-pool01# dd if=/dev/zero of=./test.img bs=256k count=409600
409600+0 records in
409600+0 records out
107374182400 bytes (107 GB, 100 GiB) copied, 48.164 s, 2.2 GB/s
root@server:/data-zfs-pool01# rm test.img
root@server:/data-zfs-pool01# /sbin/sysctl -w vm.drop_caches=3
vm.drop_caches = 3
root@server:/data-zfs-pool01# dd if=/dev/zero of=./test.img bs=256k count=409600 oflag=direct
409600+0 records in
409600+0 records out
107374182400 bytes (107 GB, 100 GiB) copied, 48.0719 s, 2.2 GB/s
root@server:/data-zfs-pool01# /sbin/sysctl -w vm.drop_caches=3
vm.drop_caches = 3
root@server:/data-zfs-pool01# dd if=./test.img of=/dev/null bs=256k count=409600
409600+0 records in
409600+0 records out
107374182400 bytes (107 GB, 100 GiB) copied, 34.6761 s, 3.1 GB/s
Well, this is all for today. Now I have to learn how to use all the cool features that OpenZFS has to offer.
Footnotes
-
OpenZFS project:
https://openzfs.org/ ↩︎ -
ZFS history:
https://en.wikipedia.org/wiki/ZFS#History ↩︎ -
“Megaraid SAS 9341-8i on Linux - Cooling and initialization issues”:
https://e-mc2.net/blog/megaraid-sas-9341-8i-not-working-with-linux/ ↩︎