How I fixed ZFS data corruption errors on Hetzner Cloud

I had been having issues with my ZFS VM for months, but I finally fixed them!

5 months ago I wrote about the corruptions issues that I had on my main Hetzner Cloud VM, which is running a lot of stuff including a high traffic Mastodon instance.

For a bit of context, this VM is running on Ubuntu 18.04, and has a single disk. This disk is partitioned between a tiny / ext4 partition and a another one for a ZFS pool. This zpool is used as a storage pool for for LXD and thus all my LXC containers running on it.

Here was the status of my zpool then:

root@kokoro ~# zpool status -v
  pool: zpool-lxd
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
	entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0h56m with 7 errors on Sun Mar 3 18:29:34 2019
config:

	NAME STATE READ WRITE CKSUM
	zpool-lxd ONLINE 0 0 0
	  sda3 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

        <0x1f40>:<0x6f840>
        <0x1f40>:<0x1858a>
        <0x1956>:<0x18e13>
        <0x1956>:<0x18bcc>
        <0xc63>:<0x3604c>
        <0x8bbf>:<0x952e>
        zpool-lxd/containers/postgresql:/rootfs/var/lib/postgresql/11/main/base/20415/20819.9

As you can see my PostgreSQL server was affected so it was quite critical. In the blog post I linked earlier, I explained how I fixed these issues by getting rid of the bad blocks.

The thing is, even after moving servers, the corruptions errors came back, again and again.

I first asked for help on the LXD forum, not sure if it was a ZFS or LXD issue. Stéphane Graber’s answer led me to think that it was probably not related to LXD, but ZFS, and that is was certainly an issue with the storage more than a ZFS bug.

Then I thought, maybe running ZFS inside a VM isn’t a good idea?

So I asked /r/zfs instead, and the folks there provided some very interesting insights.

Apparently it’s somewhat common to attach real blocks devices to a VM and use ZFS on them. That would certainly help with my corruption issues because ZFS can heal data when it’s mirrored. Whereas a traditional RAID does not.

Actually that was the issue : since my Hetzner VM has a single virtual disk backed by a RAID on NVMe drives, I thought the drives couldn’t get corrupted. Apparently I was wrong and it does actually happen more often that not.

Hetzner didn’t acknowledge the issue, so I didn’t get any help from them despite reaching out to them.

My conclusion was that data corruption was silently happening on their disks, but you wouldn’t normally notice it when using classic file systems such ext4, because they don’t actively search for errors and report them like ZFS does.

Since I couldn’t bear this situation any longer, I fixed the remaining bad blocks (as explained in my other blog post) and migrated my VM as-is to Hetzner’s Ceph offer.

It involved shutting down the VM, making a snapshot, and recreating a new Ceph VM (CX41 -> CX41-CEPH) from the snapshot. Since my VM’s disk was about 120 GB, it took quite some time (maybe less than 10 hours, I actually don’t remember precisely since it’s been a while).

Good news: it fixed everything. It’s been 10 weeks and I haven’t had a single issue:

root@kokoro ~# zpool status -v
  pool: zpool-lxd
 state: ONLINE
  scan: scrub repaired 0B in 1h26m with 0 errors on Sun Jul 14 01:50:16 2019
config:

	NAME STATE READ WRITE CKSUM
	zpool-lxd ONLINE 0 0 0
	  sda3 ONLINE 0 0 0

errors: No known data errors

I think it’s safe to assume that it was indeed a storage issue, most likely on Hetzner’s side.

You probably wonder why I didn’t go with Ceph storage from the beginning. Well, it’s mostly about performance: the Ceph backend isn’t that slower but local NVMe SSDs will always be faster.

I have a little more iowait now, but not much more than I expected. Another advantage of the Ceph backend is that if the hypervisor I’m on goes down, Hetzner will be able to restart my VM from another one.

TL;DR:

RAID does not prevent silent corruption
ZFS is nice enough to let you know about it
If you plan on using ZFS on Hetzner Cloud, do it on the Ceph storage offers!