How I did (not) recover from a data loss (featuring ZFS, LXD and PostgreSQL)

How I averted a disaster and recovered data by deleting it.

As some may know, I’m running a Mastodon instance. It’s a quite big instance with about 1000 weekly active users and 13k local accounts in total.

My current setup is a Hetzner Cloud CX41 VM running LXC containers with LXD. They are stored on a dedicated zpool on a separate partition of the disk. In this case, I have a dedicated container for the PostgreSQL server.

I was quite surprised when I looked at the state of my zpool last week:

root@kokoro ~# zpool status -v
  pool: zpool-lxd
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
	entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0h56m with 7 errors on Sun Mar 3 18:29:34 2019
config:

	NAME STATE READ WRITE CKSUM
	zpool-lxd ONLINE 0 0 0
	  sda3 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

        <0x1f40>:<0x6f840>
        <0x1f40>:<0x1858a>
        <0x1956>:<0x18e13>
        <0x1956>:<0x18bcc>
        <0xc63>:<0x3604c>
        <0x8bbf>:<0x952e>
        zpool-lxd/containers/postgresql:/rootfs/var/lib/postgresql/11/main/base/20415/20819.9

The hexadecimal stuff are deleted auto-snapshots of some datasets, but what is concerning is the PostgreSQL file. I tried multiple ZFS scrubs without success.

I checked the PostgreSQL logs and there was indeed a few failing queries:

root@postgresql ~# grep 'Input/output error' /var/log/postgresql/postgresql-11-main.log | tail
2019-03-06 20:54:19.224 CET [11371] ERROR: could not read block 1264349 in file "base/20415/20819.9": Input/output error
2019-03-06 20:54:19.225 CET [11284] mstdn@mstdn ERROR: could not read block 1264349 in file "base/20415/20819.9": Input/output error
2019-03-06 22:05:04.540 CET [11533] mstdn@mstdn ERROR: could not read block 1264346 in file "base/20415/20819.9": Input/output error
2019-03-07 00:02:48.709 CET [12615] mstdn@mstdn ERROR: could not read block 1264346 in file "base/20415/20819.9": Input/output error
2019-03-07 02:25:30.946 CET [13670] mstdn@mstdn ERROR: could not read block 1264346 in file "base/20415/20819.9": Input/output error
2019-03-07 04:07:40.608 CET [14447] mstdn@mstdn ERROR: could not read block 1264346 in file "base/20415/20819.9": Input/output error
2019-03-07 06:09:58.351 CET [15667] mstdn@mstdn ERROR: could not read block 1264346 in file "base/20415/20819.9": Input/output error
2019-03-07 06:56:37.034 CET [16238] mstdn@mstdn ERROR: could not read block 1264341 in file "base/20415/20819.9": Input/output error
2019-03-07 06:56:57.828 CET [16505] ERROR: could not read block 1264341 in file "base/20415/20819.9": Input/output error
2019-03-07 06:56:57.830 CET [15870] mstdn@mstdn ERROR: could not read block 1264341 in file "base/20415/20819.9": Input/output error

Now, the PSQL server is handling about ~200 queries per second, so there are actually very few queries failing.

I could also confirm there was some errors in the application’s logs:

Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b] method=GET path=/api/v1/timelines/tag/fun format=html controller=Api::V1::Timelines::TagController action=show status=500 error='ActiveRecord::StatementInvalid: PG::IoError: ERROR: could not read block 1264346 in file "base/20415/20819.9": Input/output error
Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b]
Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b] ActiveRecord::StatementInvalid (PG::IoError: ERROR: could not read block 1264346 in file "base/20415/20819.9": Input/output error
Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b]
Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b] app/controllers/application_controller.rb:109:in `cache_collection'
Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b] app/controllers/api/v1/timelines/tag_controller.rb:25:in `cached_tagged_statuses'
Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b] app/controllers/api/v1/timelines/tag_controller.rb:21:in `load_statuses'
Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b] app/controllers/api/v1/timelines/tag_controller.rb:10:in `show'

...

2019-03-04 06:08:21.734 CET [19320] mstdn@mstdn ERROR: could not read block 843725 in file "base/20415/20819.6": Input/output error
2019-03-04 06:08:21.734 CET [19320] mstdn@mstdn CONTEXT: parallel worker
2019-03-04 06:08:21.734 CET [19320] mstdn@mstdn STATEMENT: SELECT DISTINCT "statuses"."id", "statuses"."updated_at" FROM "statuses" INNER JOIN "statuses_tags" ON "statuses_tags"."status_id" = "statuses"."id" LEFT OUTER JOIN "accounts" ON "accounts"."id" = "statuses"."account_id" WHERE "statuses"."visibility" = 0 AND (statuses.reblog_of_id IS NULL) AND "statuses_tags"."tag_id" = 210 AND "statuses"."account_id" NOT IN (245693, 20285) AND "accounts"."silenced" = FALSE ORDER BY "statuses"."id" DESC LIMIT 4

...

2019-03-04 04:20:04.149 CET [18597] mstdn@mstdn ERROR: could not read block 843720 in file "base/20415/20819.6": Input/output error
2019-03-04 04:20:04.149 CET [18597] mstdn@mstdn STATEMENT: SELECT DISTINCT "statuses"."id", "statuses"."updated_at" FROM "statuses" INNER JOIN "statuses_tags" ON "statuses_tags"."status_id" = "statuses"."id" LEFT OUTER JOIN "accounts" ON "accounts"."id" = "statuses"."account_id" WHERE "statuses"."visibility" = 0 AND (statuses.reblog_of_id IS NULL) AND "statuses_tags"."tag_id" = 2721 AND "accounts"."silenced" = FALSE ORDER BY "statuses"."id" DESC LIMIT 20

This is still concerning because the corrupted file is about 1 GB.

The most important issue was that all my dumps for the past few days have been failing:

postgres@postgresql:~$ pg_dump -Fc mstdn -f mstdn.dump
pg_dump: Dumping the contents of table "statuses" failed: PQgetResult() failed.
pg_dump: Error message from server: ERROR: could not read block 1264336 in file "base/20415/20819.9": Input/output error
pg_dump: The command was: COPY public.statuses (id, uri, text, created_at, updated_at, in_reply_to_id, reblog_of_id, url, sensitive, visibility, spoiler_text, reply, language, conversation_id, local, account_id, application_id, in_reply_to_account_id) TO stdout;

The public.statuses is by far the biggest table in the database since it contains all of the local and foreign toots (43M rows) so I can’t skip it.

It turned out I couldn’t do anything that required reading this file.

I couldn’t:

dump the database
copy the database raw files (with either cp or dd)
export the LXD container or a snapshot
migrate the LXD container or a snapshot to another host
export the containers’ ZFS dataset

Each of these would fail when they encountered of the corrupted blocks of the file.

root@postgresql ~# cp /var/lib/postgresql/11/main/base/20415/20819.9 .
cp: error reading '/var/lib/postgresql/11/main/base/20415/20819.9': Input/output error
root@postgresql ~# ll /var/lib/postgresql/11/main/base/20415/20819.9
-rw------- 1 postgres postgres 1.0G Mar 11 22:57 /var/lib/postgresql/11/main/base/20415/20819.9
root@postgresql ~# ll 20819.9
-rw------- 1 root root 662M Mar 11 23:00 20819.9
root@postgresql ~# dd if=/var/lib/postgresql/11/main/base/20415/20819.9 of=20819.9.dd
dd: error reading '/var/lib/postgresql/11/main/base/20415/20819.9': Input/output error
1355008+0 records in
1355008+0 records out
693764096 bytes (694 MB, 662 MiB) copied, 14.2534 s, 48.7 MB/s
root@postgresql ~#

Also, it was out of question to restore a backup since it would mean I’d lost multiple days of data, which is way worse that a few failing requests a day. But I couldn’t leave thing like this.

My zpool seems corrupted in a way or another, since it reports some corrupted files, then after a few hours, they’re not corrupted anymore, and some other files seem to be corrupted. These symptoms indicate a failing disk, but I have virtual disk on top of the hypervisor’s SSD RAID… I really don’t know what’s wrong.

Anyway, out of all the “corrupted” files, /var/lib/postgresql/11/main/base/20415/20819.9 was not getting better, so it meant it was really corrupted.

Lost for lost, I decided to do something that would really make me loose this data forever: remove the corrupted blocks from the file.

FYI, I launched a new VM from an older snaphost in order to test this before doing it on the production VM.

As shown above, dd would fail when it encountered the first corrupted block.

The solution is to use the noerror option, which means dd will ignore the errors and continue. Our new file will have the same content but will be empty where the corrupted blocks were.

The result is - somewhat - a success. I’m not sure what the default block size is, but dd failed 256 times around 662 MiB.

root@postgresql ~# dd if=/var/lib/postgresql/11/main/base/20415/20819.9 of=20819.9.dd conv=sync,noerror
dd: error reading '/var/lib/postgresql/11/main/base/20415/20819.9': Input/output error
1355008+0 records in
1355008+0 records out
693764096 bytes (694 MB, 662 MiB) copied, 14.0675 s, 49.3 MB/s
dd: error reading '/var/lib/postgresql/11/main/base/20415/20819.9': Input/output error
1355008+1 records in
1355009+0 records out
693764608 bytes (694 MB, 662 MiB) copied, 14.0689 s, 49.3 MB/s
dd: error reading '/var/lib/postgresql/11/main/base/20415/20819.9': Input/output error
1355008+2 records in
1355010+0 records out
693765120 bytes (694 MB, 662 MiB) copied, 14.0696 s, 49.3 MB/s

...

dd: error reading '/var/lib/postgresql/11/main/base/20415/20819.9': Input/output error
1355008+254 records in
1355262+0 records out
693894144 bytes (694 MB, 662 MiB) copied, 14.3103 s, 48.5 MB/s
dd: error reading '/var/lib/postgresql/11/main/base/20415/20819.9': Input/output error
1355008+255 records in
1355263+0 records out
693894656 bytes (694 MB, 662 MiB) copied, 14.3117 s, 48.5 MB/s
2096896+256 records in
2097152+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 22.1875 s, 48.4 MB/s

Some simple calculations. Errors start at 693764096 and ends at 693894656 which means 130560 B = 127.5 KiB of corrupted/lost data. That’s 0,000003% of the database size, so I guess I should be able to live with it. It’s also infuriating (but logical) that such a tiny amount of data can make such a mess.

Let’s replace the file:

root@postgresql ~# ll 20819.9.dd
-rw-r--r-- 1 root root 1.0G Mar 11 23:37 20819.9.dd
root@postgresql ~# cd /var/lib/postgresql/11/main/base/20415/
root@postgresql /v/l/p/1/m/b/20415# mv 20819.9 20819.9.old
root@postgresql /v/l/p/1/m/b/20415# cp ~/20819.9.dd 20819.9
root@postgresql /v/l/p/1/m/b/20415# chown postgres:postgres 20819.9

Of course I stopped the PostgreSQL server during the operation.

Since then, I have not had a single failing query related to I/O:

root@postgresql ~# grep 'Input/output error' /var/log/postgresql/postgresql-11-main.log | wc -l
0
root@postgresql ~# grep 'Input/output error' /var/log/postgresql/postgresql-11-main.log.1 | tail -5
2019-03-09 16:22:46.269 CET [14465] mstdn@mstdn ERROR: could not read block 1264350 in file "base/20415/20819.9": Input/output error
2019-03-09 17:32:08.646 CET [16096] mstdn@mstdn ERROR: could not read block 1264350 in file "base/20415/20819.9": Input/output error
2019-03-09 17:43:54.887 CET [16763] mstdn@mstdn ERROR: could not read block 1264346 in file "base/20415/20819.9": Input/output error
2019-03-09 19:24:59.579 CET [17444] mstdn@mstdn ERROR: could not read block 1264350 in file "base/20415/20819.9": Input/output error
2019-03-09 21:24:06.808 CET [408] mstdn@mstdn ERROR: could not read block 1264351 in file "base/20415/20819.9": Input/output error
root@postgresql ~# date
Sat Mar 16 15:43:15 CET 2019

Success! I can now dump the database too.

As expected the 20819.9 file was not reported as corrupted by zpool status -v, but 20819.9.old was, so I deleted it since I didn’t need it anymore.

I’m not an expert in file systems or databases, but I thought it was an experience worth sharing. I definitely lost a bit of non-critical data, but it was the best outcome. I had very little downtime and I didn’t have to restore dumps which would have make me loss way more data. The same issue affecting the accounts table could have been disastrous.

I still have errors that come and go on my zpool. I am a bit concerned since I don’t know the source of these errors. Is the underlying storage corrupted or unreliable? Did I corrupt something myself? Is it a bug with ZOL? LXD? I don’t know.

This VM is not even a month old but I guess I have to start a new one from scratch and migrate my containers. Will use ZFS though? I’m not sure.