Bug 1708315 - Upgrade bcache based Fedora 29 system to Fedora 30 consistently corrupts the system beyond repair
Summary: Upgrade bcache based Fedora 29 system to Fedora 30 consistently corrupts the ...
Status: RELEASE_PENDING
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 30
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Keywords:
: 1707702 1707822 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-09 15:16 UTC by Rolf Fokkens
Modified: 2019-06-15 21:35 UTC (History)
41 users (show)

(edit)
Clone Of:
(edit)
Last Closed:


Attachments (Terms of Use)
dmesg containing EXT4-fs errors (48.66 KB, text/plain)
2019-05-09 15:16 UTC, Rolf Fokkens
no flags Details
A torrent for a test image (47.34 KB, text/plain)
2019-05-11 17:56 UTC, Rolf Fokkens
no flags Details
Simple, and nterresting scanario to reproduce the failure. (6.02 KB, text/plain)
2019-05-13 20:52 UTC, Rolf Fokkens
no flags Details
patched bcache module for 5.1.6-300.fc30.x86_64 (99.05 KB, application/octet-stream)
2019-06-09 18:01 UTC, Pierre Juhen
pierre.juhen: review?
Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Linux Kernel 203573 None None None 2019-06-12 17:25 UTC

Description Rolf Fokkens 2019-05-09 15:16:46 UTC
Created attachment 1566205 [details]
dmesg containing EXT4-fs errors

1. Please describe the problem: When running a Fedora 29 system using bcache for the root FS results in massive root FS corruption beyond repair The issue can be reproduced by only upgrading to a Fedora 30 kernel.


2. What is the Version-Release number of the kernel: 5.0.11-300.fc30


3. Did it work previously in Fedora? Ever since Fedora 21 this has worked fine


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

4.A) Install Fedora 29 in a VM, with the following disk/FS layout:

[root@localhost ~]# lsblk 
NAME              MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda               252:0    0  16G  0 disk 
├─vda1            252:1    0   1G  0 part /boot
└─vda2            252:2    0  15G  0 part 
  └─bcache0       251:0    0  15G  0 disk 
    └─fedora-root 253:0    0  14G  0 lvm  /
vdb               252:16   0   2G  0 disk 
└─vdb1            252:17   0   2G  0 part 
  └─bcache0       251:0    0  15G  0 disk 
    └─fedora-root 253:0    0  14G  0 lvm  /
[root@localhost ~]# 

vda represents a HDD, size is 16GB
vdb represnsts an SSD which is used as a caching device

4.B) On this system enter: dnf --release=30 update "kernel*"

4.B) Do a reboot ...

4.C) use dmesg to watch a massive FS corruption


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
Was not able to install rwahide kernel:

ported successfully
Import of key(s) didn't help, wrong key(s)?

6. Are you running any modules that not shipped with directly Fedora's kernel?:
Nope

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Comment 1 Rolf Fokkens 2019-05-09 15:21:14 UTC
I prepared a tgz containing the Fedora 29 VM to reproduce this:

-rw-r--r--. 1 root root 5061400970 May  8 22:33 fed29-test.tgz

It's 5GB in size, Is there a proper way to share this?

Comment 2 Rolf Fokkens 2019-05-11 17:56 UTC
Created attachment 1567288 [details]
A torrent for a test image

I attached a torrent file for people who wish to test this in a simple way. Due to limited BW I opted for a torrent, uplink is about 100 Mbps.

The tgz contains 2 image files and a libvirt xml which can be used to create the VM. There is a testuser account with password "testuser".

Comment 3 Pierre Juhen 2019-05-12 04:03:55 UTC
*** Bug 1707822 has been marked as a duplicate of this bug. ***

Comment 4 Rolf Fokkens 2019-05-12 11:52:02 UTC
I noticed that kernel-5.0.14-300.fc30 is available in updates-testing and that it's built using gcc-9.1.1-1.fc30.

Unfortunately this issue is still as reproducable (100%) as before in this kernel.

Comment 5 Rolf Fokkens 2019-05-12 18:16:24 UTC
Upstream: https://marc.info/?l=linux-bcache&m=155756199506020&w=2

Comment 6 Pierre Juhen 2019-05-12 20:36:13 UTC
For me, the fact bcache is in writethrough or witeback doesn't change anything.

I reproduced it in writethrough mode.

Comment 7 Pierre Juhen 2019-05-12 22:18:56 UTC
It seems that the blocks are written almost anywhere on the backend ; I had an unmounted logical volume corrupted.

One question : does the bug occur when there is no frontend (backend in running mode) and in writethrough mode ?

Comment 8 Fredrik Chabot 2019-05-13 08:07:44 UTC
*** Bug 1707702 has been marked as a duplicate of this bug. ***

Comment 9 Rolf Fokkens 2019-05-13 10:06:43 UTC
I updated to the Fedora 30 kernel, after detaching the cache from the bcache device. Reboots afterwards (3 in total) worked fine, no ext4 issues in dmesg.

Then I recreated and reattached the cache to the bcache device. After only one reboot there were already ext4 issues in dmesg, and the VM was broken again.

Comment 10 Pierre Juhen 2019-05-13 10:25:23 UTC
Thank you Rolf,

It means that it's the metadata of the frontend that is corrupted. That's what I supposed, because if the metatada of the backend was corrupted, there would be attempts to write to non-existing blocks, and different messsages in dmesg.

Could you try to detach again the frontend, after migration and reboot with the frontend , to see if the backend is corrupted ? 

Regards,

Pierre

Comment 11 Rolf Fokkens 2019-05-13 10:42:01 UTC
Another attempt, cache_mode = none has no trouble too:

[testuser@localhost ~]$ bcache-status
--- bcache ---
UUID                        348d8cc4-1434-43b7-8554-cc17f1fff933
Block Size                  512 B
Bucket Size                 512.00 KiB
Congested?                  False
Read Congestion             2.0ms
Write Congestion            20.0ms
Total Cache Size            2 GiB
Total Cache Used            40.9 MiB	(2%)
Total Cache Unused          2 GiB	(98%)
Evictable Cache             2 GiB	(100%)
Replacement Policy          [lru] fifo random
Cache Mode                  writethrough writeback writearound [none]
Total Hits                  0
Total Misses                0
Total Bypass Hits           0	(0%)
Total Bypass Misses         24260
Total Bypassed              66.0 MiB
[testuser@localhost ~]$ cat /proc/version 
Linux version 5.0.14-300.fc30.x86_64 (mockbuild@bkernel04.phx2.fedoraproject.org) (gcc version 9.1.1 20190503 (Red Hat 9.1.1-1) (GCC)) #1 SMP Thu May 9 10:43:38 UTC 2019
[testuser@localhost ~]$ 

So indeed it's like the metadata in the cache (frontend?) that is corrupt or (probably) not processed the right way by the bcache driver. 

@Pierre I'll get back on suggestion later.

Comment 12 Pierre Juhen 2019-05-13 12:55:46 UTC
What we can add to your conclusion :

The first think that's done in reboot in Fedora 30 is an fsck on the root volume. Fsck shouldn't change anything if the filesystem is clean. In the different cases I met, it seems that the first read of the fsck (superblock) was already wrong, and the driver returned the wrong block.

It means the metadata read by the driver (the adress of the superblock in the frontend) is wrong immediately.

When it's read by a driver compiled by GCC8 -> OK

When it's read by a driver compiled by GCC9 -> KO

Hope this helps !

Regards,

Comment 13 Pierre Juhen 2019-05-13 19:39:20 UTC
@Rolf

I think my first idea is not useful. After the file system is corrupted, it is corrupted.

Another idea that might be useful :

1/ Migrate to Fedora 30 in bcache mode = none

2/ Detach the frontend 

3/ Reformat the frontend using make-bcache

4/ Re-attach the reformated frontend

Change the mode to writethrough and reboot

If it works, this will mean that we face a data alignment problem in the frontend

Thanks,

Regards,

Comment 14 Rolf Fokkens 2019-05-13 20:40:21 UTC
@Pierre, I already tried this, and it still failed. But as a matter of fact the formatting is done in userland, so it doesn't really matter because I'm running Fed29 anyway, except for the kernel.

Comment 15 Rolf Fokkens 2019-05-13 20:52 UTC
Created attachment 1568093 [details]
Simple, and nterresting scanario to reproduce the failure.

Being fed-up with wrecking my (virtual) Fed29 install over and over, I looked for simpler approach: don't wreck my root FS, wreck just another FS.

At first I tried to create a /dev/bcache0 (4 GB backing device, 1 GB  and added a plain ext4 FS on it. Did a lot of writing on it, and it all worked fine, no probelem at all.

So next I recreated /dev/bcache0 and created a VG on top, and an LV in the VG. BOOM!

[root@localhost-live ~]# vgcreate BCTEST /dev/bcache0
  Physical volume "/dev/bcache0" successfully created.
  Volume group "BCTEST" successfully created
[root@localhost-live ~]# lvcreate -L 2G -n TEST BCTEST
  Incorrect checksum in metadata area header on /dev/bcache0 at 4096
  Failed to write VG BCTEST.
[root@localhost-live ~]# 

Attached (fail.lis) are all the commands I ran to reproduce this. Unfortunately NLS was Dutch, I overlooked that because a was hurrying and didn't really expect it to fail.

It's just one sample, and I need to check its reproducability. But I'd like to share it anyway.

Comment 16 Pierre Juhen 2019-05-14 04:42:17 UTC
Thanks Rolf :

WHat you have demonstated :

The bug seems to occurs ONLY if the bcache device contains an LVM2 volume group. That's explains why so few people are affected.

Thank you for your testing. I tried to read a bit of code yesterday night and found nothing suspect. Hope this helps the developers.

Regards,

Comment 17 Fredrik Chabot 2019-05-14 08:08:32 UTC
(In reply to Pierre Juhen from comment #16)
 
> The bug seems to occurs ONLY if the bcache device contains an LVM2 volume
> group.

It also happens when you create an ext4 filesystem on the raw bcache0 device.

Comment 18 bepvte+bugzilla 2019-05-14 14:08:03 UTC
The bug occured for me even before mkfs.xfs finished running on the bcache volume in some cases. The LVM corruption is more reproducible, but this affects all types of usage.

Comment 19 Laura Abbott 2019-05-15 17:25:44 UTC
Please make sure you are continuing the discussion on the upstream thread. This really needs to be debugged by the upstream maintainers as Fedora isn't carrying anything special here.

Comment 20 Rolf Fokkens 2019-05-15 17:55:49 UTC
Except for one thing, maybe: the choice for gcc9.

Comment 21 Jeff Law 2019-05-16 14:34:58 UTC
Rolf -- understood WRT gcc-9.  If things start to look like it's a GCC issue, don't hesitate to drop a note here.

Comment 22 Pierre Juhen 2019-05-16 18:51:14 UTC
I copy here a message from Coly Li from the bcache kernel list (nobody seems to have a clue wether it's a code or a gcc9 problem) :

Such problem is very easy to produce, only a few sequential I/Os may
trigger a panic from bch_btree_iter_next_check().

Here is content of my fio script,
[global]
lockmem=1
direct=1
ioengine=psync

[job1]
filename=/dev/bcache0
readwrite=write
blocksize=4k
iodepth=1
numjobs=1
io_size=64K

In my observation, 2 sequential I/Os may trigger this panic. Here is the
kernel output when panic in bch_btree_iter_next_check() triggered,

[  153.478620] bcache: bch_dump_bset() block 1 key 0/9:
[  153.478621] bcache: bch_bkey_dump()  0x0:0x10 len 0x8 -> [check dev]
[  153.478623] bcache: bch_bkey_dump()  bucket 7560168717
[  153.478624] bcache: bch_bkey_dump()
[  153.478624]
[  153.478625] bcache: bch_dump_bset() block 1 key 3/9:
[  153.478626] bcache: bch_bkey_dump()  0x0:0x18 len 0x8 -> [check dev]
[  153.478627] bcache: bch_bkey_dump()  bucket 4400861924
[  153.478628] bcache: bch_bkey_dump()
[  153.478628]
[  153.478629] bcache: bch_dump_bset() Key skipped backwards
[  153.478629]
[  153.478630] bcache: bch_dump_bset() block 1 key 6/9:
[  153.478631] bcache: bch_bkey_dump()  0x0:0x10 len 0x8 -> [0:392192
gen 1] dirty
[  153.478632] bcache: bch_bkey_dump()  bucket 383
[  153.478635] bcache: bch_bkey_dump()
[  153.478635]
[  153.532890] Kernel panic - not syncing: Key skipped backwards
[  153.535924] CPU: 0 PID: 790 Comm: bcache_writebac Tainted: G        W
        5.1.0+ #3
[  153.539656] Hardware name: VMware, Inc. VMware Virtual Platform/440BX
Desktop Reference Platform, BIOS 6.00 04/13/2018
[  153.545002] Call Trace:
[  153.546702]  dump_stack+0x85/0xc0
[  153.548675]  panic+0x106/0x2da
[  153.550560]  ? bch_ptr_invalid+0x10/0x10 [bcache]
[  153.553178]  bch_btree_iter_next_filter.cold+0xff/0x12e [bcache]
[  153.556117]  ? btree_insert_key+0x190/0x190 [bcache]
[  153.558646]  bch_btree_map_keys_recurse+0x5c/0x190 [bcache]
[  153.561557]  bch_btree_map_keys+0x177/0x1a0 [bcache]
[  153.564085]  ? btree_insert_key+0x190/0x190 [bcache]
[  153.566688]  ? dirty_init+0x80/0x80 [bcache]
[  153.569224]  bch_refill_keybuf+0xcc/0x290 [bcache]
[  153.571609]  ? finish_wait+0x90/0x90
[  153.573525]  ? dirty_init+0x80/0x80 [bcache]
[  153.575705]  bch_writeback_thread+0x3b9/0x5c0 [bcache]
[  153.578527]  ? __kthread_parkme+0x58/0x80
[  153.580662]  kthread+0x108/0x140
[  153.582541]  ? read_dirty+0x620/0x620 [bcache]
[  153.584998]  ? kthread_park+0x90/0x90
[  153.586991]  ret_from_fork+0x3a/0x50
[  153.589236] Kernel Offset: 0x12000000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  153.594868] ---[ end Kernel panic - not syncing: Key skipped
backwards ]---

The panic happens in bch_writeback context, because bkeys in btree node
is not in linear increasing order. Adjacent two sequential write
requests is very common condition in bcache, such corrupted btree node
is not reported in recent 2~3 years. Unless the kernel is compiled with
gcc9...

It is not clear to me why the key 0:0x10 appears in same btree node
twice, and why there are 3 keys for two 4K write requests.

If anyone may have any clue, please offer. Now I continue to check how
this happens.

Thanks.

-- 

Coly Li

Comment 23 Lukas Zapletal 2019-05-20 10:33:18 UTC
I missed the warning in known issues and hit by this bug. I can confirm that it screws everything even when not using LVM. Switched to LVM cache and restoring from my last backup.

Comment 24 Salvador Ortiz 2019-05-21 04:53:52 UTC
(In reply to Jeff Law from comment #21)
> Rolf -- understood WRT gcc-9.  If things start to look like it's a GCC
> issue, don't hesitate to drop a note here.

bcache's source code for latest kernel (5.0.16) in both fc29 and fc30 are identical (after the "rpmbuild -bp" stage).  
So I can confirm that the issue is fedora's GCC9. I suspect a codegen error somewhere.

Comment 25 Jeff Law 2019-05-28 14:14:40 UTC
c#24 doesn't really confirm the bug is in gcc-9.  It's not uncommon for latent bugs to be exposed by compiler updates.  Regardless, at this point there's nothing actionable for the compiler folks.  I continue to monitor the BZ and occasionally look to see if there's any upstream progress (haven't seen any, perhaps I'm looking in the wrong places).

Comment 26 Rolf Fokkens 2019-05-28 18:22:02 UTC
Upstream is communicating here: https://www.spinics.net/lists/linux-bcache/

The issue has been confirmed, but progress on identifying the bug seems to be stalling. It's still unclear if gcc9 is causing the issue or gcc9 just exposed a bug that has been there all the time.

Comment 27 Jeff Law 2019-05-29 00:20:55 UTC
Thanks Rolf.  I'd been watching the original thread which was silent after the first few messages.

After reviewing the other thread, the thing that jumps out to me is y'all are likely going to have a tough time trying to compare gcc-8 to gcc-9.

If I were in your position I'd first try to nail down which .o compiled with gcc-9 triggers the failure.  ie build everything with gcc-8, then rebuild selective .o's with gcc-9 and retest.  Often this kind of .o binary search process can result in determining a single problematical .o file which is a huge step forward.

You may also be able to make some progress by testing with -O1.  If it works with gcc-9 -O1, but fails with gcc-9 -O2 you can bisect on the various -f options that get enabled at -O2.  It'll also be likely be easier to compare the output of gcc-9 -O1 against gcc-9 -O2.  You may also want to look at -fdump-tree-optimized.  That dumps the IR at the end of the high level optimizer in a C-ish form.  If you're looking at a single problematic file or function you may be able to spot issues in the .optimized dump file.  There's also an optimize attribute you can attach to a function definition to change the optimization level on a per-function basis.  That can help in some cases as well.

Comment 28 Pierre Juhen 2019-06-03 19:19:29 UTC
I tried something probably stupid.

Since the input from Coly Li suggest the problem might lie in btree management, I took the btree.o compiled in Fedora 29, put it in the kernel source in Fedora 30, and relinked the bcache module (kernel 5.0.17)

Put the bcache module in /usr/lib/modules, dracut and reboot.

Bcache module was not loaded and I got "Exec format error".

I expected to demonstrate that the error was somewhere in btree, but all I got is that mixing objects compiled with gcc8 and gcc9 link OK, but the resulting module cannot be loaded.

Comment 29 Jeff Law 2019-06-03 19:51:07 UTC
Not a stupid idea at all.  I'd expect you to be able to mix and match code from the two compilers without issues, we do it all the time.  Most likely there's some magic kernel bits going awry here.

Comment 30 Pierre Juhen 2019-06-05 03:38:39 UTC
I tried the same thing, but more carefully than yesterday.

Compiled the bcache module un Fedora 30 (gcc9).

Import btree.o compiled under Fedora 29 (gcc8).

Relauched the compilation of the kernel 

Got the following result :
 
  CALL    scripts/checksyscalls.sh
  DESCEND  objtool
  CHK     include/generated/compile.h
  LD [M]  drivers/md/bcache/bcache.o
  TEST    posttest
arch/x86/tools/insn_decoder_test: success: Decoded and checked 6160168 instructions
  TEST    posttest
arch/x86/tools/insn_sanity: Success: decoded and checked 1000000 random instructions with 0 errors (seed:0xbbc63b34)
Kernel: arch/x86/boot/bzImage is ready  (#2)
  Building modules, stage 2.
  MODPOST 3491 modules
ERROR: "mutex_lock" [drivers/md/bcache/bcache.ko] undefined!
make[1]: *** [scripts/Makefile.modpost:92: __modpost] Error 1
make: *** [Makefile:1290: modules] Error 2

I don't know if it can help...

Comment 31 Pierre Juhen 2019-06-07 11:33:52 UTC
Maybe good news from kernel.org : 

Congrat to Lukãs

https://bugzilla.kernel.org/show_bug.cgi?id=203573

Comment 32 Pierre Juhen 2019-06-09 17:59:57 UTC
There is a fix to be approved and inserted in an upcoming version of the kernel.

For the impatient who would like to test (not for production yet, please), I join a patched bcache module for 5.1.6-300.fc30.x86_64

Regards

Comment 33 Pierre Juhen 2019-06-09 18:01 UTC
Created attachment 1578853 [details]
patched bcache module for 5.1.6-300.fc30.x86_64

For testing purposes only.

Comment 34 Pierre Juhen 2019-06-14 04:53:42 UTC
The patch is due in kernel 5.2.5.

Regards,

Comment 35 Rolf Fokkens 2019-06-14 11:21:12 UTC
I tested the patched bcache module, and things seem perfectly stable now.

Comment 36 Pierre Juhen 2019-06-15 15:13:09 UTC
Correction : the fix is due in 5.2rc5, wich I guess, will be the first 5.2.x to reach fc30

Comment 37 Tomasz Torcz 2019-06-15 17:17:48 UTC
I guess this fix is a perfect candidate to be backported to 5.1.x, 5.0.x as a stable fix.in stabe

Comment 38 Rolf Fokkens 2019-06-15 21:32:39 UTC
-------- Forwarded Message --------
Subject: 	Patch "bcache: fix stack corruption by PRECEDING_KEY()" has been added to the 5.1-stable tree
Date: 	Sat, 15 Jun 2019 18:12:30 +0200
From: 	gregkh@linuxfoundation.org
To: 	axboe@kernel.dk, colyli@suse.de, gregkh@linuxfoundation.org, kent.overstreet@gmail.com, nix@esperi.org.uk, pierre.juhen@orange.fr, rolf@rolffokkens.nl, shhuiw@foxmail.com
CC: 	stable-commits@vger.kernel.org



This is a note to let you know that I've just added the patch titled

bcache: fix stack corruption by PRECEDING_KEY()

to the 5.1-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
bcache-fix-stack-corruption-by-preceding_key.patch
and it can be found in the queue-5.1 subdirectory.

Comment 39 Rolf Fokkens 2019-06-15 21:34:02 UTC
This is a note to let you know that I've just added the patch titled

bcache: fix stack corruption by PRECEDING_KEY()

to the 4.19-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
bcache-fix-stack-corruption-by-preceding_key.patch
and it can be found in the queue-4.19 subdirectory.


Note You need to log in before you can comment on or make changes to this bug.