Created attachment 1566205 [details] dmesg containing EXT4-fs errors 1. Please describe the problem: When running a Fedora 29 system using bcache for the root FS results in massive root FS corruption beyond repair The issue can be reproduced by only upgrading to a Fedora 30 kernel. 2. What is the Version-Release number of the kernel: 5.0.11-300.fc30 3. Did it work previously in Fedora? Ever since Fedora 21 this has worked fine 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: 4.A) Install Fedora 29 in a VM, with the following disk/FS layout: [root@localhost ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT vda 252:0 0 16G 0 disk ├─vda1 252:1 0 1G 0 part /boot └─vda2 252:2 0 15G 0 part └─bcache0 251:0 0 15G 0 disk └─fedora-root 253:0 0 14G 0 lvm / vdb 252:16 0 2G 0 disk └─vdb1 252:17 0 2G 0 part └─bcache0 251:0 0 15G 0 disk └─fedora-root 253:0 0 14G 0 lvm / [root@localhost ~]# vda represents a HDD, size is 16GB vdb represnsts an SSD which is used as a caching device 4.B) On this system enter: dnf --release=30 update "kernel*" 4.B) Do a reboot ... 4.C) use dmesg to watch a massive FS corruption 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: Was not able to install rwahide kernel: ported successfully Import of key(s) didn't help, wrong key(s)? 6. Are you running any modules that not shipped with directly Fedora's kernel?: Nope 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag.
I prepared a tgz containing the Fedora 29 VM to reproduce this: -rw-r--r--. 1 root root 5061400970 May 8 22:33 fed29-test.tgz It's 5GB in size, Is there a proper way to share this?
Created attachment 1567288 [details] A torrent for a test image I attached a torrent file for people who wish to test this in a simple way. Due to limited BW I opted for a torrent, uplink is about 100 Mbps. The tgz contains 2 image files and a libvirt xml which can be used to create the VM. There is a testuser account with password "testuser".
*** Bug 1707822 has been marked as a duplicate of this bug. ***
I noticed that kernel-5.0.14-300.fc30 is available in updates-testing and that it's built using gcc-9.1.1-1.fc30. Unfortunately this issue is still as reproducable (100%) as before in this kernel.
Upstream: https://marc.info/?l=linux-bcache&m=155756199506020&w=2
For me, the fact bcache is in writethrough or witeback doesn't change anything. I reproduced it in writethrough mode.
It seems that the blocks are written almost anywhere on the backend ; I had an unmounted logical volume corrupted. One question : does the bug occur when there is no frontend (backend in running mode) and in writethrough mode ?
*** Bug 1707702 has been marked as a duplicate of this bug. ***
I updated to the Fedora 30 kernel, after detaching the cache from the bcache device. Reboots afterwards (3 in total) worked fine, no ext4 issues in dmesg. Then I recreated and reattached the cache to the bcache device. After only one reboot there were already ext4 issues in dmesg, and the VM was broken again.
Thank you Rolf, It means that it's the metadata of the frontend that is corrupted. That's what I supposed, because if the metatada of the backend was corrupted, there would be attempts to write to non-existing blocks, and different messsages in dmesg. Could you try to detach again the frontend, after migration and reboot with the frontend , to see if the backend is corrupted ? Regards, Pierre
Another attempt, cache_mode = none has no trouble too: [testuser@localhost ~]$ bcache-status --- bcache --- UUID 348d8cc4-1434-43b7-8554-cc17f1fff933 Block Size 512 B Bucket Size 512.00 KiB Congested? False Read Congestion 2.0ms Write Congestion 20.0ms Total Cache Size 2 GiB Total Cache Used 40.9 MiB (2%) Total Cache Unused 2 GiB (98%) Evictable Cache 2 GiB (100%) Replacement Policy [lru] fifo random Cache Mode writethrough writeback writearound [none] Total Hits 0 Total Misses 0 Total Bypass Hits 0 (0%) Total Bypass Misses 24260 Total Bypassed 66.0 MiB [testuser@localhost ~]$ cat /proc/version Linux version 5.0.14-300.fc30.x86_64 (mockbuild.fedoraproject.org) (gcc version 9.1.1 20190503 (Red Hat 9.1.1-1) (GCC)) #1 SMP Thu May 9 10:43:38 UTC 2019 [testuser@localhost ~]$ So indeed it's like the metadata in the cache (frontend?) that is corrupt or (probably) not processed the right way by the bcache driver. @Pierre I'll get back on suggestion later.
What we can add to your conclusion : The first think that's done in reboot in Fedora 30 is an fsck on the root volume. Fsck shouldn't change anything if the filesystem is clean. In the different cases I met, it seems that the first read of the fsck (superblock) was already wrong, and the driver returned the wrong block. It means the metadata read by the driver (the adress of the superblock in the frontend) is wrong immediately. When it's read by a driver compiled by GCC8 -> OK When it's read by a driver compiled by GCC9 -> KO Hope this helps ! Regards,
@Rolf I think my first idea is not useful. After the file system is corrupted, it is corrupted. Another idea that might be useful : 1/ Migrate to Fedora 30 in bcache mode = none 2/ Detach the frontend 3/ Reformat the frontend using make-bcache 4/ Re-attach the reformated frontend Change the mode to writethrough and reboot If it works, this will mean that we face a data alignment problem in the frontend Thanks, Regards,
@Pierre, I already tried this, and it still failed. But as a matter of fact the formatting is done in userland, so it doesn't really matter because I'm running Fed29 anyway, except for the kernel.
Created attachment 1568093 [details] Simple, and nterresting scanario to reproduce the failure. Being fed-up with wrecking my (virtual) Fed29 install over and over, I looked for simpler approach: don't wreck my root FS, wreck just another FS. At first I tried to create a /dev/bcache0 (4 GB backing device, 1 GB and added a plain ext4 FS on it. Did a lot of writing on it, and it all worked fine, no probelem at all. So next I recreated /dev/bcache0 and created a VG on top, and an LV in the VG. BOOM! [root@localhost-live ~]# vgcreate BCTEST /dev/bcache0 Physical volume "/dev/bcache0" successfully created. Volume group "BCTEST" successfully created [root@localhost-live ~]# lvcreate -L 2G -n TEST BCTEST Incorrect checksum in metadata area header on /dev/bcache0 at 4096 Failed to write VG BCTEST. [root@localhost-live ~]# Attached (fail.lis) are all the commands I ran to reproduce this. Unfortunately NLS was Dutch, I overlooked that because a was hurrying and didn't really expect it to fail. It's just one sample, and I need to check its reproducability. But I'd like to share it anyway.
Thanks Rolf : WHat you have demonstated : The bug seems to occurs ONLY if the bcache device contains an LVM2 volume group. That's explains why so few people are affected. Thank you for your testing. I tried to read a bit of code yesterday night and found nothing suspect. Hope this helps the developers. Regards,
(In reply to Pierre Juhen from comment #16) > The bug seems to occurs ONLY if the bcache device contains an LVM2 volume > group. It also happens when you create an ext4 filesystem on the raw bcache0 device.
The bug occured for me even before mkfs.xfs finished running on the bcache volume in some cases. The LVM corruption is more reproducible, but this affects all types of usage.
Please make sure you are continuing the discussion on the upstream thread. This really needs to be debugged by the upstream maintainers as Fedora isn't carrying anything special here.
Except for one thing, maybe: the choice for gcc9.
Rolf -- understood WRT gcc-9. If things start to look like it's a GCC issue, don't hesitate to drop a note here.
I copy here a message from Coly Li from the bcache kernel list (nobody seems to have a clue wether it's a code or a gcc9 problem) : Such problem is very easy to produce, only a few sequential I/Os may trigger a panic from bch_btree_iter_next_check(). Here is content of my fio script, [global] lockmem=1 direct=1 ioengine=psync [job1] filename=/dev/bcache0 readwrite=write blocksize=4k iodepth=1 numjobs=1 io_size=64K In my observation, 2 sequential I/Os may trigger this panic. Here is the kernel output when panic in bch_btree_iter_next_check() triggered, [ 153.478620] bcache: bch_dump_bset() block 1 key 0/9: [ 153.478621] bcache: bch_bkey_dump() 0x0:0x10 len 0x8 -> [check dev] [ 153.478623] bcache: bch_bkey_dump() bucket 7560168717 [ 153.478624] bcache: bch_bkey_dump() [ 153.478624] [ 153.478625] bcache: bch_dump_bset() block 1 key 3/9: [ 153.478626] bcache: bch_bkey_dump() 0x0:0x18 len 0x8 -> [check dev] [ 153.478627] bcache: bch_bkey_dump() bucket 4400861924 [ 153.478628] bcache: bch_bkey_dump() [ 153.478628] [ 153.478629] bcache: bch_dump_bset() Key skipped backwards [ 153.478629] [ 153.478630] bcache: bch_dump_bset() block 1 key 6/9: [ 153.478631] bcache: bch_bkey_dump() 0x0:0x10 len 0x8 -> [0:392192 gen 1] dirty [ 153.478632] bcache: bch_bkey_dump() bucket 383 [ 153.478635] bcache: bch_bkey_dump() [ 153.478635] [ 153.532890] Kernel panic - not syncing: Key skipped backwards [ 153.535924] CPU: 0 PID: 790 Comm: bcache_writebac Tainted: G W 5.1.0+ #3 [ 153.539656] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 [ 153.545002] Call Trace: [ 153.546702] dump_stack+0x85/0xc0 [ 153.548675] panic+0x106/0x2da [ 153.550560] ? bch_ptr_invalid+0x10/0x10 [bcache] [ 153.553178] bch_btree_iter_next_filter.cold+0xff/0x12e [bcache] [ 153.556117] ? btree_insert_key+0x190/0x190 [bcache] [ 153.558646] bch_btree_map_keys_recurse+0x5c/0x190 [bcache] [ 153.561557] bch_btree_map_keys+0x177/0x1a0 [bcache] [ 153.564085] ? btree_insert_key+0x190/0x190 [bcache] [ 153.566688] ? dirty_init+0x80/0x80 [bcache] [ 153.569224] bch_refill_keybuf+0xcc/0x290 [bcache] [ 153.571609] ? finish_wait+0x90/0x90 [ 153.573525] ? dirty_init+0x80/0x80 [bcache] [ 153.575705] bch_writeback_thread+0x3b9/0x5c0 [bcache] [ 153.578527] ? __kthread_parkme+0x58/0x80 [ 153.580662] kthread+0x108/0x140 [ 153.582541] ? read_dirty+0x620/0x620 [bcache] [ 153.584998] ? kthread_park+0x90/0x90 [ 153.586991] ret_from_fork+0x3a/0x50 [ 153.589236] Kernel Offset: 0x12000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 153.594868] ---[ end Kernel panic - not syncing: Key skipped backwards ]--- The panic happens in bch_writeback context, because bkeys in btree node is not in linear increasing order. Adjacent two sequential write requests is very common condition in bcache, such corrupted btree node is not reported in recent 2~3 years. Unless the kernel is compiled with gcc9... It is not clear to me why the key 0:0x10 appears in same btree node twice, and why there are 3 keys for two 4K write requests. If anyone may have any clue, please offer. Now I continue to check how this happens. Thanks. -- Coly Li
I missed the warning in known issues and hit by this bug. I can confirm that it screws everything even when not using LVM. Switched to LVM cache and restoring from my last backup.
(In reply to Jeff Law from comment #21) > Rolf -- understood WRT gcc-9. If things start to look like it's a GCC > issue, don't hesitate to drop a note here. bcache's source code for latest kernel (5.0.16) in both fc29 and fc30 are identical (after the "rpmbuild -bp" stage). So I can confirm that the issue is fedora's GCC9. I suspect a codegen error somewhere.
c#24 doesn't really confirm the bug is in gcc-9. It's not uncommon for latent bugs to be exposed by compiler updates. Regardless, at this point there's nothing actionable for the compiler folks. I continue to monitor the BZ and occasionally look to see if there's any upstream progress (haven't seen any, perhaps I'm looking in the wrong places).
Upstream is communicating here: https://www.spinics.net/lists/linux-bcache/ The issue has been confirmed, but progress on identifying the bug seems to be stalling. It's still unclear if gcc9 is causing the issue or gcc9 just exposed a bug that has been there all the time.
Thanks Rolf. I'd been watching the original thread which was silent after the first few messages. After reviewing the other thread, the thing that jumps out to me is y'all are likely going to have a tough time trying to compare gcc-8 to gcc-9. If I were in your position I'd first try to nail down which .o compiled with gcc-9 triggers the failure. ie build everything with gcc-8, then rebuild selective .o's with gcc-9 and retest. Often this kind of .o binary search process can result in determining a single problematical .o file which is a huge step forward. You may also be able to make some progress by testing with -O1. If it works with gcc-9 -O1, but fails with gcc-9 -O2 you can bisect on the various -f options that get enabled at -O2. It'll also be likely be easier to compare the output of gcc-9 -O1 against gcc-9 -O2. You may also want to look at -fdump-tree-optimized. That dumps the IR at the end of the high level optimizer in a C-ish form. If you're looking at a single problematic file or function you may be able to spot issues in the .optimized dump file. There's also an optimize attribute you can attach to a function definition to change the optimization level on a per-function basis. That can help in some cases as well.
I tried something probably stupid. Since the input from Coly Li suggest the problem might lie in btree management, I took the btree.o compiled in Fedora 29, put it in the kernel source in Fedora 30, and relinked the bcache module (kernel 5.0.17) Put the bcache module in /usr/lib/modules, dracut and reboot. Bcache module was not loaded and I got "Exec format error". I expected to demonstrate that the error was somewhere in btree, but all I got is that mixing objects compiled with gcc8 and gcc9 link OK, but the resulting module cannot be loaded.
Not a stupid idea at all. I'd expect you to be able to mix and match code from the two compilers without issues, we do it all the time. Most likely there's some magic kernel bits going awry here.
I tried the same thing, but more carefully than yesterday. Compiled the bcache module un Fedora 30 (gcc9). Import btree.o compiled under Fedora 29 (gcc8). Relauched the compilation of the kernel Got the following result : CALL scripts/checksyscalls.sh DESCEND objtool CHK include/generated/compile.h LD [M] drivers/md/bcache/bcache.o TEST posttest arch/x86/tools/insn_decoder_test: success: Decoded and checked 6160168 instructions TEST posttest arch/x86/tools/insn_sanity: Success: decoded and checked 1000000 random instructions with 0 errors (seed:0xbbc63b34) Kernel: arch/x86/boot/bzImage is ready (#2) Building modules, stage 2. MODPOST 3491 modules ERROR: "mutex_lock" [drivers/md/bcache/bcache.ko] undefined! make[1]: *** [scripts/Makefile.modpost:92: __modpost] Error 1 make: *** [Makefile:1290: modules] Error 2 I don't know if it can help...
Maybe good news from kernel.org : Congrat to Lukãs https://bugzilla.kernel.org/show_bug.cgi?id=203573
There is a fix to be approved and inserted in an upcoming version of the kernel. For the impatient who would like to test (not for production yet, please), I join a patched bcache module for 5.1.6-300.fc30.x86_64 Regards
Created attachment 1578853 [details] patched bcache module for 5.1.6-300.fc30.x86_64 For testing purposes only.
The patch is due in kernel 5.2.5. Regards,
I tested the patched bcache module, and things seem perfectly stable now.
Correction : the fix is due in 5.2rc5, wich I guess, will be the first 5.2.x to reach fc30
I guess this fix is a perfect candidate to be backported to 5.1.x, 5.0.x as a stable fix.in stabe
-------- Forwarded Message -------- Subject: Patch "bcache: fix stack corruption by PRECEDING_KEY()" has been added to the 5.1-stable tree Date: Sat, 15 Jun 2019 18:12:30 +0200 From: gregkh To: axboe, colyli, gregkh, kent.overstreet, nix.uk, pierre.juhen, rolf, shhuiw CC: stable-commits.org This is a note to let you know that I've just added the patch titled bcache: fix stack corruption by PRECEDING_KEY() to the 5.1-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: bcache-fix-stack-corruption-by-preceding_key.patch and it can be found in the queue-5.1 subdirectory.
This is a note to let you know that I've just added the patch titled bcache: fix stack corruption by PRECEDING_KEY() to the 4.19-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: bcache-fix-stack-corruption-by-preceding_key.patch and it can be found in the queue-4.19 subdirectory.
Tested kernel-5.1.12-200.fc29.src.rpm from updates-testing. Works OK.
Correction: kernel-5.1.12-200.fc30 from updates testing.
Did a sudo dnf system-upgrade download --refresh --releasever=30 and have been using may laptop afterwards for a day without problem (kernel = kernel-5.1.12-200.fc30.
I did also a successful migration of 2 PCs. This bug should be closed now. Thanks to all, it was a difficult one.