Created attachment 1830465 [details] console capture 1. Please describe the problem: System(Openstack qemu/kvm based aarch64 VM) hits "Unable to handle kernel paging request at virtual address" and becomes unresponsive. Reboot makes the system work until the issue is hit again. 2. What is the Version-Release number of the kernel: 5.14.9-300.fc35.aarch64 on Fedora-Cloud-Base-35-20211004.n.0.aarch64 cloud image. 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : N/A but this got hit in process of investigating issue, https://github.com/coreos/fedora-coreos-tracker/issues/965, so it is possible that it is kernel 5.13+ 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: * provision Fedora 35 aarhc64 cloud based VM in openstack * try rebuilding kernel rpm(it seems there is need for some load on the system to trigger the issue, but it seems to reliably trigger for me) 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: Yes. 6. Are you running any modules that not shipped with directly Fedora's kernel?: No. In attachment are backtraces and console capture.
Created attachment 1830477 [details] collection of backtraces
To note f34 cloud image based VMs are not affected even with 5.14 kernel. At least I haven't been able to reproduce this issue, I suspect that the btrfs might be a factor.
Proposing as a blocker for F35 GA, this affects a release blocking image - Cloud/aarch64/images/Fedora-Cloud-Base-_RELEASE_MILESTONE_.aarch64.qcow2
This is a f36 kernel? At least the initial trace is... the attachment I can't read. ;(
Two of the traces in the attachment are with 5.14.9-300.fc35.aarch64 . Jakub, could you mark the attachment non-private? I don't think it needs to be private. Thanks!
How that can be done? It seems that there has been some change in the BZ and I have missed to uncheck some filed and it seems I can't uncheck it anymore. I guess I have to re-attach??
Created attachment 1830607 [details] colection of backtraces
You can click 'Details' next to an existing attachment and change it there. It's not the *most* obvious UI design ever :D Yeah, the change is that BZ now makes attachments private by default, at least for Red Hatters (I'm not sure if it's the same way for non-redhat.com accounts). I assume too many RHers were forgetting to set the flag for attachments that really *should* be private, but it's a bit annoying when you mostly work on Fedora...
(In reply to Adam Williamson from comment #8) > You can click 'Details' next to an existing attachment and change it there. > It's not the *most* obvious UI design ever :D Then the editing link is actually placed next to the title ... i was looking for it everywhere else and was perplexed I could not find it.
I have seen the details tab, but there hasn't been any option to make it public(non-private). I assume it is not hidden under some more pop out menus or a like "show advanced fields".
You have to click "Details", then "edit details", then uncheck the "private" checkbox.
Paul, did you reproduce this? I'd be inclined to block on it, but it'd be good to check if it's reproducible I guess.
Discussed during the 2021-10-11 blocker review meeting: [0] The decision to delay the classification of this as a blocker bug was made as this is definitely a worrying bug but so far we have only a single reporter and not much detail or feedback from kernel/arm/cloud groups. We will try to get more input on this issue this week and take further action. [0] https://meetbot.fedoraproject.org/fedora-blocker-review/2021-10-11/f35-blocker-review.2021-10-11-16.00.txt
Note that as mentioned in https://github.com/coreos/fedora-coreos-tracker/issues/965 issues like this should be easily reproducible on VexxHost (OpenStack) cloud. If anyone who wants to debug would like temporary access, let me know.
Upstream bug report https://lore.kernel.org/linux-btrfs/CAJCQCtTqR8TJGZKKfwWB4sbu2-A+ZMPUBSQWzb0mYnXruuykAw@mail.gmail.com/T/#u
In bug 2006295 there's a similar call trace -> Workqueue: btrfs-delalloc btrfs_work_helper, that also directly mentions zstd compression, whereas I don't see a ZSTD related portion in this bug. However this bug does have submit_compressed_extents in the call trace, and btrfs in cloud images has zstd compression enabled. Ergo, these two reports might share the same instigator.
@jcajka what can you tell us about the host? x86_64 or aarch64? kernel and qemu version?
(In reply to Adam Williamson from comment #12) > Paul, did you reproduce this? I'd be inclined to block on it, but it'd be > good to check if it's reproducible I guess. Not yet, trying to reproduce with AWS today.
It is aarch64 host, should be qemu based on console logs. lscpu reports cavium's ThunderX 88XX(should the uarch have any impact). Unfortunately I don't know more about that environment. I have attached console dump, if I have missed anything. @dustymabe do you think it might be possible to ask more about the host environment, if that would seem relevant?
(In reply to Paul Whalen from comment #18) > (In reply to Adam Williamson from comment #12) > > Paul, did you reproduce this? I'd be inclined to block on it, but it'd be > > good to check if it's reproducible I guess. > > Not yet, trying to reproduce with AWS today. I doubt you'll find anything. We're running aarch64 AWS tests in our FCOS CI against `next-devel` (based on F35) and those are working fine. Where we're seeing issues is OpenStack.
From earlier in the support ticket thread (from 9/22) I can see the kernel is (was): ``` 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:42:54 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux ``` I asked in the ticket for more information about the hardware/kernel/qemu. So we'll see what I get back.
> cavium's ThunderX 88XX(should the uarch have any impact). So for reference that's a Cavium ThunderX1 platform (CN99xx is ThunderX2). The uarch may contribute but I suspect it's more my memory, I'll see if I can find any details in my inbox/notes.
No, my memory was the QCom Amberwing platform which was also a btrfs problem, but completely unrelated uarch. For reference https://bugzilla.redhat.com/show_bug.cgi?id=1949334
And the other bug's core trace looks very similar to this one: Other bug: Apr 14 00:53:20 fedora kernel: Call trace: Apr 14 00:53:20 fedora kernel: __list_del_entry_valid+0x30/0xb0 Apr 14 00:53:20 fedora kernel: submit_compressed_extents+0x70/0x3d0 Apr 14 00:53:20 fedora kernel: async_cow_submit+0x6c/0xc0 Apr 14 00:53:20 fedora kernel: run_ordered_work+0xc4/0x2b0 Apr 14 00:53:20 fedora kernel: btrfs_work_helper+0x98/0x270 Apr 14 00:53:20 fedora kernel: process_one_work+0x1f0/0x4cc Apr 14 00:53:20 fedora kernel: worker_thread+0x184/0x500 Apr 14 00:53:20 fedora kernel: kthread+0x120/0x124 Apr 14 00:53:20 fedora kernel: ret_from_fork+0x10/0x18 Apr 14 00:53:20 fedora kernel: Code: d2802443 f2fbd5a3 eb03003f 54000340 (f9400021) Apr 14 00:53:20 fedora kernel: ---[ end trace 4824d837c25d31e8 ]--- This bug: [ 2164.734445] Call trace: [ 2164.739675] submit_compressed_extents+0x38/0x3d0 [ 2164.746728] async_cow_submit+0x50/0xd0 [ 2164.752980] run_ordered_work+0xc8/0x280 [ 2164.759248] btrfs_work_helper+0x98/0x250 [ 2164.765449] process_one_work+0x1f0/0x4ac [ 2164.771558] worker_thread+0x188/0x504 [ 2164.777395] kthread+0x110/0x114 [ 2164.782791] ret_from_fork+0x10/0x18 [ 2164.788343] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356) [ 2164.795833] ---[ end trace e44350b86ce16830 ]---
Can someone on aarch64 install 5.14.9 kernel-debuginfo kernel-debuginfo-common-aarch64, and then run: $ /usr/src/kernels/5.14.9-300.fc35.aarch64/scripts/faddr2line /usr/lib/debug/lib/modules/5.14.9-300.fc35.aarch64/vmlinux submit_compressed_extents+0x38 Note that submit_compressed_extents+0x38 is the function we're after, and comes from the attached "collection of backtraces" here I see: [ 2164.583368] CPU: 2 PID: 8910 Comm: kworker/u8:3 Not tainted 5.14.9-300.fc35.aarch64 #1 ... [ 2164.739675] submit_compressed_extents+0x38/0x3d0 I already did this on x86_64 but I'm skeptical that it's a valid result.
Ran it on one of the openQA aarch64 workers. Note the system is F34 running kernel 5.11, don't know if that might affect the result. [root@openqa-a64-worker03 adamwill][PROD]# /usr/src/kernels/5.14.9-300.fc35.aarch64/scripts/faddr2line /usr/lib/debug/lib/modules/5.14.9-300.fc35.aarch64/vmlinux submit_compressed_extents+0x38 submit_compressed_extents+0x38/0x3d0: submit_compressed_extents at /usr/src/debug/kernel-5.14.9/linux-5.14.9-300.fc35.aarch64/fs/btrfs/inode.c:845 [root@openqa-a64-worker03 adamwill][PROD]#
From my rpi4 running on 5.14.9-300: $ /usr/src/kernels/5.14.9-300.fc35.aarch64/scripts/faddr2line /usr/lib/debug/lib/modules/5.14.9-300.fc35.aarch64/vmlinux submit_compressed_extents+0x38 submit_compressed_extents+0x38/0x3d0: submit_compressed_extents at /usr/src/debug/kernel-5.14.9/linux-5.14.9-300.fc35.aarch64/fs/btrfs/inode.c:845
I have been bit delayed as it seems even installing the debug info and devel packages can sometimes trigger the issue for me. /usr/src/kernels/5.14.10-300.fc35.aarch64/scripts/faddr2line /usr/lib/debug/lib/modules/5.14.10-300.fc35.aarch64/vmlinux submit_compressed_extents+0x38 submit_compressed_extents+0x38/0x3d0: submit_compressed_extents at /usr/src/debug/kernel-5.14.10/linux-5.14.10-300.fc35.aarch64/fs/btrfs/inode.c:845 I have had the 5.14.10-300.fc35.aarch64 installed and booted. I can retest with the 5.14.9 tomorrow, but result seems the same for the others.
For the record BT with the 5.14.10: [ 2400.479611] Unable to handle kernel paging request at virtual address fffffffffffffdd0 [ 2400.485699] Mem abort info: [ 2400.487797] ESR = 0x96000004 [ 2400.490326] EC = 0x25: DABT (current EL), IL = 32 bits [ 2400.494289] SET = 0, FnV = 0 [ 2400.496699] EA = 0, S1PTW = 0 [ 2400.499117] FSC = 0x04: level 0 translation fault [ 2400.502761] Data abort info: [ 2400.504982] ISV = 0, ISS = 0x00000004 [ 2400.507913] CM = 0, WnR = 0 [ 2400.510194] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000091051000 [ 2400.515231] [fffffffffffffdd0] pgd=0000000000000000, p4d=0000000000000000 [ 2400.520496] Internal error: Oops: 96000004 [#1] SMP [ 2400.524148] Modules linked in: virtio_gpu virtio_dma_buf drm_kms_helper joydev cec fb_sys_fops syscopyarea sysfillrect sysimgblt virtio_net virtio_balloon net_failover failover vfat fat drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg virtio_mmio aes_neon_bs [ 2400.543182] CPU: 3 PID: 881 Comm: kworker/u8:9 Not tainted 5.14.10-300.fc35.aarch64 #1 [ 2400.549301] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 2400.554687] Workqueue: btrfs-delalloc btrfs_work_helper [ 2400.558737] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--) [ 2400.563514] pc : submit_compressed_extents+0x38/0x3d0 [ 2400.567585] lr : async_cow_submit+0x50/0xd0 [ 2400.570849] sp : ffff800010d83c20 [ 2400.573496] x29: ffff800010d83c30 x28: 0000000000000000 x27: ffffddb466c11000 [ 2400.579127] x26: fffffffffffffdd0 x25: dead000000000100 x24: ffff0001bd91a208 [ 2400.584822] x23: 0000000000000000 x22: 0000000000000001 x21: ffff0000c3644880 [ 2400.590580] x20: ffff0000c0845000 x19: 0000000000000001 x18: ffff0000c4900bd4 [ 2400.596045] x17: ffff0000c4904a9d x16: 0000000000000006 x15: 93b479b5a6b4b1a6 [ 2400.601645] x14: d1e6bd8d113f85ce x13: 0000000000000020 x12: ffff0001fefa78c0 [ 2400.607340] x11: ffffddb466c2b500 x10: 0000000000000000 x9 : ffffddb465401c40 [ 2400.613045] x8 : ffff224d9834f000 x7 : ffff800010d83be0 x6 : ffffddb46715ad40 [ 2400.618724] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff0000c36448a0 [ 2400.624468] x2 : 0000000000000000 x1 : ffff0001bd91a230 x0 : ffff0001bd91a230 [ 2400.630223] Call trace: [ 2400.632181] submit_compressed_extents+0x38/0x3d0 [ 2400.636026] async_cow_submit+0x50/0xd0 [ 2400.639109] run_ordered_work+0xc8/0x280 [ 2400.642229] btrfs_work_helper+0x98/0x250 [ 2400.645480] process_one_work+0x1f0/0x4ac [ 2400.648659] worker_thread+0x188/0x504 [ 2400.651658] kthread+0x110/0x114 [ 2400.656840] ret_from_fork+0x10/0x18 [ 2400.662287] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356) [ 2400.669738] ---[ end trace 9876ac43093cccb3 ]---
(In reply to Jakub Čajka from comment #29) > For the record BT with the 5.14.10: > [ 2400.479611] Unable to handle kernel paging request at virtual address > fffffffffffffdd0 > [ 2400.485699] Mem abort info: > [ 2400.487797] ESR = 0x96000004 > [ 2400.490326] EC = 0x25: DABT (current EL), IL = 32 bits > [ 2400.494289] SET = 0, FnV = 0 > [ 2400.496699] EA = 0, S1PTW = 0 > [ 2400.499117] FSC = 0x04: level 0 translation fault > [ 2400.502761] Data abort info: > [ 2400.504982] ISV = 0, ISS = 0x00000004 > [ 2400.507913] CM = 0, WnR = 0 > [ 2400.510194] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000091051000 > [ 2400.515231] [fffffffffffffdd0] pgd=0000000000000000, p4d=0000000000000000 > [ 2400.520496] Internal error: Oops: 96000004 [#1] SMP > [ 2400.524148] Modules linked in: virtio_gpu virtio_dma_buf drm_kms_helper > joydev cec fb_sys_fops syscopyarea sysfillrect sysimgblt virtio_net > virtio_balloon net_failover failover vfat fat drm fuse zram ip_tables > crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg virtio_mmio aes_neon_bs > [ 2400.543182] CPU: 3 PID: 881 Comm: kworker/u8:9 Not tainted > 5.14.10-300.fc35.aarch64 #1 > [ 2400.549301] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 > [ 2400.554687] Workqueue: btrfs-delalloc btrfs_work_helper > [ 2400.558737] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--) > [ 2400.563514] pc : submit_compressed_extents+0x38/0x3d0 Is that the pc pointing at the same code line number?
OK, for the v5.14.9 one, the offending line is: static noinline void submit_compressed_extents(struct async_chunk *async_chunk) { struct btrfs_inode *inode = BTRFS_I(async_chunk->inode); struct btrfs_fs_info *fs_info = inode->root->fs_info; <<< struct async_extent *async_extent; This means, async_chunk is already NULL, which is possible in async_cow_start(): static noinline void async_cow_start(struct btrfs_work *work) { struct async_chunk *async_chunk; int compressed_extents; async_chunk = container_of(work, struct async_chunk, work); compressed_extents = compress_file_range(async_chunk); if (compressed_extents == 0) { btrfs_add_delayed_iput(async_chunk->inode); async_chunk->inode = NULL; <<< } } I'm not sure why we're hitting such compressed_extents == 0 case only in Arch64, but I guess it may be related to page size (64K?). Anyway, I'll craft a patch to handle the case more gracefully to avoid the crash. Thank you all for the report!
Wait for a min, we already have the handling for the case. This means inode should be there, but some other problem causing the bad access...
> I'm not sure why we're hitting such compressed_extents == 0 case only in > Arch64, but I guess it may be related to page size (64K?). Fedora's page size for aarch64 is 4K not 64K.
OK, let me try to reproduce too. Meanwhile can any one also try to test the misc-next branch? https://github.com/kdave/btrfs-devel/tree/misc-next This branch has extra compression related refactors (originally for subpage). Not sure if it may make a difference.
Testing Fedora-Cloud-Base-35-20211018.n.0.aarch64 on Vexxhost now with 5.14.10-300.fc35.aarch64 and I saw a hang occur while installing packages via dnf. The kernel trace mentioned something with brtfs but I couldn't capture the output in time. I'm trying to see if I can get a copy of the trace again.
Just another data point here... I'll point out that we (Fedora CoreOS) were seeing issues with aarch64 on VexxHost when we first added the aarch64 archtecture a month ago. We don't use btrfs at all. Is it possible the btrfs traces are a symptom of a larger problem? context: https://github.com/coreos/fedora-coreos-tracker/issues/965
Discussed during the 2021-10-18 blocker review meeting: [0] The decision to delay the classification of this bug as a blocker and accept this as an "AcceptedFreezeException (Final)" was made as, though it's very late, we're still actively researching this and can't be sure yet if it needs to block the release. Many folks are working on it actively and we hope for developments soon. We agree it's at least serious enough to grant a freeze exception in any case. [0] https://meetbot.fedoraproject.org/fedora-blocker-review/2021-10-18/f35-blocker-review.2021-10-18-16.00.txt
In bug 1949334, seeing the same call trace on actual hardware (amberwing), Fedora 34, kernel 5.11.12. Apr 14 00:53:20 fedora kernel: Call trace: Apr 14 00:53:20 fedora kernel: __list_del_entry_valid+0x30/0xb0 Apr 14 00:53:20 fedora kernel: submit_compressed_extents+0x70/0x3d0 Apr 14 00:53:20 fedora kernel: async_cow_submit+0x6c/0xc0 Apr 14 00:53:20 fedora kernel: run_ordered_work+0xc4/0x2b0 Apr 14 00:53:20 fedora kernel: btrfs_work_helper+0x98/0x270 Apr 14 00:53:20 fedora kernel: process_one_work+0x1f0/0x4cc Apr 14 00:53:20 fedora kernel: worker_thread+0x184/0x500 Apr 14 00:53:20 fedora kernel: kthread+0x120/0x124 Apr 14 00:53:20 fedora kernel: ret_from_fork+0x10/0x18 Apr 14 00:53:20 fedora kernel: Code: d2802443 f2fbd5a3 eb03003f 54000340 (f9400021)
Dusty set me up with a Vexxhost aarch64 VM to try and capture a kernel core dump for when the oops happens. But I can't reproduce it, so far after 6+ hours of kernel compiling and for about 1 hour concurrently I ran a dnf group install for Workstation edition. I'm not sure what to try next to trigger it and capture a vmcore.
I'm consistently getting a hang, but not an oops. I think the two might be unrelated, or perhaps there are two hangs. The consistent hang I think is due to this: Oct 21 23:51:22 dusty-354.novalocal sshd[1847]: error: kex_exchange_identification: Connection closed by remote host ... Oct 21 23:52:18 dusty-354.novalocal systemd[1]: sshd.service: Unit process 1844 (sshd) remains running after unit stopped. Oct 21 23:52:18 dusty-354.novalocal systemd[1]: Stopped OpenSSH server daemon. Oct 21 23:52:18 dusty-354.novalocal audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sshd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' Oct 21 23:52:18 dusty-354.novalocal systemd[1]: Stopped target sshd-keygen.target. Oct 21 23:52:18 dusty-354.novalocal systemd[1]: Stopping sshd-keygen.target... Oct 21 23:52:18 dusty-354.novalocal systemd[1]: Condition check resulted in OpenSSH ecdsa Server Key Generation being skipped. Oct 21 23:52:18 dusty-354.novalocal systemd[1]: Condition check resulted in OpenSSH ed25519 Server Key Generation being skipped. Oct 21 23:52:18 dusty-354.novalocal systemd[1]: Condition check resulted in OpenSSH rsa Server Key Generation being skipped. Oct 21 23:52:18 dusty-354.novalocal systemd[1]: Reached target sshd-keygen.target. Oct 21 23:52:18 dusty-354.novalocal systemd[1]: sshd.service: Found left-over process 1844 (sshd) in control group while starting unit. Ignoring. But this isn't the result of, or cause, of any oops or call trace. But it does result in an apparent permanent hang of the ssh session. And at least for me it never results in broken pipe or any such disconnect on the client side so it looks like an indefinite hang. I have to close the GNOME Terminal tab to get rid of it. All of that might be a bug too, but it's not the bug we're looking for.
Updates from upstream thread: https://lore.kernel.org/linux-btrfs/ff911f0c-9ea5-43b1-4b8d-e8c392f0718e@suse.com/ https://lore.kernel.org/linux-btrfs/9e746c1c-85e5-c766-26fa-a4d83f1bfd34@suse.com/ How should we meet that request? I'm not adept enough yet at navigating Fedora infrastructure to get a scratch build of the kernel, or maybe it could be done in copr? I can get a 5.15-rc6 tarball to an aarch64 host, with a 'make localmodconfig' .config file that will compile in less than an hour, and then see if I can hit the bug which really is elusive compared to the hang I see in comment 40. Still another option is to get the WARN_ON added to the Rawhide kernel, and use a subsequent F36 cloud image with that kernel in it, and try to reproduce.
I could run a scratch build, but it will take longer than "less than an hour", so that one might be the easiest option. Plus I'd have to remember any weirdness about how the kernel package does patching. Justin could do a real build with that patch if he's willing, but today is a Red Hat company holiday so he might not be around.
I can run a scratch build if someone can provide me a patch.
scratch builds with one line patch: - 5.14.10 - https://koji.fedoraproject.org/koji/taskinfo?taskID=77712076 - 5.14.14 - https://koji.fedoraproject.org/koji/taskinfo?taskID=77711885 ``` --- fs/btrfs/inode.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8132d503c83d..334cbcb01827 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -841,6 +841,7 @@ static void free_async_extent_pages(struct async_extent *async_extent) */ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) { + WARN_ON(!async_chunk->inode); struct btrfs_inode *inode = BTRFS_I(async_chunk->inode); struct btrfs_fs_info *fs_info = inode->root->fs_info; struct async_extent *async_extent; -- ```
Created attachment 1836761 [details] dmesg sysrq+t
Created attachment 1836845 [details] dmesg 5.14.10-300.fc35.dusty.aarch64 This kernel has the WARN_ON added. I'm not seeing any additional useful information though.
(In reply to Adam Williamson from comment #42) > I could run a scratch build, but it will take longer than "less than an > hour", so that one might be the easiest option. Plus I'd have to remember > any weirdness about how the kernel package does patching. > > Justin could do a real build with that patch if he's willing, but today is a > Red Hat company holiday so he might not be around. I was not around, though I am now. It seems that scratch builds have been done, but we do not have a fix?
yeah, the scratch build was for further debugging rather than expected to fix things, AIUI.
Discussed during the 2021-10-25 blocker review meeting: [0] The decision to classify this bug as a "RejectedBlocker (Final)" was made as we agreed on current information to reject this as a blocker and document it as an ongoing situation with vexxhost which we will try to get resolved ASAP. This is a tough call as we cannot be sure if it specific to vexxhost's environment somehow, and if the problem lies there or in Fedora 35 itself. [0] https://meetbot.fedoraproject.org/fedora-blocker-review/2021-10-25/f35-blocker-review.2021-10-25-16.02.txt
Now I'm getting a hang while installing kernel debuginfo. top does not show dnf using any CPU; ps aux shows that dnf is running still but doing nothing; top shows high idle and wait, with no meaningful consumption for system or user processes. The sysrq+t for this does show some meaningful items, including: [ 9602.008517] kernel: workqueue events_unbound: flags=0x2 [ 9602.016767] kernel: pwq 8: cpus=0-3 flags=0x4 nice=0 active=2/512 refcnt=4 [ 9602.026585] kernel: in-flight: 1296:btrfs_async_reclaim_metadata_space, 1304:btrfs_preempt_reclaim_metadata_space I'll attach the full dmesg next, but this is btrfs sysfs info at the time I captured the sysrq+t I see we are somehow stuck in async_reclaim_metadata_space? Huh? -bash-5.1$ grep -R . /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/ /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/disk_used:167313408 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/bytes_pinned:0 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/bytes_used:83656704 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/dup/used_bytes:83656704 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/dup/total_bytes:268435456 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/disk_total:536870912 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/total_bytes:268435456 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/bytes_reserved:0 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/bytes_readonly:65536 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/bytes_zone_unusable:0 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/bytes_may_use:9748480 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/flags:4 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/disk_used:32768 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/bytes_pinned:0 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/bytes_used:16384 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/dup/used_bytes:16384 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/dup/total_bytes:8388608 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/disk_total:16777216 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/total_bytes:8388608 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/bytes_reserved:0 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/bytes_readonly:0 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/bytes_zone_unusable:0 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/bytes_may_use:0 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/flags:2 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/global_rsv_reserved:9093120 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/disk_used:3190075392 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/bytes_pinned:0 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/bytes_used:3190075392 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/single/used_bytes:3190075392 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/single/total_bytes:4202692608 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/disk_total:4202692608 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/total_bytes:4202692608 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/bytes_reserved:0 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/bytes_readonly:0 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/bytes_zone_unusable:0 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/bytes_may_use:8192 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/flags:1 /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/global_rsv_size:9093120 -bash-5.1$
Created attachment 1836995 [details] dmesg sysrq+t, while dnf install kernel-debuginfo
kernel scratch build with the patch suggestion from https://lore.kernel.org/linux-btrfs/b03fb30f-3d4b-413c-0227-6655ffeba75d@suse.com/T/#meea74bb8944ef2dca45bd28111039e377758793f - kernel-5.14.10-300.fc35.dusty1026 - https://koji.fedoraproject.org/koji/taskinfo?taskID=77858374
Created attachment 1837629 [details] dmesg 5.14.14-300.fc35.dusty.aarch64 warnon with hang For the record I have been able to hit it with the warn_on kernel patch that Dusty built, only it took bit more time. dmsg is in the attachment. I haven't noticed any new info in the log, but I might have missed it there. I will try to reproduce it with the proposed patch, but I will not be able to post updates until next Monday.
@jcajka yeah i don't see anything new there. If you get a chance, give this kernel a shot: https://koji.fedoraproject.org/koji/taskinfo?taskID=77858541 That contains this patch from upstream https://lore.kernel.org/linux-btrfs/b03fb30f-3d4b-413c-0227-6655ffeba75d@suse.com/ I've tested it with help from Dusty, and it's 3 for 3 succesfully installing kernel debuginfo and 6 for 6 successfully installing libreoffice.
Sorry for longer reply time. For the record with the patched kernel I haven't been able to reproduce this. I have been trying for 3 days now. To not when I have been resetting me environment, the original issue this time around actually lead to corrupted root fs after the first oops. Host after the reboot would end up in the emergency shell as rootfs couldn't be mounted.
The official fix is now in misc-next https://lore.kernel.org/linux-btrfs/20211102124916.433836-1-nborisov@suse.com/
Thanks, pulled in for 5.14.17 and 5.15.1
>this time around actually lead to corrupted root fs Jakub, do you have any of these corrupted file systems still? I'd like to let upstream know more about this particular consequence to the bug, as it might itself be a bug that the kernel and/or btrfs-progs needs to be able to handle better. Can you try to mount such a corrupt file system again and share any btrfs messages that appear in dmesg? Also, 'btrfs check /dev/sdXY' for the corrupted file system would also be helpful.
FEDORA-2021-fdef34e26f has been submitted as an update to Fedora 35. https://bodhi.fedoraproject.org/updates/FEDORA-2021-fdef34e26f
FEDORA-2021-7de33b7016 has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-7de33b7016
FEDORA-2021-8364530ebf has been submitted as an update to Fedora 33. https://bodhi.fedoraproject.org/updates/FEDORA-2021-8364530ebf
FEDORA-2021-8364530ebf has been pushed to the Fedora 33 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-8364530ebf` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-8364530ebf See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2021-7de33b7016 has been pushed to the Fedora 34 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-7de33b7016` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-7de33b7016 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2021-fdef34e26f has been pushed to the Fedora 35 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-fdef34e26f` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-fdef34e26f See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
(In reply to Chris Murphy from comment #58) > >this time around actually lead to corrupted root fs > > Jakub, do you have any of these corrupted file systems still? I'd like to > let upstream know more about this particular consequence to the bug, as it > might itself be a bug that the kernel and/or btrfs-progs needs to be able to > handle better. Can you try to mount such a corrupt file system again and > share any btrfs messages that appear in dmesg? Also, 'btrfs check /dev/sdXY' > for the corrupted file system would also be helpful. Unfortunately not. To note the patched kernel is holding just fine. I have been able to hit that(I guess I go unlucky) on the vexxhost provisioning from the I believe Fedora-Cloud-Base-35-20210922.n.0.aarch64 and then updating the kernel. Previously at worst it was the hung. Using later composes I haven't got unlucky and didn't hit the oops during kernel update. Is btrfs check available in initramfs/emergency shell?
>Is btrfs check available in initramfs/emergency shell? yes
FEDORA-2021-7de33b7016 has been pushed to the Fedora 34 stable repository. If problem still persists, please make note of it in this bug report.
FEDORA-2021-8364530ebf has been pushed to the Fedora 33 stable repository. If problem still persists, please make note of it in this bug report.
FEDORA-2021-fdef34e26f has been pushed to the Fedora 35 stable repository. If problem still persists, please make note of it in this bug report.