Bug 2011928 - Fedora 35 aarch64 cloud image based openstack VM hangs
Summary: Fedora 35 aarch64 cloud image based openstack VM hangs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 35
Hardware: aarch64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: fedora-kernel-btrfs
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: https://ask.fedoraproject.org/t/we-ar...
Depends On:
Blocks: ARMTracker F35FinalFreezeException
TreeView+ depends on / blocked
 
Reported: 2021-10-07 18:51 UTC by Jakub Čajka
Modified: 2021-12-03 01:12 UTC (History)
36 users (show)

Fixed In Version: kernel-5.14.17-201.fc34 kernel-5.14.17-101.fc33 kernel-5.14.17-301.fc35
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-14 03:44:04 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
console capture (52.01 KB, text/plain)
2021-10-07 18:51 UTC, Jakub Čajka
no flags Details
colection of backtraces (8.16 KB, text/plain)
2021-10-08 06:56 UTC, Jakub Čajka
no flags Details
dmesg sysrq+t (296.71 KB, text/plain)
2021-10-25 13:14 UTC, Chris Murphy
no flags Details
dmesg 5.14.10-300.fc35.dusty.aarch64 (52.57 KB, text/plain)
2021-10-25 14:45 UTC, Chris Murphy
no flags Details
dmesg sysrq+t, while dnf install kernel-debuginfo (174.88 KB, text/plain)
2021-10-25 19:34 UTC, Chris Murphy
no flags Details
dmesg 5.14.14-300.fc35.dusty.aarch64 warnon with hang (57.79 KB, text/plain)
2021-10-27 14:48 UTC, Jakub Čajka
no flags Details

Description Jakub Čajka 2021-10-07 18:51:11 UTC
Created attachment 1830465 [details]
console capture

1. Please describe the problem:
System(Openstack qemu/kvm based aarch64 VM) hits "Unable to handle kernel paging request at virtual address" and becomes unresponsive. Reboot makes the system work until the issue is hit again.

2. What is the Version-Release number of the kernel:
5.14.9-300.fc35.aarch64 on Fedora-Cloud-Base-35-20211004.n.0.aarch64 cloud image.

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
N/A but this got hit in process of investigating issue, https://github.com/coreos/fedora-coreos-tracker/issues/965, so it is possible that it is kernel 5.13+

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
* provision Fedora 35 aarhc64 cloud based VM in openstack
* try rebuilding kernel rpm(it seems there is need for some load on the system to trigger the issue, but it seems to reliably trigger for me)

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
Yes.

6. Are you running any modules that not shipped with directly Fedora's kernel?:
No.

In attachment are backtraces and console capture.

Comment 1 Jakub Čajka 2021-10-07 18:52:18 UTC
Created attachment 1830477 [details]
collection of backtraces

Comment 2 Jakub Čajka 2021-10-07 19:14:55 UTC
To note f34 cloud image based VMs are not affected even with 5.14 kernel. At least I haven't been able to reproduce this issue, I suspect that the btrfs might be a factor.

Comment 3 Paul Whalen 2021-10-07 19:30:05 UTC
Proposing as a blocker for F35 GA, this affects a release blocking image - Cloud/aarch64/images/Fedora-Cloud-Base-_RELEASE_MILESTONE_.aarch64.qcow2

Comment 4 Kevin Fenzi 2021-10-07 22:42:59 UTC
This is a f36 kernel? At least the initial trace is... the attachment I can't read. ;(

Comment 5 Adam Williamson 2021-10-07 22:53:21 UTC
Two of the traces in the attachment are with 5.14.9-300.fc35.aarch64 . Jakub, could you mark the attachment non-private? I don't think it needs to be private. Thanks!

Comment 6 Jakub Čajka 2021-10-08 06:55:42 UTC
How that can be done? It seems that there has been some change in the BZ and I have missed to uncheck some filed and it seems I can't uncheck it anymore. I guess I have to re-attach??

Comment 7 Jakub Čajka 2021-10-08 06:56:25 UTC
Created attachment 1830607 [details]
colection of backtraces

Comment 8 Adam Williamson 2021-10-08 07:00:23 UTC
You can click 'Details' next to an existing attachment and change it there. It's not the *most* obvious UI design ever :D

Yeah, the change is that BZ now makes attachments private by default, at least for Red Hatters (I'm not sure if it's the same way for non-redhat.com accounts). I assume too many RHers were forgetting to set the flag for attachments that really *should* be private, but it's a bit annoying when you mostly work on Fedora...

Comment 9 Lukas Ruzicka 2021-10-08 08:31:30 UTC
(In reply to Adam Williamson from comment #8)
> You can click 'Details' next to an existing attachment and change it there.
> It's not the *most* obvious UI design ever :D

Then the editing link is actually placed next to the title ... i was looking for it everywhere else and was perplexed I could not find it.

Comment 10 Jakub Čajka 2021-10-08 12:41:43 UTC
I have seen the details tab, but there hasn't been any option to make it public(non-private). I assume it is not hidden under some more pop out menus or a like "show advanced fields".

Comment 11 Adam Williamson 2021-10-08 17:41:29 UTC
You have to click "Details", then "edit details", then uncheck the "private" checkbox.

Comment 12 Adam Williamson 2021-10-08 17:54:44 UTC
Paul, did you reproduce this? I'd be inclined to block on it, but it'd be good to check if it's reproducible I guess.

Comment 13 Geoffrey Marr 2021-10-11 19:00:43 UTC
Discussed during the 2021-10-11 blocker review meeting: [0]

The decision to delay the classification of this as a blocker bug was made as this is definitely a worrying bug but so far we have only a single reporter and not much detail or feedback from kernel/arm/cloud groups. We will try to get more input on this issue this week and take further action.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2021-10-11/f35-blocker-review.2021-10-11-16.00.txt

Comment 14 Dusty Mabe 2021-10-11 22:31:47 UTC
Note that as mentioned in https://github.com/coreos/fedora-coreos-tracker/issues/965 issues like this should be easily reproducible on VexxHost (OpenStack) cloud. If anyone who wants to debug would like temporary access, let me know.

Comment 16 Chris Murphy 2021-10-12 01:10:46 UTC
In bug 2006295 there's a similar call trace -> Workqueue: btrfs-delalloc btrfs_work_helper, that also directly mentions zstd compression, whereas I don't see a ZSTD related portion in this bug. However this bug does have submit_compressed_extents in the call trace, and btrfs in cloud images has zstd compression enabled. Ergo, these two reports might share the same instigator.

Comment 17 Chris Murphy 2021-10-12 14:32:44 UTC
@jcajka what can you tell us about the host? x86_64 or aarch64? kernel and qemu version?

Comment 18 Paul Whalen 2021-10-12 16:05:41 UTC
(In reply to Adam Williamson from comment #12)
> Paul, did you reproduce this? I'd be inclined to block on it, but it'd be
> good to check if it's reproducible I guess.

Not yet, trying to reproduce with AWS today.

Comment 19 Jakub Čajka 2021-10-12 17:58:21 UTC
It is aarch64 host, should be qemu based on console logs. lscpu reports cavium's ThunderX 88XX(should the uarch have any impact). Unfortunately I don't know more about that environment. I have attached console dump, if I have missed anything. @dustymabe do you think it might be possible to ask more about the host environment, if that would seem relevant?

Comment 20 Dusty Mabe 2021-10-12 18:02:47 UTC
(In reply to Paul Whalen from comment #18)
> (In reply to Adam Williamson from comment #12)
> > Paul, did you reproduce this? I'd be inclined to block on it, but it'd be
> > good to check if it's reproducible I guess.
> 
> Not yet, trying to reproduce with AWS today.

I doubt you'll find anything. We're running aarch64 AWS tests in our FCOS CI against `next-devel` (based on F35) and those are working fine. Where we're seeing issues is OpenStack.

Comment 21 Dusty Mabe 2021-10-12 18:07:09 UTC
From earlier in the support ticket thread (from 9/22) I can see the kernel is (was):

```
4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:42:54 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
```

I asked in the ticket for more information about the hardware/kernel/qemu. So we'll see what I get back.

Comment 22 Peter Robinson 2021-10-12 18:11:47 UTC
> cavium's ThunderX 88XX(should the uarch have any impact).

So for reference that's a Cavium ThunderX1 platform (CN99xx is ThunderX2). The uarch may contribute but I suspect it's more my memory, I'll see if I can find any details in my inbox/notes.

Comment 23 Peter Robinson 2021-10-12 18:13:50 UTC
No, my memory was the QCom Amberwing platform which was also a btrfs problem, but completely unrelated uarch. For reference https://bugzilla.redhat.com/show_bug.cgi?id=1949334

Comment 24 Peter Robinson 2021-10-12 18:17:51 UTC
And the other bug's core trace looks very similar to this one:

Other bug:
Apr 14 00:53:20 fedora kernel: Call trace:
Apr 14 00:53:20 fedora kernel:  __list_del_entry_valid+0x30/0xb0
Apr 14 00:53:20 fedora kernel:  submit_compressed_extents+0x70/0x3d0
Apr 14 00:53:20 fedora kernel:  async_cow_submit+0x6c/0xc0
Apr 14 00:53:20 fedora kernel:  run_ordered_work+0xc4/0x2b0
Apr 14 00:53:20 fedora kernel:  btrfs_work_helper+0x98/0x270
Apr 14 00:53:20 fedora kernel:  process_one_work+0x1f0/0x4cc
Apr 14 00:53:20 fedora kernel:  worker_thread+0x184/0x500
Apr 14 00:53:20 fedora kernel:  kthread+0x120/0x124
Apr 14 00:53:20 fedora kernel:  ret_from_fork+0x10/0x18
Apr 14 00:53:20 fedora kernel: Code: d2802443 f2fbd5a3 eb03003f 54000340 (f9400021) 
Apr 14 00:53:20 fedora kernel: ---[ end trace 4824d837c25d31e8 ]---

This bug:
[ 2164.734445] Call trace:
[ 2164.739675]  submit_compressed_extents+0x38/0x3d0
[ 2164.746728]  async_cow_submit+0x50/0xd0
[ 2164.752980]  run_ordered_work+0xc8/0x280
[ 2164.759248]  btrfs_work_helper+0x98/0x250
[ 2164.765449]  process_one_work+0x1f0/0x4ac
[ 2164.771558]  worker_thread+0x188/0x504
[ 2164.777395]  kthread+0x110/0x114
[ 2164.782791]  ret_from_fork+0x10/0x18
[ 2164.788343] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
[ 2164.795833] ---[ end trace e44350b86ce16830 ]---

Comment 25 Chris Murphy 2021-10-13 13:03:22 UTC
Can someone on aarch64 install 5.14.9 kernel-debuginfo kernel-debuginfo-common-aarch64, and then run:

$ /usr/src/kernels/5.14.9-300.fc35.aarch64/scripts/faddr2line /usr/lib/debug/lib/modules/5.14.9-300.fc35.aarch64/vmlinux submit_compressed_extents+0x38

Note that submit_compressed_extents+0x38 is the function we're after, and comes from the attached "collection of backtraces" here I see:

[ 2164.583368] CPU: 2 PID: 8910 Comm: kworker/u8:3 Not tainted 5.14.9-300.fc35.aarch64 #1
...
[ 2164.739675]  submit_compressed_extents+0x38/0x3d0

I already did this on x86_64 but I'm skeptical that it's a valid result.

Comment 26 Adam Williamson 2021-10-13 18:23:43 UTC
Ran it on one of the openQA aarch64 workers. Note the system is F34 running kernel 5.11, don't know if that might affect the result.

[root@openqa-a64-worker03 adamwill][PROD]# /usr/src/kernels/5.14.9-300.fc35.aarch64/scripts/faddr2line /usr/lib/debug/lib/modules/5.14.9-300.fc35.aarch64/vmlinux submit_compressed_extents+0x38
submit_compressed_extents+0x38/0x3d0:
submit_compressed_extents at /usr/src/debug/kernel-5.14.9/linux-5.14.9-300.fc35.aarch64/fs/btrfs/inode.c:845
[root@openqa-a64-worker03 adamwill][PROD]#

Comment 27 František Zatloukal 2021-10-13 20:07:34 UTC
From my rpi4 running on 5.14.9-300:

$ /usr/src/kernels/5.14.9-300.fc35.aarch64/scripts/faddr2line /usr/lib/debug/lib/modules/5.14.9-300.fc35.aarch64/vmlinux submit_compressed_extents+0x38
submit_compressed_extents+0x38/0x3d0:
submit_compressed_extents at /usr/src/debug/kernel-5.14.9/linux-5.14.9-300.fc35.aarch64/fs/btrfs/inode.c:845

Comment 28 Jakub Čajka 2021-10-13 21:21:29 UTC
I have been bit delayed as it seems even installing the debug info and devel packages can sometimes trigger the issue for me.
 
/usr/src/kernels/5.14.10-300.fc35.aarch64/scripts/faddr2line /usr/lib/debug/lib/modules/5.14.10-300.fc35.aarch64/vmlinux submit_compressed_extents+0x38
submit_compressed_extents+0x38/0x3d0:
submit_compressed_extents at /usr/src/debug/kernel-5.14.10/linux-5.14.10-300.fc35.aarch64/fs/btrfs/inode.c:845

I have had the 5.14.10-300.fc35.aarch64 installed and booted. I can retest with the 5.14.9 tomorrow, but result seems the same for the others.

Comment 29 Jakub Čajka 2021-10-13 21:23:47 UTC
For the record BT with the 5.14.10:
[ 2400.479611] Unable to handle kernel paging request at virtual address fffffffffffffdd0
[ 2400.485699] Mem abort info:
[ 2400.487797]   ESR = 0x96000004
[ 2400.490326]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 2400.494289]   SET = 0, FnV = 0
[ 2400.496699]   EA = 0, S1PTW = 0
[ 2400.499117]   FSC = 0x04: level 0 translation fault
[ 2400.502761] Data abort info:
[ 2400.504982]   ISV = 0, ISS = 0x00000004
[ 2400.507913]   CM = 0, WnR = 0
[ 2400.510194] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000091051000
[ 2400.515231] [fffffffffffffdd0] pgd=0000000000000000, p4d=0000000000000000
[ 2400.520496] Internal error: Oops: 96000004 [#1] SMP
[ 2400.524148] Modules linked in: virtio_gpu virtio_dma_buf drm_kms_helper joydev cec fb_sys_fops syscopyarea sysfillrect sysimgblt virtio_net virtio_balloon net_failover failover vfat fat drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg virtio_mmio aes_neon_bs
[ 2400.543182] CPU: 3 PID: 881 Comm: kworker/u8:9 Not tainted 5.14.10-300.fc35.aarch64 #1
[ 2400.549301] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 2400.554687] Workqueue: btrfs-delalloc btrfs_work_helper
[ 2400.558737] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
[ 2400.563514] pc : submit_compressed_extents+0x38/0x3d0
[ 2400.567585] lr : async_cow_submit+0x50/0xd0
[ 2400.570849] sp : ffff800010d83c20
[ 2400.573496] x29: ffff800010d83c30 x28: 0000000000000000 x27: ffffddb466c11000
[ 2400.579127] x26: fffffffffffffdd0 x25: dead000000000100 x24: ffff0001bd91a208
[ 2400.584822] x23: 0000000000000000 x22: 0000000000000001 x21: ffff0000c3644880
[ 2400.590580] x20: ffff0000c0845000 x19: 0000000000000001 x18: ffff0000c4900bd4
[ 2400.596045] x17: ffff0000c4904a9d x16: 0000000000000006 x15: 93b479b5a6b4b1a6
[ 2400.601645] x14: d1e6bd8d113f85ce x13: 0000000000000020 x12: ffff0001fefa78c0
[ 2400.607340] x11: ffffddb466c2b500 x10: 0000000000000000 x9 : ffffddb465401c40
[ 2400.613045] x8 : ffff224d9834f000 x7 : ffff800010d83be0 x6 : ffffddb46715ad40
[ 2400.618724] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff0000c36448a0
[ 2400.624468] x2 : 0000000000000000 x1 : ffff0001bd91a230 x0 : ffff0001bd91a230
[ 2400.630223] Call trace:
[ 2400.632181]  submit_compressed_extents+0x38/0x3d0
[ 2400.636026]  async_cow_submit+0x50/0xd0
[ 2400.639109]  run_ordered_work+0xc8/0x280
[ 2400.642229]  btrfs_work_helper+0x98/0x250
[ 2400.645480]  process_one_work+0x1f0/0x4ac
[ 2400.648659]  worker_thread+0x188/0x504
[ 2400.651658]  kthread+0x110/0x114
[ 2400.656840]  ret_from_fork+0x10/0x18
[ 2400.662287] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356) 
[ 2400.669738] ---[ end trace 9876ac43093cccb3 ]---

Comment 30 Qu Wenruo 2021-10-18 02:46:21 UTC
(In reply to Jakub Čajka from comment #29)
> For the record BT with the 5.14.10:
> [ 2400.479611] Unable to handle kernel paging request at virtual address
> fffffffffffffdd0
> [ 2400.485699] Mem abort info:
> [ 2400.487797]   ESR = 0x96000004
> [ 2400.490326]   EC = 0x25: DABT (current EL), IL = 32 bits
> [ 2400.494289]   SET = 0, FnV = 0
> [ 2400.496699]   EA = 0, S1PTW = 0
> [ 2400.499117]   FSC = 0x04: level 0 translation fault
> [ 2400.502761] Data abort info:
> [ 2400.504982]   ISV = 0, ISS = 0x00000004
> [ 2400.507913]   CM = 0, WnR = 0
> [ 2400.510194] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000091051000
> [ 2400.515231] [fffffffffffffdd0] pgd=0000000000000000, p4d=0000000000000000
> [ 2400.520496] Internal error: Oops: 96000004 [#1] SMP
> [ 2400.524148] Modules linked in: virtio_gpu virtio_dma_buf drm_kms_helper
> joydev cec fb_sys_fops syscopyarea sysfillrect sysimgblt virtio_net
> virtio_balloon net_failover failover vfat fat drm fuse zram ip_tables
> crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg virtio_mmio aes_neon_bs
> [ 2400.543182] CPU: 3 PID: 881 Comm: kworker/u8:9 Not tainted
> 5.14.10-300.fc35.aarch64 #1
> [ 2400.549301] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> [ 2400.554687] Workqueue: btrfs-delalloc btrfs_work_helper
> [ 2400.558737] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
> [ 2400.563514] pc : submit_compressed_extents+0x38/0x3d0

Is that the pc pointing at the same code line number?

Comment 31 Qu Wenruo 2021-10-18 06:35:33 UTC
OK, for the v5.14.9 one, the offending line is:

static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
{
        struct btrfs_inode *inode = BTRFS_I(async_chunk->inode);
        struct btrfs_fs_info *fs_info = inode->root->fs_info; <<<
        struct async_extent *async_extent;
 

This means, async_chunk is already NULL, which is possible in async_cow_start():
static noinline void async_cow_start(struct btrfs_work *work)
{
        struct async_chunk *async_chunk;
        int compressed_extents;

        async_chunk = container_of(work, struct async_chunk, work);

        compressed_extents = compress_file_range(async_chunk);
        if (compressed_extents == 0) {
                btrfs_add_delayed_iput(async_chunk->inode);
                async_chunk->inode = NULL; <<<
        }
}

I'm not sure why we're hitting such compressed_extents == 0 case only in Arch64, but I guess it may be related to page size (64K?).

Anyway, I'll craft a patch to handle the case more gracefully to avoid the crash.

Thank you all for the report!

Comment 32 Qu Wenruo 2021-10-18 06:37:28 UTC
Wait for a min, we already have the handling for the case.

This means inode should be there, but some other problem causing the bad access...

Comment 33 Peter Robinson 2021-10-18 06:39:08 UTC
> I'm not sure why we're hitting such compressed_extents == 0 case only in
> Arch64, but I guess it may be related to page size (64K?).

Fedora's page size for aarch64 is 4K not 64K.

Comment 34 Qu Wenruo 2021-10-18 07:01:20 UTC
OK, let me try to reproduce too.


Meanwhile can any one also try to test the misc-next branch?
https://github.com/kdave/btrfs-devel/tree/misc-next

This branch has extra compression related refactors (originally for subpage).

Not sure if it may make a difference.

Comment 35 Major Hayden 🤠 2021-10-18 13:27:13 UTC
Testing Fedora-Cloud-Base-35-20211018.n.0.aarch64 on Vexxhost now with 5.14.10-300.fc35.aarch64 and I saw a hang occur while installing packages via dnf. The kernel trace mentioned something with brtfs but I couldn't capture the output in time. I'm trying to see if I can get a copy of the trace again.

Comment 36 Dusty Mabe 2021-10-18 13:28:41 UTC
Just another data point here... I'll point out that we (Fedora CoreOS) were seeing issues with aarch64 on VexxHost when we first added the aarch64 archtecture a month ago. We don't use btrfs at all. Is it possible the btrfs traces are a symptom of a larger problem? 

context: https://github.com/coreos/fedora-coreos-tracker/issues/965

Comment 37 Geoffrey Marr 2021-10-18 20:51:44 UTC
Discussed during the 2021-10-18 blocker review meeting: [0]

The decision to delay the classification of this bug as a blocker and accept this as an "AcceptedFreezeException (Final)" was made as, though it's very late, we're still actively researching this and can't be sure yet if it needs to block the release. Many folks are working on it actively and we hope for developments soon. We agree it's at least serious enough to grant a freeze exception in any case.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2021-10-18/f35-blocker-review.2021-10-18-16.00.txt

Comment 38 Chris Murphy 2021-10-19 16:46:17 UTC
In bug 1949334, seeing the same call trace on actual hardware (amberwing), Fedora 34, kernel 5.11.12.

Apr 14 00:53:20 fedora kernel: Call trace:
Apr 14 00:53:20 fedora kernel:  __list_del_entry_valid+0x30/0xb0
Apr 14 00:53:20 fedora kernel:  submit_compressed_extents+0x70/0x3d0
Apr 14 00:53:20 fedora kernel:  async_cow_submit+0x6c/0xc0
Apr 14 00:53:20 fedora kernel:  run_ordered_work+0xc4/0x2b0
Apr 14 00:53:20 fedora kernel:  btrfs_work_helper+0x98/0x270
Apr 14 00:53:20 fedora kernel:  process_one_work+0x1f0/0x4cc
Apr 14 00:53:20 fedora kernel:  worker_thread+0x184/0x500
Apr 14 00:53:20 fedora kernel:  kthread+0x120/0x124
Apr 14 00:53:20 fedora kernel:  ret_from_fork+0x10/0x18
Apr 14 00:53:20 fedora kernel: Code: d2802443 f2fbd5a3 eb03003f 54000340 (f9400021)

Comment 39 Chris Murphy 2021-10-20 02:10:26 UTC
Dusty set me up with a Vexxhost aarch64 VM to try and capture a kernel core dump for when the oops happens. But I can't reproduce it, so far after 6+ hours of kernel compiling and for about 1 hour concurrently I ran a dnf group install for Workstation edition. I'm not sure what to try next to trigger it and capture a vmcore.

Comment 40 Chris Murphy 2021-10-21 23:58:19 UTC
I'm consistently getting a hang, but not an oops. I think the two might be unrelated, or perhaps there are two hangs. The consistent hang I think is due to this:

Oct 21 23:51:22 dusty-354.novalocal sshd[1847]: error: kex_exchange_identification: Connection closed by remote host
...
Oct 21 23:52:18 dusty-354.novalocal systemd[1]: sshd.service: Unit process 1844 (sshd) remains running after unit stopped.
Oct 21 23:52:18 dusty-354.novalocal systemd[1]: Stopped OpenSSH server daemon.
Oct 21 23:52:18 dusty-354.novalocal audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sshd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Oct 21 23:52:18 dusty-354.novalocal systemd[1]: Stopped target sshd-keygen.target.
Oct 21 23:52:18 dusty-354.novalocal systemd[1]: Stopping sshd-keygen.target...
Oct 21 23:52:18 dusty-354.novalocal systemd[1]: Condition check resulted in OpenSSH ecdsa Server Key Generation being skipped.
Oct 21 23:52:18 dusty-354.novalocal systemd[1]: Condition check resulted in OpenSSH ed25519 Server Key Generation being skipped.
Oct 21 23:52:18 dusty-354.novalocal systemd[1]: Condition check resulted in OpenSSH rsa Server Key Generation being skipped.
Oct 21 23:52:18 dusty-354.novalocal systemd[1]: Reached target sshd-keygen.target.
Oct 21 23:52:18 dusty-354.novalocal systemd[1]: sshd.service: Found left-over process 1844 (sshd) in control group while starting unit. Ignoring.

But this isn't the result of, or cause, of any oops or call trace. But it does result in an apparent permanent hang of the ssh session. And at least for me it never results in broken pipe or any such disconnect on the client side so it looks like an indefinite hang. I have to close the GNOME Terminal tab to get rid of it. All of that might be a bug too, but it's not the bug we're looking for.

Comment 41 Chris Murphy 2021-10-22 17:11:26 UTC
Updates from upstream thread:
https://lore.kernel.org/linux-btrfs/ff911f0c-9ea5-43b1-4b8d-e8c392f0718e@suse.com/
https://lore.kernel.org/linux-btrfs/9e746c1c-85e5-c766-26fa-a4d83f1bfd34@suse.com/

How should we meet that request? I'm not adept enough yet at navigating Fedora infrastructure to get a scratch build of the kernel, or maybe it could be done in copr? I can get a 5.15-rc6 tarball to an aarch64 host, with a 'make localmodconfig' .config file that will compile in less than an hour, and then see if I can hit the bug which really is elusive compared to the hang I see in comment 40. Still another option is to get the WARN_ON added to the Rawhide kernel, and use a subsequent F36 cloud image with that kernel in it, and try to reproduce.

Comment 42 Adam Williamson 2021-10-22 17:21:17 UTC
I could run a scratch build, but it will take longer than "less than an hour", so that one might be the easiest option. Plus I'd have to remember any weirdness about how the kernel package does patching.

Justin could do a real build with that patch if he's willing, but today is a Red Hat company holiday so he might not be around.

Comment 43 Peter Robinson 2021-10-22 17:53:34 UTC
I can run a scratch build if someone can provide me a patch.

Comment 44 Dusty Mabe 2021-10-23 15:40:27 UTC
scratch builds with one line patch:

- 5.14.10
    - https://koji.fedoraproject.org/koji/taskinfo?taskID=77712076
- 5.14.14
    - https://koji.fedoraproject.org/koji/taskinfo?taskID=77711885


```
---
 fs/btrfs/inode.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8132d503c83d..334cbcb01827 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -841,6 +841,7 @@ static void free_async_extent_pages(struct async_extent *async_extent)
  */
 static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 {
+       WARN_ON(!async_chunk->inode);
        struct btrfs_inode *inode = BTRFS_I(async_chunk->inode);
        struct btrfs_fs_info *fs_info = inode->root->fs_info;
        struct async_extent *async_extent;
-- 
```

Comment 45 Chris Murphy 2021-10-25 13:14:32 UTC
Created attachment 1836761 [details]
dmesg sysrq+t

Comment 46 Chris Murphy 2021-10-25 14:45:15 UTC
Created attachment 1836845 [details]
dmesg 5.14.10-300.fc35.dusty.aarch64

This kernel has the WARN_ON added. I'm not seeing any additional useful information though.

Comment 47 Justin M. Forbes 2021-10-25 16:10:44 UTC
(In reply to Adam Williamson from comment #42)
> I could run a scratch build, but it will take longer than "less than an
> hour", so that one might be the easiest option. Plus I'd have to remember
> any weirdness about how the kernel package does patching.
> 
> Justin could do a real build with that patch if he's willing, but today is a
> Red Hat company holiday so he might not be around.

I was not around, though I am now. It seems that scratch builds have been done, but we do not have a fix?

Comment 48 Adam Williamson 2021-10-25 16:59:13 UTC
yeah, the scratch build was for further debugging rather than expected to fix things, AIUI.

Comment 49 Geoffrey Marr 2021-10-25 18:23:41 UTC
Discussed during the 2021-10-25 blocker review meeting: [0]

The decision to classify this bug as a "RejectedBlocker (Final)" was made as we agreed on current information to reject this as a blocker and document it as an ongoing situation with vexxhost which we will try to get resolved ASAP. This is a tough call as we cannot be sure if it specific to vexxhost's environment somehow, and if the problem lies there or in Fedora 35 itself.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2021-10-25/f35-blocker-review.2021-10-25-16.02.txt

Comment 50 Chris Murphy 2021-10-25 19:34:16 UTC
Now I'm getting a hang while installing kernel debuginfo. top does not show dnf using any CPU; ps aux shows that dnf is running still but doing nothing; top shows high idle and wait, with no meaningful consumption for system or user processes. 

The sysrq+t for this does show some meaningful items, including:
[ 9602.008517] kernel: workqueue events_unbound: flags=0x2
[ 9602.016767] kernel:   pwq 8: cpus=0-3 flags=0x4 nice=0 active=2/512 refcnt=4
[ 9602.026585] kernel:     in-flight: 1296:btrfs_async_reclaim_metadata_space, 1304:btrfs_preempt_reclaim_metadata_space

I'll attach the full dmesg next, but this is btrfs sysfs info at the time I captured the sysrq+t I see we are somehow stuck in async_reclaim_metadata_space? Huh?


-bash-5.1$ grep -R . /sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/disk_used:167313408
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/bytes_pinned:0
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/bytes_used:83656704
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/dup/used_bytes:83656704
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/dup/total_bytes:268435456
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/disk_total:536870912
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/total_bytes:268435456
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/bytes_reserved:0
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/bytes_readonly:65536
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/bytes_zone_unusable:0
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/bytes_may_use:9748480
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/metadata/flags:4
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/disk_used:32768
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/bytes_pinned:0
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/bytes_used:16384
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/dup/used_bytes:16384
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/dup/total_bytes:8388608
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/disk_total:16777216
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/total_bytes:8388608
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/bytes_reserved:0
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/bytes_readonly:0
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/bytes_zone_unusable:0
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/bytes_may_use:0
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/system/flags:2
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/global_rsv_reserved:9093120
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/disk_used:3190075392
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/bytes_pinned:0
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/bytes_used:3190075392
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/single/used_bytes:3190075392
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/single/total_bytes:4202692608
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/disk_total:4202692608
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/total_bytes:4202692608
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/bytes_reserved:0
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/bytes_readonly:0
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/bytes_zone_unusable:0
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/bytes_may_use:8192
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/data/flags:1
/sys/fs/btrfs/624e06e4-b43c-47cf-a74d-1b824c81a401/allocation/global_rsv_size:9093120
-bash-5.1$

Comment 51 Chris Murphy 2021-10-25 19:34:55 UTC
Created attachment 1836995 [details]
dmesg sysrq+t, while dnf install kernel-debuginfo

Comment 53 Jakub Čajka 2021-10-27 14:48:14 UTC
Created attachment 1837629 [details]
dmesg 5.14.14-300.fc35.dusty.aarch64 warnon with hang

For the record I have been able to hit it with the warn_on kernel patch that Dusty built, only it took bit more time. dmsg is in the attachment. I haven't noticed any new info in the log, but I might have missed it there.

I will try to reproduce it with the proposed patch, but I will not be able to post updates until next Monday.

Comment 54 Chris Murphy 2021-10-27 18:09:43 UTC
@jcajka yeah i don't see anything new there. If you get a chance, give this kernel a shot:

https://koji.fedoraproject.org/koji/taskinfo?taskID=77858541

That contains this patch from upstream https://lore.kernel.org/linux-btrfs/b03fb30f-3d4b-413c-0227-6655ffeba75d@suse.com/

I've tested it with help from Dusty, and it's 3 for 3 succesfully installing kernel debuginfo and 6 for 6 successfully installing libreoffice.

Comment 55 Jakub Čajka 2021-11-05 10:22:18 UTC
Sorry for longer reply time. For the record with the patched kernel I haven't been able to reproduce this. I have been trying for 3 days now. 

To not when I have been resetting me environment, the original issue this time around actually lead to corrupted root fs after the first oops. Host after the reboot would end up in the emergency shell as rootfs couldn't be mounted.

Comment 56 Chris Murphy 2021-11-05 15:49:14 UTC
The official fix is now in misc-next
https://lore.kernel.org/linux-btrfs/20211102124916.433836-1-nborisov@suse.com/

Comment 57 Justin M. Forbes 2021-11-08 12:34:54 UTC
Thanks, pulled in for 5.14.17 and 5.15.1

Comment 58 Chris Murphy 2021-11-09 03:53:28 UTC
>this time around actually lead to corrupted root fs

Jakub, do you have any of these corrupted file systems still? I'd like to let upstream know more about this particular consequence to the bug, as it might itself be a bug that the kernel and/or btrfs-progs needs to be able to handle better. Can you try to mount such a corrupt file system again and share any btrfs messages that appear in dmesg? Also, 'btrfs check /dev/sdXY' for the corrupted file system would also be helpful.

Comment 59 Fedora Update System 2021-11-09 12:54:53 UTC
FEDORA-2021-fdef34e26f has been submitted as an update to Fedora 35. https://bodhi.fedoraproject.org/updates/FEDORA-2021-fdef34e26f

Comment 60 Fedora Update System 2021-11-09 12:55:03 UTC
FEDORA-2021-7de33b7016 has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-7de33b7016

Comment 61 Fedora Update System 2021-11-09 12:55:14 UTC
FEDORA-2021-8364530ebf has been submitted as an update to Fedora 33. https://bodhi.fedoraproject.org/updates/FEDORA-2021-8364530ebf

Comment 62 Fedora Update System 2021-11-10 03:23:25 UTC
FEDORA-2021-8364530ebf has been pushed to the Fedora 33 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-8364530ebf`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-8364530ebf

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 63 Fedora Update System 2021-11-10 04:04:14 UTC
FEDORA-2021-7de33b7016 has been pushed to the Fedora 34 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-7de33b7016`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-7de33b7016

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 64 Fedora Update System 2021-11-10 04:11:15 UTC
FEDORA-2021-fdef34e26f has been pushed to the Fedora 35 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-fdef34e26f`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-fdef34e26f

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 65 Jakub Čajka 2021-11-10 08:28:01 UTC
(In reply to Chris Murphy from comment #58)
> >this time around actually lead to corrupted root fs
> 
> Jakub, do you have any of these corrupted file systems still? I'd like to
> let upstream know more about this particular consequence to the bug, as it
> might itself be a bug that the kernel and/or btrfs-progs needs to be able to
> handle better. Can you try to mount such a corrupt file system again and
> share any btrfs messages that appear in dmesg? Also, 'btrfs check /dev/sdXY'
> for the corrupted file system would also be helpful.

Unfortunately not. To note the patched kernel is holding just fine.

I have been able to hit that(I guess I go unlucky) on the vexxhost provisioning from the I believe Fedora-Cloud-Base-35-20210922.n.0.aarch64 and then updating the kernel. Previously at worst it was the hung. Using later composes I haven't got unlucky and didn't hit the oops during kernel update.

Is btrfs check available in initramfs/emergency shell?

Comment 66 Chris Murphy 2021-11-12 02:53:32 UTC
>Is btrfs check available in initramfs/emergency shell?

yes

Comment 67 Fedora Update System 2021-11-14 03:44:04 UTC
FEDORA-2021-7de33b7016 has been pushed to the Fedora 34 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 68 Fedora Update System 2021-11-14 03:49:57 UTC
FEDORA-2021-8364530ebf has been pushed to the Fedora 33 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 69 Fedora Update System 2021-11-14 04:07:18 UTC
FEDORA-2021-fdef34e26f has been pushed to the Fedora 35 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.