Bug 2121791
Summary: | kernel regression causing mdraid systems to hang during reboot | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Dusty Mabe <dustymabe> | ||||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | rawhide | CC: | acaringi, adscvr, airlied, alciregi, awilliam, bcotton, bskeggs, eric.eisenhart, hdegoede, hpa, jarodwilson, jforbes, jglisse, jonathan, josef, kernel-maint, lam, lgoncalv, linville, masami256, mchehab, minlei, mironov.ivan, ncroxon, norbert.jurkeit, pgnet.dev, ptalbert, robatino, steved, xni | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | AcceptedBlocker | ||||||||
Fixed In Version: | kernel-5.19.6-300.fc37 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2022-09-02 22:27:35 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 2009537 | ||||||||
Attachments: |
|
Description
Dusty Mabe
2022-08-26 15:59:21 UTC
Created attachment 1907916 [details]
console.txt
Created attachment 1907917 [details]
journal.txt
I did a git bisect between v5.19-rc3 and v5.19-rc4. I believe the first bad commit is a09b314005f3: ``` $ git bisect bad a09b314005f3a0956ebf56e01b3b80339df577cc is the first bad commit commit a09b314005f3a0956ebf56e01b3b80339df577cc Author: Christoph Hellwig <hch> Date: Tue Jun 14 09:48:27 2022 +0200 block: freeze the queue earlier in del_gendisk Freeze the queue earlier in del_gendisk so that the state does not change while we remove debugfs and sysfs files. Ming mentioned that being able to observer request in debugfs might be useful while the queue is being frozen in del_gendisk, which is made possible by this change. Signed-off-by: Christoph Hellwig <hch> Link: https://lore.kernel.org/r/20220614074827.458955-5-hch@lst.de Signed-off-by: Jens Axboe <axboe> block/genhd.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) ``` Reverting this commit and building on top of latest git master (4c612826b) gave me successful results. This happens to me on Fedora 36 x86_64 with kernel 5.19.4. I see a lot of "block device autoloading is deprecated and will be removed" during reboot or poweroff on a machine with mdadm RAID6, and reboot/poweroff is not happening. I can confirm that rebuilding 5.19.4 with reverted "block: freeze the queue earlier in del_gendisk" fixes this. Interestingly, Fedora 36 aarch64 with kernel 5.19.4 on an Allwinner H6 SBC with mdadm RAID1 is not affected. F36 (x86-64), mdadm raid1 on / (no separate /boot, but that doesn't seem important). I've had this issue with 5.19.1 (from https://koji.fedoraproject.org/koji/buildinfo?buildID=2044709) and now after the official 5.19.4 F36 update, more people will see their reboots hang with these messages. I came here to remind future visitors that Alt-SysRq-B helps :) Same for me on a PC with several mdadm raid1 devices after upgrade from 5.18.19-200.fc36.x86_64 to 5.19.4-200.fc36.x86_64. This didn't happen however during kernel test days running kerneltest-5.19.1.iso on the same hardware. Proposed as a Freeze Exception for 37-beta by Fedora user dustymabe using the blocker tracking app because: Machine's with RAID1 setups should be able to shutdown/reboot without hanging. fwiw, given source of the logged message, [v2] block: deprecate autoloading based on dev_t https://patchwork.kernel.org/project/linux-block/patch/20220104071647.164918-1-hch@lst.de/#24842631 [PATCH] block: deprecate autoloading based on dev_t https://patchwork.kernel.org/project/linux-block/patch/20220104071647.164918-1-hch@lst.de/#24842631 here, with uname -rm 5.19.4-200.fc36.x86_64 x86_64 changing , edit /etc/mdadm.conf MAILADDR root - AUTO +imsm +1.x -all + #AUTO +imsm +1.x -all - ARRAY /dev/md/0 level=raid1 num-devices=2 UUID=11...bb - ARRAY /dev/md/1 level=raid1 num-devices=2 UUID=22...cc + ARRAY /dev/md0 level=raid1 num-devices=2 metadata=1.2 UUID=11...bb name=dev003:0 + ARRAY /dev/md1 level=raid1 num-devices=2 metadata=1.2 UUID=22...cc name=dev003:1 seems to consistently eliminate the message on boot start before edit, mdadm --detail --scan ARRAY /dev/md/dev003:0 metadata=1.2 name=dev003:0 UUID=11...bb ARRAY /dev/md/dev003:1 metadata=1.2 name=dev003:1 UUID=22...cc ls -al /dev/md/dev003\:* /dev/md{0,1} brw-rw---- 1 root disk 9, 0 Aug 31 12:42 /dev/md0 brw-rw---- 1 root disk 9, 1 Aug 31 12:42 /dev/md1 lrwxrwxrwx 1 root root 6 Aug 31 12:42 /dev/md/dev003:0 -> ../md0 lrwxrwxrwx 1 root root 6 Aug 31 12:42 /dev/md/dev003:1 -> ../md1 dmesg | grep deprecated [ 7.026798] block device autoloading is deprecated and will be removed. after edit, mdadm --detail --scan ARRAY /dev/md0 metadata=1.2 name=dev003:0 UUID=11...bb ARRAY /dev/md1 metadata=1.2 name=dev003:1 UUID=22...cc ls -al /dev/md/dev003\:* /dev/md{0,1} ls: cannot access '/dev/md/dev003:*': No such file or directory brw-rw---- 1 root disk 9, 0 Aug 31 12:42 /dev/md0 brw-rw---- 1 root disk 9, 1 Aug 31 12:42 /dev/md1 dmesg | grep deprecated (empty) and, on this one test machine, eliminates boot hang/loop on restart; have NOT tested more broadly yet > and, on this one test machine, eliminates boot hang/loop on restart; have NOT tested more broadly yet
tested these changes on 4 machines.
2 stopped looping on boot, 2 continue to do so.
there's more to this ...
Proposed as a Blocker for 37-beta by Fedora user bcotton using the blocker tracking app because: Adding an F37 Beta blocker nomination to the existing FE nomination. This seems like a violation of the basic release criterion: It must be possible to trigger a clean system shutdown using standard console commands. https://fedoraproject.org/wiki/Basic_Release_Criteria#Shutdown The commit in question was reverted in 5.19.6-300.fc37, it looks like: https://koji.fedoraproject.org/koji/buildinfo?buildID=2055666 "- Revert "block: freeze the queue earlier in del_gendisk" (Justin M. Forbes)" can you try that kernel and see if it helps? Yes, it was reverted, and discussed. I do not want this bug closed because upstream has made no movement on this yet. It is not reverted from Rawhide and won't be reverted in the 6.0 branch unless upstream chooses to do so. Leaving this open will help me track it without forgetting. We don't have to close the bug, but for F37 Beta purposes, I need to know if that update addresses it, so we can pull it into Beta. fwiw, on 2 F36 boxes, with uname -rm 5.19.4-200.fc36.x86_64 x86_64 hanging in loop @ shutdown, upgrading to uname -rm 5.19.6-200.fc36.x86_64 x86_64 , remaking init, after reboot, subsequent reboots are OK. no more loop. no testing beyond that. FEDORA-2022-ccb0138bb6 has been submitted as an update to Fedora 37. https://bodhi.fedoraproject.org/updates/FEDORA-2022-ccb0138bb6 5.19.6 builds seem to be working for me. FEDORA-2022-ccb0138bb6 has been pushed to the Fedora 37 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2022-ccb0138bb6` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-ccb0138bb6 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates. I haven't installed F37 yet but can confirm that 5.19.6 builds fix the issue for me with F35 and F36. +5 in https://pagure.io/fedora-qa/blocker-review/issue/882 , marking accepted. FEDORA-2022-ccb0138bb6 has been pushed to the Fedora 37 stable repository. If problem still persists, please make note of it in this bug report. The revert for the offending kernel commit landed upstream in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4c66a326b5ab784cddd72de07ac5b6210e9e1b06 |