Bug 2080468
Summary: | kdump on aarch64 AWS instances gets stuck | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Dusty Mabe <dustymabe> | ||||
Component: | kexec-tools | Assignee: | Pingfan Liu <piliu> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 36 | CC: | bhe, coxu, davdunc, piliu, ruyang, ryncsn, travier, vkuznets, xiliang | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-11-08 06:32:00 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Dusty Mabe
2022-04-29 17:45:40 UTC
Note here is a video recording of the entire process: https://dustymabe.fedorapeople.org/videos/2022-04-29_kdump-bz2080468.mp4 Thanks to dyoung I tried to remove "irqpoll" from KDUMP_COMMANDLINE_APPEND= in /etc/sysconfig/kdump. This worked for Fedora aarch64 AWS instances. So there's still some issue here to work out, but we know of a workaround. Add knowledge article link. https://access.redhat.com/articles/6562431 Created attachment 1878498 [details]
/sys/kernel/irq/* information from AWS c6g.xlarge instance
Output of running the following script on the target hardware:
for irq in /sys/kernel/irq/*
do
echo "irq: $irq"
echo "hwirq: $(cat $irq/hwirq)"
echo "actions: $(cat $irq/actions)"
echo "chip_name: $(cat $irq/chip_name)"
echo " "
done
cross referencing: https://bugzilla.redhat.com/show_bug.cgi?id=1654962 Could you upload the boot log for the following combination? group of the 1st kernel: 1. cmdline with ttyS0 for the 1st kernel's boot log 2. cmdline without ttyS0 for the 1st kernel's boot log group of the kdump kernel: 3. cmdline with ttyS0 with irqpoll in kdump kernel cmdline (the description is partial and may lose some hints) 4. cmdline without ttyS0 with irqpoll in kdump kernel cmdline Thanks for your help. (In reply to Pingfan Liu from comment #6) > Could you upload the boot log for the following combination? > group of the 1st kernel: > 1. cmdline with ttyS0 for the 1st kernel's boot log > 2. cmdline without ttyS0 for the 1st kernel's boot log > > > group of the kdump kernel: > 3. cmdline with ttyS0 with irqpoll in kdump kernel cmdline (the description > is partial and may lose some hints) > 4. cmdline without ttyS0 with irqpoll in kdump kernel cmdline > > > Thanks for your help. With Frank's help, I can access an AWS instance and begin to debug. And I can collect all message by myself now. @Frank, thanks for your help A little weird, I have tried upstream kernel 5.18/5.17/5.14, I can not reproduce this bug. I have also tried rhel kernel: 5.14.0-70.el9, 5.14.0-99.el9. They are free of this bug. But I did hit this issue with 5.14.0-70.13.1.el9_0.aarch64 For the kdump kernel, the command line is [ 0.000000] Kernel command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-70.el9.aarch64 console=ttyS0,115200n8 console=tty0 net.ifnames=0 rd.blacklist=nouveau nvme_core.io_timeout=4294967295 iommu.strict=0 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory udev.children-max=2 panic=10 swiotlb=noforce novmcoredd cma=0 hugetlb_cma=0 ... [ 0.015559] DMI: Amazon EC2 c6g.xlarge/, BIOS 1.0 11/1/2018 I observe this issue on: - 5.17.9-300.fc36.aarch64 (https://koji.fedoraproject.org/koji/search?terms=kernel-5.17.9-300.fc36&type=build&match=exact) - 5.18.0-60.fc37.aarch64 (https://koji.fedoraproject.org/koji/search?terms=kernel-5.18.0-60.fc37&type=build&match=exact) Do you observe the issue when running the Fedora kernels? (In reply to Dusty Mabe from comment #9) > I observe this issue on: > > - 5.17.9-300.fc36.aarch64 > (https://koji.fedoraproject.org/koji/search?terms=kernel-5.17.9-300. > fc36&type=build&match=exact) > - 5.18.0-60.fc37.aarch64 > (https://koji.fedoraproject.org/koji/search?terms=kernel-5.18.0-60. > fc37&type=build&match=exact) > > Do you observe the issue when running the Fedora kernels? I will try later after tracing down RHEL kernel 5.14.0-70.13.1.el9_0.aarch64. I pick up RHEL kernel for testing since it is easy to download each minor releases for bisect. bitsect 5.14.0-70.x, this bug can be reproduced by 5.14.0-70.13.1.el9_0, not reproduced by 5.14.0-70.12.1.el9_0. But the commits seem unrelated with irqpoll or tty 2b84b162f9b3 (tag: kernel-5.14.0-70.13.1.el9_0, tag: RHEL-9.0.0) [redhat] kernel-5.14.0-70.13.1.el9_0 a6008c855537 Merge: redhat: disable uncommon media device infrastructure a2ce164afefb Merge: netfilter: heap out of bounds write in nf_dup_netdev.c since 5.4 7df3c94aa3a1 Merge: netfilter: nf_tables: validate registers coming from userspace. 44a4dd30077c Merge: scsi: iscsi: iSCSI Offload regression fixes de3103fbfadf scsi: qedi: Fix failed disconnect handling 77fa8a4637da scsi: iscsi: Fix unbound endpoint error handling a602e37b5547 scsi: iscsi: Fix conn cleanup and stop race during iscsid restart 711af464feaf scsi: iscsi: Fix endpoint reuse regression c962bb5e8066 scsi: iscsi: Release endpoint ID when its freed ce711a8d2f3d scsi: iscsi: Fix offload conn cleanup when iscsid restarts 6b7f5e6bd86e Revert "scsi: iscsi: Fix offload conn cleanup when iscsid restarts" ef4d4002f567 scsi: iscsi: Speed up session unblocking and removal 6d3c125edaca scsi: iscsi: Fix recovery and unblocking race 0bae86ba1c35 scsi: qedi: Fix cmd_cleanup_cmpl counter mismatch issue e9ff2c8b7487 scsi: iscsi: Unblock session then wake up error handler 623f01150f92 scsi: iscsi: Fix set_param() handling 40de9a34a363 scsi: iscsi: Fix iscsi_task use after free 1255087ae481 scsi: iscsi: Adjust iface sysfs attr detection a1d592e5729f scsi: qedi: Add support for fastpath doorbell recovery f276818f0070 redhat: disable uncommon media device infrastructure e24e48cbf7c1 CI: Drop baseline runs a1a8ee7551a8 (tag: kernel-5.14.0-70.12.1.el9_0) [redhat] kernel-5.14.0-70.12.1.el9_0 Can you link where you fixed it? We can not find the fix. Apparently it's fixed by https://src.fedoraproject.org/rpms/kexec-tools/c/d55a0565585aa22db069cf5f5fa1955373be60b3 Yes. And originally it was fixed by https://src.fedoraproject.org/rpms/kexec-tools/c/d593bfa6fc5e2e894798e22fa9c4c433517de4b3 But I don't see this in any code in RHEL (or maybe I'm missing something). Can we get this fixed in RHEL? |