Description of problem: kdump on aarch64 AWS instances (in this case c6g.xlarge) gets stuck. This is somehow related to the serial console of the machine. When setting up kdump and using sysrq to trigger a crash we notice that the crash kernel hangs and never completes. It always gets stuck at a particular point: ``` [ 10.506150] printk: console [ttyS0] disabled ``` If I then type some characters into the serial console the system (or the console) gets unstuck, but it looks like another kexec happens in the background. That kernel eventually bails out (though I do notice this interested stack trace before it does bail out): ``` [ 79.141909] irq 14: nobody cared (try booting with the "irqpoll" option) [ 79.141916] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.17.4-300.fc36.aarch64 #1 [ 79.141919] Hardware name: Amazon EC2 c6g.xlarge/, BIOS 1.0 11/1/2018 [ 79.141921] Call trace: [ 79.141922] dump_backtrace+0xfc/0x134 [ 79.141927] show_stack+0x24/0x6c [ 79.141929] dump_stack_lvl+0x64/0x80 [ 79.141933] dump_stack+0x18/0x34 [ 79.141935] __report_bad_irq+0x54/0x16c [ 79.141938] note_interrupt+0x30c/0x40c [ 79.141942] handle_irq_event+0xec/0x180 [ 79.141944] handle_fasteoi_irq+0xcc/0x200 [ 79.141946] generic_handle_domain_irq+0x48/0x70 [ 79.141948] gic_handle_irq+0xc0/0x140 [ 79.141950] call_on_irq_stack+0x2c/0x38 [ 79.141952] do_interrupt_handler+0x88/0x90 [ 79.141955] el1_interrupt+0x34/0x54 [ 79.141959] el1h_64_irq_handler+0x18/0x24 [ 79.141961] el1h_64_irq+0x7c/0x80 [ 79.141963] arch_cpu_idle+0x18/0x2c [ 79.141964] default_idle_call+0x4c/0x140 [ 79.141967] cpuidle_idle_call+0x14c/0x1a0 [ 79.141970] do_idle+0xb0/0x100 [ 79.141973] cpu_startup_entry+0x30/0x8c [ 79.141976] rest_init+0xd0/0xe0 [ 79.141977] arch_call_rest_init+0x1c/0x28 [ 79.141980] start_kernel+0x484/0x4a0 [ 79.141981] __primary_switched+0xc0/0xc8 [ 79.141985] handlers: [ 79.141986] [<00000000f4a19d33>] serial8250_interrupt [ 79.141991] Disabling IRQ #14 [ 79.144145] pci 0000:00:01.0: [1d0f:8250] type 00 class 0x070003 [ 79.144261] pci 0000:00:01.0: reg 0x10: [mem 0x80118000-0x80118fff] [ 79.144655] pci 0000:00:01.0: BAR 0: assigned [mem 0x80000000-0x80000fff] [ 79.145089] printk: console [ttyS0] disabled [ 79.145243] 0000:00:01.0: ttyS0 at MMIO 0x80000000 (irq = 14, base_baud = 115200) is a 16550A [ 94.741159] printk: console [ttyS0] enabled [ 94.744715] pci 0000:00:04.0: [1d0f:8061] type 00 class 0x010802 [ 94.746219] pci 0000:00:04.0: reg 0x10: [mem 0x80110000-0x80113fff] [ 94.749505] pci 0000:00:04.0: PME# supported from D0 D1 D2 D3hot D3cold ``` and then the system seems to go through a reboot (i.e. I see grub and a full boot happens). At the end of all this there is still never any files created in `/var/crash`. Since the system got hung up initially on a message about the console I decided to try the test after removing `console=ttyS0,115200n8` on the kernel command line. In this case the test passes, but I have no idea why. We originally added `console=ttyS0,115200n8` to the kernel command line for these aarch64 instances because they wouldn't boot otherwise (see https://github.com/coreos/fedora-coreos-tracker/issues/920#issuecomment-914334988). It's possible there should be a BZ created out of that and investigated itself. Version-Release number of selected component (if applicable): kexec-tools-2.0.23-5.fc36 kernel-5.17.4-300.fc36 How reproducible: always Steps to Reproduce: 1. Boot AMI ami-09253652082332cd1 in us-east-1 2. Set a password for the `core` user 3. Get serial console access to the machine - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connect-to-serial-console.html#sc-connect-SSH 4. set up crash kernel - sudo rpm-ostree kargs --append='crashkernel=256M' - sudo systemctl enable kdump - sudo reboot 5. after reboot trigger crash - sudo su - - echo 1 > /proc/sys/kernel/sysrq - echo c > /proc/sysrq-trigger Actual results: Machine gets hung. Expected results: Crash dump file is created. Additional info:
Note here is a video recording of the entire process: https://dustymabe.fedorapeople.org/videos/2022-04-29_kdump-bz2080468.mp4
Thanks to dyoung I tried to remove "irqpoll" from KDUMP_COMMANDLINE_APPEND= in /etc/sysconfig/kdump. This worked for Fedora aarch64 AWS instances. So there's still some issue here to work out, but we know of a workaround.
Add knowledge article link. https://access.redhat.com/articles/6562431
Created attachment 1878498 [details] /sys/kernel/irq/* information from AWS c6g.xlarge instance Output of running the following script on the target hardware: for irq in /sys/kernel/irq/* do echo "irq: $irq" echo "hwirq: $(cat $irq/hwirq)" echo "actions: $(cat $irq/actions)" echo "chip_name: $(cat $irq/chip_name)" echo " " done
cross referencing: https://bugzilla.redhat.com/show_bug.cgi?id=1654962
Could you upload the boot log for the following combination? group of the 1st kernel: 1. cmdline with ttyS0 for the 1st kernel's boot log 2. cmdline without ttyS0 for the 1st kernel's boot log group of the kdump kernel: 3. cmdline with ttyS0 with irqpoll in kdump kernel cmdline (the description is partial and may lose some hints) 4. cmdline without ttyS0 with irqpoll in kdump kernel cmdline Thanks for your help.
(In reply to Pingfan Liu from comment #6) > Could you upload the boot log for the following combination? > group of the 1st kernel: > 1. cmdline with ttyS0 for the 1st kernel's boot log > 2. cmdline without ttyS0 for the 1st kernel's boot log > > > group of the kdump kernel: > 3. cmdline with ttyS0 with irqpoll in kdump kernel cmdline (the description > is partial and may lose some hints) > 4. cmdline without ttyS0 with irqpoll in kdump kernel cmdline > > > Thanks for your help. With Frank's help, I can access an AWS instance and begin to debug. And I can collect all message by myself now. @Frank, thanks for your help
A little weird, I have tried upstream kernel 5.18/5.17/5.14, I can not reproduce this bug. I have also tried rhel kernel: 5.14.0-70.el9, 5.14.0-99.el9. They are free of this bug. But I did hit this issue with 5.14.0-70.13.1.el9_0.aarch64 For the kdump kernel, the command line is [ 0.000000] Kernel command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-70.el9.aarch64 console=ttyS0,115200n8 console=tty0 net.ifnames=0 rd.blacklist=nouveau nvme_core.io_timeout=4294967295 iommu.strict=0 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory udev.children-max=2 panic=10 swiotlb=noforce novmcoredd cma=0 hugetlb_cma=0 ... [ 0.015559] DMI: Amazon EC2 c6g.xlarge/, BIOS 1.0 11/1/2018
I observe this issue on: - 5.17.9-300.fc36.aarch64 (https://koji.fedoraproject.org/koji/search?terms=kernel-5.17.9-300.fc36&type=build&match=exact) - 5.18.0-60.fc37.aarch64 (https://koji.fedoraproject.org/koji/search?terms=kernel-5.18.0-60.fc37&type=build&match=exact) Do you observe the issue when running the Fedora kernels?
(In reply to Dusty Mabe from comment #9) > I observe this issue on: > > - 5.17.9-300.fc36.aarch64 > (https://koji.fedoraproject.org/koji/search?terms=kernel-5.17.9-300. > fc36&type=build&match=exact) > - 5.18.0-60.fc37.aarch64 > (https://koji.fedoraproject.org/koji/search?terms=kernel-5.18.0-60. > fc37&type=build&match=exact) > > Do you observe the issue when running the Fedora kernels? I will try later after tracing down RHEL kernel 5.14.0-70.13.1.el9_0.aarch64. I pick up RHEL kernel for testing since it is easy to download each minor releases for bisect.
bitsect 5.14.0-70.x, this bug can be reproduced by 5.14.0-70.13.1.el9_0, not reproduced by 5.14.0-70.12.1.el9_0. But the commits seem unrelated with irqpoll or tty 2b84b162f9b3 (tag: kernel-5.14.0-70.13.1.el9_0, tag: RHEL-9.0.0) [redhat] kernel-5.14.0-70.13.1.el9_0 a6008c855537 Merge: redhat: disable uncommon media device infrastructure a2ce164afefb Merge: netfilter: heap out of bounds write in nf_dup_netdev.c since 5.4 7df3c94aa3a1 Merge: netfilter: nf_tables: validate registers coming from userspace. 44a4dd30077c Merge: scsi: iscsi: iSCSI Offload regression fixes de3103fbfadf scsi: qedi: Fix failed disconnect handling 77fa8a4637da scsi: iscsi: Fix unbound endpoint error handling a602e37b5547 scsi: iscsi: Fix conn cleanup and stop race during iscsid restart 711af464feaf scsi: iscsi: Fix endpoint reuse regression c962bb5e8066 scsi: iscsi: Release endpoint ID when its freed ce711a8d2f3d scsi: iscsi: Fix offload conn cleanup when iscsid restarts 6b7f5e6bd86e Revert "scsi: iscsi: Fix offload conn cleanup when iscsid restarts" ef4d4002f567 scsi: iscsi: Speed up session unblocking and removal 6d3c125edaca scsi: iscsi: Fix recovery and unblocking race 0bae86ba1c35 scsi: qedi: Fix cmd_cleanup_cmpl counter mismatch issue e9ff2c8b7487 scsi: iscsi: Unblock session then wake up error handler 623f01150f92 scsi: iscsi: Fix set_param() handling 40de9a34a363 scsi: iscsi: Fix iscsi_task use after free 1255087ae481 scsi: iscsi: Adjust iface sysfs attr detection a1d592e5729f scsi: qedi: Add support for fastpath doorbell recovery f276818f0070 redhat: disable uncommon media device infrastructure e24e48cbf7c1 CI: Drop baseline runs a1a8ee7551a8 (tag: kernel-5.14.0-70.12.1.el9_0) [redhat] kernel-5.14.0-70.12.1.el9_0
Can you link where you fixed it? We can not find the fix.
Apparently it's fixed by https://src.fedoraproject.org/rpms/kexec-tools/c/d55a0565585aa22db069cf5f5fa1955373be60b3
Yes. And originally it was fixed by https://src.fedoraproject.org/rpms/kexec-tools/c/d593bfa6fc5e2e894798e22fa9c4c433517de4b3 But I don't see this in any code in RHEL (or maybe I'm missing something). Can we get this fixed in RHEL?