Bug 2080468 - kdump on aarch64 AWS instances gets stuck
Summary: kdump on aarch64 AWS instances gets stuck
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kexec-tools
Version: 36
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ---
Assignee: Pingfan Liu
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-29 17:45 UTC by Dusty Mabe
Modified: 2023-09-07 16:45 UTC (History)
9 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2022-11-08 06:32:00 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
/sys/kernel/irq/* information from AWS c6g.xlarge instance (4.46 KB, text/plain)
2022-05-11 01:51 UTC, Dusty Mabe
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FC-439 0 None None None 2022-04-29 17:59:12 UTC
Red Hat Knowledge Base (Article) 6562431 0 None None None 2022-06-10 02:04:01 UTC

Description Dusty Mabe 2022-04-29 17:45:40 UTC
Description of problem:

kdump on aarch64 AWS instances (in this case c6g.xlarge) gets stuck. This is somehow related to the serial console of the machine.

When setting up kdump and using sysrq to trigger a crash we notice that the crash kernel hangs and never completes. It always gets stuck at a particular point:


```
[   10.506150] printk: console [ttyS0] disabled
```

If I then type some characters into the serial console the system (or the console) gets unstuck, but it looks like another kexec happens in the background. That kernel eventually bails out (though I do notice this interested stack trace before it does bail out):

```
[   79.141909] irq 14: nobody cared (try booting with the "irqpoll" option)                                                                                   
[   79.141916] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.17.4-300.fc36.aarch64 #1 
[   79.141919] Hardware name: Amazon EC2 c6g.xlarge/, BIOS 1.0 11/1/2018       
[   79.141921] Call trace:                                                         
[   79.141922]  dump_backtrace+0xfc/0x134                          
[   79.141927]  show_stack+0x24/0x6c                                               
[   79.141929]  dump_stack_lvl+0x64/0x80                                           
[   79.141933]  dump_stack+0x18/0x34                                               
[   79.141935]  __report_bad_irq+0x54/0x16c            
[   79.141938]  note_interrupt+0x30c/0x40c          
[   79.141942]  handle_irq_event+0xec/0x180                               
[   79.141944]  handle_fasteoi_irq+0xcc/0x200            
[   79.141946]  generic_handle_domain_irq+0x48/0x70                              
[   79.141948]  gic_handle_irq+0xc0/0x140                                   
[   79.141950]  call_on_irq_stack+0x2c/0x38                                 
[   79.141952]  do_interrupt_handler+0x88/0x90         
[   79.141955]  el1_interrupt+0x34/0x54                                            
[   79.141959]  el1h_64_irq_handler+0x18/0x24                                                                                                                          
[   79.141961]  el1h_64_irq+0x7c/0x80
[   79.141963]  arch_cpu_idle+0x18/0x2c                                                                                                                                
[   79.141964]  default_idle_call+0x4c/0x140                                    
[   79.141967]  cpuidle_idle_call+0x14c/0x1a0          
[   79.141970]  do_idle+0xb0/0x100                                                 
[   79.141973]  cpu_startup_entry+0x30/0x8c    
[   79.141976]  rest_init+0xd0/0xe0
[   79.141977]  arch_call_rest_init+0x1c/0x28
[   79.141980]  start_kernel+0x484/0x4a0
[   79.141981]  __primary_switched+0xc0/0xc8
[   79.141985] handlers:
[   79.141986] [<00000000f4a19d33>] serial8250_interrupt
[   79.141991] Disabling IRQ #14
[   79.144145] pci 0000:00:01.0: [1d0f:8250] type 00 class 0x070003
[   79.144261] pci 0000:00:01.0: reg 0x10: [mem 0x80118000-0x80118fff]
[   79.144655] pci 0000:00:01.0: BAR 0: assigned [mem 0x80000000-0x80000fff]
[   79.145089] printk: console [ttyS0] disabled
[   79.145243] 0000:00:01.0: ttyS0 at MMIO 0x80000000 (irq = 14, base_baud = 115200) is a 16550A
[   94.741159] printk: console [ttyS0] enabled
[   94.744715] pci 0000:00:04.0: [1d0f:8061] type 00 class 0x010802
[   94.746219] pci 0000:00:04.0: reg 0x10: [mem 0x80110000-0x80113fff]
[   94.749505] pci 0000:00:04.0: PME# supported from D0 D1 D2 D3hot D3cold
```


and then the system seems to go through a reboot (i.e. I see grub and a full boot happens). At the end of all this there is still never any files created in `/var/crash`.

Since the system got hung up initially on a message about the console I decided to try the test after removing `console=ttyS0,115200n8` on the kernel command line. In this case the test passes, but I have no idea why.

We originally added `console=ttyS0,115200n8` to the kernel command line for these aarch64 instances because they wouldn't boot otherwise (see https://github.com/coreos/fedora-coreos-tracker/issues/920#issuecomment-914334988). It's possible there should be a BZ created out of that and investigated itself.


Version-Release number of selected component (if applicable):


kexec-tools-2.0.23-5.fc36
kernel-5.17.4-300.fc36

How reproducible:

always

Steps to Reproduce:
1. Boot AMI ami-09253652082332cd1 in us-east-1
2. Set a password for the `core` user
3. Get serial console access to the machine - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connect-to-serial-console.html#sc-connect-SSH
4. set up crash kernel
    - sudo rpm-ostree kargs --append='crashkernel=256M'
    - sudo systemctl enable kdump
    - sudo reboot
5. after reboot trigger crash
    - sudo su -
    - echo 1 > /proc/sys/kernel/sysrq
    - echo c > /proc/sysrq-trigger

Actual results:

Machine gets hung. 

Expected results:

Crash dump file is created.


Additional info:

Comment 1 Dusty Mabe 2022-04-29 17:49:16 UTC
Note here is a video recording of the entire process: https://dustymabe.fedorapeople.org/videos/2022-04-29_kdump-bz2080468.mp4

Comment 2 Dusty Mabe 2022-05-02 02:29:38 UTC
Thanks to dyoung I tried to remove "irqpoll" from KDUMP_COMMANDLINE_APPEND= in /etc/sysconfig/kdump. This worked for Fedora aarch64 AWS instances.

So there's still some issue here to work out, but we know of a workaround.

Comment 3 Frank Liang 2022-05-09 02:59:15 UTC
Add knowledge article link.
https://access.redhat.com/articles/6562431

Comment 4 Dusty Mabe 2022-05-11 01:51:21 UTC
Created attachment 1878498 [details]
/sys/kernel/irq/* information from AWS c6g.xlarge instance

Output of running the following script on the target hardware:

for irq in /sys/kernel/irq/*
do
        echo "irq: $irq"
        echo "hwirq: $(cat $irq/hwirq)"
        echo "actions: $(cat $irq/actions)"
        echo "chip_name: $(cat $irq/chip_name)"
        echo " "
done

Comment 5 Dusty Mabe 2022-05-11 03:18:55 UTC
cross referencing: https://bugzilla.redhat.com/show_bug.cgi?id=1654962

Comment 6 Pingfan Liu 2022-05-17 08:37:14 UTC
Could you upload the boot log for the following combination?
group of the 1st kernel:
1. cmdline with ttyS0 for the 1st kernel's boot log
2. cmdline without ttyS0 for the 1st kernel's boot log


group of the kdump kernel:
3. cmdline with ttyS0 with irqpoll in kdump kernel cmdline (the description is partial and may lose some hints)
4. cmdline without ttyS0 with irqpoll in kdump kernel cmdline 


Thanks for your help.

Comment 7 Pingfan Liu 2022-05-25 01:00:17 UTC
(In reply to Pingfan Liu from comment #6)
> Could you upload the boot log for the following combination?
> group of the 1st kernel:
> 1. cmdline with ttyS0 for the 1st kernel's boot log
> 2. cmdline without ttyS0 for the 1st kernel's boot log
> 
> 
> group of the kdump kernel:
> 3. cmdline with ttyS0 with irqpoll in kdump kernel cmdline (the description
> is partial and may lose some hints)
> 4. cmdline without ttyS0 with irqpoll in kdump kernel cmdline 
> 
> 
> Thanks for your help.

With Frank's help, I can access an AWS instance and begin to debug. And I can collect all message by myself now.

@Frank, thanks for your help

Comment 8 Pingfan Liu 2022-05-26 09:08:39 UTC
A little weird, I have tried upstream kernel 5.18/5.17/5.14, I can not reproduce this bug.

I have also tried rhel kernel: 5.14.0-70.el9, 5.14.0-99.el9. They are free of this bug.
But I did hit this issue with 5.14.0-70.13.1.el9_0.aarch64

For the kdump kernel, the command line is
[    0.000000] Kernel command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-70.el9.aarch64 console=ttyS0,115200n8 console=tty0 net.ifnames=0 rd.blacklist=nouveau nvme_core.io_timeout=4294967295 iommu.strict=0  irqpoll nr_cpus=1 reset_devices cgroup_disable=memory udev.children-max=2 panic=10 swiotlb=noforce novmcoredd cma=0 hugetlb_cma=0
...
[    0.015559] DMI: Amazon EC2 c6g.xlarge/, BIOS 1.0 11/1/2018

Comment 9 Dusty Mabe 2022-05-26 14:04:57 UTC
I observe this issue on:

- 5.17.9-300.fc36.aarch64 (https://koji.fedoraproject.org/koji/search?terms=kernel-5.17.9-300.fc36&type=build&match=exact)
- 5.18.0-60.fc37.aarch64 (https://koji.fedoraproject.org/koji/search?terms=kernel-5.18.0-60.fc37&type=build&match=exact)

Do you observe the issue when running the Fedora kernels?

Comment 10 Pingfan Liu 2022-05-27 01:18:02 UTC
(In reply to Dusty Mabe from comment #9)
> I observe this issue on:
> 
> - 5.17.9-300.fc36.aarch64
> (https://koji.fedoraproject.org/koji/search?terms=kernel-5.17.9-300.
> fc36&type=build&match=exact)
> - 5.18.0-60.fc37.aarch64
> (https://koji.fedoraproject.org/koji/search?terms=kernel-5.18.0-60.
> fc37&type=build&match=exact)
> 
> Do you observe the issue when running the Fedora kernels?

I will try later after tracing down RHEL kernel 5.14.0-70.13.1.el9_0.aarch64.

I pick up RHEL kernel for testing since it is easy to download each minor releases for bisect.

Comment 11 Pingfan Liu 2022-05-27 04:46:52 UTC
bitsect 5.14.0-70.x, this bug can be reproduced by 5.14.0-70.13.1.el9_0, not reproduced by 5.14.0-70.12.1.el9_0.

But the commits seem unrelated with irqpoll or tty

2b84b162f9b3 (tag: kernel-5.14.0-70.13.1.el9_0, tag: RHEL-9.0.0) [redhat] kernel-5.14.0-70.13.1.el9_0
a6008c855537 Merge: redhat: disable uncommon media device infrastructure
a2ce164afefb Merge: netfilter: heap out of bounds write in nf_dup_netdev.c since 5.4
7df3c94aa3a1 Merge: netfilter: nf_tables: validate registers coming from userspace.
44a4dd30077c Merge: scsi: iscsi: iSCSI Offload regression fixes
de3103fbfadf scsi: qedi: Fix failed disconnect handling
77fa8a4637da scsi: iscsi: Fix unbound endpoint error handling
a602e37b5547 scsi: iscsi: Fix conn cleanup and stop race during iscsid restart
711af464feaf scsi: iscsi: Fix endpoint reuse regression
c962bb5e8066 scsi: iscsi: Release endpoint ID when its freed
ce711a8d2f3d scsi: iscsi: Fix offload conn cleanup when iscsid restarts
6b7f5e6bd86e Revert "scsi: iscsi: Fix offload conn cleanup when iscsid restarts"
ef4d4002f567 scsi: iscsi: Speed up session unblocking and removal
6d3c125edaca scsi: iscsi: Fix recovery and unblocking race
0bae86ba1c35 scsi: qedi: Fix cmd_cleanup_cmpl counter mismatch issue
e9ff2c8b7487 scsi: iscsi: Unblock session then wake up error handler
623f01150f92 scsi: iscsi: Fix set_param() handling
40de9a34a363 scsi: iscsi: Fix iscsi_task use after free
1255087ae481 scsi: iscsi: Adjust iface sysfs attr detection
a1d592e5729f scsi: qedi: Add support for fastpath doorbell recovery
f276818f0070 redhat: disable uncommon media device infrastructure
e24e48cbf7c1 CI: Drop baseline runs
a1a8ee7551a8 (tag: kernel-5.14.0-70.12.1.el9_0) [redhat] kernel-5.14.0-70.12.1.el9_0

Comment 25 Timothée Ravier 2023-09-07 16:24:46 UTC
Can you link where you fixed it? We can not find the fix.

Comment 26 Timothée Ravier 2023-09-07 16:36:02 UTC
Apparently it's fixed by https://src.fedoraproject.org/rpms/kexec-tools/c/d55a0565585aa22db069cf5f5fa1955373be60b3

Comment 27 Dusty Mabe 2023-09-07 16:45:21 UTC
Yes. And originally it was fixed by https://src.fedoraproject.org/rpms/kexec-tools/c/d593bfa6fc5e2e894798e22fa9c4c433517de4b3

But I don't see this in any code in RHEL (or maybe I'm missing something). Can we get this fixed in RHEL?


Note You need to log in before you can comment on or make changes to this bug.