Bug 2037005

Summary: [Azure]2 simultaneous crash kernel requests cause system hang in D2s_v4 size
Product: Red Hat Enterprise Linux 8 Reporter: Yuxin Sun <yuxisun>
Component: kernelAssignee: Ani Sinha <anisinha>
kernel sub component: Hyper-V QA Contact: xxiong
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: andavis, artem, decui, eterrell, johan.burati, litian, mikhail.velikikh, misha.bykov, ruyang, svpcom, xuli, xxiong, yacao, yuxisun
Version: 8.6Keywords: Triaged
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-4.18.0-485.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-14 15:37:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yuxin Sun 2022-01-04 16:01:22 UTC
Description of problem:

2 simultaneous crash kernel requests cause system hang in D2s_v4 VM on Azure.

Version-Release number of selected component (if applicable):
kernel-4.18.0-353.el8.x86_64
kernel-4.18.0-354.el8.x86_64


How reproducible:
100%


Steps to Reproduce:
1. Create a D2s_v4 VM on Azure
2. systemctl disable kdump
echo "kernel.panic = 1" >> /etc/sysctl.conf
reboot
3. Create a shell script, e.g. crash.sh:
```
echo c > /proc/sysrq-trigger &
echo c > /proc/sysrq-trigger &
```
Run the script as root in serial console

Actual results:
The system hangs forever

[root@wala86d2sv412150239-vm1 ~]# ./crash.sh
[ 1595.360425] sysrq: SysRq : 
[ 1595.360425] sysrq: SysRq : 
[ 1595.360429] Trigger a crash
[ 1595.362352] Trigger a crash
[ 1595.362358] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 1595.362360] PGD 0 P4D 0 
[ 1595.373334] Oops: 0002 [#1] SMP NOPTI
[ 1595.375292] CPU: 1 PID: 1623 Comm: bash Not tainted 4.18.0-353.el8.x86_64 #1
[ 1595.379167] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008  12/07/2018
[ 1595.384921] RIP: 0010:sysrq_handle_crash+0x12/0x20
[ 1595.389325] Code: 44 1f c0 ff 48 89 df e8 7c fb ff ff e9 9c fe ff ff 90 90 90 90 90 90 90 0f 1f 44 00 00 c7 05 1d 13 29 01 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 0f 1f 44 00 00 0f 1f 44 00 00 fb 66 0f
[ 1595.399723] RSP: 0018:ffffab69812efe78 EFLAGS: 00010246
[ 1595.402470] RAX: ffffffffab1bb920 RBX: 0000000000000063 RCX: 0000000000000007
[ 1595.406350] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000063
[ 1595.410362] RBP: 0000000000000007 R08: ffffffffac4616e0 R09: 00080000000000ff
[ 1595.414191] R10: 6873617263206120 R11: 2072656767697254 R12: 0000000000000000
[ 1595.418106] R13: 0000000000000000 R14: ffffffffabab0280 R15: 0000000000000000
[ 1595.421857] FS:  00007fb1712ad740(0000) GS:ffff9349b7d00000(0000) knlGS:0000000000000000
[ 1595.426059] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1595.429197] CR2: 0000000000000000 CR3: 00000001089fe005 CR4: 00000000003706e0
[ 1595.432892] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1595.436746] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1595.440468] Call Trace:
[ 1595.441775]  __handle_sysrq.cold.11+0x48/0xfb
[ 1595.444074]  write_sysrq_trigger+0x2b/0x30
[ 1595.446218]  proc_reg_write+0x39/0x60
[ 1595.448530]  vfs_write+0xa5/0x1a0
[ 1595.450592]  ksys_write+0x4f/0xb0
[ 1595.452609]  do_syscall_64+0x5b/0x1a0
[ 1595.454717]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 1595.457469] RIP: 0033:0x7fb1709b85c8
[ 1595.459353] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 d5 3f 2a 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[ 1595.469617] RSP: 002b:00007ffcc3a3fb58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 1595.474148] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb1709b85c8
[ 1595.478128] RDX: 0000000000000002 RSI: 000055aec1399ac0 RDI: 0000000000000001
[ 1595.481830] RBP: 000055aec1399ac0 R08: 000000000000000a R09: 0000000000000002
[ 1595.485969] R10: 000000000000000a R11: 0000000000000246 R12: 00007fb170c586e0
[ 1595.496965] R13: 0000000000000002 R14: 00007fb170c53880 R15: 0000000000000002
[ 1595.500709] Modules linked in: intel_rapl_msr intel_rapl_common isst_if_mbox_msr isst_if_common nfit libnvdimm xt_conntrack kvm_intel nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 kvm nft_counter xt_owner irqbypass nft_compat nf_tables rapl nfnetlink pcspkr i2c_piix4 hyperv_fb hv_balloon hv_utils joydev vfat fat ipmi_devintf ipmi_msghandler xfs libcrc32c ata_generic crct10dif_pclmul sd_mod t10_pi sg crc32_pclmul hv_netvsc hid_hyperv hyperv_keyboard hv_storvsc crc32c_intel scsi_transport_fc ata_piix ghash_clmulni_intel serio_raw hv_vmbus libata sunrpc dm_mirror dm_region_hash dm_log dm_mod
[ 1595.533131] CR2: 0000000000000000
[ 1595.535923] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 1595.535925] ---[ end trace 5f81f0a8565c2888 ]---
[ 1595.535928] RIP: 0010:sysrq_handle_crash+0x12/0x20
[ 1595.541653] PGD 0 
[ 1595.545272] Code: 44 1f c0 ff 48 89 df e8 7c fb ff ff e9 9c fe ff ff 90 90 90 90 90 90 90 0f 1f 44 00 00 c7 05 1d 13 29 01 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 0f 1f 44 00 00 0f 1f 44 00 00 fb 66 0f
[ 1595.548987] P4D 0 
[ 1595.551116] RSP: 0018:ffffab69812efe78 EFLAGS: 00010246
[ 1595.562861] 
[ 1595.565276] 
[ 1595.569024] Oops: 0002 [#2] SMP NOPTI
[ 1595.571014] RAX: ffffffffab1bb920 RBX: 0000000000000063 RCX: 0000000000000007
[ 1595.572770] CPU: 0 PID: 1622 Comm: bash Tainted: G      D          --------- -  - 4.18.0-353.el8.x86_64 #1
[ 1595.575998] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000063
[ 1595.580789] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008  12/07/2018
[ 1595.587856] RBP: 0000000000000007 R08: ffffffffac4616e0 R09: 00080000000000ff
[ 1595.594763] RIP: 0010:sysrq_handle_crash+0x12/0x20
[ 1595.601099] R10: 6873617263206120 R11: 2072656767697254 R12: 0000000000000000
[ 1595.606032] Code: 44 1f c0 ff 48 89 df e8 7c fb ff ff e9 9c fe ff ff 90 90 90 90 90 90 90 0f 1f 44 00 00 c7 05 1d 13 29 01 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 0f 1f 44 00 00 0f 1f 44 00 00 fb 66 0f
[ 1595.609686] R13: 0000000000000000 R14: ffffffffabab0280 R15: 0000000000000000
[ 1595.613657] RSP: 0018:ffffab6982c7be78 EFLAGS: 00010246
[ 1595.625738] FS:  00007fb1712ad740(0000) GS:ffff9349b7d00000(0000) knlGS:0000000000000000
[ 1595.630611] 
[ 1595.634433] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1595.639875] RAX: ffffffffab1bb920 RBX: 0000000000000063 RCX: 0000000000000000
[ 1595.641831] CR2: 0000000000000000 CR3: 00000001089fe005 CR4: 00000000003706e0
[ 1595.645825] RDX: 0000000000000000 RSI: ffff9349b7c16858 RDI: 0000000000000063
[ 1595.651260] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1595.656239] RBP: 0000000000000007 R08: 0000000000000000 R09: 0000000000000096
[ 1595.661249] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1595.666187] R10: 00000000ff000000 R11: ffffab69838a0020 R12: 0000000000000000

Expected results:
The system should panic and reboot

Additional info:
Related BZ#2033214

Comment 2 Yuxin Sun 2022-01-05 15:47:54 UTC
(In reply to Dave Young from comment #1)
> Hi,
> 
> Could you try to disable panic notifiers as below before doing the crash
> test. see if it helps? 
> echo 0 > /sys/module/kernel/parameters/crash_kexec_post_notifiers
> 
> Thanks
> Dave

Hi Dave,

A quick test: After 'echo 0 > /sys/module/kernel/parameters/crash_kexec_post_notifiers' and run './crash.sh', the VM prints several "[  151.029536] hv_vmbus: Waiting for VMBus UNLOAD to complete" messages for ~100s and then reboot successfully.

Thanks!

Comment 4 Yuxin Sun 2022-01-06 13:26:30 UTC
(In reply to Dave Young from comment #3)
> Hi, thanks. So it could be same issue reported in
> https://bugzilla.redhat.com/show_bug.cgi?id=1865745, I suggest to open the
> bug to MS to have a look

Thanks Dave! CC Dexuan.

Hi Dexuan,

Could you please help to have a look? Whether it is the same issue with BZ#1865745? Thanks!

Comment 5 Dexuan Cui 2022-01-06 22:06:26 UTC
The symptom looks a little differnt as kdump is disabled here ("systemctl disable kdump").

I'm not sure why the VM is able to reboot upon panic, if we run 'echo 0 > /sys/module/kernel/parameters/crash_kexec_post_notifiers'. Since kdump is disabled here, crash_kexec_post_notifiers=0 or 1 should not make any difference to me (?)

It's unclear to me how ]2 simultaneous crash kernel requests cause system hang.

Comment 6 Dave Young 2022-01-07 02:06:17 UTC
(In reply to Dexuan Cui from comment #5)
> The symptom looks a little differnt as kdump is disabled here ("systemctl
> disable kdump").
> 
> I'm not sure why the VM is able to reboot upon panic, if we run 'echo 0 >
> /sys/module/kernel/parameters/crash_kexec_post_notifiers'. Since kdump is
> disabled here, crash_kexec_post_notifiers=0 or 1 should not make any
> difference to me (?)

From kernel/panic.c kdump loaded or not is checked in *crash_kexec* fuctions, but the uplevel code is still different with the different values of crash_kexec_post_notifiers.

It seems the crash_smp_send_stop callbacks are different, probably worth to have a look if any Hyper-V specific things should be considered in the cpu shootdown path, especially in kdump_nmi_callback(), I'm not x86 experts so this is just a wild guess.

Comment 7 Dexuan Cui 2022-01-07 02:50:07 UTC
@Yuxin, do you happen to know if RHEL 9 has the bug or not? It's also very helpful to give the mainline kernel (kernel-ml) a try: https://elrepo.org/linux/kernel/el8/x86_64/RPMS/ (e.g. kernel-ml-*5.15.13*).

Comment 8 Yuxin Sun 2022-01-07 14:33:52 UTC
(In reply to Dexuan Cui from comment #7)
> @Yuxin, do you happen to know if RHEL 9 has the bug or not? It's also very
> helpful to give the mainline kernel (kernel-ml) a try:
> https://elrepo.org/linux/kernel/el8/x86_64/RPMS/ (e.g. kernel-ml-*5.15.13*).

Hi Dexuan,

This issue also exists in RHEL-9 kernel-5.14.0-39.el9.x86_64, but doesn't exist in kernel-ml-5.15.13-1.el8.elrepo.x86_64. Thanks!

Comment 24 xxiong 2023-04-11 05:31:36 UTC
Checked with compose RHEL-8.9.0-20230410.21 (4.18.0-485.el8.x86_64) on Size: Standard D2s v4 VM, the result of this issue is PASS


[root@LISAv2-xxq-rhel8 azureuser]# sh crash.sh 
[  892.917227] sysrq: SysRq : Trigger a crash
[  892.917467] sysrq: SysRq : Trigger a crash
[  892.919302] Kernel panic - not syncing: sysrq triggered crash
[  892.919302] 
[  892.925027] CPU: 0 PID: 1712 Comm: sh Tainted: G               X --------- -  - 4.18.0-485.el8.x86_64 #1
[  892.930222] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022
[  892.935856] Call Trace:
[  892.937179]  dump_stack+0x41/0x60
[  892.939016]  panic+0xe7/0x2ac
[  892.940574]  ? printk+0x58/0x73
[  892.942286]  sysrq_handle_crash+0x11/0x20
[  892.944390]  __handle_sysrq.cold.13+0x48/0xff
[  892.946576]  write_sysrq_trigger+0x2b/0x40
[  892.948616]  proc_reg_write+0x39/0x60
[  892.950541]  vfs_write+0xa5/0x1b0
[  892.952237]  ksys_write+0x4f/0xb0
[  892.953924]  do_syscall_64+0x5b/0x1b0
[  892.955823]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[  892.958486] RIP: 0033:0x7fdb38b8ea28
[  892.960334] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 15 4d 2a 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[  892.969548] RSP: 002b:00007ffd112cf5b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  892.973057] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fdb38b8ea28
[  892.976510] RDX: 0000000000000002 RSI: 00005577df4573b0 RDI: 0000000000000001
[  892.980000] RBP: 00005577df4573b0 R08: 000000000000000a R09: 00005577df43c08e
[  892.983337] R10: 000000000000000a R11: 0000000000000246 R12: 00007fdb38e2f6e0
[  892.986634] R13: 0000000000000002 R14: 00007fdb38e2a860 R15: 0000000000000002
[  892.991226] Kernel Offset: 0x10c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  893.017433] Rebooting in 1 seconds..
[  894.019232] WARNING: CPU: 0 PID: 1712 at arch/x86/kernel/nmi.c:164 __register_nmi_handler+0x1e/0x130
[  894.024054] Modules linked in: ext4 mbcache jbd2 intel_rapl_msr intel_rapl_common nfit libnvdimm kvm_intel xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 kvm xt_owner nft_counter nft_compat nf_tables nfnetlink irqbypass rapl pcspkr hyperv_fb joydev hv_balloon hv_utils vfat fat xfs libcrc32c nvme_tcp(X) nvme_fabrics nvme sr_mod nvme_core sd_mod cdrom t10_pi sg crct10dif_pclmul crc32_pclmul hv_storvsc serio_raw hv_netvsc scsi_transport_fc hid_hyperv hyperv_keyboard crc32c_intel hv_vmbus ghash_clmulni_intel sunrpc dm_mirror dm_region_hash dm_log dm_mod
[  894.047363] CPU: 0 PID: 1712 Comm: sh Tainted: G               X --------- -  - 4.18.0-485.el8.x86_64 #1
[  894.051827] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022
[  894.057122] RIP: 0010:__register_nmi_handler+0x1e/0x130
[  894.059602] Code: 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 83 7e 10 00 74 08 48 8b 06 48 39 c6 74 0c <0f> 0b b8 ea ff ff ff e9 ab 00 00 00 41 89 ff 48 c7 46 20 00 00 00
[  894.068320] RSP: 0018:ffffbb47019dbd50 EFLAGS: 00010087
[  894.071613] RAX: ffffffff93637a60 RBX: 00000000000003e8 RCX: 00000000feda3223
[  894.075947] RDX: 000000001f8bfbff RSI: ffffffff9362f1c0 RDI: 0000000000000000
[  894.080134] RBP: ffffbb47019dbe68 R08: ffffbb47019dbdac R09: ffffbb47019dbdb0
[  894.084369] R10: 0000000000000001 R11: ffffbb47019dbc18 R12: 0000000000000001
[  894.089023] R13: 00000000000003e8 R14: 0000000000000061 R15: 0000000000000000
[  894.092938] FS:  00007fdb39486740(0000) GS:ffff8f6177c00000(0000) knlGS:0000000000000000
[  894.097245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  894.100570] CR2: 00007fdb38beee33 CR3: 0000000108194001 CR4: 0000000000370ef0
[  894.104596] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  894.108491] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  894.112353] Call Trace:
[  894.114127]  nmi_shootdown_cpus+0x3f/0xa0
[  894.116593]  native_machine_emergency_restart+0x224/0x280
[  894.119759]  panic+0x242/0x2ac
[  894.121839]  ? printk+0x58/0x73
[  894.123948]  sysrq_handle_crash+0x11/0x20
[  894.126402]  __handle_sysrq.cold.13+0x48/0xff
[  894.128998]  write_sysrq_trigger+0x2b/0x40
[  894.131559]  proc_reg_write+0x39/0x60
[  894.134020]  vfs_write+0xa5/0x1b0
[  894.136363]  ksys_write+0x4f/0xb0
[  894.138604]  do_syscall_64+0x5b/0x1b0
[  894.140967]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[  894.143918] RIP: 0033:0x7fdb38b8ea28
[  894.146215] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 15 4d 2a 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[  894.156432] RSP: 002b:00007ffd112cf5b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  894.160658] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fdb38b8ea28
[  894.164518] RDX: 0000000000000002 RSI: 00005577df4573b0 RDI: 0000000000000001
[  894.168445] RBP: 00005577df4573b0 R08: 000000000000000a R09: 00005577df43c08e
[  894.172279] R10: 000000000000000a R11: 0000000000000246 R12: 00007fdb38e2f6e0
[  894.176065] R13: 0000000000000002 R14: 00007fdb38e2a860 R15: 0000000000000002
[  894.179912] ---[ end trace ef4e37249f09b81d ]---

Comment 28 xxiong 2023-04-15 08:51:49 UTC
as comment 24,Change to verified

Comment 30 errata-xmlrpc 2023-11-14 15:37:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:7077