Bug 2037005
Summary: | [Azure]2 simultaneous crash kernel requests cause system hang in D2s_v4 size | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Yuxin Sun <yuxisun> |
Component: | kernel | Assignee: | Ani Sinha <anisinha> |
kernel sub component: | Hyper-V | QA Contact: | xxiong |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | andavis, artem, decui, eterrell, johan.burati, litian, mikhail.velikikh, misha.bykov, ruyang, svpcom, xuli, xxiong, yacao, yuxisun |
Version: | 8.6 | Keywords: | Triaged |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | kernel-4.18.0-485.el8 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-11-14 15:37:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Yuxin Sun
2022-01-04 16:01:22 UTC
(In reply to Dave Young from comment #1) > Hi, > > Could you try to disable panic notifiers as below before doing the crash > test. see if it helps? > echo 0 > /sys/module/kernel/parameters/crash_kexec_post_notifiers > > Thanks > Dave Hi Dave, A quick test: After 'echo 0 > /sys/module/kernel/parameters/crash_kexec_post_notifiers' and run './crash.sh', the VM prints several "[ 151.029536] hv_vmbus: Waiting for VMBus UNLOAD to complete" messages for ~100s and then reboot successfully. Thanks! (In reply to Dave Young from comment #3) > Hi, thanks. So it could be same issue reported in > https://bugzilla.redhat.com/show_bug.cgi?id=1865745, I suggest to open the > bug to MS to have a look Thanks Dave! CC Dexuan. Hi Dexuan, Could you please help to have a look? Whether it is the same issue with BZ#1865745? Thanks! The symptom looks a little differnt as kdump is disabled here ("systemctl disable kdump"). I'm not sure why the VM is able to reboot upon panic, if we run 'echo 0 > /sys/module/kernel/parameters/crash_kexec_post_notifiers'. Since kdump is disabled here, crash_kexec_post_notifiers=0 or 1 should not make any difference to me (?) It's unclear to me how ]2 simultaneous crash kernel requests cause system hang. (In reply to Dexuan Cui from comment #5) > The symptom looks a little differnt as kdump is disabled here ("systemctl > disable kdump"). > > I'm not sure why the VM is able to reboot upon panic, if we run 'echo 0 > > /sys/module/kernel/parameters/crash_kexec_post_notifiers'. Since kdump is > disabled here, crash_kexec_post_notifiers=0 or 1 should not make any > difference to me (?) From kernel/panic.c kdump loaded or not is checked in *crash_kexec* fuctions, but the uplevel code is still different with the different values of crash_kexec_post_notifiers. It seems the crash_smp_send_stop callbacks are different, probably worth to have a look if any Hyper-V specific things should be considered in the cpu shootdown path, especially in kdump_nmi_callback(), I'm not x86 experts so this is just a wild guess. @Yuxin, do you happen to know if RHEL 9 has the bug or not? It's also very helpful to give the mainline kernel (kernel-ml) a try: https://elrepo.org/linux/kernel/el8/x86_64/RPMS/ (e.g. kernel-ml-*5.15.13*). (In reply to Dexuan Cui from comment #7) > @Yuxin, do you happen to know if RHEL 9 has the bug or not? It's also very > helpful to give the mainline kernel (kernel-ml) a try: > https://elrepo.org/linux/kernel/el8/x86_64/RPMS/ (e.g. kernel-ml-*5.15.13*). Hi Dexuan, This issue also exists in RHEL-9 kernel-5.14.0-39.el9.x86_64, but doesn't exist in kernel-ml-5.15.13-1.el8.elrepo.x86_64. Thanks! Checked with compose RHEL-8.9.0-20230410.21 (4.18.0-485.el8.x86_64) on Size: Standard D2s v4 VM, the result of this issue is PASS [root@LISAv2-xxq-rhel8 azureuser]# sh crash.sh [ 892.917227] sysrq: SysRq : Trigger a crash [ 892.917467] sysrq: SysRq : Trigger a crash [ 892.919302] Kernel panic - not syncing: sysrq triggered crash [ 892.919302] [ 892.925027] CPU: 0 PID: 1712 Comm: sh Tainted: G X --------- - - 4.18.0-485.el8.x86_64 #1 [ 892.930222] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022 [ 892.935856] Call Trace: [ 892.937179] dump_stack+0x41/0x60 [ 892.939016] panic+0xe7/0x2ac [ 892.940574] ? printk+0x58/0x73 [ 892.942286] sysrq_handle_crash+0x11/0x20 [ 892.944390] __handle_sysrq.cold.13+0x48/0xff [ 892.946576] write_sysrq_trigger+0x2b/0x40 [ 892.948616] proc_reg_write+0x39/0x60 [ 892.950541] vfs_write+0xa5/0x1b0 [ 892.952237] ksys_write+0x4f/0xb0 [ 892.953924] do_syscall_64+0x5b/0x1b0 [ 892.955823] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 892.958486] RIP: 0033:0x7fdb38b8ea28 [ 892.960334] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 15 4d 2a 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55 [ 892.969548] RSP: 002b:00007ffd112cf5b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 892.973057] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fdb38b8ea28 [ 892.976510] RDX: 0000000000000002 RSI: 00005577df4573b0 RDI: 0000000000000001 [ 892.980000] RBP: 00005577df4573b0 R08: 000000000000000a R09: 00005577df43c08e [ 892.983337] R10: 000000000000000a R11: 0000000000000246 R12: 00007fdb38e2f6e0 [ 892.986634] R13: 0000000000000002 R14: 00007fdb38e2a860 R15: 0000000000000002 [ 892.991226] Kernel Offset: 0x10c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 893.017433] Rebooting in 1 seconds.. [ 894.019232] WARNING: CPU: 0 PID: 1712 at arch/x86/kernel/nmi.c:164 __register_nmi_handler+0x1e/0x130 [ 894.024054] Modules linked in: ext4 mbcache jbd2 intel_rapl_msr intel_rapl_common nfit libnvdimm kvm_intel xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 kvm xt_owner nft_counter nft_compat nf_tables nfnetlink irqbypass rapl pcspkr hyperv_fb joydev hv_balloon hv_utils vfat fat xfs libcrc32c nvme_tcp(X) nvme_fabrics nvme sr_mod nvme_core sd_mod cdrom t10_pi sg crct10dif_pclmul crc32_pclmul hv_storvsc serio_raw hv_netvsc scsi_transport_fc hid_hyperv hyperv_keyboard crc32c_intel hv_vmbus ghash_clmulni_intel sunrpc dm_mirror dm_region_hash dm_log dm_mod [ 894.047363] CPU: 0 PID: 1712 Comm: sh Tainted: G X --------- - - 4.18.0-485.el8.x86_64 #1 [ 894.051827] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022 [ 894.057122] RIP: 0010:__register_nmi_handler+0x1e/0x130 [ 894.059602] Code: 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 83 7e 10 00 74 08 48 8b 06 48 39 c6 74 0c <0f> 0b b8 ea ff ff ff e9 ab 00 00 00 41 89 ff 48 c7 46 20 00 00 00 [ 894.068320] RSP: 0018:ffffbb47019dbd50 EFLAGS: 00010087 [ 894.071613] RAX: ffffffff93637a60 RBX: 00000000000003e8 RCX: 00000000feda3223 [ 894.075947] RDX: 000000001f8bfbff RSI: ffffffff9362f1c0 RDI: 0000000000000000 [ 894.080134] RBP: ffffbb47019dbe68 R08: ffffbb47019dbdac R09: ffffbb47019dbdb0 [ 894.084369] R10: 0000000000000001 R11: ffffbb47019dbc18 R12: 0000000000000001 [ 894.089023] R13: 00000000000003e8 R14: 0000000000000061 R15: 0000000000000000 [ 894.092938] FS: 00007fdb39486740(0000) GS:ffff8f6177c00000(0000) knlGS:0000000000000000 [ 894.097245] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 894.100570] CR2: 00007fdb38beee33 CR3: 0000000108194001 CR4: 0000000000370ef0 [ 894.104596] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 894.108491] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 894.112353] Call Trace: [ 894.114127] nmi_shootdown_cpus+0x3f/0xa0 [ 894.116593] native_machine_emergency_restart+0x224/0x280 [ 894.119759] panic+0x242/0x2ac [ 894.121839] ? printk+0x58/0x73 [ 894.123948] sysrq_handle_crash+0x11/0x20 [ 894.126402] __handle_sysrq.cold.13+0x48/0xff [ 894.128998] write_sysrq_trigger+0x2b/0x40 [ 894.131559] proc_reg_write+0x39/0x60 [ 894.134020] vfs_write+0xa5/0x1b0 [ 894.136363] ksys_write+0x4f/0xb0 [ 894.138604] do_syscall_64+0x5b/0x1b0 [ 894.140967] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 894.143918] RIP: 0033:0x7fdb38b8ea28 [ 894.146215] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 15 4d 2a 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55 [ 894.156432] RSP: 002b:00007ffd112cf5b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 894.160658] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fdb38b8ea28 [ 894.164518] RDX: 0000000000000002 RSI: 00005577df4573b0 RDI: 0000000000000001 [ 894.168445] RBP: 00005577df4573b0 R08: 000000000000000a R09: 00005577df43c08e [ 894.172279] R10: 000000000000000a R11: 0000000000000246 R12: 00007fdb38e2f6e0 [ 894.176065] R13: 0000000000000002 R14: 00007fdb38e2a860 R15: 0000000000000002 [ 894.179912] ---[ end trace ef4e37249f09b81d ]--- as comment 24,Change to verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:7077 |