Bug 2143438
Summary: | kernel 3.10.0-1160.80.1.el7.x86_64 on Xeon E55xx crashes upon KVM startup [rhel-7.9.z] | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Petko Alov <petko> |
Component: | kernel | Assignee: | Maxim Levitsky <mlevitsk> |
kernel sub component: | KVM | QA Contact: | liunana <nanliu> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | ailan, bhoefer, chayang, coli, gosert, jan.vanmullem, jinzhao, john.sincock, juzhang, Kiryanov_AK, kpfleming, mlevitsk, nathan, nmurray, orion, own3mall, quent.haas, redhatbug, redhat-bugzilla, rvrbovsk, sa-redhat, vchepkov, virt-maint, vquemener, xiaohli, ymankad |
Version: | 7.9 | Keywords: | Regression, Triaged, ZStream |
Target Milestone: | rc | Flags: | jinzhao:
needinfo-
|
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | kernel-3.10.0-1160.87.1.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-03-07 09:54:02 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2160371 |
Description
Petko Alov
2022-11-16 23:14:45 UTC
This is also happening on AlmaLinux 8.6, vmlinuz-4.18.0-372.32.1.el8_6.x86_64 Same behavior, Xeon E56XX processors seemed fine, but a Xeon E5504 resulted in the following when doing a virsh start This works when starting up using vmlinuz-4.18.0-372.26.1.el8_6.x86_64 and vmlinuz-4.18.0-372.16.1.el8_6.x86_64 Note: Have not tested with the AlmaLinux 8.7 kernel, and have since upgraded the CPU to a Xeon E5620 to work around this issue. <pre> [ 1916.592496] perf: interrupt took too long (2509 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 [ 3039.858655] perf: interrupt took too long (3462 > 3136), lowering kernel.perf_event_max_sample_rate to 57000 [ 6595.276702] perf: interrupt took too long (4509 > 4327), lowering kernel.perf_event_max_sample_rate to 44000 [18421.984056] perf: interrupt took too long (5679 > 5636), lowering kernel.perf_event_max_sample_rate to 35000 [28152.353609] perf: interrupt took too long (7386 > 7098), lowering kernel.perf_event_max_sample_rate to 27000 [62686.481557] [drm] fb mappable at 0xB0363000 [62686.481567] [drm] vram apper at 0xB0000000 [62686.481570] [drm] size 3145728 [62686.481573] [drm] fb depth is 24 [62686.481576] [drm] pitch is 4096 [62686.482139] fbcon: radeondrmfb (fb0) is primary device [62686.508939] Console: switching to colour frame buffer device 128x48 [62686.511319] radeon 0000:02:00.0: [drm] fb0: radeondrmfb frame buffer device [62871.281318] tun: Universal TUN/TAP device driver, 1.6 [62871.282942] bb.br: port 2(vnet0) entered blocking state [62871.282981] bb.br: port 2(vnet0) entered disabled state [62871.283076] device vnet0 entered promiscuous mode [62871.283310] bb.br: port 2(vnet0) entered blocking state [62871.283345] bb.br: port 2(vnet0) entered forwarding state [62871.453289] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation [62872.743528] int3: 0000 [#1] SMP PTI [62872.743531] CPU: 0 PID: 75559 Comm: CPU 0/KVM Kdump: loaded Not tainted 4.18.0-372.32.1.el8_6.x86_64 #1 [62872.743532] Hardware name: , BIOS S5500.86B.01.00.0064.050520141428 05/05/2014 [62872.743533] RIP: 0010:setno+0x9/0x10 [kvm] [62872.743534] Code: e5 dd 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 0f 90 c0 e9 b8 de e5 dd cc cc cc cc cc cc cc cc 0f 91 c0 e9 a8 de e5 dd cc <cc> cc cc cc cc cc cc 0f 92 c0 e9 98 de e5 dd cc cc cc cc cc cc cc [62872.743536] RSP: 0018:ffffaccc03de3c18 EFLAGS: 00000286 [62872.743538] RAX: 00000000ffffffff RBX: ffff88b6472e9f50 RCX: 0000000000000000 [62872.743539] RDX: ffffffffc09a3594 RSI: 0000000000000000 RDI: ffff88b6472e9f50 [62872.743540] RBP: 0000000000000006 R08: ffff88b6f3360000 R09: 0000000000000000 [62872.743540] R10: 0000000000000230 R11: 0000000000000005 R12: ffffffffc09dae20 [62872.743541] R13: 0000000000000000 R14: ffff88b6472e9f50 R15: ffff88b6f3360000 [62872.743542] FS: 00007f6406a37700(0000) GS:ffff88bb97c00000(0000) knlGS:0000000000000000 [62872.743543] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [62872.743544] CR2: 0000000000000000 CR3: 0000000170378000 CR4: 00000000000026f0 [62872.743544] Call Trace: [62872.743545] ? x86_emulate_insn+0x1c7/0x1050 [kvm] [62872.743546] ? x86_decode_emulated_instruction+0x5a/0x210 [kvm] [62872.743546] ? x86_emulate_instruction+0x2f2/0x560 [kvm] [62872.743547] ? emulator_pio_in+0x30/0x70 [kvm] [62872.743548] ? vmx_handle_exit+0x36d/0x7a0 [kvm_intel] [62872.743549] ? vcpu_enter_guest+0xabb/0x1730 [kvm] [62872.743549] ? vmx_set_rflags+0xb3/0x240 [kvm_intel] [62872.743550] ? x86_emulate_instruction+0x47b/0x560 [kvm] [62872.743551] ? vmx_vcpu_load+0x27/0x40 [kvm_intel] [62872.743551] ? kvm_arch_vcpu_ioctl_run+0xff/0x5f0 [kvm] [62872.743552] ? kvm_vcpu_ioctl+0x2cc/0x640 [kvm] [62872.743553] ? do_vfs_ioctl+0xa4/0x690 [62872.743553] ? ksys_ioctl+0x64/0xa0 [62872.743554] ? __x64_sys_ioctl+0x16/0x20 [62872.743555] ? do_syscall_64+0x5b/0x1b0 [62872.743555] ? entry_SYSCALL_64_after_hwframe+0x61/0xc6 [62872.743556] Modules linked in: vhost_net vhost vhost_iotlb tap tun rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ipmi_ssif binfmt_misc scsi_transport_iscsi 8021q garp mrp bridge stp llc nft_chain_nat nf_nat nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables_set nf_tables nfnetlink sunrpc iTCO_wdt gpio_ich iTCO_vendor_support intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate snd_hda_codec_hdmi intel_uncore snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec pcspkr snd_hda_core snd_hwdep snd_seq snd_seq_device ioatdma snd_pcm snd_timer i2c_i801 snd lpc_ich soundcore acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler i7core_edac i5500_temp acpi_cpufreq vfat fat xfs libcrc32c ext4 mbcache jbd2 raid1 sd_mod t10_pi sg radeon mgag200 drm_ttm_helper ttm ahci libahci drm_kms_helper libata syscopyarea sysfillrect sysimgblt crc32c_intel fb_sys_fops drm igb dca i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod </pre> I am seeing this as well. I'm having this issue as well on my Dell C1100 server running KVM virtual machines. Is there any update on this issue? It completely broke my server, and I wasn't able to boot until I used the last kernel version (3.10.0-1160.76.1.el7.x86_64). This is serious issue! Same issue on a Colfax CXT5000 server with said Xeon X5550 CPU and kernel version 3.10.0-1160.80.1.el7 in EL7.9 Can reproduce this issue by booting qemu easily on Intel(R) Xeon(R) CPU E5540 @ 2.53GHz. [438213.290834] Modules linked in: bridge stp llc intel_powerclamp coretemp kvm_intel gpio_ich iTCO_wdt kvm iTCO_vendor_support ipmi_ssif irqbypass pcspkr lpc_ich hpilo sg hpwdt i7core_edac ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_cpufreq ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi radeon sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper crct10dif_common syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm drm_panel_orientation_quirks ata_piix qlcnic libata crc32c_intel serio_raw hpsa netxen_nic scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [438213.301942] CPU: 5 PID: 6403 Comm: qemu-kvm Kdump: loaded Tainted: G I ------------ 3.10.0-1160.80.1.el7.x86_64 #1 [438213.304465] Hardware name: HP ProLiant ML370 G6, BIOS P63 08/16/2015 [438213.305821] task: ffff9aadf048b180 ti: ffff9aad72754000 task.ti: ffff9aad72754000 [438213.307397] RIP: 0010:[<ffffffffc0906f55>] [<ffffffffc0906f55>] setno+0x5/0x10 [kvm] [438213.309121] RSP: 0018:ffff9aad72757c18 EFLAGS: 00000202 [438213.310247] RAX: 0000000000000200 RBX: ffff9aaec15f18c0 RCX: 000301001a242000 [438213.311759] RDX: ffffffffc0906f54 RSI: 0000000000000000 RDI: ffff9aaec15f18c0 [438213.313263] RBP: ffff9aad72757c48 R08: 0000000000000000 R09: 0000000000000000 [438213.314775] R10: 0000000000001ae0 R11: ffff9aaef2630008 R12: ffffffffc0927240 [438213.316280] R13: 0000000000000006 R14: ffff9aaec15f0000 R15: 0000000000000000 [438213.317792] FS: 00007f6ed071f700(0000) GS:ffff9aaef7880000(0000) knlGS:0000000000000000 [438213.319502] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [438213.320798] CR2: 0000000000000000 CR3: 00000000be524000 CR4: 00000000000027e0 [438213.322309] Call Trace: [438213.322852] [<ffffffffc0912f35>] ? x86_emulate_insn+0x635/0xd30 [kvm] [438213.324277] [<ffffffffc08f19d4>] x86_emulate_instruction+0x1e4/0x720 [kvm] [438213.325795] [<ffffffffc0982689>] vmx_handle_exit+0x1f9/0xc00 [kvm_intel] [438213.327277] [<ffffffffc0972c4d>] ? vmx_set_cr3+0xbd/0x190 [kvm_intel] [438213.328737] [<ffffffffc09182a5>] ? kvm_apic_has_interrupt+0x45/0xa0 [kvm] [438213.330252] [<ffffffffc08ed910>] vcpu_enter_guest+0x770/0x1470 [kvm] [438213.331669] [<ffffffffc08f1cb4>] ? x86_emulate_instruction+0x4c4/0x720 [kvm] [438213.333218] [<ffffffffba44bb02>] ? __mem_cgroup_commit_charge+0xe2/0x2f0 [438213.334755] [<ffffffffc08f5a18>] kvm_arch_vcpu_ioctl_run+0x358/0x480 [kvm] [438213.336271] [<ffffffffc08d6239>] kvm_vcpu_ioctl+0x2d9/0x700 [kvm] [438213.337630] [<ffffffffba471418>] do_vfs_ioctl+0x3a8/0x5c0 [438213.338855] [<ffffffffba4716b1>] SyS_ioctl+0x81/0xa0 [438213.339927] [<ffffffffba3467a6>] ? __audit_syscall_exit+0x1f6/0x2b0 [438213.341319] [<ffffffffba9c539a>] system_call_fastpath+0x25/0x2a [438213.342643] Code: 00 00 48 85 ff 74 0a 55 48 89 e5 e8 06 48 9c f9 5d c3 cc cc cc cc 0f 90 c0 c3 cc cc cc cc cc cc cc cc cc cc cc cc 0f 91 c0 c3 cc <cc> cc cc cc cc cc cc cc cc cc cc 0f 92 c0 c3 cc cc cc cc cc cc [438213.346655] RIP [<ffffffffc0906f55>] setno+0x5/0x10 [kvm] [438213.347869] RSP <ffff9aad72757c18> [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.10.0-1160.80.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Sat Oct 8 18:13:21 UTC 2022 Also can reproduce this issue with 3.10.0-1160.81.1.el7.x86_64. Reproduce steps: Boot qemu: /usr/libexec/qemu-kvm Test Env: # virsh domcapabilities | grep model <mode name='host-model' supported='yes'> <model fallback='forbid'>Nehalem-IBRS</model> Model name: Intel(R) Xeon(R) CPU E5540 @ 2.53GHz qemu-kvm-rhev-2.12.0-48.el7_9.4.x86_64 I will upload the full vmcore-dmesg.txt later. Hi Amnon, Would you please help to find the right developer to help to check this issue? Thanks a lot! Best regards Nana Does 3.10.0-1160.81.1.el7 do anything to fix this? I'm guessing no, but thought I'd ask. * Thu Nov 24 2022 Rado Vrbovsky <rvrbovsk> [3.10.0-1160.81.1.el7] - [netdrv] bnxt: don't lock the tx queue from napi poll (Jamie Bainbridge) [2110869] - [netdrv] bnxt_en: reverse order of TX disable and carrier off (Jamie Bainbridge) [2110869] - [netdrv] qede: confirm skb is allocated before using (Jamie Bainbridge) [2131145] So, no I'm also seeing this on 3.10.0-1160.80.1.el7.x86_64, with Intel(R) Xeon(R) CPU E5520 @ 2.27GHz As soon as i attempt to start vm, host hangs and reboots. This happened to me on a server never used for KVM before, when setting up first VM, and with autostart NOT enabled. If this kernel is installed on a hypervisor with VMs already setup and set to autostart on reboot, then the result will likely be an endless loop of hang/reboot. Which will likely continue until someone notices, and boots into rescue mode or alternate kernel and sorts things out. Or if noone notices, them it will likely continue until the hard reboots corrupt something important on FS, and leave system in a real mess. Brilliant. This kernel should be removed from repository, it is a menace. Older kernel 3.10.0-1160.66.1.el7.x86_64 works fine. Hi Maxim, Would you please help to check the test results in Comment 20? The fix works well on my side. What's the next step for this bug? Thanks. Nana I'm unfortunately able to reproduce the issue with Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (on HP ProLiant ML350 G6), too. Reverting to 3.10.0-1160.71.1.el7.x86_64 keeps the system working. We also see this issue on one of our KVM servers. But reverting back to 3.10.0-1160.76.1.el7.x86_64.img is no longer an option due to vulnerabilities in this old kernel version. When could we expect a fixed kernel version or is there a work around available to prevent the crash? Experiencing the same issue with a couple of old HVs with Intel Xeon E5420 and E5620 CPUs. 3.10.0-1160.76.1 seems to be the last working kernel version. Same issue, Intel Xeon E5520 *** Bug 2167465 has been marked as a duplicate of this bug. *** Verified this bug with 3.10.0-1160.87.1.el7.x86_64. Test PASS. Test Env: 3.10.0-1160.87.1.el7.x86_64 qemu-kvm-rhev-2.12.0-48.el7_9.4.x86_64 Model name: Intel(R) Xeon(R) CPU E5506 @ 2.13GHz (1/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads: PASS (3384.42 s) (2/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.rh_kernel_update: PASS (311.20 s) (3/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.x86_cpu_model.host: PASS (158.35 s) (4/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.system_reset_bootable: PASS (393.83 s) (5/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.system_powerdown: PASS (91.47 s) (6/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.system_reset_during_boot: PASS (986.99 s) (7/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.boot: PASS (74.82 s) (8/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.reboot: PASS (1571.62 s) Move this bug to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: kernel security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:1091 |