Bug 2143438

Summary: kernel 3.10.0-1160.80.1.el7.x86_64 on Xeon E55xx crashes upon KVM startup [rhel-7.9.z]
Product: Red Hat Enterprise Linux 7 Reporter: Petko Alov <petko>
Component: kernelAssignee: Maxim Levitsky <mlevitsk>
kernel sub component: KVM QA Contact: liunana <nanliu>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: ailan, bhoefer, chayang, coli, gosert, jan.vanmullem, jinzhao, john.sincock, juzhang, Kiryanov_AK, kpfleming, mlevitsk, nathan, nmurray, orion, own3mall, quent.haas, redhatbug, redhat-bugzilla, rvrbovsk, sa-redhat, vchepkov, virt-maint, vquemener, xiaohli, ymankad
Version: 7.9Keywords: Regression, Triaged, ZStream
Target Milestone: rcFlags: jinzhao: needinfo-
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-3.10.0-1160.87.1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-07 09:54:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2160371    

Description Petko Alov 2022-11-16 23:14:45 UTC
Description of problem:

Attempt to start qemu-kvm VM under kernel-3.10.0-1160.80.1.el7.x86_64 freezes any of 5 workstations with dual E5507 (all work OK under kernel-3.10.0-1160.76.1.el7.x86_64 or any previous version). Boot of the previous kernel (3.10.0-1160.76.1.el7.x86_64 ) resolves the issue.

The workstations with E5-2609, E5-2650 or E5-2630 are not affected - all of them run qemu-kvm VM under kernel-3.10.0-1160.80.1.el7.x86_64 without problems. 



Version-Release number of selected component (if applicable):

3.10.0-1160.80.1.el7.x86_64


How reproducible:

Always


Steps to Reproduce:

1. Install kernel-3.10.0-1160.80.1.el7.x86_64 on a Xeon E5507 system
2. Reboot to new kernel
3. Run qemu-kvm


Actual results:

System freezes, power-off restart required


Expected results:

VM started


Additional info:

The workstations with E5-2609, E5-2650 or E5-2630 are not affected - all of them run qemu-kvm VM under kernel-3.10.0-1160.80.1.el7.x86_64 without problems.

Some more information could be found on CentOS mailing list, thread "[CentOS] Trouble with kernel-3.10.0-1160.80.1.el7.x86_64", including similar observation on Centos 8 with kernel-4.18.0-372.32.1.el8_6.x86_64

Comment 3 Nathan Coulson 2022-11-17 19:03:24 UTC
This is also happening on AlmaLinux 8.6, vmlinuz-4.18.0-372.32.1.el8_6.x86_64

Same behavior, Xeon E56XX processors seemed fine, but a Xeon E5504 resulted in the following when doing a virsh start

This works when starting up using vmlinuz-4.18.0-372.26.1.el8_6.x86_64 and vmlinuz-4.18.0-372.16.1.el8_6.x86_64

Note: Have not tested with the AlmaLinux 8.7 kernel, and have since upgraded the CPU to a Xeon E5620 to work around this issue.

<pre>
[ 1916.592496] perf: interrupt took too long (2509 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 3039.858655] perf: interrupt took too long (3462 > 3136), lowering kernel.perf_event_max_sample_rate to 57000
[ 6595.276702] perf: interrupt took too long (4509 > 4327), lowering kernel.perf_event_max_sample_rate to 44000
[18421.984056] perf: interrupt took too long (5679 > 5636), lowering kernel.perf_event_max_sample_rate to 35000
[28152.353609] perf: interrupt took too long (7386 > 7098), lowering kernel.perf_event_max_sample_rate to 27000
[62686.481557] [drm] fb mappable at 0xB0363000
[62686.481567] [drm] vram apper at 0xB0000000
[62686.481570] [drm] size 3145728
[62686.481573] [drm] fb depth is 24
[62686.481576] [drm]    pitch is 4096
[62686.482139] fbcon: radeondrmfb (fb0) is primary device
[62686.508939] Console: switching to colour frame buffer device 128x48
[62686.511319] radeon 0000:02:00.0: [drm] fb0: radeondrmfb frame buffer device
[62871.281318] tun: Universal TUN/TAP device driver, 1.6
[62871.282942] bb.br: port 2(vnet0) entered blocking state
[62871.282981] bb.br: port 2(vnet0) entered disabled state
[62871.283076] device vnet0 entered promiscuous mode
[62871.283310] bb.br: port 2(vnet0) entered blocking state
[62871.283345] bb.br: port 2(vnet0) entered forwarding state
[62871.453289] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
[62872.743528] int3: 0000 [#1] SMP PTI
[62872.743531] CPU: 0 PID: 75559 Comm: CPU 0/KVM Kdump: loaded Not tainted 4.18.0-372.32.1.el8_6.x86_64 #1
[62872.743532] Hardware name:  , BIOS S5500.86B.01.00.0064.050520141428 05/05/2014
[62872.743533] RIP: 0010:setno+0x9/0x10 [kvm]
[62872.743534] Code: e5 dd 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 0f 90 c0 e9 b8 de e5 dd cc cc cc cc cc cc cc cc 0f 91 c0 e9 a8 de e5 dd cc <cc> cc cc cc cc cc cc 0f 92 c0 e9 98 de e5 dd cc cc cc cc cc cc cc
[62872.743536] RSP: 0018:ffffaccc03de3c18 EFLAGS: 00000286
[62872.743538] RAX: 00000000ffffffff RBX: ffff88b6472e9f50 RCX: 0000000000000000
[62872.743539] RDX: ffffffffc09a3594 RSI: 0000000000000000 RDI: ffff88b6472e9f50
[62872.743540] RBP: 0000000000000006 R08: ffff88b6f3360000 R09: 0000000000000000
[62872.743540] R10: 0000000000000230 R11: 0000000000000005 R12: ffffffffc09dae20
[62872.743541] R13: 0000000000000000 R14: ffff88b6472e9f50 R15: ffff88b6f3360000
[62872.743542] FS:  00007f6406a37700(0000) GS:ffff88bb97c00000(0000) knlGS:0000000000000000
[62872.743543] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[62872.743544] CR2: 0000000000000000 CR3: 0000000170378000 CR4: 00000000000026f0
[62872.743544] Call Trace:
[62872.743545]  ? x86_emulate_insn+0x1c7/0x1050 [kvm]
[62872.743546]  ? x86_decode_emulated_instruction+0x5a/0x210 [kvm]
[62872.743546]  ? x86_emulate_instruction+0x2f2/0x560 [kvm]
[62872.743547]  ? emulator_pio_in+0x30/0x70 [kvm]
[62872.743548]  ? vmx_handle_exit+0x36d/0x7a0 [kvm_intel]
[62872.743549]  ? vcpu_enter_guest+0xabb/0x1730 [kvm]
[62872.743549]  ? vmx_set_rflags+0xb3/0x240 [kvm_intel]
[62872.743550]  ? x86_emulate_instruction+0x47b/0x560 [kvm]
[62872.743551]  ? vmx_vcpu_load+0x27/0x40 [kvm_intel]
[62872.743551]  ? kvm_arch_vcpu_ioctl_run+0xff/0x5f0 [kvm]
[62872.743552]  ? kvm_vcpu_ioctl+0x2cc/0x640 [kvm]
[62872.743553]  ? do_vfs_ioctl+0xa4/0x690
[62872.743553]  ? ksys_ioctl+0x64/0xa0
[62872.743554]  ? __x64_sys_ioctl+0x16/0x20
[62872.743555]  ? do_syscall_64+0x5b/0x1b0
[62872.743555]  ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[62872.743556] Modules linked in: vhost_net vhost vhost_iotlb tap tun rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ipmi_ssif binfmt_misc scsi_transport_iscsi 8021q garp mrp bridge stp llc nft_chain_nat nf_nat nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables_set nf_tables nfnetlink sunrpc iTCO_wdt gpio_ich iTCO_vendor_support intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate snd_hda_codec_hdmi intel_uncore snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec pcspkr snd_hda_core snd_hwdep snd_seq snd_seq_device ioatdma snd_pcm snd_timer i2c_i801 snd lpc_ich soundcore acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler i7core_edac i5500_temp acpi_cpufreq vfat fat xfs libcrc32c ext4 mbcache jbd2 raid1 sd_mod t10_pi sg radeon mgag200 drm_ttm_helper ttm ahci libahci drm_kms_helper libata syscopyarea sysfillrect sysimgblt crc32c_intel fb_sys_fops drm igb dca i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod
</pre>

Comment 4 Orion Poplawski 2022-11-19 02:37:24 UTC
I am seeing this as well.

Comment 5 OwN 2022-11-23 21:17:17 UTC
I'm having this issue as well on my Dell C1100 server running KVM virtual machines.

Is there any update on this issue?  It completely broke my server, and I wasn't able to boot until I used the last kernel version (3.10.0-1160.76.1.el7.x86_64).  

This is serious issue!

Comment 6 Quentin Haas 2022-11-26 13:07:20 UTC
Same issue on a Colfax CXT5000 server with said Xeon X5550 CPU and kernel version 3.10.0-1160.80.1.el7 in EL7.9

Comment 7 liunana 2022-11-29 09:39:59 UTC
Can reproduce this issue by booting qemu easily on Intel(R) Xeon(R) CPU E5540 @ 2.53GHz.


[438213.290834] Modules linked in: bridge stp llc intel_powerclamp coretemp kvm_intel gpio_ich iTCO_wdt kvm iTCO_vendor_support ipmi_ssif irqbypass pcspkr lpc_ich hpilo sg hpwdt i7core_edac ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_cpufreq ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi radeon sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper crct10dif_common syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm drm_panel_orientation_quirks ata_piix qlcnic libata crc32c_intel serio_raw hpsa netxen_nic scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
[438213.301942] CPU: 5 PID: 6403 Comm: qemu-kvm Kdump: loaded Tainted: G          I    ------------   3.10.0-1160.80.1.el7.x86_64 #1
[438213.304465] Hardware name: HP ProLiant ML370 G6, BIOS P63 08/16/2015
[438213.305821] task: ffff9aadf048b180 ti: ffff9aad72754000 task.ti: ffff9aad72754000
[438213.307397] RIP: 0010:[<ffffffffc0906f55>]  [<ffffffffc0906f55>] setno+0x5/0x10 [kvm]
[438213.309121] RSP: 0018:ffff9aad72757c18  EFLAGS: 00000202
[438213.310247] RAX: 0000000000000200 RBX: ffff9aaec15f18c0 RCX: 000301001a242000
[438213.311759] RDX: ffffffffc0906f54 RSI: 0000000000000000 RDI: ffff9aaec15f18c0
[438213.313263] RBP: ffff9aad72757c48 R08: 0000000000000000 R09: 0000000000000000
[438213.314775] R10: 0000000000001ae0 R11: ffff9aaef2630008 R12: ffffffffc0927240
[438213.316280] R13: 0000000000000006 R14: ffff9aaec15f0000 R15: 0000000000000000
[438213.317792] FS:  00007f6ed071f700(0000) GS:ffff9aaef7880000(0000) knlGS:0000000000000000
[438213.319502] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[438213.320798] CR2: 0000000000000000 CR3: 00000000be524000 CR4: 00000000000027e0
[438213.322309] Call Trace:
[438213.322852]  [<ffffffffc0912f35>] ? x86_emulate_insn+0x635/0xd30 [kvm]
[438213.324277]  [<ffffffffc08f19d4>] x86_emulate_instruction+0x1e4/0x720 [kvm]
[438213.325795]  [<ffffffffc0982689>] vmx_handle_exit+0x1f9/0xc00 [kvm_intel]
[438213.327277]  [<ffffffffc0972c4d>] ? vmx_set_cr3+0xbd/0x190 [kvm_intel]
[438213.328737]  [<ffffffffc09182a5>] ? kvm_apic_has_interrupt+0x45/0xa0 [kvm]
[438213.330252]  [<ffffffffc08ed910>] vcpu_enter_guest+0x770/0x1470 [kvm]
[438213.331669]  [<ffffffffc08f1cb4>] ? x86_emulate_instruction+0x4c4/0x720 [kvm]
[438213.333218]  [<ffffffffba44bb02>] ? __mem_cgroup_commit_charge+0xe2/0x2f0
[438213.334755]  [<ffffffffc08f5a18>] kvm_arch_vcpu_ioctl_run+0x358/0x480 [kvm]
[438213.336271]  [<ffffffffc08d6239>] kvm_vcpu_ioctl+0x2d9/0x700 [kvm]
[438213.337630]  [<ffffffffba471418>] do_vfs_ioctl+0x3a8/0x5c0
[438213.338855]  [<ffffffffba4716b1>] SyS_ioctl+0x81/0xa0
[438213.339927]  [<ffffffffba3467a6>] ? __audit_syscall_exit+0x1f6/0x2b0
[438213.341319]  [<ffffffffba9c539a>] system_call_fastpath+0x25/0x2a
[438213.342643] Code: 00 00 48 85 ff 74 0a 55 48 89 e5 e8 06 48 9c f9 5d c3 cc cc cc cc 0f 90 c0 c3 cc cc cc cc cc cc cc cc cc cc cc cc 0f 91 c0 c3 cc <cc> cc cc cc cc cc cc cc cc cc cc 0f 92 c0 c3 cc cc cc cc cc cc 
[438213.346655] RIP  [<ffffffffc0906f55>] setno+0x5/0x10 [kvm]
[438213.347869]  RSP <ffff9aad72757c18>
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.10.0-1160.80.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Sat Oct 8 18:13:21 UTC 2022



Also can reproduce this issue with 3.10.0-1160.81.1.el7.x86_64.

Reproduce steps:
Boot qemu: /usr/libexec/qemu-kvm

Test Env:
    # virsh domcapabilities | grep model
        <mode name='host-model' supported='yes'>
          <model fallback='forbid'>Nehalem-IBRS</model>
    Model name:            Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz
    qemu-kvm-rhev-2.12.0-48.el7_9.4.x86_64



I will upload the full vmcore-dmesg.txt later.




Hi Amnon,


Would you please help to find the right developer to help to check this issue? Thanks a lot!


Best regards
Nana

Comment 15 Scott M 2022-12-23 19:06:12 UTC
Does 3.10.0-1160.81.1.el7 do anything to fix this?

I'm guessing no, but thought I'd ask.

Comment 16 Orion Poplawski 2022-12-28 19:58:31 UTC
* Thu Nov 24 2022 Rado Vrbovsky <rvrbovsk> [3.10.0-1160.81.1.el7]
- [netdrv] bnxt: don't lock the tx queue from napi poll (Jamie Bainbridge) [2110869]
- [netdrv] bnxt_en: reverse order of TX disable and carrier off (Jamie Bainbridge) [2110869]
- [netdrv] qede: confirm skb is allocated before using (Jamie Bainbridge) [2131145]

So, no

Comment 18 John 2023-01-10 06:12:43 UTC
I'm also seeing this on 3.10.0-1160.80.1.el7.x86_64, with Intel(R) Xeon(R) CPU E5520  @ 2.27GHz
As soon as i attempt to start vm, host hangs and reboots.

This happened to me on a server never used for KVM before, when setting up first VM, and with autostart NOT enabled.

If this kernel is installed on a hypervisor with VMs already setup and set to autostart on reboot, then the result will likely be an endless loop of hang/reboot.
Which will likely continue until someone notices, and boots into rescue mode or alternate kernel and sorts things out.
Or if noone notices, them it will likely continue until the hard reboots corrupt something important on FS, and leave system in a real mess.

Brilliant.

This kernel should be removed from repository, it is a menace.

Older kernel 3.10.0-1160.66.1.el7.x86_64 works fine.

Comment 21 liunana 2023-01-11 10:21:13 UTC
Hi Maxim,

Would you please help to check the test results in Comment 20? The fix works well on my side.
What's the next step for this bug?


Thanks.
Nana

Comment 22 Robert Scheck 2023-01-12 23:10:30 UTC
I'm unfortunately able to reproduce the issue with Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (on HP ProLiant ML350 G6), too. Reverting to 3.10.0-1160.71.1.el7.x86_64 keeps the system working.

Comment 23 Thorsten Gosert 2023-01-31 10:36:33 UTC
We also see this issue on one of our KVM servers. But reverting back to 3.10.0-1160.76.1.el7.x86_64.img is no longer an option due to vulnerabilities in this old kernel version. When could we expect a fixed kernel version or is there a work around available to prevent the crash?

Comment 24 Andrey Kiryanov 2023-02-01 09:51:03 UTC
Experiencing the same issue with a couple of old HVs with Intel Xeon E5420 and E5620 CPUs.
3.10.0-1160.76.1 seems to be the last working kernel version.

Comment 28 Vadym Chepkov 2023-02-04 18:01:27 UTC
Same issue, Intel Xeon E5520

Comment 30 Bernie Hoefer 2023-02-07 15:02:13 UTC
*** Bug 2167465 has been marked as a duplicate of this bug. ***

Comment 36 liunana 2023-02-15 13:34:10 UTC
Verified this bug with 3.10.0-1160.87.1.el7.x86_64. Test PASS.

Test Env:
    3.10.0-1160.87.1.el7.x86_64
    qemu-kvm-rhev-2.12.0-48.el7_9.4.x86_64
    Model name:            Intel(R) Xeon(R) CPU           E5506  @ 2.13GHz


 (1/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads: PASS (3384.42 s)
 (2/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.rh_kernel_update: PASS (311.20 s)
 (3/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.x86_cpu_model.host: PASS (158.35 s)
 (4/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.system_reset_bootable: PASS (393.83 s)
 (5/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.system_powerdown: PASS (91.47 s)
 (6/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.system_reset_during_boot: PASS (986.99 s)
 (7/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.boot: PASS (74.82 s)
 (8/8) Host_RHEL.m7.u9.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.7.9.x86_64.io-github-autotest-qemu.reboot: PASS (1571.62 s)


Move this bug to verified.

Comment 40 errata-xmlrpc 2023-03-07 09:54:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:1091