Description of problem:
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Start a virtual machine under virt-manager (windows 7 or XP)
2. See it running slowly
3. Check dmesg and see zillions of warnings
VM runs slow
zillions of errors in dmesg
[35238.492760] vmwrite error: reg 401e value 2021 (err 12)
[35238.492765] CPU: 0 PID: 21396 Comm: CPU 0/KVM Tainted: P W OE 4.16.1-300.fc28.x86_64 #1
[35238.492766] Hardware name: System manufacturer System Product Name/P5B-Deluxe, BIOS 1238 09/30/2008
[35238.492767] Call Trace:
[35238.492780] vmx_set_cr4+0x1dc/0x200 [kvm_intel]
[35238.492803] kvm_set_cr4+0x18b/0x2e0 [kvm]
[35238.492823] rsm_enter_protected_mode+0x7f/0x110 [kvm]
[35238.492844] rsm_load_state_64+0x6cb/0x8b0 [kvm]
[35238.492863] ? emulator_task_switch+0x290/0x4a0 [kvm]
[35238.492883] em_rsm+0x258/0x2a0 [kvm]
[35238.492903] x86_emulate_insn+0x15a/0x10a0 [kvm]
[35238.492908] ? vmx_get_cs_db_l_bits+0x17/0x30 [kvm_intel]
[35238.492927] x86_emulate_instruction+0x14d/0x670 [kvm]
[35238.492932] handle_exception+0x112/0x330 [kvm_intel]
[35238.492937] vmx_handle_exit+0xab/0xe10 [kvm_intel]
[35238.492941] ? vmx_vcpu_run+0x7b7/0x990 [kvm_intel]
[35238.492961] ? kvm_arch_vcpu_ioctl_run+0x504/0x1720 [kvm]
[35238.492964] ? __schedule+0x23f/0x850
[35238.492967] ? __update_load_avg_se.isra.33+0x1c5/0x1d0
[35238.492984] ? kvm_vcpu_ioctl+0x2b8/0x5e0 [kvm]
[35238.492986] ? set_next_entity+0x96/0x1c0
[35238.492988] ? pick_next_task_fair+0x2f7/0x5a0
[35238.492991] ? do_vfs_ioctl+0xa4/0x610
[35238.492993] ? SyS_ioctl+0x74/0x80
[35238.492995] ? do_syscall_64+0x74/0x180
[35238.492997] ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
same errors also with nouveau. nvidia taint is not the cause
Runs fast with no warnings
Old processor, same one hit previously by kernel regression
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
v4.16 is doing an unconditional RMW on SECONDARY_VM_EXEC_CONTROL in the vmx_set_cr4 path. The SECONDARY_VM_EXEC_CONTROL VMCS field is not guaranteed to exist on older processors, e.g. most Core 2 SKUs, causing VMWRITE to fail. KVM's VMREAD function doesn't warn on failure and so only the VMWRITE failure is logged. Because the VMREAD fails in the RMW sequence, the value reported in the VMWRITE error path is garbage, i.e. whatever happened to be in the GPR prior to VMREAD. In this case it's more than likely the value that is actually destined to be written to CR4.
This error is benign in the sense that it won't actually affect KVM functionality, though it undoubtedly slows down the system due to flooding the kernel log.
I confirm that since 4.16 I cannot use VMs with kvm anymore, again.
I have to switch back to using VirtualBox, that while nice enough, is not my preferred solution.
If removing the RMW for that set of old processors won't affect functionality, then I would ask to at least make it conditional or roll the change back as it has created a regression.
Also happens with kernel-4.16.3-200.fc27. In my setup kernel stalls then completely (e.g. after bootup 3 pings, afterwards no ping replay from the hosts). Around 10 VMs are started.
See https://bodhi.fedoraproject.org/updates/FEDORA-2018-b6be90818d for details.
Possible fix: https://patchwork.kernel.org/patch/10353377/
Testbuild at: https://koji.fedoraproject.org/koji/taskinfo?taskID=26538797
I have installed and tested on my PC the testbuild kernel from:
It works fine in my PC. No issues to report. Good job everyone!
Same problem and same fix with the testbuild kernel https://koji.fedoraproject.org/koji/taskinfo?taskID=26538797
Works fine again. Good job!
I have seen various comments about the correctness of the patch, just wondering if there's a consensus about it and therefore when the patch would be applied upstream and made available in an official fedora build. Right now I have to choose between using the fixed but stale koji build and a broken fresh release, that is not the ideal situation.
There was a second version posted on Monday https://patchwork.kernel.org/patch/10372151/ and tagged for stable. I haven't seen it applied to the kvm tree yet but I would expect it to happen sometime this coming week. If you see the maintainer ack it or mention it is applied feel free to let us know so we can pick it up.
*** Bug 1575625 has been marked as a duplicate of this bug. ***
I don't know if this is related or not... but I haven't found a better bug match yet...
I have 16GB of RAM and 4GB of swap. I start a KVM VM that is assigned 2GB of RAM. I have plenty of free RAM available (although a significant amount of it might be used as buffer/cache) but the system goes into a swap storm. It is like getting a request to 2GB of RAM for the VM somehow made the system overreact and decide to start swapping all my desktop applications to swap. swap goes from 0 to 2GB+ in a couple of minutes, the hard drive light is constantly on... and switching between applications is impossible... unless I'm extra patient and wait several minutes. Looking at free -m I have 10GB+ of RAM available but it is in a heavy swap situation.
Just rebooted tot he 4.16.7-300 kernel and am hoping not to run into that again but I started having it on 4.15.x and all of the 4.16.x releases.
kernel-4.16.8-300.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-ac3b4c7605
*** Bug 1577283 has been marked as a duplicate of this bug. ***
kernel-4.16.8-300.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-ac3b4c7605
kernel-4.16.8-300.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.
Please issue an fc27 update, since it too is affected.
Kernel 4.17.19-200.fc28.x86_64 is ok
but Kernel 4.18.9-200.fc28.x86_64 is again NOT ok.
Maybe the patch is missing.
Is the upstream kernel integration maybe missing, too?
Indeed problem is still present with kernel 4.18.9-200.fc28.x86_64.
Errors in dmesg:
[207786.536493] vmwrite error: reg 401e value 0 (err 12)
[207786.546141] CPU: 6 PID: 7719 Comm: CPU 0/KVM Not tainted 4.18.9-200.fc28.x86_64 #1
[207786.556149] Hardware name: Dell Inc. PowerEdge 2900/0TM757, BIOS 2.7.0 10/30/2010
[207786.566105] Call Trace:
[207786.585904] vmx_set_virtual_apic_mode+0x1c5/0x270 [kvm_intel]
[207786.595926] ? kvm_make_all_cpus_request+0x4d/0x70 [kvm]
[207786.605794] kvm_lapic_set_base+0x7b/0x190 [kvm]
[207786.615583] kvm_set_apic_base+0xac/0xd0 [kvm]
[207786.625218] kvm_set_msr_common+0x829/0xc30 [kvm]
[207786.635258] ? vmx_vcpu_run+0x404/0xb80 [kvm_intel]
[207786.644796] ? vmx_vcpu_run+0x410/0xb80 [kvm_intel]
[207786.654695] ? vmx_vcpu_run+0x404/0xb80 [kvm_intel]
[207786.664250] ? vmx_vcpu_run+0x410/0xb80 [kvm_intel]
[207786.673532] ? vmx_vcpu_run+0x404/0xb80 [kvm_intel]
[207786.683044] handle_wrmsr+0x55/0x140 [kvm_intel]
[207786.693016] kvm_arch_vcpu_ioctl_run+0x56e/0x1a10 [kvm]
[207786.702666] kvm_vcpu_ioctl+0x2b8/0x5e0 [kvm]
[207786.722416] ? apic_timer_interrupt+0xa/0x20
[207786.771391] RIP: 0033:0x7ff61abf4c57
[207786.781515] Code: 00 00 90 48 8b 05 49 82 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 19 82 2c 00 f7 d8 64 89 01 48
[207786.802966] RSP: 002b:00007ff60ba37648 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[207786.813995] RAX: ffffffffffffffda RBX: 00007ff624ac7002 RCX: 00007ff61abf4c57
[207786.824446] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000017
[207786.834947] RBP: 0000000000000001 R08: 000055b035f73730 R09: 000000000000ffff
[207786.844994] R10: 000055b035f5d490 R11: 0000000000000246 R12: 000055b035f59760
[207786.855174] R13: 0000000000000000 R14: 00007ff624ac6000 R15: 000055b0386308b0
Confirmed there's something fishy again with kernel-4.18.9-200.fc28.x86_64
[ +0.163969] vmwrite error: reg 401e value 0 (err 12)
[ +0.000008] CPU: 0 PID: 8329 Comm: CPU 1/KVM Tainted: P OE 4.18.9-200.fc28.x86_64 #1
[ +0.000001] Hardware name: System manufacturer System Product Name/P5B-Deluxe, BIOS 1238 09/30/2008
[ +0.000001] Call Trace:
[ +0.000011] dump_stack+0x5c/0x80
[ +0.000010] vmx_set_virtual_apic_mode+0x1c5/0x270 [kvm_intel]
[ +0.000039] ? kvm_make_all_cpus_request+0x4d/0x70 [kvm]
[ +0.000019] kvm_lapic_set_base+0x7b/0x190 [kvm]
[ +0.000019] kvm_set_apic_base+0xac/0xd0 [kvm]
[ +0.000018] kvm_set_msr_common+0x829/0xc30 [kvm]
[ +0.000005] ? vmx_vcpu_run+0x404/0xb80 [kvm_intel]
[ +0.000005] ? vmx_vcpu_run+0x410/0xb80 [kvm_intel]
[ +0.000004] ? vmx_vcpu_run+0x404/0xb80 [kvm_intel]
[ +0.000004] ? vmx_vcpu_run+0x410/0xb80 [kvm_intel]
[ +0.000004] ? vmx_vcpu_run+0x404/0xb80 [kvm_intel]
[ +0.000004] handle_wrmsr+0x55/0x140 [kvm_intel]
[ +0.000020] kvm_arch_vcpu_ioctl_run+0x56e/0x1a10 [kvm]
[ +0.000016] kvm_vcpu_ioctl+0x2b8/0x5e0 [kvm]
[ +0.000004] ? do_signal+0x1a8/0x610
[ +0.000003] ? __fpu__restore_sig+0x75/0x440
[ +0.000002] do_vfs_ioctl+0xa4/0x620
[ +0.000003] ksys_ioctl+0x60/0x90
[ +0.000001] __x64_sys_ioctl+0x16/0x20
[ +0.000003] do_syscall_64+0x5b/0x160
[ +0.000003] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ +0.000003] RIP: 0033:0x7f1fcf961c57
[ +0.000000] Code: 00 00 90 48 8b 05 49 82 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 19 82 2c 00 f7 d8 64 89 01 48
[ +0.000035] RSP: 002b:00007f1fbb7fd648 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ +0.000002] RAX: ffffffffffffffda RBX: 0000562f67310920 RCX: 00007f1fcf961c57
[ +0.000001] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000017
[ +0.000001] RBP: 0000562f6568aa60 R08: 0000562f650ac730 R09: 0000562f656bb640
[ +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000562f65092760
[ +0.000001] R13: 00007ffe00301fbf R14: 00007f1fd96a6000 R15: 0000562f67310920
However the behavior seems a bit different than before:
My windows 7 VM goes into suspended state after a while and I cannot resume it, while my fedora 28 VM throws some of these errors but not the zillions we had previously.
This is clearly a bummer, can someone take a look at it?
Let's track this on the new bug (https://bugzilla.redhat.com/show_bug.cgi?id=1632426) - it's not a regression of the fix for this bug as far as I can tell, just a new unconditional write to a field that presumably doesn't exist.