Bug 1036792

Summary: PXE boot 5-10x slower in RHEL due to invalid guest state emulation
Product: Red Hat Enterprise Linux 7 Reporter: Paolo Bonzini <pbonzini>
Component: kernelAssignee: Bandan Das <bdas>
kernel sub component: KVM QA Contact: Virtualization Bugs <virt-bugs>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: amit.shah, bcl, bdas, berrange, cfergeau, clalancette, dwmw2, gansalmon, hhuang, itamar, jforbes, jonathan, juzhang, kernel-maint, knoel, laine, lersek, libvirt-maint, madhu.chinakonda, michen, mkolman, pbonzini, qiguo, rjones, scottt.tw, veillard, virt-maint, vpodzime, xwei
Version: 7.0Keywords: TestOnly
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-3.10.0-258.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 980882 Environment:
Last Closed: 2015-11-19 20:01:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 980882, 1209995    
Bug Blocks: 1205790    

Description Paolo Bonzini 2013-12-02 15:49:59 UTC
+++ This bug was initially created as a clone of Bug #980882 +++

Description of problem:
After update to the recent version of libvirt etc. in F18, booting virtual machine over PXE is unusably slow. Upgrading to F19 doesn't solve the issue. It was okay with the older versions.

This seems to be a kernel issue -- works well with kernel-3.8.13-100.fc17 -- reassigning. Did something change in the TFTP stack recently?

--- Additional comment from Paolo Bonzini on 2013-11-30 04:05:27 EST ---

Vratislav, can you please run

grep . /sys/module/kvm_intel/parameters/*

I suspect this is caused by iPXE's usage of big real mode.

--- Additional comment from Paolo Bonzini on 2013-12-02 05:29:05 EST ---

Indeed:

/sys/module/kvm_intel/parameters/emulate_invalid_guest_state:Y
/sys/module/kvm_intel/parameters/unrestricted_guest:N

--- Additional comment from Vratislav Podzimek on 2013-12-02 05:35:20 EST ---

Is there any boot option or something like that I could use to disable the problematic feature?

--- Additional comment from Paolo Bonzini on 2013-12-02 10:41:01 EST ---

Yes, you can load KVM with "emulate_invalid_guest_state=N" to work around the problem.

I reproduced your scenario with a 2MB vmlinuz and 12MB initrd.img.  Here is my pxelinux.cfg/default file:

default anaconda
prompt 1
timeout 10
label local
  localboot 1
label anaconda
  kernel kernels/vmlinuz
  initrd kernels/initrd.img

I have:
unrestricted_guest=Y => ~1s
emulate_invalid_guest_state=N unrestricted_guest=N => ~1s
emulate_invalid_guest_state=Y unrestricted_guest=N => 10s

So it's not _unbearably slow_, but it is pretty slow indeed.

Comment 3 Paolo Bonzini 2014-03-27 11:24:31 UTC
I managed to get a 1.6-3x speedup in invalid guest state microbenchmarks.  We could consider moving this to kernel if that speedup is enough.

Comment 4 Vratislav Podzimek 2014-03-27 11:40:10 UTC
(In reply to Paolo Bonzini from comment #3)
> I managed to get a 1.6-3x speedup in invalid guest state microbenchmarks. 
> We could consider moving this to kernel if that speedup is enough.
That would be nice.

Comment 6 Vratislav Podzimek 2014-06-18 12:36:40 UTC
Any progress? Btw, I still see no reason for the PXE boot in a VM being drastically slower than on a real HW with trivial PXE stack firmware. What's the limitation?

Comment 7 Laszlo Ersek 2014-06-18 17:03:08 UTC
- iPXE is anything but trivial

- when you have suitable hardware that supports the "unrestricted guest" feature,
  the big real mode code of iPXE runs transparently on the VCPU

- when you have old hardware without this feature, then KVM has to emulate
  such code manually (implement the instructions in software) -- hence
  "emulate_invalid_guest_state". It is slow, but Paolo and Bandan are
  optimizing it.

- when you have old hardware (no unrestricted guest host CPU feature), and
  *also* turn off emulation, then... iPXE shouldn't actually run (because the
  code shouldn't be possible to execute in any way). Qemu should bomb out with
  an emulation failure from KVM. Hpwever, according to comment 0, under this
  setup iPXE runs in fact. I don't understand how that's possible, so I'll
  leave its explanation to Paolo & Bandan.

Comment 8 Paolo Bonzini 2014-06-19 15:28:49 UTC
In non-emulate_invalid_guest_state mode, KVM tries to make invalid state work by approximating it within the constraints of the processor.  If iPXE works, it means that it doesn't *really* need big real mode; the usage is only there to comply with the protected mode memory allocator spec (PMM).

IIUC iPXE has to do a lot of mode transitions because it has to work together with the BIOS.  Each time you go out of iPXE and back you get 4 transition (because SeaBIOS also goes to protected mode and back).

On a real processor (or a Westmere or more recent) this costs only a few hundred cycles times 4.  On older processors the cost, incurred a lot of times per second, is multiplied by 3000 or more, until the point that not only you saturate the CPU, you also start getting slower.

See bug 1088784 too, exposed by the many transitions.

It's actually Stefan who probably knows more about it than everyone else.

Comment 12 Rafael Aquini 2015-06-02 12:25:29 UTC
Patch(es) available on kernel-3.10.0-258.el7

Comment 14 Xiaoqing Wei 2015-07-03 08:13:54 UTC
on system with packages:
[root@dhcp-11-118 ~]# rpm -qa | grep -i ipxe
ipxe-bootimgs-20130517-6.gitc4bce43.el7.noarch
ipxe-roms-20130517-6.gitc4bce43.el7.noarch
ipxe-roms-qemu-20130517-6.gitc4bce43.el7.noarch
[root@dhcp-11-118 ~]# rpm -q qemu-kvm-rhev
qemu-kvm-rhev-2.3.0-1.el7.x86_64
[root@dhcp-11-118 ~]# 

define a VM with net boot as 1st priority,
measuring time from start loading kernel, to the anaconda interface appearing


[root@dhcp-11-118 ~]# uname -r
3.10.0-287.el7.x86_64

emulate_invalid_guest_state=Y unrestricted_guest=N => 43
emulate_invalid_guest_state=N unrestricted_guest=N => 24

==========


[root@dhcp-11-118 ~]# uname -r
3.10.0-247.el7.x86_64

emulate_invalid_guest_state=Y unrestricted_guest=N => 57
emulate_invalid_guest_state=N unrestricted_guest=N => 24







[root@dhcp-11-118 ~]# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 42
Model name:            Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz
Stepping:              7
CPU MHz:               3374.035
BogoMIPS:              6185.72
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K
NUMA node0 CPU(s):     0-3

hardware features exposed by default(modprobe kvm_intel)
[root@dhcp-11-118 ~]# grep . /sys/module/kvm_intel/parameters/*
/sys/module/kvm_intel/parameters/emulate_invalid_guest_state:Y
/sys/module/kvm_intel/parameters/enable_apicv:N
/sys/module/kvm_intel/parameters/enable_shadow_vmcs:N
/sys/module/kvm_intel/parameters/ept:Y
/sys/module/kvm_intel/parameters/eptad:N
/sys/module/kvm_intel/parameters/fasteoi:Y
/sys/module/kvm_intel/parameters/flexpriority:Y
/sys/module/kvm_intel/parameters/nested:Y
/sys/module/kvm_intel/parameters/ple_gap:0
/sys/module/kvm_intel/parameters/ple_window:4096
/sys/module/kvm_intel/parameters/ple_window_grow:2
/sys/module/kvm_intel/parameters/ple_window_max:1073741823
/sys/module/kvm_intel/parameters/ple_window_shrink:0
/sys/module/kvm_intel/parameters/unrestricted_guest:Y
/sys/module/kvm_intel/parameters/vmm_exclusive:Y
/sys/module/kvm_intel/parameters/vpid:Y

Comment 15 Xiaoqing Wei 2015-07-03 08:19:18 UTC
so on fixed version:
the buggy configuration could save 14 seconds
emulate_invalid_guest_state=Y unrestricted_guest=N

while the un-buggy one didn't impacted.


Hi Bandan,

could you pls have a check whether above test sufficient to verify this bug ?

Thanks,
Xiaoqing.

Comment 16 Bandan Das 2015-07-06 22:57:17 UTC
(In reply to Xiaoqing Wei from comment #15)
> so on fixed version:
> the buggy configuration could save 14 seconds
> emulate_invalid_guest_state=Y unrestricted_guest=N
> 
> while the un-buggy one didn't impacted.
> 
> 
> Hi Bandan,
> 
> could you pls have a check whether above test sufficient to verify this bug ?

Looks good, Additionally, you could run realmode.flat from kvm-unit-tests as another test point. Make sure to set emulate_invalid_guest_state.

> Thanks,
> Xiaoqing.

Comment 17 Xiaoqing Wei 2015-07-07 08:56:32 UTC
git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git

[root@dhcp-11-118 kvm-unit-tests]# uname -r
3.10.0-287.el7.x86_64
[root@dhcp-11-118 kvm-unit-tests]# grep . /sys/module/kvm_intel/parameters/*
/sys/module/kvm_intel/parameters/emulate_invalid_guest_state:Y
/sys/module/kvm_intel/parameters/enable_apicv:N
/sys/module/kvm_intel/parameters/enable_shadow_vmcs:N
/sys/module/kvm_intel/parameters/ept:Y
/sys/module/kvm_intel/parameters/eptad:N
/sys/module/kvm_intel/parameters/fasteoi:Y
/sys/module/kvm_intel/parameters/flexpriority:Y
/sys/module/kvm_intel/parameters/nested:Y
/sys/module/kvm_intel/parameters/ple_gap:0
/sys/module/kvm_intel/parameters/ple_window:4096
/sys/module/kvm_intel/parameters/ple_window_grow:2
/sys/module/kvm_intel/parameters/ple_window_max:1073741823
/sys/module/kvm_intel/parameters/ple_window_shrink:0
/sys/module/kvm_intel/parameters/pml:N
/sys/module/kvm_intel/parameters/unrestricted_guest:N
/sys/module/kvm_intel/parameters/vmm_exclusive:Y
/sys/module/kvm_intel/parameters/vpid:Y

=============================

[root@dhcp-11-118 kvm-unit-tests]# ./run_tests.sh x86/realmode.flat
PASS apic
PASS ioapic
PASS smptest
PASS smptest3
PASS vmexit_cpuid
PASS vmexit_vmcall
PASS vmexit_mov_from_cr8
PASS vmexit_mov_to_cr8
PASS vmexit_inl_pmtimer
PASS vmexit_ipi
PASS vmexit_ipi_halt
PASS vmexit_ple_round_robin
PASS access
PASS emulator
PASS eventinj
PASS hypercall
PASS idt_test
PASS msr
skip pmu (/proc/sys/kernel/nmi_watchdog not equal to 0)
PASS port80
PASS realmode


^^^^^ the test stops output since this line, for hours.




-------------------------------------------


[root@dhcp-11-118 kvm-unit-tests]# tail test.log 
1540 cycles/emulated memory RMW instruction
Return value from qemu: 1
qemu-kvm -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/s3.flat -smp 1
enabling apic
RSDP is at f1e70
RSDT is at 7fe1657
FADT is at 7fe0bda
FACS is at 7fe0000
resume vector addr is 7fe000c
copy resume code from 400348


^^^^ test log also stop output for hours.



Bandan,
any idea ?

Comment 18 Bandan Das 2015-07-07 18:49:35 UTC
(In reply to Xiaoqing Wei from comment #17)
> git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
...
> 
> [root@dhcp-11-118 kvm-unit-tests]# tail test.log 
> 1540 cycles/emulated memory RMW instruction
> Return value from qemu: 1
> qemu-kvm -enable-kvm -device pc-testdev -device
> isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device
> pci-testdev -kernel x86/s3.flat -smp 1
> enabling apic
> RSDP is at f1e70
> RSDT is at 7fe1657
> FADT is at 7fe0bda
> FACS is at 7fe0000
> resume vector addr is 7fe000c
> copy resume code from 400348
runtests.sh doesn't take an argument, it just runs all the tests. This behavior in the resume test is expected.

You can just run realmode.flat test by executing:
./x86-run x86/realmode.flat

Then, check the number of cycles per emulated instruction. This should be smaller for build 287.

> 
> ^^^^ test log also stop output for hours.
> 
> 
> 
> Bandan,
> any idea ?

Comment 19 Xiaoqing Wei 2015-07-08 02:58:20 UTC
[root@dhcp-11-118 kvm-unit-tests]# tail -n 20 247 287 
==> 247 <==
PASS: movsx ah
PASS: movzx ah
PASS: bswap
PASS: aad
PASS: aam
PASS: xlat
PASS: salc (1)
PASS: salc (2)
PASS: fninit
PASS: mov dr with mod bits
FAIL: smsw
PASS: nopl
FAIL: xadd
2129 cycles/emulated jump instruction
2301 cycles/emulated move instruction
2215 cycles/emulated arithmetic instruction
2496 cycles/emulated memory load instruction
2607 cycles/emulated memory store instruction
3040 cycles/emulated memory RMW instruction
Return value from qemu: 3

==> 287 <==
PASS: movsx ah
PASS: movzx ah
PASS: bswap
PASS: aad
PASS: aam
PASS: xlat
PASS: salc (1)
PASS: salc (2)
PASS: fninit
PASS: mov dr with mod bits
PASS: smsw
PASS: nopl
PASS: xadd
774 cycles/emulated jump instruction
870 cycles/emulated move instruction
835 cycles/emulated arithmetic instruction
1076 cycles/emulated memory load instruction
1155 cycles/emulated memory store instruction
1539 cycles/emulated memory RMW instruction
Return value from qemu: 1
[root@dhcp-11-118 kvm-unit-tests]# 


the cycles per instruction for build 287 is smaller then 247, as expected.
but the return value from qemu both non-zero on both kernel versions,


# tail -n5 x86/README 
Legacy notes:
 The exit status of the binary (and the script) is inconsistent: with
 qemu-system, after the unit-test is done, the exit status of qemu is 1,
 different from the 'old style' qemu-kvm, whose exit status in successful
 completion is 0.


so I think the value of 1 on 287(the patched kernel) should be OK,

Bandan,

could you please confirm ?
or correct me if I'm wrong.

Thanks,
Xiaoqing.

Comment 20 Bandan Das 2015-07-08 04:59:45 UTC
(In reply to Xiaoqing Wei from comment #19)
> [root@dhcp-11-118 kvm-unit-tests]# tail -n 20 247 287 
> ==> 247 <==
> PASS: movsx ah
> PASS: movzx ah
> PASS: bswap
> PASS: aad
> PASS: aam
> PASS: xlat
> PASS: salc (1)
> PASS: salc (2)
> PASS: fninit
> PASS: mov dr with mod bits
> FAIL: smsw
> PASS: nopl
> FAIL: xadd
> 2129 cycles/emulated jump instruction
> 2301 cycles/emulated move instruction
> 2215 cycles/emulated arithmetic instruction
> 2496 cycles/emulated memory load instruction
> 2607 cycles/emulated memory store instruction
> 3040 cycles/emulated memory RMW instruction
> Return value from qemu: 3
> 
> ==> 287 <==
> PASS: movsx ah
> PASS: movzx ah
> PASS: bswap
> PASS: aad
> PASS: aam
> PASS: xlat
> PASS: salc (1)
> PASS: salc (2)
> PASS: fninit
> PASS: mov dr with mod bits
> PASS: smsw
> PASS: nopl
> PASS: xadd
> 774 cycles/emulated jump instruction
> 870 cycles/emulated move instruction
> 835 cycles/emulated arithmetic instruction
> 1076 cycles/emulated memory load instruction
> 1155 cycles/emulated memory store instruction
> 1539 cycles/emulated memory RMW instruction
> Return value from qemu: 1
> [root@dhcp-11-118 kvm-unit-tests]# 
> 
> 
> the cycles per instruction for build 287 is smaller then 247, as expected.
> but the return value from qemu both non-zero on both kernel versions,
> 
> 
> # tail -n5 x86/README 
> Legacy notes:
>  The exit status of the binary (and the script) is inconsistent: with
>  qemu-system, after the unit-test is done, the exit status of qemu is 1,
>  different from the 'old style' qemu-kvm, whose exit status in successful
>  completion is 0.
> 
> 
> so I think the value of 1 on 287(the patched kernel) should be OK,
> 
> Bandan,
> 
> could you please confirm ?
> or correct me if I'm wrong.

Yes, this looks good and the results confirm that the bug has been fixed. You are right about the return value, you can safely ignore it. Thank you for the thorough testing!

Bandan

> Thanks,
> Xiaoqing.

Comment 21 Xiaoqing Wei 2015-07-08 05:27:45 UTC
> Yes, this looks good and the results confirm that the bug has been fixed.
> You are right about the return value, you can safely ignore it. Thank you
> for the thorough testing!
> 
> Bandan


thx for the confirm,
moving to VERIFIED.

Comment 23 errata-xmlrpc 2015-11-19 20:01:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-2152.html