Bug 505695
Summary: | Poor KVM guest performance doing kernel builds (100+% overhead, w/ 8vcpu and virtio) | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | erikj |
Component: | qemu | Assignee: | Glauber Costa <gcosta> |
Status: | CLOSED WORKSFORME | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 11 | CC: | chellwig, dwmw2, gcosta, itamar, jaswinder, markmc, mishu, virt-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-08-07 12:46:55 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 498969 |
Description
erikj
2009-06-12 23:22:16 UTC
Avi Kivity has suggested re-testing with Nehalem CPUs as there are known lock contention issues for build style load. I'll be trying to locate a system to run some tests in that scenario. Still bad results, even with the 5500 processor series on a multi-socket system. However, a series of problems with Fedora relating to this type of processor in a multi-core socket rendered the result somewhat suspect. Most of the issues in the bullets have BZs already. Date: Thu, 18 Jun 2009 18:07:44 -0500 From: Erik Jacobson <erikj> To: Avi Kivity <avi> Cc: Erik Jacobson <erikj>, kvm.org Subject: Re: slow guest performance with build load, looking for ideas Hello. I'll top-post since the quoted text is just for reference. Sorry the follow-up testing took so long. We're very low on 5500/Nehalem resources at the moment and I had to track down lots of stuff before getting to the test. I ran some tests on a 2-socket, 8-core system. I wasn't pleased with the results for a couple reasons. One, the issue of it being twice as slow as the host with no guest was still present. However, in trying to make use of this system using Fedora 11, I ran in to several issues not directly related to virtualization. So these test runs have that grain of salt. Example issues... * Node ordering is not sequential (Ie /sys/devices/sysstem/node/node0 and node2, but no node 1). This caused tools based on libvirt and friends to be unhappy. I worked around this by using qemu-kvm by hand directly. we found an LKML posting to address this issue; I didn't check if it made it in yet. * All cores show up as being associated with the first node (node0) even though half should be associated with the 2nd node (still researching that some). * In some of the timing runs on this system, the "real time" reported by the time command was off by 10 to 11 times. Issues were found in the messages file that seemed to relate to this including HUGE time adjustments by NTP and kernel hrtimer 'interrupt too slow' messages. This specific problem seems to be intermittent. * None of the above problems were observed in 8-core/2-socket non-5500/ Nehalem systems. Of course, 2-socket non-Nehalem systems do not have multiple nodes listed under /sys. * I lose access to the resource today but can try to beg and plead again some time next week if folks have ideas to try. Let me know. So those are the grains of salt. I've found that, when doing the timing by hand instead of using the time command, the build time seems to be around 10 to 12 minutes. I'm not sure how trustworthy the output from the time command are in these trials. In any event, that's still more than double for host alone with no guests. System: SGI XE270, 8-core, Xeon X5570 (Nehalem), Hyperthreading turned off Supermicro model: X8DTN Disk1: root disk 147GB ST3146855SS 15K 16MB cache SAS Disk2: work area disk 500GB HDS725050KLA360 7200rpm 16MB cache SATA Distro: Everything Fedora11+released updates Memory: 8 gb in 2048 DDR3 1066 MHZ 18JSF25672PY-1G1D1 DIMMs Only Fedora11 was used (host and guest where applicable). The first timing weirdness was done on a F11 guest with no updates applied. I later applied the updates and the timings seemed to get worse, although I don't trust the values any more. F11+released updates has these versions: kernel-2.6.29.4-167.fc11.x86_64 qemu-kvm-0.10.5-2.fc11.x86_64 Test, as before, was simply this for a kernel build. The .config file has plenty of modules configured. time (make -j12 && make -j12 modules) host only, no guest, baseline ----------------------------- trial 1: real 5m44.823s user 28m45.725s sys 5m46.633s trial 2: real 5m34.438s user 28m14.347s sys 5m41.597s guest, 8 vcpu, 4096 mem, virtio, no cache param, disk device supplied in full ----------------------------------------------------------------------------- trial 1: real 125m5.995s user 31m23.790s sys 9m17.602s trial 2 (changed to 7168 mb memory for the guest): real 120m48.431s user 14m38.967s sys 6m12.437s That's real strange... The 'time' command is showing whacked out results. I then watched a run by hand and counted it at about 10 minutes. However, this third run had the proper time! So whatever the weirdness is, it doesn't happen every time: real 9m49.802s user 24m46.009s sys 8m10.349s I decided this could be related to ntp running as I saw this in messages: Jun 18 16:34:23 localhost ntpd[1916]: time reset -0.229209 s Jun 18 16:34:23 localhost ntpd[1916]: kernel time sync status change 0001 Jun 18 16:40:17 localhost ntpd[1916]: synchronized to 128.162.244.1, stratum 2 and earlier: Jun 18 16:19:09 localhost ntpd[1916]: synchronized to 128.162.244.1, stratum 2 Jun 18 16:19:09 localhost ntpd[1916]: time reset +6609.851122 s Jun 18 16:23:39 localhost ntpd[1916]: synchronized to 128.162.244.1, stratum 2 Jun 18 16:24:04 localhost kernel: hrtimer: interrupt too slow, forcing clock min delta to 62725995 ns I then installed all F11 updates in the guest and tried again (host had updates all along). I got these strange results, strange because of the timing difference. I didn't "watch a non-computer clock" for these. Timing from that was: trial 1: real 16m10.337s user 28m27.604s sys 9m12.772s trial 2: real 11m45.934s user 25m4.432s sys 8m2.189s Here is the qemu-kvm command line used. The -m was for the first run was 4096, and it was 7168 for the other runs. # /usr/bin/qemu-kvm -M pc -m 4096 -smp 8 -name f11-test -uuid b7b4b7e4-9c07-22aa-0c95-d5c8a24176c5 -monitor pty -pidfile /var/run/libvirt/qemu//f11-test.pid -drive file=/foo/f11/Fedora-11-x86_64-DVD.iso,if=virtio,media=cdrom,index=2 -drive file=/var/lib/libvirt/images/f11-test.img,if=virtio,index=0,boot=on -drive file=/dev/sdb,if=virtio,index=1 -net nic,macaddr=54:52:00:46:48:0e,model=virtio -net user -serial pty -parallel none -usb -usbdevice tablet -vnc cct201:1 -soundhw es1370 -redir tcp:5555::22 /proc/cpuinfo is pasted after the test results. # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid bogomips : 5865.69 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 4 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid bogomips : 5865.76 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 0 siblings : 4 core id : 2 cpu cores : 4 apicid : 4 initial apicid : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid bogomips : 5823.99 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 apicid : 6 initial apicid : 6 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid bogomips : 5865.76 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 4 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 1 siblings : 4 core id : 0 cpu cores : 4 apicid : 16 initial apicid : 16 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid bogomips : 5865.80 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 5 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 1 siblings : 4 core id : 1 cpu cores : 4 apicid : 18 initial apicid : 18 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid bogomips : 5865.80 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 6 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 1 siblings : 4 core id : 2 cpu cores : 4 apicid : 20 initial apicid : 20 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid bogomips : 5865.80 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 1 siblings : 4 core id : 3 cpu cores : 4 apicid : 22 initial apicid : 22 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid bogomips : 5865.79 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: On Sun, Jun 14, 2009 at 12:33:06PM +0300, Avi Kivity wrote: > Erik Jacobson wrote: >> We have been trying to test qemu-kvm virtual machines under an IO load. >> The IO load is quite simple: A timed build of the linux kernel and modules. >> I have found that virtual machines take more than twice as long to do this >> build as the host. It doesn't seem to matter if I use virtio or not, Using >> the same device and same filesystem, the host is more than twice as fast. >> >> We're hoping that we can get some advice on how to address this issue. If >> there are any options I should add for our testing, we'd appreciate it. I'm >> also game to try development bits to see if they make a difference. If it >> turns out "that is just the way it is right now", we'd like to know that >> too. >> >> For these tests, I used Fedora 11 as the virtualization server. I did this >> because it has recent bits. I experimented with SLES11 and Fedora11 guests. >> >> In general, I used virt-manager to do the setup and launching. So the >> qemu-kvm command lines are based on that (and this explains why they are >> a bit long). I then modified the qemu-kvm command line to perform other >> variations of the test. Example command lines can be found at the end of >> this message. >> >> I performed tests on two different systems to be sure it isn't related to >> specific hardware. >> > > What is the host cpu type? On pre-Nehalem/Barcelona processors kvm has > poor scalability in mmu intensive workloads like kernel builds. > > -- > error compiling committee.c: too many arguments to function -- Erik Jacobson - Linux System Software - SGI - Eagan, Minnesota 2.6.30-git14 on the F11 host, 2.6.29.4-167.fc11 on the guest... No improvements. FYI. Thanks for all the data Erik - IMHO, the most useful bits so far is: fedora11 host, no guest (baseline) ----------------------- -> real 10m38.116s fedora11 host, fedora11 guest ----------------------------- virtio devices, device fully imported to guest for workarea, cache=none -> real 23m28.397s Best to keep plugging away on kvm@vger seeing if we can gather more data to help us identify where the bottlenecks are. (Note: if you're testing 2.6.30 kernels from rawhide, they have lots of debugging configured, so they're probably not too useful for performance comparisons) Erik, as in bug #509383, please try virtio-blk in rotational mode Same test system/details (but some newer f11 updates since the last post). I'd say enabling block queue rotation made a difference, but not nearly as big a difference as it did with the mkfs.ext3 operations. There were other suggestions in the KVM thread that I haven't attempted yet. Timing with the rotational stuff set to 1... real 14m13.015s user 29m42.162s sys 8m37.416s To confirm this was really better, I halted the virtual machine and restarted it without doing setting the rotational values to 1. I got this timing: real 16m50.829s user 29m33.933s sys 9m4.905s And finally, to confirm the numbers on the host with no guest running... The same disk/filesystem, now mounted on the host instead of the guest, gave this timing: real 6m13.398s user 26m56.061s sys 5m34.477s qemu-kvm command line, both guest runs: /usr/bin/qemu-kvm -M pc -m 4096 -smp 8 -name f11-test -uuid b7b4b7e4-9c07-22aa-0c95-d5c8a24176c5 -monitor pty -pidfile /var/run/libvirt/qemu//f11-test.pid -drive file=/var/lib/libvirt/images/f11-test.img,if=virtio,index=0,boot=on -drive file=/dev/sdb,if=virtio,index=1 -drive file=/var/lib/libvirt/images/test.img,if=virtio,index=2 -net nic,macaddr=54:52:00:46:48:0e,model=virtio -net user -serial pty -parallel none -usb -usbdevice tablet -vnc cct201:1 -soundhw es1370 -redir tcp:5555::22 Expanding on the above, here is an email I sent to the kvm list with some other adjustments that made a difference. Date: Thu, 9 Jul 2009 13:01:10 -0500 From: Erik Jacobson <erikj> To: Avi Kivity <avi> Cc: Erik Jacobson <erikj>, Mark McLoughlin <markmc>, kvm.org, Jes Sorensen <jes> Subject: Re: slow guest performance with build load, looking for ideas >> Timing with the rotational stuff set to 1... >> >> real 14m13.015s >> user 29m42.162s >> sys 8m37.416s > > (user + sys) / real = 2.7 > >> And finally, to confirm the numbers on the host with no guest running... >> The same disk/filesystem, now mounted on the host instead of the guest, gave >> this timing: >> >> real 6m13.398s >> user 26m56.061s >> sys 5m34.477s >> > > (user + sys) / real = 5.2 > > I got 6.something in a guest! > Please drop -usbdevice tablet and set the host I/O scheduler to > deadline. Add cache=none to the -drive options. yes, these changes make a difference. Before starting qemu-kvm, I did this to change the IO scheduler: BEFORE: # for f in /sys/block/sd*/queue/scheduler; do cat $f; done noop anticipatory deadline [cfq] noop anticipatory deadline [cfq] SET: # for f in /sys/block/sd*/queue/scheduler; do echo "deadline" > $f; done CONFIRM: # for f in /sys/block/sd*/queue/scheduler; do cat $f; done noop anticipatory [deadline] cfq noop anticipatory [deadline] cfq qemu command line. Note that usbtablet is off and cache=none is used in drive options: /usr/bin/qemu-kvm -M pc -m 4096 -smp 8 -name f11-test -uuid b7b4b7e4-9c07-22aa-0c95-d5c8a24176c5 -monitor pty -pidfile /var/run/libvirt/qemu//f11-test.pid -drive file=/var/lib/libvirt/images/f11-test.img,if=virtio,index=0,boot=on,cache=none -drive file=/dev/sdb,if=virtio,index=1,cache=none -drive file=/var/lib/libvirt/images/test.img,if=virtio,index=2,cache=none -net nic,macaddr=54:52:00:46:48:0e,model=virtio -net user -serial pty -parallel none -usb -vnc cct201:1 -soundhw es1370 -redir tcp:5555::22 # rotation enabled this way in the guest, once the guest was started: for f in /sys/block/vd*/queue/rotational; do echo 1 > $f; done Test runs after make clean... time (make -j12&& make -j12 modules) real 10m25.585s user 26m36.450s sys 8m14.776s 2nd trial (make clean followed by the same test again. real 9m21.626s user 26m42.144s sys 8m14.532s -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo.org More majordomo info at http://vger.kernel.org/majordomo-info.html okay, I think the summary of all this is some tuning recommendations: 1) Put virtio-blk devices in rotational mode: for f in /sys/block/vd*/queue/rotational; do echo 1 > $f; done 2) Use -drive cache=none 3) Set the host I/O schedular to deadline: for f in /sys/block/sd*/queue/scheduler; do echo "deadline" > $f; done I'm going to close this as WORKSFORME, since there's no much we can do in Fedora apart from recommend people use these settings. |