Bug 1002621

Summary: qemu-kvm slow Fedora-19 guest
Product: [Fedora] Fedora Reporter: Heinz Mauelshagen <heinzm>
Component: qemuAssignee: Fedora Virtualization Maintainers <virt-maint>
Status: CLOSED DEFERRED QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 19CC: amit.shah, berrange, bmr, cfergeau, crobinso, dwmw2, gleb, itamar, mkarg, mst, mtosatti, pbonzini, rjones, scottt.tw, virt-maint
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1026808 (view as bug list) Environment:
Last Closed: 2013-11-17 20:27:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1026808    
Attachments:
Description Flags
Fedora-19 guest configuration
none
Guest vmstat output every 3s for 2m and kvm_stat output of host
none
"trace-cmd report test.dat" ~600ms
none
host /proc/interrupts
none
guest /proc/interrupts
none
guest lspci -vv
none
New guest /proc/interrupts after switching to virtio NIC
none
host vmstat + kvm_stat
none
guest vmstat + /proc/interrupts
none
host trace-cmd -b 20000 -e kvm
none
guest "lspci -vv" output none

Description Heinz Mauelshagen 2013-08-29 15:09:33 UTC
Description of problem:
Running a "make -j25" upstream kernel compilation takes about factor 10 more time in a Fedora-19 guest than on its Fedora-19 host on an Intel Hexacore i7-3930K.
Running 12 processes with busy loops (ie. no IO) saturates the guests VCPUs but
not the hosts coresm which are still idling at ~50%.

Version-Release number of selected component (if applicable):
qemu-1.4.2-7.fc19.x86_64

How reproducible:
Create a Fedora-19 guest on an Intel Hexacore with 12 VCPUS pinned to 12 hyperthreads (mapped to 6 cores in one socket), 16G memory for the guest (32G physical RAM on the host) with qcow2 backup file as IDE system disk for the guest.

Steps to Reproduce:
1. Make usptream kernel source tree accessible to guest
   (NFS or copy to /tmp doesn't matter in my case!)
2. in kernel siurce topdir: make distclean; make localmodconfig;make -j 25
3. check with top in guest and host

Actual results:
Builds slow with guest idling at ~80% and host at ~77%

Expected results:
Guest VCPUS _and_ host Cores all saturated. Build only some one digit percentage slower than on the host.

Additional info:
See guest config attached.

Comment 1 Heinz Mauelshagen 2013-08-29 15:11:50 UTC
Created attachment 791876 [details]
Fedora-19 guest configuration

Comment 2 Cole Robinson 2013-08-31 14:23:02 UTC
Thanks for the report. The main missing bit is that your guest isn't using virtio. If you are creating a VM with virt-install or virt-manager, make sure you specify OS as Fedora 19 so you get the optimal defaults.

Use virt-manager to change the disk bus to virtio and the network model to virtio.

Most people aren't assigning 12 VCPUs to their VM and doing 25 way make jobs, so you'll likely benefit for more tuning. see: http://www.redhat.com/summit/2011/presentations/summit/decoding_the_code/wednesday/wagner_w_420_kvm_performance_improvements_and_optimizations.pdf

Please reopen if you still seeing largely degraded performance after making those tweaks.

Comment 3 Heinz Mauelshagen 2013-09-02 17:39:21 UTC
(In reply to Cole Robinson from comment #2)
> Thanks for the report. The main missing bit is that your guest isn't using
> virtio. If you are creating a VM with virt-install or virt-manager, make
> sure you specify OS as Fedora 19 so you get the optimal defaults.

os variant set to fedora19.
I changed to the aforementioned main missing bit (both disk and network to virtio) even though I was wondering what it could help an upstream kernel build (make -j 25) in lieu of building in tmpfs not helping and the build tool chain being cached anyway.

All those changes did not help, my guest is still idling due to top at ~80% as before and the host at > 70%, thus not building the kernel any faster from before.

> 
> Use virt-manager to change the disk bus to virtio and the network model to
> virtio.

Done, see above.

> 
> Most people aren't assigning 12 VCPUs to their VM and doing 25 way make
> jobs, so you'll likely benefit for more tuning. see:
> http://www.redhat.com/summit/2011/presentations/summit/decoding_the_code/
> wednesday/wagner_w_420_kvm_performance_improvements_and_optimizations.pdf

I read that and the only helpful item seemed huge pages which I tried to no eval. Maybe there's more?

> 
> Please reopen if you still seeing largely degraded performance after making
> those tweaks.

Reopend for further advice.

Comment 4 Cole Robinson 2013-09-02 20:20:10 UTC
Not really sure how to go about debugging this TBH, CCing Paolo, Gleb, Marcelo, please chime in if you guys have ideas.

Comment 5 Heinz Mauelshagen 2013-09-02 21:50:19 UTC
what perf and/or output would be usefull?

Comment 6 Gleb Natapov 2013-09-09 07:05:09 UTC
What is your command line? If guest cpu is idling it means IO is going on (asumming there is work to do of course) and I would expect IO during compilation. Another thing to note is that guest with 12 vcpus is not equivalent to host with 12 hyperthreads. Hyperthread is not real cpu, host kernel knows that, guest's does not, so unless you provide correct topology to a guest and do your pinning correctly the guest is in disadvantage. And of course you have twice as much memory in a host, check how much memory compilation takes.

Comment 7 Heinz Mauelshagen 2013-09-09 11:28:32 UTC
(In reply to Gleb Natapov from comment #6)
> What is your command line? If guest cpu is idling it means IO is going on
> (asumming there is work to do of course) and I would expect IO during
> compilation.

Sure it does IO. _But_ on the host it doesn't idle any cores whilts building an upstream kernel.
In the guest it drastically does, which is $subject.

> Another thing to note is that guest with 12 vcpus is not
> equivalent to host with 12 hyperthreads. Hyperthread is not real cpu, host
> kernel knows that, guest's does not, so unless you provide correct topology
> to a guest and do your pinning correctly the guest is in disadvantage. And
> of course you have twice as much memory in a host, check how much memory
> compilation takes.

To install?

virt-install --name RHEL-7.0-20130909.n.0 --ram 4096 --vcpus 4 --cpu host --disk path=/home/mauelsha/.virt/disks/RHEL-7.0-20130909.n.0.qcow2,bus=virtio,format=qcow2 --location http://download.devel.redhat.com/nightly/RHEL-7.0-20130909.n.0/compose/Server/x86_64/os --os-type=linux --description RHEL 7.0-20130909.n.0 Test System --extra-args=ks=ftp://192.168.1.10/ks-RHEL-7.0-20130909.n.0.cfg --force


To build the upstream kernel?

make -j 25


VCPUs are pairwise pinned to host CPU cores.

RSS on build is < 2GiB RAM, so it even compiles way faster on a 4 core/4 GiB RAM physical machine.

Comment 8 Gleb Natapov 2013-09-09 11:46:37 UTC
(In reply to Heinz Mauelshagen from comment #7)
> (In reply to Gleb Natapov from comment #6)
> > What is your command line? If guest cpu is idling it means IO is going on
> > (asumming there is work to do of course) and I would expect IO during
> > compilation.
> 
> Sure it does IO. _But_ on the host it doesn't idle any cores whilts building
> an upstream kernel.
> In the guest it drastically does, which is $subject.
IO is much slower in a guest. You still should not see 10 times slow down. Can you try with raw file format?

> 
> > Another thing to note is that guest with 12 vcpus is not
> > equivalent to host with 12 hyperthreads. Hyperthread is not real cpu, host
> > kernel knows that, guest's does not, so unless you provide correct topology
> > to a guest and do your pinning correctly the guest is in disadvantage. And
> > of course you have twice as much memory in a host, check how much memory
> > compilation takes.
> 
> To install?
I need qemu comand line.

> 
> 
> VCPUs are pairwise pinned to host CPU cores.
This means nothing if you do not specify correct topology to a guest.

Comment 9 Heinz Mauelshagen 2013-09-09 11:49:24 UTC
(In reply to Gleb Natapov from comment #8)
> (In reply to Heinz Mauelshagen from comment #7)
> > (In reply to Gleb Natapov from comment #6)
> > > What is your command line? If guest cpu is idling it means IO is going on
> > > (asumming there is work to do of course) and I would expect IO during
> > > compilation.
> > 
> > Sure it does IO. _But_ on the host it doesn't idle any cores whilts building
> > an upstream kernel.
> > In the guest it drastically does, which is $subject.
> IO is much slower in a guest. You still should not see 10 times slow down.
> Can you try with raw file format?

It's _not_ the IO.

With the WS in tmpfs, it's as slow.

> 
> > 
> > > Another thing to note is that guest with 12 vcpus is not
> > > equivalent to host with 12 hyperthreads. Hyperthread is not real cpu, host
> > > kernel knows that, guest's does not, so unless you provide correct topology
> > > to a guest and do your pinning correctly the guest is in disadvantage. And
> > > of course you have twice as much memory in a host, check how much memory
> > > compilation takes.
> > 
> > To install?
> I need qemu comand line.

/usr/bin/qemu-system-x86_64 -machine accel=kvm -name RHEL-7.0-20130909.n.0 -S -machine pc-i440fx-1.4,accel=kvm,usb=off -cpu SandyBridge,+pdpe1gb,+osxsave,+dca,+pcid,+pdcm,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 4096 -smp 12,sockets=12,cores=1,threads=1 -uuid 9c35660e-e05b-5b43-60a7-a6b80b2aa38a -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/RHEL-7.0-20130909.n.0.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/home/mauelsha/.virt/disks/RHEL-7.0-20130909.n.0.qcow2,if=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=24,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:13:d4:98,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

> 
> > 
> > 
> > VCPUs are pairwise pinned to host CPU cores.
> This means nothing if you do not specify correct topology to a guest.

Comment 10 Gleb Natapov 2013-09-09 12:16:32 UTC
(In reply to Heinz Mauelshagen from comment #9)
> (In reply to Gleb Natapov from comment #8)
> > (In reply to Heinz Mauelshagen from comment #7)
> > > (In reply to Gleb Natapov from comment #6)
> > > > What is your command line? If guest cpu is idling it means IO is going on
> > > > (asumming there is work to do of course) and I would expect IO during
> > > > compilation.
> > > 
> > > Sure it does IO. _But_ on the host it doesn't idle any cores whilts building
> > > an upstream kernel.
> > > In the guest it drastically does, which is $subject.
> > IO is much slower in a guest. You still should not see 10 times slow down.
> > Can you try with raw file format?
> 
> It's _not_ the IO.
Run vmstat in a guest kvm_stat in a host and provide output here please.

Comment 11 Heinz Mauelshagen 2013-09-09 12:36:25 UTC
Created attachment 795602 [details]
Guest vmstat output every 3s for 2m and kvm_stat output of host

As requested in comment #10

Comment 12 Gleb Natapov 2013-09-09 12:45:49 UTC
The amount of interrupts is too damn high. Can you trace it according to instructions here: http://www.linux-kvm.org/page/Tracing? Also what is the output of "cat /proc/interrupts"?

Comment 13 Heinz Mauelshagen 2013-09-09 13:38:40 UTC
Created attachment 795617 [details]
"trace-cmd report test.dat"  ~600ms

As requested in comment #12.

Comment 14 Heinz Mauelshagen 2013-09-09 14:12:58 UTC
Created attachment 795638 [details]
host /proc/interrupts

As requested in comment #12

Comment 15 Gleb Natapov 2013-09-09 15:03:50 UTC
(In reply to Heinz Mauelshagen from comment #14)
> Created attachment 795638 [details]
> host /proc/interrupts
> 
> As requested in comment #12

Forget to mention that this should be taken in the guest.

Comment 16 Gleb Natapov 2013-09-09 15:10:48 UTC
(In reply to Gleb Natapov from comment #15)
> (In reply to Heinz Mauelshagen from comment #14)
> > Created attachment 795638 [details]
> > host /proc/interrupts
> > 
> > As requested in comment #12
> 
> Forget to mention that this should be taken in the guest.

Also "lspci -vv" from the guest.

Comment 17 Heinz Mauelshagen 2013-09-09 15:21:05 UTC
Created attachment 795664 [details]
guest /proc/interrupts

As requested in comment #16

Comment 18 Heinz Mauelshagen 2013-09-09 15:21:43 UTC
Created attachment 795665 [details]
guest lspci -vv

As requested in comment #16

Comment 19 Gleb Natapov 2013-09-09 16:05:03 UTC
(In reply to Heinz Mauelshagen from comment #18)
> Created attachment 795665 [details]
> guest lspci -vv
> 
> As requested in comment #16

Is the command line you provided in comment #9 complte one? You have much more devices here then specified on the command line. LSI Logic / Symbios Logic 53c895a for instance, there are ~1500 accessed to it in the trace and it seams the guest doesn't even have a driver for it.

Second why are you using rlt nic instead of virtio? In the comment #3 you've said you changed to virtio, but you didn't. Are you running make from ssh session? If yes redirect all output to a file, your NIC generates huge amount of interrupts.

Comment 20 Heinz Mauelshagen 2013-09-09 16:27:29 UTC
(In reply to Gleb Natapov from comment #19)
> (In reply to Heinz Mauelshagen from comment #18)
> > Created attachment 795665 [details]
> > guest lspci -vv
> > 
> > As requested in comment #16
> 
> Is the command line you provided in comment #9 complte one?

Hmm, selected from "ps aux" so that likely cut it off...

> You have much
> more devices here then specified on the command line. LSI Logic / Symbios
> Logic 53c895a for instance, there are ~1500 accessed to it in the trace and
> it seams the guest doesn't even have a driver for it.

A result of the cutoff.

> 
> Second why are you using rlt nic instead of virtio?


I meanwhile created a new guest in order to try if that helped the issue
and virtio NIC got lost in transition. Well spotted.

Changed to virio again. No difference WRT slowness of the guest.

Rescheduling interrupts on the host extremely high as before.

>  In the comment #3 you've
> said you changed to virtio, but you didn't. Are you running make from ssh
> session? If yes redirect all output to a file, your NIC generates huge
> amount of interrupts.

No, make runs from console with stdout redirected to /dev/null.

Comment 21 Gleb Natapov 2013-09-09 16:46:30 UTC
(In reply to Heinz Mauelshagen from comment #20)
> (In reply to Gleb Natapov from comment #19)
> > (In reply to Heinz Mauelshagen from comment #18)
> > > Created attachment 795665 [details]
> > > guest lspci -vv
> > > 
> > > As requested in comment #16
> > 
> > Is the command line you provided in comment #9 complte one?
> 
> Hmm, selected from "ps aux" so that likely cut it off...
> 
> > You have much
> > more devices here then specified on the command line. LSI Logic / Symbios
> > Logic 53c895a for instance, there are ~1500 accessed to it in the trace and
> > it seams the guest doesn't even have a driver for it.
> 
> A result of the cutoff.
So what is the full command line?

> 
> > 
> > Second why are you using rlt nic instead of virtio?
> 
> 
> I meanwhile created a new guest in order to try if that helped the issue
> and virtio NIC got lost in transition. Well spotted.
> 
> Changed to virio again. No difference WRT slowness of the guest.
> 
> Rescheduling interrupts on the host extremely high as before.
> 
Are you oversubscribing the host?

> >  In the comment #3 you've
> > said you changed to virtio, but you didn't. Are you running make from ssh
> > session? If yes redirect all output to a file, your NIC generates huge
> > amount of interrupts.
> 
> No, make runs from console with stdout redirected to /dev/null.
How /proc/interrupts looks with virtio? Now usb and nic should use different vectors, so it will be possible to say which one generates interrupts.

Comment 22 Heinz Mauelshagen 2013-09-09 16:58:41 UTC
(In reply to Gleb Natapov from comment #21)
> (In reply to Heinz Mauelshagen from comment #20)
> > (In reply to Gleb Natapov from comment #19)
> > > (In reply to Heinz Mauelshagen from comment #18)
> > > > Created attachment 795665 [details]
> > > > guest lspci -vv
> > > > 
> > > > As requested in comment #16
> > > 
> > > Is the command line you provided in comment #9 complte one?
> > 
> > Hmm, selected from "ps aux" so that likely cut it off...
> > 
> > > You have much
> > > more devices here then specified on the command line. LSI Logic / Symbios
> > > Logic 53c895a for instance, there are ~1500 accessed to it in the trace and
> > > it seams the guest doesn't even have a driver for it.
> > 
> > A result of the cutoff.
> So what is the full command line?
> 
> > 
> > > 
> > > Second why are you using rlt nic instead of virtio?
> > 
> > 
> > I meanwhile created a new guest in order to try if that helped the issue
> > and virtio NIC got lost in transition. Well spotted.
> > 
> > Changed to virio again. No difference WRT slowness of the guest.
> > 
> > Rescheduling interrupts on the host extremely high as before.
> > 
> Are you oversubscribing the host?

Intel Hexacore with HT (1 socket, 6 cores, 12 threads).
Max 12 VCPUs tested.
Even with 4 VCPUs, the build in the guest is way too slow.

> 
> > >  In the comment #3 you've
> > > said you changed to virtio, but you didn't. Are you running make from ssh
> > > session? If yes redirect all output to a file, your NIC generates huge
> > > amount of interrupts.
> > 
> > No, make runs from console with stdout redirected to /dev/null.
> How /proc/interrupts looks with virtio? Now usb and nic should use different
> vectors, so it will be possible to say which one generates interrupts.

Please see below.

Comment 23 Heinz Mauelshagen 2013-09-09 17:01:42 UTC
Created attachment 795710 [details]
New guest /proc/interrupts after switching to virtio NIC

As requested in comment #22

Comment 24 Gleb Natapov 2013-09-09 17:55:34 UTC
(In reply to Heinz Mauelshagen from comment #23)
> Created attachment 795710 [details]
> New guest /proc/interrupts after switching to virtio NIC
> 
> As requested in comment #22

If you do not use network can you bring it down and try again, you have a lot of incomming trafic.

If still slow do tracing again and attach .dat file. Provide "lspci -vv" and full command line for the new guest.

Comment 25 Gleb Natapov 2013-09-09 17:57:45 UTC
(In reply to Gleb Natapov from comment #24)
> (In reply to Heinz Mauelshagen from comment #23)
> > Created attachment 795710 [details]
> > New guest /proc/interrupts after switching to virtio NIC
> > 
> > As requested in comment #22
> 
> If you do not use network can you bring it down and try again, you have a
> lot of incomming trafic.
Do it in a guest I mean, not host.

Comment 26 Heinz Mauelshagen 2013-09-10 08:55:16 UTC
(In reply to Gleb Natapov from comment #25)
> (In reply to Gleb Natapov from comment #24)
> > (In reply to Heinz Mauelshagen from comment #23)
> > > Created attachment 795710 [details]
> > > New guest /proc/interrupts after switching to virtio NIC
> > > 
> > > As requested in comment #22
> > 
> > If you do not use network can you bring it down and try again, you have a
> > lot of incomming trafic.
> Do it in a guest I mean, not host.

I actually do use network to access my kernel source tree via NFS on the host.

Bringing the NIC down saves me the ~30s in the build I'm seeing when I build from kernel source in /tmp/ vs. NFS; ie. it saves nothing.

The best "make -j 21" result I achieved building in a guest is 2m30s (kernel source in /tmp) vs. 1m1s (kernel source in ext4) on the host for the very same upstream kernel build, which is too slow still.

Results building on guest in /tmp are the same with NIC up or down obviously.


Overview of the results of "make -j21" of my upstream kernel:

host/ext4 -> 1m1s
guest/tmpfs -> 2m30s (with NIC up or down)
guest/nfs -> ~3m.


Please advice, if a trace still helps based on these reults.

Comment 27 Gleb Natapov 2013-09-10 11:43:13 UTC
> Overview of the results of "make -j21" of my upstream kernel:
> 
> host/ext4 -> 1m1s
> guest/tmpfs -> 2m30s (with NIC up or down)
OK so we are not talking "factor 10 more time" here. Is this first compilation after guest start or have you done warm up build?

> 
> Please advice, if a trace still helps based on these reults.
Yes, I need everything I asked: full command line, trace while building with network disabled in a guest (I see a lot of network interrupts and they will make trace less readable), kvm_stat output while building, top in a guest and host while building, "lspci -vv" from the guest (needed to interpret trace).

Do you have ksm enabled on a host?

Comment 28 Heinz Mauelshagen 2013-09-10 12:05:35 UTC
The one minute included sparse, w/o it the host build time goes down to 30s, which is factor 6, but yes, that's not 10 ;-)

All builds are warm.

qemu command line:
/usr/bin/qemu-system-x86_64 -machine accel=kvm -name RHEL-7.0-20130909.n.0-clone -S -machine pc-i440fx-1.4,accel=kvm,usb=off -cpu SandyBridge,+pdpe1gb,+osxsave,+dca,+pcid,+pdcm,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 16384 -smp 10,maxcpus=12,sockets=12,cores=1,threads=1 -uuid 3bf7f3da-8b6f-eb49-6086-f63952b29ed1 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/RHEL-7.0-20130909.n.0-clone.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/home/mauelsha/.virt/disks/RHEL-7.0-20130909.n.0-clone.qcow2,if=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=24,id=hostnet0,vhost=on,vhostfd=25 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:19:af:71,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

trace etc. following.

Yes, KSM is enabled on Fedora 19 host:

$ uname -r
3.10.10-200.fc19.x86_64
$ grep -i ksm config-3.10.10-200.fc19.x86_64
CONFIG_KSM=y

Comment 29 Gleb Natapov 2013-09-10 12:10:28 UTC
(In reply to Heinz Mauelshagen from comment #28)
> The one minute included sparse, w/o it the host build time goes down to 30s,
> which is factor 6, but yes, that's not 10 ;-)
Guest one does not?

> Yes, KSM is enabled on Fedora 19 host:
> 
> $ uname -r
> 3.10.10-200.fc19.x86_64
> $ grep -i ksm config-3.10.10-200.fc19.x86_64
> CONFIG_KSM=y
It is compiled in, but not necessary enabled. Can you check in /sys/kernel/mm/ksm?

Comment 30 Heinz Mauelshagen 2013-09-10 12:56:38 UTC
It is, checked /sys/kernel/mm/kvm/run = 1

Comment 31 Heinz Mauelshagen 2013-09-10 13:01:19 UTC
Created attachment 795994 [details]
host vmstat + kvm_stat

As reqested in comment #27

Comment 32 Heinz Mauelshagen 2013-09-10 13:03:02 UTC
Created attachment 795995 [details]
guest vmstat + /proc/interrupts

"lspci -vv" -> empty (all virtio)

Comment 33 Heinz Mauelshagen 2013-09-10 13:07:25 UTC
Created attachment 795997 [details]
host trace-cmd -b 20000 -e kvm

First ~200K lines of trace-cmd report as requested in comment #27.

Comment 34 Gleb Natapov 2013-09-10 13:14:26 UTC
(In reply to Heinz Mauelshagen from comment #30)
> It is, checked /sys/kernel/mm/kvm/run = 1

Disable and retry. This is not performance setting. It is for VM density.

Comment 35 Gleb Natapov 2013-09-10 13:17:15 UTC
(In reply to Heinz Mauelshagen from comment #32)
> Created attachment 795995 [details]
> guest vmstat + /proc/interrupts
> 
> "lspci -vv" -> empty (all virtio)

Virtio devices are all PCI (look at lspci output from comment #18) and there are always other PCI devices besides virtio.

Comment 36 Heinz Mauelshagen 2013-09-10 14:00:39 UTC
What can be the reason for it not to list anything then?

Comment 37 Heinz Mauelshagen 2013-09-10 14:04:58 UTC
(In reply to Gleb Natapov from comment #34)
> (In reply to Heinz Mauelshagen from comment #30)
> > It is, checked /sys/kernel/mm/kvm/run = 1
> 
> Disable and retry. This is not performance setting. It is for VM density.

No visible difference perfomance-wise

Comment 38 Gleb Natapov 2013-09-10 14:12:02 UTC
(In reply to Heinz Mauelshagen from comment #36)
> What can be the reason for it not to list anything then?

Something wrong with your install. Your guess is as good as mine :) Is this the same one as from comment #18? Is /sys mounted? strace lspci and see what it is doing.

Comment 39 Heinz Mauelshagen 2013-09-10 14:37:40 UTC
(In reply to Gleb Natapov from comment #38)
> (In reply to Heinz Mauelshagen from comment #36)
> > What can be the reason for it not to list anything then?
> 
> Something wrong with your install. Your guess is as good as mine :) Is this
> the same one as from comment #18?

No, as mentioned in comment #20.

> Is /sys mounted? strace lspci and see what
> it is doing.

/sys mounted.

Odd, had to reinstall pciutils to get it to work!?

"lspci -vv" output follows...

Comment 40 Heinz Mauelshagen 2013-09-10 14:38:29 UTC
Created attachment 796036 [details]
guest "lspci -vv" output

As requested in comment #27

Comment 41 Gleb Natapov 2013-09-11 08:56:47 UTC
Both traces shows that all vcpus are bound to the same pcpu 0. Also in the last ftrace I see only 10 vcpus and checking command line from comment #28 this is indeed what is configured.

Comment 42 Heinz Mauelshagen 2013-09-12 11:58:27 UTC
(In reply to Gleb Natapov from comment #41)
> Both traces shows that all vcpus are bound to the same pcpu 0. Also in the
> last ftrace I see only 10 vcpus and checking command line from comment #28
> this is indeed what is configured.

Gleb and I had a longer session yesterday trying to get to the root cause of
the performance issue on the given i7-3930K processor based system.

We utilized and tried:

- vmstat on guest+host
- perf on host
- VCPU pinning
- hyperthreading on+off
- wothout hyperthreading: "make -j6"
- with         "        : "make -j12" and "make -j25"

w/o getting the kernel build on the guest close to the one on the same working set on the host (warm, fasted was 2m7s vs. 33s, typically it took ~3m).

So we're still seeing a factor 6, respectivly 4 at best.

perf shows, that the guest is utilizing close to 100% CPU, which is not to be expected facing such performance degradation.

Gleb, please feel free to extend, thanks.

Analysis proceeds...

Comment 43 Heinz Mauelshagen 2013-09-18 10:17:00 UTC
After removing ccache on host and guest on Gleb's request, host and guest performance difference is negligible. Even with CCACHE_DIR=/tmp/ccache (ie. tmpfs), no performance gain in guest achievable.

Analyzing further ccache tweaks.

Comment 44 Heinz Mauelshagen 2013-09-19 15:53:58 UTC
New evidence is, that networking constraints are the core of the issue:

I disabled ccache completely in a test series:

1. on the host with a kernel build tree on local ext4
2. fs shared build tree via NFS locally (ie. nfs client and server on the host)
3. fs shared build tree via NFS on the VM (NFS server on the host as before)

The NFSv4 mount options have been the same in case 2 and 3:

Results of a "make clean ; make -j 25":

1. 2m8s
2. 2m38s
3. 6m39s

BTW: putting the workset in tmpfs on the VM: 2m24s (good vs. 1 above)

So the virtio newtworking (with 64K jumbo frames) seems to be the bottleneck with a factor _10_ difference in ops as per nfsiostat in case 2 compared to 3.

How can the virtio bottleneck be analyzed further and eased?

Comment 45 Gleb Natapov 2013-09-20 15:43:06 UTC
(In reply to Heinz Mauelshagen from comment #44)
> New evidence is, that networking constraints are the core of the issue:
> 
> I disabled ccache completely in a test series:
> 
> 1. on the host with a kernel build tree on local ext4
> 2. fs shared build tree via NFS locally (ie. nfs client and server on the
> host)
> 3. fs shared build tree via NFS on the VM (NFS server on the host as before)
> 
> The NFSv4 mount options have been the same in case 2 and 3:
> 
> Results of a "make clean ; make -j 25":
> 
> 1. 2m8s
> 2. 2m38s
> 3. 6m39s
> 
> BTW: putting the workset in tmpfs on the VM: 2m24s (good vs. 1 above)
> 
> So the virtio newtworking (with 64K jumbo frames) seems to be the bottleneck
> with a factor _10_ difference in ops as per nfsiostat in case 2 compared to
> 3.
> 
> How can the virtio bottleneck be analyzed further and eased?
I added virtio networking expert to the BZ. It may be limitation of virtio + loopback. Can you try test with remote NFS?

Comment 46 Heinz Mauelshagen 2013-09-20 17:04:45 UTC
(In reply to Gleb Natapov from comment #45)
> (In reply to Heinz Mauelshagen from comment #44)
> > New evidence is, that networking constraints are the core of the issue:
> > 
> > I disabled ccache completely in a test series:
> > 
> > 1. on the host with a kernel build tree on local ext4
> > 2. fs shared build tree via NFS locally (ie. nfs client and server on the
> > host)
> > 3. fs shared build tree via NFS on the VM (NFS server on the host as before)
> > 
> > The NFSv4 mount options have been the same in case 2 and 3:
> > 
> > Results of a "make clean ; make -j 25":
> > 
> > 1. 2m8s
> > 2. 2m38s
> > 3. 6m39s
> > 
> > BTW: putting the workset in tmpfs on the VM: 2m24s (good vs. 1 above)
> > 
> > So the virtio newtworking (with 64K jumbo frames) seems to be the bottleneck
> > with a factor _10_ difference in ops as per nfsiostat in case 2 compared to
> > 3.
> > 
> > How can the virtio bottleneck be analyzed further and eased?
> I added virtio networking expert to the BZ. It may be limitation of virtio +
> loopback. Can you try test with remote NFS?

Sure, wondering if this makes sense, because of traffic crossing the
same virtio network? Eventually I could pass a physical GigE NIC through...

BTW: I've tested with NFSv3 recently, because NFSv4 is throttling even worse
(the 6m39s in comment #44 have been with NFSv4, see NFSv3 below).

Remote NFS server (v3,same options as host local one,host attached via GigE) resulting in:

(no ccache)    "make clean ; make -j25" : 4m4s
(ccache, warm) "make clean ; male -j25": 2m27s

(no ccache on host local NFS server) "make clean ; make -j25": 4m43s

So even with the GigE limitation to the remote NFS server, it's quicker (4m4s) than the local one (4m43s).


Addition:
I did a dummy stream transfer from host to guest via "ssh host dd if=/dev/zero bs=8192k count=16" > /dev/null", resulting in ~290MB/s, which is not extremely slower than the same test on the host against self resulting in ~380MB/s.

Comment 47 Gleb Natapov 2013-09-22 06:51:09 UTC
(In reply to Heinz Mauelshagen from comment #46)
>
> Addition:
> I did a dummy stream transfer from host to guest via "ssh host dd
> if=/dev/zero bs=8192k count=16" > /dev/null", resulting in ~290MB/s, which
> is not extremely slower than the same test on the host against self
> resulting in ~380MB/s.
This transfers big chunks which is much more virtuslization friendly. My guess that NFS transaction consists of a lot of small transfers.

Comment 48 Michael S. Tsirkin 2013-09-22 15:56:46 UTC
this could be the case if power magenent being too aggressive
one  quick thing to try is adding idle=poll on command line in host.
does performance get better if you do this?

Comment 49 Heinz Mauelshagen 2013-09-23 11:34:59 UTC
(In reply to Michael S. Tsirkin from comment #48)
> this could be the case if power magenent being too aggressive
> one  quick thing to try is adding idle=poll on command line in host.
> does performance get better if you do this?

Core temperature and power consumption increased as expected.

With "idle=poll", Fedora 19 VM performance is worse
(with both ccache and buiuld tree on host NFSv3):

                With  / Without
                ---------------
(ccache, cold): 7m34s / 6m52s (variation vs. comment #49 unclear!)
(ccache, warm): 3m44s / 3m20s

Comment 50 Heinz Mauelshagen 2013-09-23 11:37:38 UTC
(In reply to Gleb Natapov from comment #47)
> (In reply to Heinz Mauelshagen from comment #46)
> >
> > Addition:
> > I did a dummy stream transfer from host to guest via "ssh host dd
> > if=/dev/zero bs=8192k count=16" > /dev/null", resulting in ~290MB/s, which
> > is not extremely slower than the same test on the host against self
> > resulting in ~380MB/s.
> This transfers big chunks which is much more virtuslization friendly. My
> guess that NFS transaction consists of a lot of small transfers.

How does that explain that a local NFS server is slower than a remote one on a slower GigE link?

Comment 51 Cole Robinson 2013-10-31 22:35:20 UTC
This stalled. mst, is there anything left todo here?

Heinz, if you're motivated, might be interesting to see if you can reproduce on RHEL7, filing a bug there will ensure it stays on someones watch list :)

Comment 52 Heinz Mauelshagen 2013-11-05 12:03:08 UTC
(In reply to Cole Robinson from comment #51)
> This stalled. mst, is there anything left todo here?
> 
> Heinz, if you're motivated, might be interesting to see if you can reproduce
> on RHEL7, filing a bug there will ensure it stays on someones watch list :)

Reproduced in RHEL-7.0-20131018.n.0.

Filing RHEL-7 bug...

Comment 53 Cole Robinson 2013-11-17 20:27:53 UTC
Since there's a RHEL7 bug now where this will be appropriately scoped, there's not much more use in tracking this for Fedora IMO, so closing.