Bug 745405

Summary: macvtap enabled KVMs recurrently slow down extremely <40kb/sec
Product: Red Hat Enterprise Linux 6 Reporter: Kai Mosebach <redhat-bugzilla>
Component: qemu-kvmAssignee: Michael S. Tsirkin <mst>
Status: CLOSED WORKSFORME QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.2CC: acathrow, juzhang, michen, mkenneth, rhod, tburke, virt-maint, vyasevic, wquan, xfu
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-07-21 20:11:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Kai Mosebach 2011-10-12 09:40:41 UTC
Description of problem:

We run several hundred KVMs on 12 Dell Servers (Dell R710 / Dell R815), the physical network HW is always a Broadcom NetXtreme II BCM5709 Gigabit against Cisco switches.

The KVM Host servers are setup with eth0 bridged with a macvhost0 interface which gets the IP of the KVM Host.

Each of the KVM guests (mostly el6, some windows xp/7) gets a macvtap interface of the type 'bridge' and the source 'eth0', driver virtio, e.g. in libvirt :

    <interface type='direct'>
      <mac address='xxxxxxxxx'/>
      <source dev='eth0' mode='bridge'/>
      <target dev='macvtap1'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>

now some of the machines network connectivity becomes very slow from time to time (not reproducible by hand) which means down to 40KB/s where they normally reach 40MB/s. this is true for TCP traffic, but if i for instance run a flood ping i do not see any lost packages (even with payload >60kb / ping).

A reboot of the VM does not help ( a reboot of the KVM host is not possible ATM).

To fix it :
- migrating the machine to another KVM host usually helps
- sometimes the problem disappears the same way it came along

Version-Release number of selected component (if applicable):

kernels tested :
- 2.6.32-131.17.1.el6.x86_64
- 2.6.32-131.12.1.el6.x86_64
- 2.6.32-131.6.1.el6.x86_64

qemu :

- 0.12.1.2-2.160.el6_1.8
- 0.12.1.2-2.160.el6_1.6

seabios :
- seabios-0.6.1.2-3.el6
- seabios-0.6.1.2-3.el6_1.1
- seabios-0.6.3-0 (public release) 

How reproducible:
recurrently, but sporadic

Steps to Reproduce:
1.
2.
3.
  
Actual results:
network performance (assumably tcp only) drops below 40kb/s

Expected results:
stable network performance (normally ~ 40 MB/s)

Additional info:

Comment 1 Kai Mosebach 2011-10-12 09:46:32 UTC
- vhost_net is enabled on the KVM host
- ip link looks similar to all vms

the server interface :

6: macvhost0@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
    link/ether XXX brd ff:ff:ff:ff:ff:ff

one of the guest interfaces
8: macvtap1@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 500
    link/ether XXX brd ff:ff:ff:ff:ff:ff

On one windows guest a "connectivity problem" was shown once on the (virtual) network interface but the connection was working (slowly)

running "nload" on one KVM host against the windows vtap interface repeatedly showed this traffic pattern while downloading a file in the according guest : 100kb/s -> 40kb/s -> 0kb/s -> 100kb/s -> 40kb/s -> 0kb/s

Comment 3 Kai Mosebach 2011-10-13 09:28:46 UTC
wireshark reveals more info, between a kvm (http-server) and a client we see a lot of

HTTP	[TCP Retransmission] Continuation or non-HTTP traffic
HTTP	[TCP Fast Retransmission] Continuation or non-HTTP traffic
HTTP	[TCP Out-Of-Order] Continuation or non-HTTP traffic

which explains that icmp has no problems but tcp has... will go an try the EL6.2 beta kernel next.

Comment 4 Michael S. Tsirkin 2011-10-16 19:28:06 UTC
So to clarify,

is it a single stream that is affected?
other streams keep going?

would a certain number of packets getting reordered
incorrectly before being transmitted out of host
explain the observed behaviour?

Comment 5 Kai Mosebach 2011-10-24 14:19:57 UTC
all streams are slow... (and there is no "keep going" since they are just slow...
its the whole traffic of a machine which is affected (as soon as tcp is involved - could not try udp yet)

i dont get the 2nd question, you mean, that an intial disorder would/could keep the whole stream "disordered" by some packages?

Comment 6 Michael S. Tsirkin 2011-10-24 14:32:55 UTC
Could it be that restarting a VM causes the problem?
Does the problem happen if you don't restart any VMs?

Could the problem be the host being out of memory?
Coould you try scripting it so we get the /proc/meminfo and
/proc/slabinfo on host around the time traffic slows?

Comment 7 Kai Mosebach 2011-10-24 15:01:39 UTC
Note : the EL6.2 beta kernel does not change the behaviour

Memory : its highly unprobable, the machine has 72gb, others we saw the problem have 256gb and only one 32GB machine on it.

Restart of the VM: Nope, also a shutdown and clean start brought back the problem. Migrating it away to another KVM solved the issue though.

it also seems, that under load machines sometimes "loose" their macvtap adapter, but thats another story we will investigate first and file a bug report if we have enough details, but mayve its related (i.e. the "disconnected network interface" in windows could be related).

what values of slabinfo / meminfo should i monitor? anything else i could look in? bridge states etc...?

Comment 8 Michael S. Tsirkin 2011-10-24 16:58:55 UTC
MemFree and kmalloc-4096 I guess.

Yes could be related. What do you mean "loose"?

Comment 9 Kai Mosebach 2011-10-24 18:35:35 UTC
- guest is not pingable anymore
- macv adapter is still visible in the kvm server
- login fails (they are ldap-connected and time out, even for root) in the serial console
- log-server does not reveal anything (assumable due to network failure)
- /var/log/libvirt/qemu/<host>.log does not show anomalities
- VM can be shutdown cleanly with virsh shutdown <vmname>

there is no MemFree or kmalloc in either /proc/slabinfo nor /proc/meminfo...

Comment 10 Kai Mosebach 2011-10-27 14:16:22 UTC
additional info to the "lost" connection :

- virsh save on the one host and then 
- virsh restore on another host restores the network connectivity

Comment 14 Michael S. Tsirkin 2011-12-12 19:50:05 UTC
To Comment 10:
what about restore on the same host after this?

Comment 17 RHEL Program Management 2012-07-10 08:17:46 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 18 RHEL Program Management 2012-07-11 02:01:12 UTC
This request was erroneously removed from consideration in Red Hat Enterprise Linux 6.4, which is currently under development.  This request will be evaluated for inclusion in Red Hat Enterprise Linux 6.4.

Comment 19 Michael S. Tsirkin 2012-07-15 15:08:57 UTC
restoring needinfo: comment 14

Comment 20 Michael S. Tsirkin 2012-07-15 15:11:04 UTC
does this get qa ack?
can qa reproduce?

Comment 21 Ronen Hod 2012-07-15 15:12:27 UTC
Hi Kai,

Thank you for taking the time to enter a bug report with us. We appreciate the feedback and look to use reports such as this to guide our efforts at improving our products. That being said, this bug tracking system is not a mechanism for requesting support, and we are not able to  guarantee the timeliness or suitability of a resolution.
 
If this issue is critical or in any way time sensitive, please raise a ticket through your regular Red Hat support channels to make certain  it receives the proper attention and prioritization to assure a timely resolution. 
 
For information on how to contact the Red Hat production support team, please visit:
https://www.redhat.com/support/process/production/#howto

Thanks, Ronen.

Comment 22 Kai Mosebach 2012-07-15 15:49:27 UTC
Hi Michael,

sorry for the long delays, atm we do not see the problem too often.

i strongly assume though, its related to this 

http://comments.gmane.org/gmane.comp.emulators.libvirt.user/2706

and vhost checksum problems, since we saw dropped packages on the according devices, i.e.

RX packets:526 errors:84511 dropped:84511 overruns:0 frame:0

next time we see the problem i will try to save and restore the machine.

best Kai

Comment 23 FuXiangChun 2012-07-17 10:47:13 UTC
(In reply to comment #20)
> does this get qa ack?
> can qa reproduce?

cann't reproduce this issue. 

below is steps of testing.
1.setup 20 macvtap interface
 i=11
while [ $i -lt 21 ]
 do
ip link add link eth0 dev macvtap$i type macvtap
ip link set macvtap$i address 00:24:E8:81:14:5$i up
i=$(($i+1))
 done

 i=10
while [ $i -lt 21 ]
 do
ip link add link eth0 dev macvtap$i type macvtap
ip link set macvtap$i address 00:24:E8:81:14:$i up
i=$(($i+1))
 done
 
2.boot 20 guests in the same host
 use command line 
/usr/libexec/qemu-kvm -cpu host -enable-kvm -smp 2 -m 2G -usb -device usb-tablet,id=input0 -name test -uuid `uuidgen` -drive file=/root/vm-images/rhel6.1,if=none,id=hd,format=qcow2,aio=native,cache=none,werror=stop,rerror=stop -device ide-drive,drive=hd,id=blk_image,bootindex=1 -netdev tap,id=netdev0,fd=42 -device virtio-net-pci,netdev=netdev0,id=device-net0,mac=00:24:e8:81:14:19 42<>/dev/tap42 -vnc :20 -balloon none -device sga -chardev socket,id=serial0,path=/var/test1,server,nowait -device isa-serial,chardev=serial0 &

3. check each guest and transfer size 2G file to other machine by scp
4. repeat multiple times and transfer rate is normal.

for above steps,if I have any missing, please correct me.

Comment 24 Ronen Hod 2012-07-21 20:04:47 UTC
Closing.
For now, QE cannot reproduce, and the reporter also doesn't encounter it (or at least much less).
Let's reopen if we have new data.
Added Vlad to the CC list since it reminds me a little of Bug 795314 (GRO).

Thanks, Ronen.