Bug 1210221

Summary:	Netperf UDP_STREAM Lost most of the packets on spapr-vlan device
Product:	Red Hat Enterprise Linux 7	Reporter:	Zhengtong <zhengtli>
Component:	qemu-kvm-rhev	Assignee:	Thomas Huth <thuth>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.2	CC:	dgibson, gklein, hannsj_uhl, knoel, lvivier, michen, mrezanin, qzhang, thuth, virt-maint, zhengtli
Target Milestone:	rc
Target Release:	---
Hardware:	ppc64le
OS:	Linux
Whiteboard:
Fixed In Version:	qemu-kvm-rhev-2.6.0-5.el7	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-07 20:22:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1359843

Description Zhengtong 2015-04-09 08:19:29 UTC

Description of problem:

Guest as Netserver , and do netperf on Host , It will lost 100% packet in UDP_STREAM protocol

Version-Release number of selected component (if applicable):

Host kernel: 3.10.0-234.el7.ppc64
Guest kernel: 3.10.0-229.el7.ppc64
qemu-kvm-rhev-2.2.0-8.el7
Netperf version 2.6.0

How reproducible:

100%

Steps to Reproduce:
1.Start up Guest with spapr-vlan device.
#usr/libexec/qemu-kvm -name liuzt-RHEL-7.1-20150219.1 -machine pseries-rhel7.1.0,accel=kvm,usb=off -m 32768 -realtime mlock=off -smp 64,sockets=1,cores=16,threads=4 \
-uuid 95346a10-1828-403a-a610-ac5a52a29483  \
-monitor stdio \
-rtc base=localtime,clock=host \
-no-shutdown \
-boot strict=on \
-device usb-ehci,id=usb,bus=pci.0,addr=0x2 \
-device pci-ohci,id=usb1,bus=pci.0,addr=0x1 \
-device spapr-vscsi,id=scsi0,reg=0x1000 \
-drive file=/var/lib/libvirt/images/liuzt-RHEL-7.1-20150219.1-Server-ppc64.img,if=none,id=drive-scsi0-0-0-0,format=raw,cache=none \
-device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 \
-serial pty \
-device usb-kbd,id=input0 \
-device usb-mouse,id=input1 \
-device usb-tablet,id=input2 \
-vnc 0:19 -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x4 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 \
-msg timestamp=on \
-netdev tap,id=hostnet0,script=/etc/qemu-liuzt-ifup,downscript=/etc/qemu-liuzt-ifdown \
-device spapr-vlan,netdev=hostnet0,id=net0,mac=52:54:00:c4:e7:83,reg=0x2000 \

2.start netserver in Guest. 192.168.200.100 is the Guest ip 
  # netserver -p 44444 -L 192.168.200.100 -f -D -4 &

3.start netperf in Host , and got the performance.
  #netperf   -p 44444 -L 192.168.200.1  -H 192.168.200.100 -t UDP_STREAM -l 60 -- -m 16384

Actual results:

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

229376   16384   60.00     1569894      0    3429.47
229376           60.00          24              0.05

Expected results:

packet loss shouldn't be ~100%,  At a pure Host/Guest network environment, the packet loss should be less than 5%

Additional info:

Comment 2 David Gibson 2015-04-10 00:27:55 UTC

Sorry, I'm not very familiar with netperf.  What part of the output is showing the packet loss?

Comment 3 David Gibson 2015-04-10 00:29:41 UTC

Also, where can I get netperf to try to reproduce this?

Comment 4 Zhengtong 2015-04-10 01:55:31 UTC

(In reply to David Gibson from comment #3)
> Also, where can I get netperf to try to reproduce this?

Hi David,

You can get netperf from  here. 
"ftp://ftp.netperf.org/netperf/netperf-2.6.0.tar.gz"

To clarify the packet loss, here is a example without issues tested on two X86 host:

 Server: [root@dhcp-10-201 ~]# netserver -L 10.66.9.95  -f -D -4 &
 Client: [root@localhost liuzt]# netperf -H 10.66.9.95  -t UDP_STREAM -l 45 -- -m 16384

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   16384   45.00      326513      0     951.03
212992           45.00      326510            951.02

Here , 951.03Mb/s is the client sending data bandwidth, and 951.02 is the server receiving data bandwidth, they are nearly the same, means nearly no packet loss. 

Comparing with this, In our scenario, client sending data bandwith is 3429.47Mb/s and server receiving data bandwidth is 0.05Mb/s , So the data loss is almost 100%.

Comment 5 Laurent Vivier 2015-07-20 10:29:56 UTC

I can reproduce the bug.

But I have to stop firewalld service to allow the server and the client to connect.

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

229376   16384   60.00     2013685      0    4398.93
229376           60.00          10              0.02

Comment 6 Laurent Vivier 2015-07-24 16:52:27 UTC

Packets seem lost between ibmveth driver and the application.

All packets are received by the driver and sent to the NAPI framework via napi_gro_receive() but are never received by recvfrom().

Comment 7 Laurent Vivier 2015-07-27 17:48:19 UTC

it seems packets are received while there are available buffers at hypervisor level, and then stops...

The hypervisor stops because it doesn't have any buffer valid.

static ssize_t spapr_vlan_receive()
...

if (!(bd & VLAN_BD_VALID) || (VLAN_BD_LEN(bd) < (size + 8))) {
/* Failed to find a suitable buffer */
...
What I understand of the spapr-vlan protocol is that the driver doesn't issue often enough the H_ADD_LOGICAL_LAN_BUFFER command.

Comment 8 Laurent Vivier 2015-07-29 08:26:24 UTC

To be able to determine the maximum bandwith I use iperf3 instead of netperf.

https://iperf.fr/

server side on guest: iperf3 -s
client side on host:  iperf3 --udp -c 192.168.122.117 -l 16384 -b 4G

'-b' is the bandwith in b/s, 4G means "4Gb/s" and in this case we have the same result as with netperf.

Bandwidth       Jitter       Lost/Total Datagrams
 996 Kbits/sec  7113.980 ms     0/76      (0%)
1.99 Mbits/sec  64.239 ms       3/152     (2%)
2.98 Mbits/sec  1.287 ms       16/227     (7%)
3.97 Mbits/sec  0.181 ms       57/302     (19%)
4.95 Mbits/sec  0.084 ms       92/378     (24%)
9.91 Mbits/sec  0.615 ms      365/750     (49%) 
99.0 Mbits/sec  0.196 ms     7167/7481    (96%) 
 992 Mbits/sec  25.710 ms   72112/72665   (99%)
1.99 Gbits/sec  79.970 ms  150088/150288  (100%)
3.00 Gbits/sec  50.301 ms  222135/222310  (100%)
3.99 Gbits/sec  17588.700 ms 213767/213829 (100%)

The last bandwith with less than 5% packet lost is 2Mb/s.

Comment 9 Laurent Vivier 2015-07-29 16:57:12 UTC

More information:

The driver, after a while fails to give back some buffers to the QEMU with h_add_logical_lan_buffer(), the error is H_RESOURCE.

At the QEMU level, it can't take back the buffers because the rx_buffers is full (dev->rx_bufs >= VLAN_MAX_BUFS (509)) -> H_RESOURCE.

The only way to decrease dev->rx_bufs is to receive a packet from the network (QEMU takes a buffer from there). But the receive function can't take a buffer because all (509) buffers are invalid. So, QEMU can't take buffers to receive, but it can't have new buffers.

The question is: who marks the buffers as VALID ?

Comment 10 David Gibson 2015-07-30 01:44:49 UTC

Ok, here's what I can figure out from a combination of memory and reading the code.

 * The interface uses this concept of a "buffer descriptor" which is a 64-bit quantity including the buffer address, length and some flags (including VALID)

 * IIUC, buffer descriptors handed in via h_add_logical_lan_buffer should be VALID.

 * When a buffer is consumed by the Rx path, it's descriptor is cleared which also clears VALID.  This is the:
        vio_stq(sdev, dev->buf_list + dev->use_buf_ptr, 0);
in spapr_vlan_receive().

 * dev->rx_bufs is *supposed* to be the number of VALID buffers in the pool.  Since you're getting a situation with no valid buffers but rx_bufs too large to allow h_add_logical_lan_buffer() it looks like these are getting out of sync somehow.

I'd add a check in h_add_logical_lan_buffer() to make sure that the guest is not passing in invalid buffer descriptors - that would certainly cause those to get out of sync.  I'm guessing that when I wrote it, I thought check_bd() would ensure that, but it looks like it doesn't.

Although if we're leaking buffers like this, I would have expected it to come to a complete halt and not let any data through.  You could check if the guest is getting timeouts and resetting the device though, which would fix it temporarily.

Comment 11 Laurent Vivier 2015-07-30 18:46:21 UTC

What happens:

1- the buf_list is only 509 entries long, and under high pressure it is quickly empty, and QEMU must wait the driver fills it again with h_add_logical_lan_buffer(), but during this waiting period all incoming packets are canceled. Moreover once the 509 entries are filled the driver must wait next incoming packets to try to add remaining available buffers,

2- the size of incoming packets is ~1500 bytes, so after a while the buf_list is filled by 512 bytes long buffers as they are not used. This is why buf_list is not empy but we can't use the buffers it has: their size is too small to receive the incoming packets. This can drive to two consequences: in the worst case, no bigger buffer is available and the packets are discarded, in the better case, "some" (= "a lot" under high pressure) packets are discarded because the receive function becomes "slow" as the receive function has to scan the buf_list to find a well sized buffer and at least the half of buf_list is filled by 512 bytes buffers.

We can solve (1) by adding a "cache" in QEMU spapr-vlan to store released buffers while the buf_list is full (I think we can't increase the size of buf_list).

To solve (2) we can either modify the driver to not use buffer smaller than 2048 bytes or modify the previous "cache" to always add bigger buffers first (but this will not solve the slowness of the receive function when the buf_list is half filled by 512 bytes buffers).

But the root problem seems to be in the way spapr-vlan works, the good solution could be to use virtio-net instead...

Comment 12 David Gibson 2015-07-31 05:51:31 UTC

Adding an extra queue of packets in qemu is a bad idea. a) it allows guest activity to cause qemu to allocate memory without a clear bound and b) it's likely to mess up application level flow control algorithms (flow control works best when intermediate queues are kept to a minimum).

Remember that blasting UDP packets with no flow control can be an interesting benchmark but every real application must include some kinnd flow control to be usable.

What's unclear to me at this point is this: is it simply that the driver isn't keeping up with the flow of packets, or is something more going on? 2Mbps seems terribly, terribly slow - even with qemu having to frequently scan the receive buffer list.

* Changing the driver not to allocate the 512 byte buffers might help - you should be able to experiment without recompiling: it looks like the driver lets you set the numbers of buffers in each pool size from sysfs

* I can think of a couple of possible optimizations in the qemu side code which might help, though I can't be certain. 1. In the loop across the buffer list in spapr_vlan_receive(), iterate until you've seen as most dev->rx_bufs unsuitable buffers. That might cut the loop short when no buffers in the pool are large enough. 2. More complex, instead of just dev->rx_bufs, keep multiple counters for buffers of different sizes; check those before looping through every buffer.

* The complex buffer pool stuff in the guest driver seems real overkill - I suspect it's there mostly for the benefit of guests running under PowerVM and may not be well tested with the qemu implementation.

* I thought you were right in that it wasn't possible to increase the number of buffers (because I thought the buffer list was limited to 1 2k page -> 512 buffer descriptors minus a few reserved ones -> 509 buffers). However on the driver side, I see IBMVETH_MAX_POOL_COUNT == 4096, which suggests otherwise. Again, this could be only with PowerVM extensions we haven't implemented in qemu.

Comment 13 Thomas Huth 2015-07-31 08:56:53 UTC

(In reply to David Gibson from comment #12)
> What's unclear to me at this point is this: is it simply that the driver
> isn't keeping up with the flow of packets, or is something more going on? 
> 2Mbps seems terribly, terribly slow - even with qemu having to frequently
> scan the receive buffer list.

Maybe it's worth a try to use a profiler like gprof or callgrind on QEMU, to see where it spends most of its time in this case? That could help to optimize the hot spot.

Comment 14 Laurent Vivier 2015-07-31 18:16:43 UTC

(In reply to David Gibson from comment #12)
...
>  * Changing the driver not to allocate the 512 byte buffers might help - you
> should be able to experiment without recompiling: it looks like the driver
> lets you set the numbers of buffers in each pool size from sysfs

I've been able to disable pool0 (512 bytes long) and add buffers in pool1 (2048), there is no more loop to find well sized buffers, but as the internal buffer of spapr-vlan is always 509 entries long it is really often empty. So many packets are discarded at QEMU level (spapr_vlan_can_receive() is false because rx_bufs is 0).

>  * I can think of a couple of possible optimizations in the qemu side code
> which might help, though I can't be certain.  1. In the loop across the
> buffer list in spapr_vlan_receive(), iterate until you've seen as most
> dev->rx_bufs unsuitable buffers.  That might cut the loop short when no
> buffers in the pool are large enough.  2. More complex, instead of just
> dev->rx_bufs, keep multiple counters for buffers of different sizes; check
> those before looping through every buffer.

Perhaps a useless work if we can manage the issue by disabling 512 bytes buffers ?

>  * I thought you were right in that it wasn't possible to increase the
> number of buffers (because I thought the buffer list was limited to 1 2k
> page -> 512 buffer descriptors minus a few reserved ones -> 509 buffers). 
> However on the driver side, I see IBMVETH_MAX_POOL_COUNT == 4096, which
> suggests otherwise.  Again, this could be only with PowerVM extensions we
> haven't implemented in qemu.

IBMVETH_MAX_POOL_COUNT is for the driver internal pools (pool0, pool1, ...).
I think the driver don't know the size of the rx_buffs, as it tries to add buffers until it receives a H_RESOURCE error.

(In reply to Thomas Huth from comment #13)
> (In reply to David Gibson from comment #12)
> > What's unclear to me at this point is this: is it simply that the driver
> > isn't keeping up with the flow of packets, or is something more going on? 
> > 2Mbps seems terribly, terribly slow - even with qemu having to frequently
> > scan the receive buffer list.
> 
> Maybe it's worth a try to use a profiler like gprof or callgrind on QEMU, to
> see where it spends most of its time in this case? That could help to
> optimize the hot spot.

More bugs on PPC64LE:
- gprof is broken (ib glibc). I have filled a BZ.
  https://bugzilla.redhat.com/show_bug.cgi?id=1249102
- valgrind is broken too:
  qemu-system-ppc64: cannot set up guest memory 'ppc_spapr.ram': Invalid argument
  The bug seems to be in QEMU, I'll fill a BZ if needed.
- oprofile seems broken too. I'm not able to use "operf --pid", only
  "operf --system-wide" and the "--callgraph" parameter doesn't work.
  It is not usable. I'll check if a BZ is needed...

Comment 16 David Gibson 2015-08-03 01:14:53 UTC

Right, but did changing the packet pool sizes change the maximum effective throughput at all?  What I'm trying to determine here is whether looping through the buffers in qemu is a cause of the slowness, or if it's simply that the guest is unable to keep up supplying buffers.

Another thing to check: KVM exits are very expensive on Power - even more than on x86.  Is the qemu IO thread running on the same host CPU as the guest vcpu threads?

Comment 17 Laurent Vivier 2015-08-05 12:12:00 UTC

(In reply to David Gibson from comment #16)

For a 100Mb/s iperf3 UDP traffic, with threads on distinct CPUs.

If we change the pool size, the number of packet lost decrease and the jitter is lower.

In the 512 byte packet size case, most of the time, 98%, is used to find a suitable buffer (>= 1504 bytes), 98% of the packets are lost and 75% of the CPU time is for the QEMU main loop (I/O).

In the 2048 byte packet size case, the function spapr_vlan_can_receive() is called 10x more than in the previous case and only 12% is for the buffer loop. We can guess QEMU is waiting for some space in the buffer to send new packets. 88% of the packets are lost and 16% is for the QEMU main loop.

Details:
-----------------------------------------------------------------------------
512 byte packet size:

[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-120.00 sec  1.40 GBytes  99.9 Mbits/sec  0.152 ms  88918/91405 (97%)
[  4] Sent 91405 datagrams

2,967,183  hw/net/spapr_llan.c:spapr_vlan_can_receive
   800,885  {
   320,354      VIOsPAPRVLANDevice *dev = qemu_get_nic_opaque(nc);
 1,281,416  => net/net.c:qemu_get_nic_opaque (160177x)
         .
 1,205,236      return (dev->isopen && dev->rx_bufs > 0);
   640,708  }


244,552,891  hw/net/spapr_llan.c:spapr_vlan_receive
...
         .      do {
35,167,012          buf_ptr += 8;
52,750,518          if (buf_ptr >= (VLAN_RX_BDS_LEN + VLAN_RX_BDS_OFF)) {
    34,545              buf_ptr = VLAN_RX_BDS_OFF;
         .          }
         .
52,785,063          bd = vio_ldq(sdev, dev->buf_list + buf_ptr);
         .          DPRINTF("use_buf_ptr=%d bd=0x%016llx\n",
         .                  buf_ptr, (unsigned long long)bd);
11,570,334      } while ((!(bd & VLAN_BD_VALID) || (VLAN_BD_LEN(bd) < (size + 8)))
87,865,114               && (buf_ptr != dev->use_buf_ptr));
...
240,172,586  TOTAL LOOP = 98% of the time of the function

CPU%
75% main_loop
10% kvm_cpu_exec
-------------------------------------------------------------------------------

2048 byte packet size:

[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-120.00 sec  1.40 GBytes  99.9 Mbits/sec  0.082 ms  80648/91411 (88%)
[  4] Sent 91411 datagrams

 25,116,799  hw/net/spapr_llan.c:spapr_vlan_can_receive
 6,629,865  {
 2,651,946      VIOsPAPRVLANDevice *dev = qemu_get_nic_opaque(nc);
10,607,784  => net/net.c:qemu_get_nic_opaque (1325973x)
         .
10,531,096      return (dev->isopen && dev->rx_bufs > 0);
 5,303,892  }


 45,377,776  hw/net/spapr_llan.c:spapr_vlan_receive
         .      do {
   725,944          buf_ptr += 8;
 1,088,916          if (buf_ptr >= (VLAN_RX_BDS_LEN + VLAN_RX_BDS_OFF)) {
       712              buf_ptr = VLAN_RX_BDS_OFF;
         .          }
         .
 1,089,628          bd = vio_ldq(sdev, dev->buf_list + buf_ptr);
         .          DPRINTF("use_buf_ptr=%d bd=0x%016llx\n",
         .                  buf_ptr, (unsigned long long)bd);
 1,814,860      } while ((!(bd & VLAN_BD_VALID) || (VLAN_BD_LEN(bd) < (size + 8)))
   725,944               && (buf_ptr != dev->use_buf_ptr));

 5,446,004 TOTAL LOOP = 12% of the time of the function

CPU%
16% main_loop
16% kvm_cpu_exec

Comment 19 David Gibson 2015-12-11 00:15:28 UTC

Hmm, does limited performance on a non-recommended NIC really warrant high priority?

Comment 20 Thomas Huth 2015-12-11 07:42:46 UTC

By the way, it could be that bug 1271496 ("Guest loses packets with 65507 packet size ...") contributes to the bad performance. It's a different bug, but it also causes packets to be dropped (but this time the problem is in the ibmveth driver of the guest). So it might be worth a try to do a "echo 0 > /sys/devices/vio/30000004/pool4/active" or so to see whether this increases the performance at least a little bit.

Comment 21 David Gibson 2015-12-14 02:11:51 UTC

I think it's unlikely that bug 1271496 is relevant.  AFAICT in the case tested here the message size is 16384 bytes.  Even with various header overheads that will be much less than the 65507 packet size which triggers the other bug.

Comment 22 Thomas Huth 2015-12-14 07:14:56 UTC

(In reply to David Gibson from comment #21)
> I think it's unlikely that bug 1271496 is relevant.  AFAICT in the case
> tested here the message size is 16384 bytes.  Even with various header
> overheads that will be much less than the 65507 packet size which triggers
> the other bug.

The MTU size in bug 1271496 is still 1500, so the big ICMP packets are fragmented into smaller IP packets there - QEMU still steps through all RX buffer types anyway since it does not distinguish different RX buffer pools as PowerVM is doing it. So assuming that QEMU also uses the big RX buffers here, it's likely that some packets are lost due to bug 1271496 here, too.

Comment 23 Gil Klein 2016-01-06 13:59:48 UTC

(In reply to David Gibson from comment #19)
> Hmm, does limited performance on a non-recommended NIC really warrant high
> priority?
No, looks like a mistake. sorry about that.

Comment 24 Thomas Huth 2016-03-07 08:45:17 UTC

This bug might be related to BZ 1210221, so I'll take a look at this here, too, while working on that other bug.

Comment 25 Thomas Huth 2016-03-09 15:27:02 UTC

(In reply to Thomas Huth from comment #24)
> This bug might be related to BZ 1210221

I meant BZ 1271496 ("Guest loses packets with 65507 packet size for spapr-vlan emulated NIC card"), of course.

Comment 27 Thomas Huth 2016-03-14 16:27:53 UTC

Another interesting observation: The values look very different when running netserver on the KVM host instead of running the netserver in the guest:

MIGRATED UDP STREAM TEST from 192.168.122.214 () port 0 AF_INET to 192.168.122.1 () port 0 AF_INET
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

229376   16384   10.00       54534      0     714.77
229376           10.00       54534            714.77

Not sure yet, but this could maybe be an indication that the bad values that  Zhengtong reported could be due to the big QEMU I/O lock, i.e. the spapr-vlan device in QEMU can not send and receive packets simultaneously due to the I/O lock. The receiving I/O thread in QEMU is then busy with the huge amount of received packets and rarely drops the lock, so the sending thread in the guest hardly can send out any packets.

Comment 28 Thomas Huth 2016-03-18 12:40:49 UTC

Ok, forget about my conclusion in my last comment, I had just wrong assumptions about what that netperf test is doing.

So some more observations:

1) When running the test between two guests instead of host-to-guest, then the numbers get a little bit better, but still far from being good:

MIGRATED UDP STREAM TEST from 192.168.122.39 () port 0 AF_INET to 192.168.122.170 () port 0 AF_INET
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

229376   16384   20.00       89326      0     585.39
229376           20.00         443              2.90

2) When checking the "RX packets" counters via ifconfig in the guest, I can see that the driver in the guest apparently recorded ~660 MiBs in the 20 seconds while the test was running ... that should give a throughput of 33 MB/s = 264 Mbits/s theoretically - but the netserver only saw less than 1 Mbit/s. Hmmm, I just saw that Laurent already mentioned that in comment 6 ... we might need to find out why the IP layer discards so many packets here...

Comment 29 Thomas Huth 2016-03-18 12:42:23 UTC

BTW, problem also occurs on little endian, so I'm setting the hardware field to ppc64le now (since we're only supporting ppc64le on the host officially).

Comment 30 Thomas Huth 2016-03-22 09:00:42 UTC

For the record: When using the "e1000" NIC instead of "spapr-vlan", I get values like this:

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

229376    1440   60.00     10852664      0    2083.71
229376           60.00      345587             66.35

So ideally, the spapr-vlan should get similar results as the e1000.

Comment 31 Thomas Huth 2016-03-22 09:11:41 UTC

Oops, that test in comment 30 was already with a different block size ... but with block size = 16384, I get similar or even better results with the e1000 NIC:

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

229376   16384   60.00     1765430      0    3856.64
229376           60.00       45784            100.02

Comment 32 Thomas Huth 2016-03-22 09:38:26 UTC

When using smaller block sizes (less than the MTU), the results also look much better with the spapr-vlan NIC:

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

229376    1450   60.00     11124413      0    2150.72
229376           60.00      137636             26.61

Comment 33 Thomas Huth 2016-03-22 09:50:36 UTC

I think I now basically understood what is going wrong here with the spapr-vlan NIC: Since you're using a packet size of 16384 bytes, which is much bigger than the MTU of 1500 bytes, the netperf tool is sending fragmented UDP packets to the netserver. Since the spapr-vlan interface is really not a high-performance interface, the receiver side in the guest regularly runs out of RX buffers sooner or later, so some of the packets are dropped. Since the test is using fragmented IP packets, all other parts of the UDP packet also have to be dropped by the receiver side if one of the fragments got lost. And apparently for almost all UDP packets, at least one fragment is lost, so that almost all UDP packets are discarded in the guest kernel IP layer. That's why the RX counters of "ifconfig" in the guest show still quite a lot of received data (see comment 28), while the netserver application hardly sees any packets at all - the most part of the data is dropped during fragments reassembly in the IP layer.

Comment 34 Thomas Huth 2016-03-22 11:45:19 UTC

So the question is: Why is for example the e1000 NIC showing much better results with the packet size of 16384 bytes than the spapr-vlan NIC? ... I think I also got an answer for that one: The problem is the way the receive buffers are supplied by the guest to QEMU.

The e1000 driver in the guest can supply multiple receive buffers to the emulated NIC in the host at once, by writing an appropriate value to the RDT register (Receive Descriptor Tail). Once QEMU detected the write to the RDT register in the set_rdt() of hw/net/e1000.c, it calls qemu_flush_queued_packets() - and this flushes the receive queue. So if the receiver ran out of packets and the receive queue got stalled, the guest can add multiple new receive buffers at once, and QEMU then immediately tries to flush the queued, received packets into these new buffers. If there are fragmented IP packets waiting, the guest likely added enough receive buffers at once so that all parts of the fragmented packet can be passed to the guest, and the IP layer in the guest kernel then can reassemble the fragmented UDP packet successfully.

Now the interface in the spapr-vlan driver works slightly differently: The guest can only supply one receive buffer at a time with the H_ADD_LOGICAL_LAN_BUFFER hypercall. QEMU then tries to flush the queued packets with the qemu_flush_queued_packets() after each added single buffer. So if the receive buffer queue ran empty at one point in time, QEMU tries to signal single packets to the guest once the guests adds a single buffer with the H_ADD_LOGICAL_LAN_BUFFER hypercall. This of course requires quite a bit of processing overhead, and chances are high that one packet of the fragmented UDP packet is lost or arrives too late in the guest, so that the whole UDP packet is dropped by the UDP layer in the guest.

One possible solution could be that we do not try to flush the RX queue after each RX buffer that has been submitted by the guest, but to wait a little bit 'till there are enough buffers available to hold a whole fragmented UDP packet.
Here is a quick-n-dirty proof-of-concept patch that waits for 200 packets in the RX queue before trying to flush the received packets to the guest:

diff --git a/hw/net/spapr_llan.c b/hw/net/spapr_llan.c
index 9359f37..ce63b0f 100644
--- a/hw/net/spapr_llan.c
+++ b/hw/net/spapr_llan.c
@@ -203,7 +203,7 @@ static ssize_t spapr_vlan_receive(NetClientState *nc, const uint8_t *buf,
     }
 
     if (!dev->rx_bufs) {
-        return -1;
+        return 0;
     }
 
     if (dev->compat_flags & SPAPRVLAN_FLAG_RX_BUF_POOLS) {
@@ -212,7 +212,7 @@ static ssize_t spapr_vlan_receive(NetClientState *nc, const uint8_t *buf,
         bd = spapr_vlan_get_rx_bd_from_page(dev, size);
     }
     if (!bd) {
-        return -1;
+        return 0;
     }
 
     dev->rx_bufs--;
@@ -554,6 +554,9 @@ static target_long spapr_vlan_add_rxbuf_to_pool(VIOsPAPRVLANDevice *dev,
 
     dev->rx_pool[pool]->bds[dev->rx_pool[pool]->count++] = buf;
 
+    if (dev->rx_pool[pool]->count > 200)
+        qemu_flush_queued_packets(qemu_get_queue(dev->nic));
+
     return 0;
 }
 
@@ -627,7 +630,7 @@ static target_ulong h_add_logical_lan_buffer(PowerPCCPU *cpu,
 
     dev->rx_bufs++;
 
-    qemu_flush_queued_packets(qemu_get_queue(dev->nic));
+    //qemu_flush_queued_packets(qemu_get_queue(dev->nic));
 
     return H_SUCCESS;
 }

With this patch applied, I get much better results with netperf:

MIGRATED UDP STREAM TEST from 192.168.122.1 () port 0 AF_INET to 192.168.122.214 () port 0 AF_INET
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

229376   16384   60.00     1835897      0    4010.57
229376           60.00       20615             45.03

The final solution would require some kind of timer for flushing the queue, of course, in case the guest never tries to add 200 packets to the guest.

Comment 35 Thomas Huth 2016-04-01 14:01:57 UTC

Suggested patch upstream: https://patchwork.ozlabs.org/patch/604094/

Comment 37 Miroslav Rezanina 2016-06-06 10:50:01 UTC

Fix included in qemu-kvm-rhev-2.6.0-5.el7

Comment 39 Zhengtong 2016-06-13 09:26:01 UTC

I do test again with the fixed version : qemu-kvm-rhev-2.6.0-5.el7

in Guest:
# netserver -p 44444 -L 192.168.122.32 -f -D -4 &

in Host:
#netperf -p 44444 -L 192.168.122.1 -H 192.168.122.32 -t UDP_STREAM -l 20 -- -m 16384

and the result is :
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

229376   16384   20.00      583811      0    3826.05
229376           20.00       25627            167.95


The performance is much better than before, obviously. but not good enough as in comment #4.

Hi, Thomas,
Does this performance is as expected for spapr-vlan ?

Comment 40 Thomas Huth 2016-06-13 11:06:41 UTC

Yes, performance is expected to be much worse than for virtio-net with vhost enabled, so IMHO you can mark this bug as verified.

Rationale: virtio-net with vhost can do much faster networking since it bypasses the bottlenecks in QEMU with the vhost kernel module. So  if you need something to compare spapr-vlan with, you should rather compare it to the performance of an emulated e1000 card or virtion-net *without* vhost acceleration here. (And even then the performance of spapr-vlan will likely be worse, since the interface of this NIC is really defined in a bad way - the guest can only supply one buffer at a time via a hypercall, so this will always be a bottleneck).

Comment 41 Zhengtong 2016-06-13 12:18:02 UTC

Thanks, Thomas.

Comment 43 errata-xmlrpc 2016-11-07 20:22:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2673.html