Bug 1280040

Summary: Difficulty consistently processing more than 1 packet per burst with vhostuser
Product: Red Hat Enterprise Linux 7 Reporter: Andrew Theurer <atheurer>
Component: openvswitch-dpdkAssignee: Flavio Leitner <fleitner>
Status: CLOSED WORKSFORME QA Contact: Jean-Tsung Hsiao <jhsiao>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.3CC: aloughla, atragler, david.marchand, fbaudin, jean-mickael.guerin, kzhang, mleitner, rkhan, sukulkar, thibaut.collet, vincent.jardin
Target Milestone: rc   
Target Release: 7.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-16 17:43:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1301628, 1313485    

Description Andrew Theurer 2015-11-10 19:40:15 UTC
Description of problem:

The DPDK application, testpmd, running in a KVM VM, can typically only dequeue (rx) 1 packet per burst when using openvswitch-dpdk with vhostuser interfaces.


Version-Release number of selected component (if applicable):

openvswitch-dpdk-2.4.0-0.10346.git97bab959.1.el7.x86_64

How reproducible:


Steps to Reproduce:
1. configure openvwitch-dpdk with the following:
    Bridge "ovsbr1"
        Port "dpdk1"
            Interface "dpdk1"
                type: dpdk
        Port "ovsbr1"
            Interface "ovsbr1"
                type: internal
        Port "vhost-user2"
            Interface "vhost-user2"
                type: dpdkvhostuser
    Bridge "ovsbr0"
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal
        Port "dpdk0"
            Interface "dpdk0"
                type: dpdk
        Port "vhost-user1"
            Interface "vhost-user1"
                type: dpdkvhostuser

(port type "dpdk" is a 10Gb Intel Ethernet physical function)

2. Create/start VM with huge pages & 2 vhostuser interfaces
3. Run testpmd with burst stats enabled and fwd mode set to "rxonly"
4. Run packet generator on other host with 64-byte frame size and 14.88 Mpps

Actual results:

Packet rate in to the VM is limited to about 4 million.  The bottleneck appears to be related to the ability for the VM to dequeue packets off the rx ring.  These are done in "bursts", where ideally more than one packet is dequeued at a time (up to 32).  However, the bursts typically only have 1 packet:

These are burst stats from test pmd for rx:

burst-size: 0 count: 0 percent: 0.00%
burst-size: 1 count: 245734373 percent: 99.27%
burst-size: 2 count: 1429921 percent: 0.58%
burst-size: 3 count: 46440 percent: 0.02%
burst-size: 4 count: 28377 percent: 0.01%
burst-size: 5 count: 16156 percent: 0.01%
burst-size: 6 count: 16747 percent: 0.01%
burst-size: 7 count: 18883 percent: 0.01%
burst-size: 8 count: 13881 percent: 0.01%
burst-size: 9 count: 8977 percent: 0.00%
burst-size: 10 count: 5895 percent: 0.00%
burst-size: 11 count: 4785 percent: 0.00%
burst-size: 12 count: 5081 percent: 0.00%
burst-size: 13 count: 5333 percent: 0.00%
burst-size: 14 count: 5541 percent: 0.00%
burst-size: 15 count: 6093 percent: 0.00%
burst-size: 16 count: 6911 percent: 0.00%
burst-size: 17 count: 7175 percent: 0.00%
burst-size: 18 count: 7742 percent: 0.00%
burst-size: 19 count: 6827 percent: 0.00%
burst-size: 20 count: 5297 percent: 0.00%
burst-size: 21 count: 4289 percent: 0.00%
burst-size: 22 count: 3757 percent: 0.00%
burst-size: 23 count: 3442 percent: 0.00%
burst-size: 24 count: 2870 percent: 0.00%
burst-size: 25 count: 2508 percent: 0.00%
burst-size: 26 count: 2617 percent: 0.00%
burst-size: 27 count: 2457 percent: 0.00%
burst-size: 28 count: 2267 percent: 0.00%
burst-size: 29 count: 2036 percent: 0.00%
burst-size: 30 count: 2027 percent: 0.00%
burst-size: 31 count: 2548 percent: 0.00%
burst-size: 32 count: 135681 percent: 0.05%

Expected results:

Majority of bursts using a size of 32.

Additional info:

Analysis of openvswitch-dpdk in the host shows that it is enqueueing the packets in bursts of size 32.  So this may not be a problem with openvswitch.  It's possible this is a problem with dpdk's virtio pmd library.

Comment 6 Flavio Leitner 2016-08-14 02:22:12 UTC
Hi,

I could reproduce this with DPDK 16.04.  It is interesting because the stats now are a bit better and spread from 1 to 7 packets per batch, but still far away from the upper limit 64.

If I add a simple log to show the burst received to the virtio burst rx for every 10k packets received, then the logs and the stats are 100% at the upper limit 64.  So, I believe the guest is polling too fast and finding only few packets available to fetch at each time.  When I add a debug, it slows down enough to accumulate in the virtio queue allowing full batch sizes.

The problem remains though and the question if the host can't push more because the ring is full, then why the guest is finding only a few packets?  To answer this question I disabled mergeable buffers in the testpmd.  The result is full batch size all the time:

Rx-bursts: 1002289 [99% of 64 pkts + 1% of others]

So, it seems like mergeable buffers is causing the issue but I don't know yet the reason.

Comment 7 Flavio Leitner 2016-09-16 17:43:06 UTC
Hi,

I've looked more into this but using our current versions (OVS 2.5 + DPDK 2.2 and DPDK 16.04 in the guest).

I could not reproduce the issue unless in two situations:  Low traffic rate or qemu thread running on another socket.

In any case, I changed OVS code to record the last 64 batch sizes sent by PMD and changed testpmd code to show the batch size and number of used entries in the ring after a number of packets.  I also enabled the RX burst stats in testpmd.

What I see is that with low tput rate obviously the batch is smaller so the NIC sends small batches at each time and that is what the guest gets as a consequence.  This is normal and expected.

With higher tput close to 0 drop rates, the used buffers in guest bounces between 1/3 and 2/3 of ring's total entries (255), so the guest can get batches with sizes varying from 1 to 64 (depending on testpmd configuration).  The testpmd can always read all available buffers when they are available in the ring. For instance, with 222 busy entries, testpmd will get at least 3 batches of 64 in sequence.

In the third case the qemu (vcpu thread) is running on another socket, so the memory operations have the known additional cost. In this case the testpmd gets mostly one or very few packets in each single batch.  Looking at the host, it is pushing batches of 32, but at very low rate because the time spent copying causes the ethernet device's queue to overflow. So, vhost-user is actually forwarding everything (no drops) but the NIC is dropping most of them.  As a result, the guest gets mostly one packet in each batch.  The tput rate is about 1.6~1.8Mpps, so it doesn't look like the case reported here.

I can revisit this but based on the above and experience with other related tests, I'd say that the current versions are okay and batches are working as expected.

Having said that, I will close this bug. If you disagree, please re-open providing more details on how to reproduce the issue so I can take a second and more focused look.

Thanks!
fbl