Bug 1460690 - write() from PMD thread introduce delays to packet processing, causing packet loss
write() from PMD thread introduce delays to packet processing, causing packet...
Status: NEW
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: openvswitch (Show other bugs)
7.3
x86_64 Linux
high Severity medium
: rc
: ---
Assigned To: Kevin Traynor
ovs-qe@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-12 08:16 EDT by Andrew Theurer
Modified: 2017-08-04 16:56 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Andrew Theurer 2017-06-12 08:16:25 EDT
Description of problem:

openvswitch's use of write() in the PMD thread are introducing potential delays in processing packets which can cause loss of packets.


Version-Release number of selected component (if applicable):
openvswitch-2.6.1-10 (-4 and -20 also tested, see same problem)

How reproducible: easily, just need to use openvswitch with DPDK


Steps to Reproduce:
1. configure OVS with at least 2 physical interfaces and 2 vhostuser interfaces
2. start VM with 2 vhostuser interfaces with testpmd forwarding between the two
3. run a packet generator test for 10 minutes at 3Mpps on both interfaces, and check for received packets at 3Mpps per interface, note dropped packets

Actual results:
The packet generator will show dropped packets in the hundreds to thousands range.

If you run ftrace (mode: function) on the CPUs that run PMD threads, you can see a significant number of vfs_write() in the kernel.  These are a result of write() from the PMD thread.  You can also monitor I/O from the PMD process with strace.

Some of these calls to write() can take up to 91 ms (91,000 usec) to complete, which is way too long to not process packets when packets are coming in at 3Mpps rate.


Expected results:
The PMD threads ideally should not do any I/O.  I believe this I/O is for a pipe, but that still can have issues.  The PMD threads should rarely make any system call in its processing loop.  

Additional info:
Comment 2 Andrew Theurer 2017-06-12 08:46:32 EDT
Example of ftrace output for write() from PMD

149996 [028] 86247.522310: function:  vfs_write
149996 [028] 86247.522311: function:     rw_verify_area
149996 [028] 86247.522311: function:        security_file_permission
149996 [028] 86247.522311: function:           selinux_file_permission
149996 [028] 86247.522311: function:              avc_policy_seqno
149996 [028] 86247.522311: function:     do_sync_write
149996 [028] 86247.522311: function:        tun_chr_aio_write
149996 [028] 86247.522311: function:           tun_get_user
149996 [028] 86247.522312: function:              sock_alloc_send_pskb
149996 [028] 86247.522312: function:                 __alloc_skb
149996 [028] 86247.522312: function:                    kmem_cache_alloc_node
149996 [028] 86247.522312: function:                       _cond_resched
149996 [028] 86247.522312: function:                    __kmalloc_reserve.isra.31
149996 [028] 86247.522312: function:                       __kmalloc_node_track_caller
149996 [028] 86247.522312: function:                          kmalloc_slab
149996 [028] 86247.522313: function:                          _cond_resched
149996 [028] 86247.522313: function:                    ksize
149996 [028] 86247.522313: function:              skb_put
149996 [028] 86247.522313: function:              skb_copy_datagram_from_iovec
149996 [028] 86247.522313: function:                 memcpy_fromiovecend
149996 [028] 86247.522313: function:              eth_type_trans
149996 [028] 86247.522313: function:              __skb_flow_dissect
149996 [028] 86247.522314: function:                 __skb_flow_get_ports
149996 [028] 86247.522314: function:              __skb_get_hash
149996 [028] 86247.522314: function:                 __skb_flow_dissect
149996 [028] 86247.522314: function:                    __skb_flow_get_ports
149996 [028] 86247.522314: function:              netif_rx_ni
149996 [028] 86247.522314: function:                 netif_rx_internal
149996 [028] 86247.522315: function:                    ktime_get_real
149996 [028] 86247.522315: function:                       getnstimeofday64
149996 [028] 86247.522315: function:                          __getnstimeofday64
149996 [028] 86247.522315: function:                    enqueue_to_backlog
149996 [028] 86247.522315: function:                       _raw_spin_lock
149996 [028] 86247.522316: function:                       _raw_spin_unlock
149996 [028] 86247.522316: function:                       kfree_skb
149996 [028] 86247.614081: function:              __skb_get_hash
149996 [028] 86247.614081: function:                 __skb_flow_dissect
149996 [028] 86247.614081: function:                    __skb_flow_get_ports
149996 [028] 86247.614082: function:              netif_rx_ni
149996 [028] 86247.614082: function:                 netif_rx_internal
149996 [028] 86247.614082: function:                    ktime_get_real
149996 [028] 86247.614082: function:                       getnstimeofday64
149996 [028] 86247.614082: function:                          __getnstimeofday64
149996 [028] 86247.614082: function:                    enqueue_to_backlog
149996 [028] 86247.614082: function:                       _raw_spin_lock
149996 [028] 86247.614083: function:                       _raw_spin_unlock
149996 [028] 86247.614083: function:                       kfree_skb
149996 [028] 86247.614083: function:                          skb_release_all
149996 [028] 86247.614083: function:                             skb_release_head_state
149996 [028] 86247.614083: function:                                sock_wfree
149996 [028] 86247.614083: function:                                   tun_sock_write_space
149996 [028] 86247.614083: function:                             skb_release_data
149996 [028] 86247.614084: function:                                kfree
149996 [028] 86247.614084: function:                          kfree_skbmem
149996 [028] 86247.614084: function:                             kmem_cache_free
149996 [028] 86247.614084: function:     __fsnotify_parent
149996 [028] 86247.614084: function:     fsnotify
149996 [028] 86247.614084: function:        __srcu_read_lock
149996 [028] 86247.614085: function:        __srcu_read_unlock
Comment 3 Flavio Leitner 2017-06-12 09:34:48 EDT
That happens when the packet is being sent to the kernel space.
Can you confirm?
Thanks!
Comment 4 Andrew Theurer 2017-06-12 11:37:39 EDT
Since all of the ports created are of type netdev, what activity would cause packets being sent to kernel space?  There is a LOCAL ports in each bridge.  If there is any Tx or RX for those LOCAL ports, does that count as packets being sent to kernel space?  And if so, when would this have to happen?
Comment 5 Andrew Theurer 2017-06-12 14:59:39 EDT
During a 20 minute test, bridge br-int port LOCAL had transmitted 14 packets (but received 0 new packets).  br-p2p1 also transmitted 24 packets on port LOCAL.  All of the other OVS bridges' LOCAL ports did not have any new activity during this time.  Would these packet Tx cause sending packets to kernel space?
Comment 6 Andrew Theurer 2017-06-13 20:29:16 EDT
Flavio, can you confirm, that for any incoming broadcast-type packet, like an ARP request or DHCP request, openvswitch will forward this request to port LOCAL (unless flow rules explicitly prohibit)?  If so, and this causes a packet to be sent through kernel, would this not cause a delay in packet processing (compared to DPDK speeds) and thus cause packet drops?
Comment 7 Flavio Leitner 2017-06-14 08:42:51 EDT
(In reply to Andrew Theurer from comment #6)
> Flavio, can you confirm, that for any incoming broadcast-type packet, like
> an ARP request or DHCP request, openvswitch will forward this request to
> port LOCAL (unless flow rules explicitly prohibit)?  If so, and this causes
> a packet to be sent through kernel, would this not cause a delay in packet
> processing (compared to DPDK speeds) and thus cause packet drops?

OVS bridge port is a TAP device and when the action normal is used (i.e.: broadcasts) or a flow is sending packets to that port, the packet is sent to the kernel DP causing the traces you have found.

We are working to fix OVS to not send packets to an interface that is DOWN. However, OVS is broken with regards to network namespace support. So, if we fix that, we brake others environment that are moving, for instance, internal ports to other namespaces.

The correct solution is to have minimal network namespace support in OVS and then fix the interface status. To fix OVS, we also need to fix the kernel and that is WIP.

However, please confirm that the traces you found are related to the traffic in the local ports.  I think that's correct, but maybe there is something else going on.
Comment 8 Andrew Theurer 2017-06-16 14:11:47 EDT
I can confirm there are at least some packets on the local ports.

For example: 

17:33:25.659577 IP6 :: > ff02::1:ff45:b797: ICMP6, neighbor solicitation, who has compute-0.localdomain, length 24

17:33:25.871589 IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28

17:33:26.661577 IP6 compute-0.localdomain > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28

17:33:27.491563 IP6 compute-0.localdomain > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28


(the ones below are from the traffic generator, and these no longer go to the LOCAL port once the destinations are learned.)


17:33:46.214604 IP 192.168.0.100.search-agent > 192.168.1.100.search-agent: UDP, length 18

17:33:46.217909 IP 192.168.0.101.search-agent > 192.168.1.100.search-agent: UDP, length 18

17:33:46.217928 IP 192.168.0.102.search-agent > 192.168.1.100.search-agent: UDP, length 18


I have also seen some ARP packets coming in from router in our VMs.
Comment 9 Flavio Leitner 2017-07-26 11:42:10 EDT
(In reply to Andrew Theurer from comment #8)
> I can confirm there are at least some packets on the local ports.

Do you think we can correlate the write() calls to those packets?
I wonder if that is the root cause or if we have something else going on.

fbl
Comment 10 Andrew Theurer 2017-07-27 10:13:32 EDT
The easiest way to correlate this would be to inject those types of packets on an otherwise idle network and see if we get writes.  Using something like scapy to send out ipv6 neighbor discovery should work.

Note You need to log in before you can comment on or make changes to this bug.