Bug 1460690 - write() from PMD thread introduce delays to packet processing, causing packet loss
write() from PMD thread introduce delays to packet processing, causing packet...
Status: NEW
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: openvswitch (Show other bugs)
7.3
x86_64 Linux
high Severity medium
: rc
: ---
Assigned To: Kevin Traynor
ovs-qe@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-12 08:16 EDT by Andrew Theurer
Modified: 2017-12-06 20:24 EST (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Andrew Theurer 2017-06-12 08:16:25 EDT
Description of problem:

openvswitch's use of write() in the PMD thread are introducing potential delays in processing packets which can cause loss of packets.


Version-Release number of selected component (if applicable):
openvswitch-2.6.1-10 (-4 and -20 also tested, see same problem)

How reproducible: easily, just need to use openvswitch with DPDK


Steps to Reproduce:
1. configure OVS with at least 2 physical interfaces and 2 vhostuser interfaces
2. start VM with 2 vhostuser interfaces with testpmd forwarding between the two
3. run a packet generator test for 10 minutes at 3Mpps on both interfaces, and check for received packets at 3Mpps per interface, note dropped packets

Actual results:
The packet generator will show dropped packets in the hundreds to thousands range.

If you run ftrace (mode: function) on the CPUs that run PMD threads, you can see a significant number of vfs_write() in the kernel.  These are a result of write() from the PMD thread.  You can also monitor I/O from the PMD process with strace.

Some of these calls to write() can take up to 91 ms (91,000 usec) to complete, which is way too long to not process packets when packets are coming in at 3Mpps rate.


Expected results:
The PMD threads ideally should not do any I/O.  I believe this I/O is for a pipe, but that still can have issues.  The PMD threads should rarely make any system call in its processing loop.  

Additional info:
Comment 2 Andrew Theurer 2017-06-12 08:46:32 EDT
Example of ftrace output for write() from PMD

149996 [028] 86247.522310: function:  vfs_write
149996 [028] 86247.522311: function:     rw_verify_area
149996 [028] 86247.522311: function:        security_file_permission
149996 [028] 86247.522311: function:           selinux_file_permission
149996 [028] 86247.522311: function:              avc_policy_seqno
149996 [028] 86247.522311: function:     do_sync_write
149996 [028] 86247.522311: function:        tun_chr_aio_write
149996 [028] 86247.522311: function:           tun_get_user
149996 [028] 86247.522312: function:              sock_alloc_send_pskb
149996 [028] 86247.522312: function:                 __alloc_skb
149996 [028] 86247.522312: function:                    kmem_cache_alloc_node
149996 [028] 86247.522312: function:                       _cond_resched
149996 [028] 86247.522312: function:                    __kmalloc_reserve.isra.31
149996 [028] 86247.522312: function:                       __kmalloc_node_track_caller
149996 [028] 86247.522312: function:                          kmalloc_slab
149996 [028] 86247.522313: function:                          _cond_resched
149996 [028] 86247.522313: function:                    ksize
149996 [028] 86247.522313: function:              skb_put
149996 [028] 86247.522313: function:              skb_copy_datagram_from_iovec
149996 [028] 86247.522313: function:                 memcpy_fromiovecend
149996 [028] 86247.522313: function:              eth_type_trans
149996 [028] 86247.522313: function:              __skb_flow_dissect
149996 [028] 86247.522314: function:                 __skb_flow_get_ports
149996 [028] 86247.522314: function:              __skb_get_hash
149996 [028] 86247.522314: function:                 __skb_flow_dissect
149996 [028] 86247.522314: function:                    __skb_flow_get_ports
149996 [028] 86247.522314: function:              netif_rx_ni
149996 [028] 86247.522314: function:                 netif_rx_internal
149996 [028] 86247.522315: function:                    ktime_get_real
149996 [028] 86247.522315: function:                       getnstimeofday64
149996 [028] 86247.522315: function:                          __getnstimeofday64
149996 [028] 86247.522315: function:                    enqueue_to_backlog
149996 [028] 86247.522315: function:                       _raw_spin_lock
149996 [028] 86247.522316: function:                       _raw_spin_unlock
149996 [028] 86247.522316: function:                       kfree_skb
149996 [028] 86247.614081: function:              __skb_get_hash
149996 [028] 86247.614081: function:                 __skb_flow_dissect
149996 [028] 86247.614081: function:                    __skb_flow_get_ports
149996 [028] 86247.614082: function:              netif_rx_ni
149996 [028] 86247.614082: function:                 netif_rx_internal
149996 [028] 86247.614082: function:                    ktime_get_real
149996 [028] 86247.614082: function:                       getnstimeofday64
149996 [028] 86247.614082: function:                          __getnstimeofday64
149996 [028] 86247.614082: function:                    enqueue_to_backlog
149996 [028] 86247.614082: function:                       _raw_spin_lock
149996 [028] 86247.614083: function:                       _raw_spin_unlock
149996 [028] 86247.614083: function:                       kfree_skb
149996 [028] 86247.614083: function:                          skb_release_all
149996 [028] 86247.614083: function:                             skb_release_head_state
149996 [028] 86247.614083: function:                                sock_wfree
149996 [028] 86247.614083: function:                                   tun_sock_write_space
149996 [028] 86247.614083: function:                             skb_release_data
149996 [028] 86247.614084: function:                                kfree
149996 [028] 86247.614084: function:                          kfree_skbmem
149996 [028] 86247.614084: function:                             kmem_cache_free
149996 [028] 86247.614084: function:     __fsnotify_parent
149996 [028] 86247.614084: function:     fsnotify
149996 [028] 86247.614084: function:        __srcu_read_lock
149996 [028] 86247.614085: function:        __srcu_read_unlock
Comment 3 Flavio Leitner 2017-06-12 09:34:48 EDT
That happens when the packet is being sent to the kernel space.
Can you confirm?
Thanks!
Comment 4 Andrew Theurer 2017-06-12 11:37:39 EDT
Since all of the ports created are of type netdev, what activity would cause packets being sent to kernel space?  There is a LOCAL ports in each bridge.  If there is any Tx or RX for those LOCAL ports, does that count as packets being sent to kernel space?  And if so, when would this have to happen?
Comment 5 Andrew Theurer 2017-06-12 14:59:39 EDT
During a 20 minute test, bridge br-int port LOCAL had transmitted 14 packets (but received 0 new packets).  br-p2p1 also transmitted 24 packets on port LOCAL.  All of the other OVS bridges' LOCAL ports did not have any new activity during this time.  Would these packet Tx cause sending packets to kernel space?
Comment 6 Andrew Theurer 2017-06-13 20:29:16 EDT
Flavio, can you confirm, that for any incoming broadcast-type packet, like an ARP request or DHCP request, openvswitch will forward this request to port LOCAL (unless flow rules explicitly prohibit)?  If so, and this causes a packet to be sent through kernel, would this not cause a delay in packet processing (compared to DPDK speeds) and thus cause packet drops?
Comment 7 Flavio Leitner 2017-06-14 08:42:51 EDT
(In reply to Andrew Theurer from comment #6)
> Flavio, can you confirm, that for any incoming broadcast-type packet, like
> an ARP request or DHCP request, openvswitch will forward this request to
> port LOCAL (unless flow rules explicitly prohibit)?  If so, and this causes
> a packet to be sent through kernel, would this not cause a delay in packet
> processing (compared to DPDK speeds) and thus cause packet drops?

OVS bridge port is a TAP device and when the action normal is used (i.e.: broadcasts) or a flow is sending packets to that port, the packet is sent to the kernel DP causing the traces you have found.

We are working to fix OVS to not send packets to an interface that is DOWN. However, OVS is broken with regards to network namespace support. So, if we fix that, we brake others environment that are moving, for instance, internal ports to other namespaces.

The correct solution is to have minimal network namespace support in OVS and then fix the interface status. To fix OVS, we also need to fix the kernel and that is WIP.

However, please confirm that the traces you found are related to the traffic in the local ports.  I think that's correct, but maybe there is something else going on.
Comment 8 Andrew Theurer 2017-06-16 14:11:47 EDT
I can confirm there are at least some packets on the local ports.

For example: 

17:33:25.659577 IP6 :: > ff02::1:ff45:b797: ICMP6, neighbor solicitation, who has compute-0.localdomain, length 24

17:33:25.871589 IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28

17:33:26.661577 IP6 compute-0.localdomain > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28

17:33:27.491563 IP6 compute-0.localdomain > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28


(the ones below are from the traffic generator, and these no longer go to the LOCAL port once the destinations are learned.)


17:33:46.214604 IP 192.168.0.100.search-agent > 192.168.1.100.search-agent: UDP, length 18

17:33:46.217909 IP 192.168.0.101.search-agent > 192.168.1.100.search-agent: UDP, length 18

17:33:46.217928 IP 192.168.0.102.search-agent > 192.168.1.100.search-agent: UDP, length 18


I have also seen some ARP packets coming in from router in our VMs.
Comment 9 Flavio Leitner 2017-07-26 11:42:10 EDT
(In reply to Andrew Theurer from comment #8)
> I can confirm there are at least some packets on the local ports.

Do you think we can correlate the write() calls to those packets?
I wonder if that is the root cause or if we have something else going on.

fbl
Comment 10 Andrew Theurer 2017-07-27 10:13:32 EDT
The easiest way to correlate this would be to inject those types of packets on an otherwise idle network and see if we get writes.  Using something like scapy to send out ipv6 neighbor discovery should work.
Comment 11 Andrew Theurer 2017-09-22 11:02:50 EDT
I have run more recent tests where we do not have any transmiotted packets to the LOCAL port, but we still have write() from the PMD thread.  Using strace, I see a couple writes per minute, like the following:

futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0
write(40, "\0", 1)                      = 1
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0
write(40, "\0", 1)                      = 1
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 1
write(40, "\0", 1)                      = 1
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0
write(40, "\0", 1)                      = 1
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fe17225f620, FUTEX_WAKE_PRIVATE, 1) = 0


The writes are always to file handle 40
#ls -l /proc/1529/fd

...
lr-x------. 1 root qemu 64 Sep 22 13:38 39 -> pipe:[305]
lrwx------. 1 root qemu 64 Sep 22 13:38 4 -> socket:[47550]
l-wx------. 1 root qemu 64 Sep 22 13:38 40 -> pipe:[305]
lr-x------. 1 root qemu 64 Sep 22 13:38 41 -> pipe:[306]
l-wx------. 1 root qemu 64 Sep 22 13:38 42 -> pipe:[306]
...

Can we find out what pipe:[305] is for?


Based in the ftrace data, it looks like this write ends up waking another thread:

<...>-312128 6622558002061.63 sys_write
<...>-312128 6622558002062.03 fget_light
<...>-312128 6622558002063.17 vfs_write
<...>-312128 6622558002063.53    rw_verify_area
<...>-312128 6622558002063.93       security_file_permission
<...>-312128 6622558002064.43          selinux_file_permission
<...>-312128 6622558002065.17             file_has_perm
<...>-312128 6622558002065.58                inode_has_perm.isra.33.constprop.63
<...>-312128 6622558002065.97                   avc_has_perm_flags
<...>-312128 6622558002066.56    do_sync_write
<...>-312128 6622558002066.96       pipe_write
<...>-312128 6622558002067.33          mutex_lock
<...>-312128 6622558002067.66             _cond_resched
<...>-312128 6622558002068.22          iov_fault_in_pages_read
<...>-312128 6622558002068.68          pipe_iov_copy_from_user
<...>-312128 6622558002069.43          mutex_unlock
<...>-312128 6622558002069.9          __wake_up_sync_key
<...>-312128 6622558002070.25             _raw_spin_lock_irqsave
<...>-312128 6622558002070.71             __wake_up_common
<...>-312128 6622558002071.26                pollwake
<...>-312128 6622558002071.76                   default_wake_function
<...>-312128 6622558002072.14                      try_to_wake_up
<...>-312128 6622558002072.45                         _raw_spin_lock_irqsave
<...>-312128 6622558002073.4                         task_waking_fair
<...>-312128 6622558002074.06                         select_task_rq_fair
<...>-312128 6622558002074.69                         _raw_spin_lock
<...>-312128 6622558002075.13                         ttwu_do_activate.constprop.91
<...>-312128 6622558002075.52                            activate_task
<...>-312128 6622558002075.83                               enqueue_task
<...>-312128 6622558002076.35                                  update_rq_clock.part.79
<...>-312128 6622558002077.07                                  enqueue_task_fair
<...>-312128 6622558002077.47                                     enqueue_entity
<...>-312128 6622558002077.92                                        update_curr
<...>-312128 6622558002078.73                                        update_cfs_rq_blocked_load
<...>-312128 6622558002079.15                                        account_entity_enqueue
<...>-312128 6622558002079.67                                        update_cfs_shares
<...>-312128 6622558002080.07                                        place_entity
<...>-312128 6622558002080.54                                        __enqueue_entity
<...>-312128 6622558002081.12                                     hrtick_update
<...>-312128 6622558002081.55                            ttwu_do_wakeup
<...>-312128 6622558002081.87                               check_preempt_curr
<...>-312128 6622558002082.42                                  resched_curr
<...>-312128 6622558002083.06                                     native_smp_send_reschedule
<...>-312128 6622558002083.42                                        x2apic_send_IPI_mask
<...>-312128 6622558002083.78                                           __x2apic_send_IPI_mask
<...>-312128 6622558002085.93                         _raw_spin_unlock
<...>-312128 6622558002086.35                         _raw_spin_unlock_irqrestore
<...>-312128 6622558002086.86             _raw_spin_unlock_irqrestore
<...>-312128 6622558002087.37          kill_fasync
<...>-312128 6622558002087.78          __sb_start_write
<...>-312128 6622558002088.46          file_update_time
<...>-312128 6622558002088.81             current_fs_time
<...>-312128 6622558002089.13                current_kernel_time
<...>-312128 6622558002089.61          __sb_end_write


The futex calls are a lot more frequent, but usually take less time.  However, some task a long as 16usec, for example:

<...>-312128 6622609846556.18 do_futex
<...>-312128 6622609846556.57    futex_wake
<...>-312128 6622609846556.94       get_futex_key
<...>-312128 6622609846557.33          get_futex_key_refs.isra.12
<...>-312128 6622609846557.71       hash_futex
<...>-312128 6622609846558.26       _raw_spin_lock
<...>-312128 6622609846558.93       wake_futex
<...>-312128 6622609846559.44          __unqueue_futex
<...>-312128 6622609846559.94          wake_up_state
<...>-312128 6622609846560.26             try_to_wake_up
<...>-312128 6622609846560.57                _raw_spin_lock_irqsave
<...>-312128 6622609846561.08                task_waking_fair
<...>-312128 6622609846561.66                select_task_rq_fair
<...>-312128 6622609846562.08                _raw_spin_lock
<...>-312128 6622609846562.52                ttwu_do_activate.constprop.91
<...>-312128 6622609846562.84                   activate_task
<...>-312128 6622609846563.15                      enqueue_task
<...>-312128 6622609846563.54                         update_rq_clock.part.79
<...>-312128 6622609846564.2                         enqueue_task_fair
<...>-312128 6622609846564.58                            enqueue_entity
<...>-312128 6622609846564.92                               update_curr
<...>-312128 6622609846565.48                               update_cfs_rq_blocked_load
<...>-312128 6622609846565.79                               account_entity_enqueue
<...>-312128 6622609846566.28                               update_cfs_shares
<...>-312128 6622609846566.6                               place_entity
<...>-312128 6622609846566.94                               __enqueue_entity
<...>-312128 6622609846567.47                            hrtick_update
<...>-312128 6622609846567.88                   ttwu_do_wakeup
<...>-312128 6622609846568.2                      check_preempt_curr
<...>-312128 6622609846568.61                         resched_curr
<...>-312128 6622609846570.03                            native_smp_send_reschedule
<...>-312128 6622609846570.44                               x2apic_send_IPI_mask
<...>-312128 6622609846570.88                                  __x2apic_send_IPI_mask
<...>-312128 6622609846571.93                _raw_spin_unlock
<...>-312128 6622609846572.28                _raw_spin_unlock_irqrestore
<...>-312128 6622609846572.68       _raw_spin_unlock
<...>-312128 6622609846573.04       drop_futex_key_refs.isra.13

Having a write followed very soon after by a futex can easily cause a long enough combined interruption to the PMD thread that we may drop packets.
Comment 12 Aaron Conole 2017-12-06 10:54:18 EST
This looks like the latch / sequence writes.  These shouldn't be a regression afaict - they only happen on threads where reconfigurations occur and for the revalidator threads, and have been in the system for a *long* time.

see latch-unix.c

Perhaps it would be good to use gdb and attach to the system while these writes occur to see which tasks exactly are issuing the writes (using a conditional break on write when the thread is a specific PMD thread). Perhaps there's a revalidator thread being scheduled on those cores?

Note You need to log in before you can comment on or make changes to this bug.