2087604 – UDP Packet loss in OpenShift using IPv6

Bug 2087604 - UDP Packet loss in OpenShift using IPv6

Summary: UDP Packet loss in OpenShift using IPv6

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Marcelo Ricardo Leitner
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-18 05:39 UTC by Michael Washer
Modified:	2022-11-23 00:29 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-22 23:26:35 UTC
Target Upstream Version:
Embargoed:
Flags:	aapark: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	2085089	1	None	None	None	2023-09-18 04:37:02 UTC
Red Hat Bugzilla	2118848	0	unspecified	CLOSED	Backport: [ovs-dev] netdev-linux: skip some internal kernel stats gathering	2023-01-20 11:26:36 UTC

Internal Links: 2110268 2118848

Description Michael Washer 2022-05-18 05:39:29 UTC

Description of problem:
There appears to be packet loss when using IPv6 with OpenShift 4.9

Running iPerf on an IPv6 Cluster with OVN-Kubernetes and OpenShift 4.9 using the following flags, results in packet-loss.

```
iperf3 -V -f m -l 265 -b 200m -c {SVC_IP} -u -i 1 -t 600
```

Version-Release number of selected component (if applicable):
4.9.12

How reproducible:
Everytime

Steps to Reproduce:
1. Create 2 Pods with iPerf installed in an IPv6 Cluster with a ServiceIP defined for one of the Pods
2. Start iPerf Server to accept connections
3. Start iPerf Client to connect to SVC IP using below command:
```
iperf3 -V -f m -l 265 -b 200m -c {SVC_IP} -u -i 1 -t 600
```

Packet loss appears to happen both when the Pods are located on the same Node and different Nodes. This has only been shown to have issues in IPv6 deployments.

Actual results:
Running a systemtap script during the tests, we can see that there are many packets dropped (kfree_skb()) during the Netlink ovs upcall.

~~~
Sun Apr 24 11:43:54 2022
184 packets dropped at ovs_vport_cmd_get
85 packets dropped at netlink_attachskb
85 packets dropped at ovs_dp_process_packet
.
.
Sun Apr 24 11:43:59 2022
12 packets dropped at queue_userspace_packet
.
.
Sun Apr 24 11:48:44 2022
3207 packets dropped at netlink_attachskb
3207 packets dropped at ovs_dp_process_packet <----- OVS upcall here
61 packets dropped at __udp4_lib_rcv
53 packets dropped at ipv6_rcv
51 packets dropped at ovs_vport_cmd_get
~~~

Expected results:
Packets arrive at their destination Pod.

Additional info:

Comment 3 Rashid Khan 2022-05-19 12:55:30 UTC

(In reply to Michael Washer from comment #0)
> Description of problem:
> There appears to be packet loss when using IPv6 with OpenShift 4.9
> 
> Running iPerf on an IPv6 Cluster with OVN-Kubernetes and OpenShift 4.9 using
> the following flags, results in packet-loss.
> 
> ```
> iperf3 -V -f m -l 265 -b 200m -c {SVC_IP} -u -i 1 -t 600
> ```
> 
> Version-Release number of selected component (if applicable):
> 4.9.12
> 
> How reproducible:
> Everytime
> 
> Steps to Reproduce:
> 1. Create 2 Pods with iPerf installed in an IPv6 Cluster with a ServiceIP
> defined for one of the Pods
> 2. Start iPerf Server to accept connections
> 3. Start iPerf Client to connect to SVC IP using below command:
> ```
> iperf3 -V -f m -l 265 -b 200m -c {SVC_IP} -u -i 1 -t 600
> ```
> 
> Packet loss appears to happen both when the Pods are located on the same
> Node and different Nodes. This has only been shown to have issues in IPv6
> deployments.
> 
> Actual results:
> Running a systemtap script during the tests, we can see that there are many
> packets dropped (kfree_skb()) during the Netlink ovs upcall.
> 
> ~~~
> Sun Apr 24 11:43:54 2022
> 184 packets dropped at ovs_vport_cmd_get
> 85 packets dropped at netlink_attachskb
> 85 packets dropped at ovs_dp_process_packet
> .
> .
> Sun Apr 24 11:43:59 2022
> 12 packets dropped at queue_userspace_packet
> .
> .
> Sun Apr 24 11:48:44 2022
> 3207 packets dropped at netlink_attachskb
> 3207 packets dropped at ovs_dp_process_packet <----- OVS upcall here
> 61 packets dropped at __udp4_lib_rcv
> 53 packets dropped at ipv6_rcv
> 51 packets dropped at ovs_vport_cmd_get
> ~~~
> 
> Expected results:
> Packets arrive at their destination Pod.
> 
> Additional info:


hi Michael W, 
Thx for the bug and info. But is the packet loss only seen with IPv6?  
Or have we also seen any evidence of loss with Ipv4? 
Can you please ask JMaxx also when he comes back from PTO. Thx

Comment 4 Shane Ronan 2022-05-19 13:43:13 UTC

Iperf over IPv6 on OVN Dual Stack - No Packet Loss

- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-600.00 sec  29.1 GBytes   417 Mbits/sec  0.000 ms  0/122223633 (0%)  sender
[  5]   0.00-600.04 sec  29.1 GBytes   417 Mbits/sec  0.002 ms  0/122223633 (0%)  receiver
CPU Utilization: local/sender 46.9% (3.1%u/43.8%s), remote/receiver 69.4% (6.0%u/63.3%s)

Comment 5 Marcelo Ricardo Leitner 2022-05-19 14:08:30 UTC

(In reply to Shane Ronan from comment #4)
> Iperf over IPv6 on OVN Dual Stack - No Packet Loss

You mean iperf used IPv6 while the overlay was using IPv4? Or IPv4 was just around, and everything for the test was using IPv6?

Comment 6 Shane Ronan 2022-05-19 14:13:18 UTC

OVN was configured as Dual Stack, with PODS having both IPv4 and IPv6 addresses.

The Iperf test was run over IPv6

Comment 7 Marcelo Ricardo Leitner 2022-05-24 20:34:08 UTC

(In reply to Michael Washer from comment #0)
> Description of problem:
> There appears to be packet loss when using IPv6 with OpenShift 4.9
> 
> Running iPerf on an IPv6 Cluster with OVN-Kubernetes and OpenShift 4.9 using
> the following flags, results in packet-loss.

Was any tuning applied to the hosts here?

I was having drops in my test environment, but after applying the tuning available in
https://bugzilla.redhat.com/show_bug.cgi?id=2085089#c3 (and other places),
the drops are mostly gone. Now I have drops only at the start of the test, and they are in the network itself here. The sending node sent then, while the receiver never got them. Then I can leave it running and there are no drops.

Comment 8 Aaron Park 2022-05-24 23:59:03 UTC

Tuning is applied to the node where iperf3 pod is running as shown below.

[core@worker05 ~]$ sudo sysctl -a | egrep 'net.core.rmem_max|net.core.wmem_max|net.core.rmem_default|net.core.wmem_default|net.core.optmem_max|net.core.netdev_max_backlog|net.ipv4.udp_rmem_min'
net.core.netdev_max_backlog = 100000
net.core.optmem_max = 40960
net.core.rmem_default = 33554432
net.core.rmem_max = 33554432
net.core.wmem_default = 33554432
net.core.wmem_max = 33554432
net.ipv4.udp_rmem_min = 8192

Comment 26 Jianlin Shi 2022-06-14 08:31:23 UTC

tested with following ovn only script:

systemctl start openvswitch                          
systemctl start ovn-northd                                                                            
ovn-nbctl set-connection ptcp:6641                                                                    
ovn-sbctl set-connection ptcp:6642
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:127.0.0.1:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=127.0.0.1
systemctl restart ovn-controller                                                                      
                                                                                                      
ovn-nbctl ls-add ls1                                                                                  
                                                                                                      
ovn-nbctl lsp-add ls1 ls1p1                                                                           
ovn-nbctl lsp-set-addresses ls1p1 "00:00:00:00:01:01 192.168.1.1 2001::1"                             
ovn-nbctl lsp-add ls1 ls1p2                                                                           
ovn-nbctl lsp-set-addresses ls1p2 "00:00:00:00:01:02 192.168.1.2 2001::2"                             
                                                                                                      
ovn-nbctl lr-add lr1                                                                                  
ovn-nbctl lrp-add lr1 lr1-ls1 00:00:00:0f:01:01 192.168.1.254/24 2001::a/64                           
ovn-nbctl lsp-add ls1 ls1-lr1 \                                                                       
        -- lsp-set-type ls1-lr1 router \                                                              
        -- lsp-set-addresses ls1-lr1 router \                                                         
        -- lsp-set-options ls1-lr1 router-port=lr1-ls1                                                
                                                                                                      
ovn-nbctl lb-add lb0 1111::100 2001::2                                                                
ovn-nbctl ls-lb-add ls1 lb0                                                                           
                                                                                                      
ovs-vsctl add-port br-int ls1p1 -- set interface ls1p1 type=internal external:iface-id=ls1p1          
ip netns add ls1p1                                                                                    
ip link set ls1p1 netns ls1p1                                                                         
ip netns exec ls1p1 ip link set ls1p1 address 00:00:00:00:01:01                                       
ip netns exec ls1p1 ip link set ls1p1 up                                                              
ip netns exec ls1p1 ip addr add 192.168.1.1/24 dev ls1p1                                              
ip netns exec ls1p1 ip route add default via 192.168.1.254                                            
ip netns exec ls1p1 ip addr add 2001::1/64 dev ls1p1                                                  
ip netns exec ls1p1 ip -6 route add default via 2001::a                                               
                                                                                                      
ovs-vsctl add-port br-int ls1p2 -- set interface ls1p2 type=internal external_ids:iface-id=ls1p2      
ip netns add ls1p2                                                                                    
ip link set ls1p2 netns ls1p2                                                                         
ip netns exec ls1p2 ip link set ls1p2 address 00:00:00:00:01:02                                       
ip netns exec ls1p2 ip link set ls1p2 up                                                              
ip netns exec ls1p2 ip addr add 192.168.1.2/24 dev ls1p2                                              
ip netns exec ls1p2 ip addr add 2001::2/64 dev ls1p2                                                  
ip netns exec ls1p2 ip route add default via 192.168.1.254                                            
ip netns exec ls1p2 ip route add default via 2001::a                                                  
ip netns exec ls1p2 iperf3 -s -p 5201 -D                                                              
                                                                                                      
sleep 2                                                                                               
ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m  -u -i 1 -t 600  -c 1111::100                       
sysctl -w net.core.netdev_max_backlog=100000                                                          
sysctl -w net.core.optmem_max=81920                                                                   
sysctl -w net.core.rmem_default=33554432                                                              
sysctl -w net.core.rmem_max=33554432                                                                  
sysctl -w net.core.wmem_default=33554432                                                              
sysctl -w net.core.wmem_max=33554432                                                                  
sysctl -w net.ipv4.udp_rmem_min=8192                                                                  
                                                                                                      
ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m  -u -i 1 -t 600  -c 1111::100 

test result with ovn21.09-21.09.0-20 + openvswitch2.16-2.16.0-15 (ocp 4.19.12):

+ ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c 1111::100
iperf 3.5
Linux wsfd-advnetlab20.anl.lab.eng.bos.redhat.com 4.18.0-372.11.1.el8_6.x86_64 #1 SMP Wed May 25 11:35:25 EDT 2022 x86_64
Control connection MSS 1428
Time: Tue, 14 Jun 2022 07:45:17 GMT
Connecting to host 1111::100, port 5201
      Cookie: g5u7ckowkmk2r2agmhamnn36lzg2q4p2yeyv
[  5] local 2001::1 port 48386 connected to 1111::100 port 5201
Starting Test: protocol: UDP, 1 streams, 256 byte blocks, omitting 0 seconds, 600 second test, tos 0
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-1.00   sec  23.8 MBytes   200 Mbits/sec  97592
......
[  5] 597.00-598.00 sec  23.8 MBytes   200 Mbits/sec  97641
[  5] 598.00-599.00 sec  23.8 MBytes   200 Mbits/sec  97667
[  5] 599.00-600.00 sec  23.9 MBytes   200 Mbits/sec  97704
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.000 ms  0/58593742 (0%)  sender
[  5]   0.00-600.04 sec  14.0 GBytes   200 Mbits/sec  0.001 ms  2806/58593742 (0.0048%)  receiver

<== 0.0048% loss

CPU Utilization: local/sender 60.1% (7.3%u/52.9%s), remote/receiver 60.6% (12.1%u/48.5%s)

iperf Done.
+ sysctl -w net.core.netdev_max_backlog=100000
net.core.netdev_max_backlog = 100000
+ sysctl -w net.core.optmem_max=81920
net.core.optmem_max = 81920
+ sysctl -w net.core.rmem_default=33554432
net.core.rmem_default = 33554432
+ sysctl -w net.core.rmem_max=33554432
net.core.rmem_max = 33554432
+ sysctl -w net.core.wmem_default=33554432
net.core.wmem_default = 33554432
+ sysctl -w net.core.wmem_max=33554432
net.core.wmem_max = 33554432
+ sysctl -w net.ipv4.udp_rmem_min=8192
net.ipv4.udp_rmem_min = 8192
+ ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c 1111::100
iperf 3.5
Linux wsfd-advnetlab20.anl.lab.eng.bos.redhat.com 4.18.0-372.11.1.el8_6.x86_64 #1 SMP Wed May 25 11:35:25 EDT 2022 x86_64
Control connection MSS 1428
Time: Tue, 14 Jun 2022 07:55:17 GMT
Connecting to host 1111::100, port 5201
      Cookie: ukpti2ebu2p26lty7rpitji5ytzvd635aspm
[  5] local 2001::1 port 47661 connected to 1111::100 port 5201
Starting Test: protocol: UDP, 1 streams, 256 byte blocks, omitting 0 seconds, 600 second test, tos 0
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-1.00   sec  23.8 MBytes   200 Mbits/sec  97556
[  5]   1.00-2.00   sec  23.9 MBytes   200 Mbits/sec  97715
[  5]   2.00-3.00   sec  23.8 MBytes   200 Mbits/sec  97657
......
[  5] 597.00-598.00 sec  23.9 MBytes   200 Mbits/sec  97731
[  5] 598.00-599.00 sec  23.8 MBytes   200 Mbits/sec  97641
[  5] 599.00-600.00 sec  23.8 MBytes   200 Mbits/sec  97663
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.000 ms  0/58593666 (0%)  sender
[  5]   0.00-600.04 sec  14.0 GBytes   200 Mbits/sec  0.001 ms  0/58593666 (0%)  receiver

<=== 0% loss

CPU Utilization: local/sender 60.3% (8.1%u/52.3%s), remote/receiver 60.9% (12.0%u/48.9%s)

iperf Done.

the ovn only setup is not totally the same as ocp.

Comment 28 Marcelo Ricardo Leitner 2022-06-22 20:03:20 UTC

(In reply to Jianlin Shi from comment #26)
> tested with following ovn only script:

Interesting. AFAICT this is all within the same host, which is easier to reproduce as it doesn't depend much on the HW.

...
> ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m  -u -i 1 -t 600  -c
> 1111::100 
...
> + ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c
> 1111::100

Is it me or in the 2 iterations, one had drops and the other had not?
Can you please share the full output of the server side? It should help show which points during the test had drops.

Comment 29 Jianlin Shi 2022-06-23 01:04:38 UTC

(In reply to Marcelo Ricardo Leitner from comment #28)
> (In reply to Jianlin Shi from comment #26)
> > tested with following ovn only script:
> 
> Interesting. AFAICT this is all within the same host, which is easier to
> reproduce as it doesn't depend much on the HW.
> 
> ...
> > ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m  -u -i 1 -t 600  -c
> > 1111::100 
> ...
> > + ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c
> > 1111::100
> 
> Is it me or in the 2 iterations, one had drops and the other had not?

packet drop when sysctl setting is not tuned, and 0 loss after tuned.


> Can you please share the full output of the server side? It should help show
> which points during the test had drops.

Comment 31 Jianlin Shi 2022-06-23 06:22:24 UTC

(In reply to Marcelo Ricardo Leitner from comment #28)
> (In reply to Jianlin Shi from comment #26)
> > tested with following ovn only script:
> 
> Interesting. AFAICT this is all within the same host, which is easier to
> reproduce as it doesn't depend much on the HW.
> 
> ...
> > ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m  -u -i 1 -t 600  -c
> > 1111::100 
> ...
> > + ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c
> > 1111::100
> 
> Is it me or in the 2 iterations, one had drops and the other had not?
> Can you please share the full output of the server side? It should help show
> which points during the test had drops.

log for iperf3 server is attached

Comment 32 Marcelo Ricardo Leitner 2022-06-23 17:07:45 UTC

(In reply to Jianlin Shi from comment #29)
> (In reply to Marcelo Ricardo Leitner from comment #28)
> > (In reply to Jianlin Shi from comment #26)
> > > tested with following ovn only script:
> > 
> > Interesting. AFAICT this is all within the same host, which is easier to
> > reproduce as it doesn't depend much on the HW.
> > 
> > ...
> > > ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m  -u -i 1 -t 600  -c
> > > 1111::100 
> > ...
> > > + ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c
> > > 1111::100
> > 
> > Is it me or in the 2 iterations, one had drops and the other had not?
> 
> packet drop when sysctl setting is not tuned, and 0 loss after tuned.

Ahh I see. I had missed that the first part is the script itself, and then its execution.

> 
> 
> > Can you please share the full output of the server side? It should help show
> > which points during the test had drops.

The log and results are matching ours then.
Thanks Jianlin.

Comment 57 Flavio Leitner 2022-08-02 14:27:55 UTC

Hi,

We need to try to reproduce this in-house. I tried without
OCP and so far I cannot see the issue. Can someone try to
reproduce on a Red Hat OCP cluster?

Regarding to:

https://bugzilla.redhat.com/show_bug.cgi?id=2087604#c41
    63319 packets dropped at netlink_attachskb
    63319 packets dropped at ovs_dp_process_packet <----


That is the known bug of reporting drops twice and we know
it is because the netlink socket is overflowing.  The ovs2.16
has the buffer size increased to 4M already. Basically the
OVS userspace is not getting enough CPU time.

This sequence is a bit interesting:
     0xffffffff90180510 : kfree_skb+0x0/0xa0 [kernel]
     0xffffffffc0d0d09d : ovs_dp_process_packet+0x1bd/0x1d0 [openvswitch]
     0xffffffffc0d09073 : clone_execute+0x1c3/0x1f0 [openvswitch]

The packet had to recirculate, so it is entering the datapath again,
but it got dropped. Shouldn't we see netlink_attachskb() failing
too if that is a netlink buffer issue? Do we see a message like
below in the log? It would help to explain the issue.
"ovs: action execution error on datapath"


Regarding to comment#56:
I was expecting some improvements at least.

So, in the case of process_backlog() OVS doesn't control when it
is going to call it. The kernel does that. I mean, after the packet
is queued into the CPU backlog [enqueue_to_backlog()], it schedules
NAPI for that CPU. Then when the CPU is preempted, ksoftirq will
check the NAPI and poll the packets calling process_backlog().

When process_backlog() runs, it will dequeue packets from the
queue and push through the networking stack __netif_receive_skb().
Then __netif_receive_skb_core() will find that the packet
has a dev with rx_handler set to netdev_frame_hook() which
is when OVS modules start to do something for the first time.

The netdev_frame_hook() will do some sanity checks and call
ovs_vport_receive() which prepares the skb to finally enter
the OVS data path at ovs_dp_process_packet().

In summary:

 <pod networking TX path>
 +--- tx packet
  +--- enqueue_to_backlog()
 <preemption happen>
 ksoftirqd runs
  +-- __do_softirq
   +--- net_rx_action()
    +--- loop over NAPI
     +--- poll [process_backlog()]
      +--- __netif_receive_skb()        <-- Net stack begins
       +--- __netif_receive_skb_core()
        +--- netdev_frame_hook()        <-- OVS begins
         +--- ovs_vport_receive()       <-- OVS port receives skb
          +--- ovs_dp_process_packet()  <-- DP processing begins


We need to first identify if the problem is in the user space
or kernel space. That is not clear yet for me. For example,
if the flows are being updated and user space is struggling
to update the data path, then packets might be dropped and we
need to debug the user space.

However, if flows are okay and there are packet drops then
it is likely a kernel issue and we need to focus on the kernel.

With that said, because it is iperf3 I don't think the flows
are expiring, but they might get expunged when new containers
are coming up. In that case, we will see upcalls and then
we should monitor the miss and lost counters:

# ovs-appctl dpctl/show | head -n 3
system@ovs-system:
  lookups: hit:113211077 missed:13103 lost:0

The missed counter is the number of packets going through upcall
(kernel cache miss) and the lost counter is the number of packets
dropped in the process.

Please try to correlate those counters with the iperf3 issue.

In parallel we can take some flow dumps every second with
timestamps, so that we can identify the moment before and
after the iperf3 issue:
ovs-appctl dpctl/dump-flows

The idea is to check if the flows remain in the kernel or
not. If they are present, the issue is very likely to be
in the kernel itself.

I suspect the issue is in the kernel and something is
preventing ksoftirq to run to drain the CPU's backlog
and that's why veth devices are dropping packets.
In that case, we need to monitor the stats at
/proc/net/softnet_stat while reproducing the issue to
confirm.

If that is confirmed, the next steps is to find out what
is keeping the CPU busy. Maybe we can pin the iperf3
threads to specific CPUs, which means we will know the
CPU's backlogs and then use 'perf top' on those CPUs
to see what is running on them. We could also use
mpstat -P ALL 1 to monitor CPU stats.

Do you think you can reproduce the issue creating/destroying
less containers? I think you're doing 30, maybe we can try
with 5 or 10 and see how it goes. If the drop window reduces
or stops, we know it's load related too.

Thanks,
fbl

Comment 63 Flavio Leitner 2022-08-03 19:12:55 UTC

Hi

>> > The idea is to check if the flows remain in the kernel or
>> > not. If they are present, the issue is very likely to be
>> > in the kernel itself.

>> Okay. Sure it could be in the kernel or in the OVS module. I think that is likely due to [1].

>Should read:

>Okay. Sure it could be in the kernel or in the OVS module. As there no upcalls for the iperf3 tuple during the test as per [1] iperf3 upcalls during the test should not be >causing this.

Ok, then we know OVS user space is not the issue, at least, not directly.
Let's focus on the kernel side.

> I see a lot of nf_conntrack calls in the 5 second perf sample during the packet drop.
> Around 968 calls from process_backlog().
> But it didn't help. Is there a way to notrack a port in OVS conntrack?

The kernel mode is run-to-completion, so each packet will be fully processed by the stack.
That means conntrack cost is included.
Although the OVS module uses kernel's conntrack, you can't use iptables as it is separated.
The only way would be to modify the OVS flow table to not use ct() calls.

So, conntrack is expected to show up high in the profiling, but it should not be
preventing the CPU to continue processing. I mean, we would see drops spread over
time and not really a black out. A black out means that the CPU really got stuck
not processing the backlog queue for seconds.


>Wed Aug  3 02:05:09 UTC 2022
> system@ovs-system:
>  lookups: hit:3448792004 missed:11390247 lost:133
>  flows: 2075
>Wed Aug  3 02:05:19 UTC 2022 <---- interestingly it took 10 seconds instead of 5 seconds for this sample which may have stalled due to the same reason as what is causing the delay in process_backlog().
> system@ovs-system:
>  lookups: hit:3450246898 missed:11391284 lost:133

Yes, that's why I suspect there is a generic kernel issue.

This is concerning me at the moment:
1180594 ovs-vswitchd 362576 [011] 88807.339531:     838148   cycles: ↩
1180595 ▹       ffffffff8253ab17 mutex_spin_on_owner+0x27 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩
1180596 ▹       ffffffff82d4c55d __mutex_lock.isra.6+0x33d (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩
1180597 ▹       ffffffff82ba9603 rtnetlink_rcv_msg+0x1c3 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩
1180598 ▹       ffffffff82c0832c netlink_rcv_skb+0x4c (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩
1180599 ▹       ffffffff82c07b4e netlink_unicast+0x19e (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩
1180600 ▹       ffffffff82c07e14 netlink_sendmsg+0x204 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩
1180601 ▹       ffffffff82b741fc sock_sendmsg+0x4c (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩
1180602 ▹       ffffffff82b7451b ____sys_sendmsg+0x1eb (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩
1180603 ▹       ffffffff82b75b9c ___sys_sendmsg+0x7c (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩
1180604 ▹       ffffffff82b75c67 __sys_sendmsg+0x57 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩
1180605 ▹       ffffffff8240420b do_syscall_64+0x5b (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩
1180606 ▹       ffffffff82e000ad entry_SYSCALL_64_after_hwframe+0x65 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩


There are at least two spin_locks to get that mutex.
One is the optimistic and the other is to add the waiter.
Maybe there is a bottleneck there because in this other case we have even the watchdog firing:
https://bugzilla.redhat.com/show_bug.cgi?id=2107020#c25

> Jul 25 09:03:50 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jul 25 09:03:50 kernel: ovs-vswitchd    D    0 2534542      1 0x00000080
> Jul 25 09:03:50 kernel: Call Trace:
> Jul 25 09:03:50 kernel:  __schedule+0x2bd/0x760
> Jul 25 09:03:50 kernel:  schedule+0x38/0xa0
> Jul 25 09:03:50 kernel:  schedule_preempt_disabled+0xa/0x10
> Jul 25 09:03:50 kernel:  __mutex_lock.isra.6+0x2d0/0x4a0
> Jul 25 09:03:50 kernel:  ? do_sys_poll+0x313/0x520
> Jul 25 09:03:50 kernel:  rtnetlink_rcv_msg+0x1c3/0x2f0


My suggestion is to run 'perf sched record' to all CPUs, then pin iperf3 client and server on separate CPUs, single threads.
Once we see the issue, get the sysrq+t (stack traces) output each 2 seconds some times.

When we pin the iperf3 doing TX, that means we know the CPU backlog, and then we can check in
perf schedule if that CPU was preempting (or not), what tasks were running, etc.
The stack traces (sysrq+t) will tell which points the CPU most probably was stuck. If we see those there,
we might be able to see in other CPUs stack traces which one was holding the lock and why it wasn't moving.

If we see no scheduling events during the problem and the stack trace looks like the one above,
we know there is no issue with data path itself or process_backlog(). It is the netlink socket
holding the CPUs.

Another option is to lower the kernel watchdog to a couple seconds and see what we get
from that. We should be able to see the stack holding the CPU, but not what other CPUs
were doing.

I am still checking if the dp flows provided are okay.

fbl

Comment 69 Flavio Leitner 2022-08-06 19:51:57 UTC

Hi Jon,

I think test02 reproduced another issue with similar end results.

Reproducer start at: 02:03:15 +117s = 02:05:12
[  5] 116.00-117.00 sec  23.8 MBytes   200 Mbits/sec  0.002 ms  0/94362 (0%)
[  5] 117.00-118.00 sec  8.78 MBytes  73.6 Mbits/sec  0.003 ms  59581/94315 (63%)
[  5] 118.00-119.00 sec  23.8 MBytes   199 Mbits/sec  0.004 ms  295/94332 (0.31%)

That is not the full traffic hang we see in the other samples.

Using grep, iperf3 seems to be running as below:
Server iperf3 2206146 [060]
Client iperf3 3112652 [022]

ksoftirqd is active in these CPUs:
# grep 'ksoftirqd/' perf.script.02:05:05 | awk '{ print $3 }' | tr -d '][' | sort -n | uniq -c | sort -nr
    473 009
     59 041
     57 002
     54 045
     47 003
     44 049
     44 047
     41 048
     38 008
     36 007
     33 043
     33 006
     28 044
     28 005
     24 040
     18 004
     18 001
     16 042
     11 000
      7 046

# grep 'ksoftirqd/' perf.script.02:05:10 | awk '{ print $3 }' | tr -d '][' | sort -n | uniq -c | sort -nr
    470 009
     67 003
     41 008
     34 001
     30 044
     30 006
     29 007
     25 041
     22 004
     19 043
     19 005
     18 040
     17 000
     15 049
     12 002
     11 046
      5 045
      4 048

grep 'ksoftirqd/' perf.script.02:05:16 | awk '{ print $3 }' | tr -d '][' | sort -n | uniq -c | sort -nr
    354 009
     25 002
     23 042
     22 005
     17 040
     16 047
     15 045
     13 046
     12 007
     11 049
     10 004
      9 000
      7 043
      6 048
      6 041
      6 001
      5 044
      5 003
      4 008
      3 006


So, I agree with Jon that it looks like RPS is being used.
Focusing on CPU#9.

Measuring the time between each call to process_backlog I found
a window with 100x extra time:
https://docs.google.com/spreadsheets/d/17Csv4MKCY0WGDEdxMO00K3ClESQR8ZqnrbLBr_oNWP8/edit?usp=sharing

Relevant part:
Meaning
  ts(rel) timestamp since the beginning of the trace
  diff    the time from last call (usecs)
  ts      timestamp of the calltrace
  prev ts timestamp of the previous calltrace
  ts      absolute timestamp (ts converted to usecs)


ts(rel) diff        ts        prev ts     ts (abs)
[...]
644076  174     88806.71339 88806.71322 88806713394
644236  160     88806.71355 88806.71339 88806713554
772795  128559  88806.84211 88806.71355 88806842113
1096032 323237  88807.16535 88806.84211 88807165350
1146116 50084   88807.21543 88807.16535 88807215434
1170815 24699   88807.24013 88807.21543 88807240133
1271092 100277  88807.34041 88807.24013 88807340410
1278077 6985    88807.3474  88807.34041 88807347395
1278164 87      88807.34748 88807.3474  88807347482
1278240 76      88807.34756 88807.34748 88807347558
[...]

The interval to call process_backlog spikes to ~300msecs and it does align with iperf3 reproducing time.

I see many traces like these below in that time window:
runc 3285892 [009] 88806.846232:     224278   cycles:
        5580021a6a03 [unknown] (/usr/bin/runc)
        5580021a600c [unknown] (/usr/bin/runc)
        5580021a7599 [unknown] (/usr/bin/runc)
        5580021a3b8f [unknown] (/usr/bin/runc)
        5580021a3a99 [unknown] (/usr/bin/runc)
        5580021ccce5 [unknown] (/usr/bin/runc)
        7f58bf7fde70 [unknown] ([unknown])

Several of this one below in sequence from 88806.855111 to 88806.898144 which is ~43ms.
crio 214214 [009] 88806.855267:     512866   cycles:
        56383270ec6f [unknown] (/usr/bin/crio)
        56383270f869 [unknown] (/usr/bin/crio)
        563832711e69 [unknown] (/usr/bin/crio)
        563832719fd2 [unknown] (/usr/bin/crio)
        563832c78a69 [unknown] (/usr/bin/crio)
        563832c77d0b [unknown] (/usr/bin/crio)
        5638331d8465 [unknown] (/usr/bin/crio)
        5638331d9e9d [unknown] (/usr/bin/crio)
        563833b96263 [unknown] (/usr/bin/crio)
        563833b9867e [unknown] (/usr/bin/crio)
        563833c15656 [unknown] (/usr/bin/crio)

But they don't explain the issue because I also see:

swapper     0 [009] 88806.898391:     746610   cycles:
    ffffffff82d4ea54 intel_idle+0x74 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82b1ed07 cpuidle_enter_state+0x87 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82b1f08c cpuidle_enter+0x2c (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82516464 do_idle+0x234 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff8251665f cpu_startup_entry+0x6f (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82455b79 start_secondary+0x199 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82400107 secondary_startup_64_no_verify+0xc2 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)

That means the CPU went idle!

Also, there were several calls to __do_softirq() which means that the network bottom half could have ran.

Maybe the traffic moved to another CPU? But it weird that it came back later.

I seems more likely that we had neighboring entry expired because there are many of these traces in the time window:
perf3 3112652 [022] 88806.894374:     415451   cycles:
    ffffffff82b784bd sock_def_write_space+0x2d (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82b7a43a sock_wfree+0x3a (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82b80428 skb_release_head_state+0x68 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82b804ce skb_release_all+0xe (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82b80542 kfree_skb+0x32 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82ba5591 __neigh_event_send+0x1e1 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82ba586f neigh_resolve_output+0x17f (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82cb5d09 ip6_finish_output2+0x1b9 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82cb75ac ip6_output+0x6c (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82cb7c3e ip6_send_skb+0x1e (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82cd8d5d udp_v6_send_skb.isra.37+0x19d (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82cd9da9 udpv6_sendmsg+0xbd9 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82b741fc sock_sendmsg+0x4c (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82b74297 sock_write_iter+0x97 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)

iperf3 3112652 [022] 88806.910217:     430227   cycles:
    ffffffff82ba04b6 dst_release+0x16 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82b804aa skb_release_head_state+0xea (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82b804ce skb_release_all+0xe (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82b80542 kfree_skb+0x32 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82ba5591 __neigh_event_send+0x1e1 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82ba586f neigh_resolve_output+0x17f (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82cb5d09 ip6_finish_output2+0x1b9 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82cb75ac ip6_output+0x6c (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82cb7c3e ip6_send_skb+0x1e (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)

iperf3 3112652 [022] 88806.925388:     414424   cycles:
    ffffffff82b784bd sock_def_write_space+0x2d (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82b7a43a sock_wfree+0x3a (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82b80428 skb_release_head_state+0x68 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82b804ce skb_release_all+0xe (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82b80542 kfree_skb+0x32 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82ba5591 __neigh_event_send+0x1e1 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82ba586f neigh_resolve_output+0x17f (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)
    ffffffff82cb5d09 ip6_finish_output2+0x1b9 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)

That would open a traffic black hole until the IPv6 ND is done.

We should monitor the counter 'unresolved_discards' in /proc/net/stat/ndisc_cache and perhaps see if increasing
the queue length helps, though it's per interface and that will be a pain with multiple containers
/proc/sys/net/ipv6/neigh/<iface>/unres_qlen

So, the above explains why iperf3 continues to run and we see its output.
It explains why process_backlog was not called (no packets in the queue) during that time.
It explains why veth TX drops increased.
It also explains Jon's previous comment:
  Dropwatch:
  Fri Jul 29 19:32:10 2022
  337033 packets dropped at __neigh_event_send <------- drops as the neigh entry is unreachable

It doesn't explain why CPU backlog queue overflows, but I am not sure if that happened during test02.

I think iperf3 starts sending traffic in one direction and Linux doesn't see replies, so eventually the neighboring entry expires and we notice the issue.
We need to monitor the IP6 ND cache while reproducing the issue as well.

Thanks,
fbl

Comment 92 Sehee Oh 2022-11-01 07:30:09 UTC

Case closed. 
If there would need additional followup I'll update.

Note You need to log in before you can comment on or make changes to this bug.