Description of problem: There appears to be packet loss when using IPv6 with OpenShift 4.9 Running iPerf on an IPv6 Cluster with OVN-Kubernetes and OpenShift 4.9 using the following flags, results in packet-loss. ``` iperf3 -V -f m -l 265 -b 200m -c {SVC_IP} -u -i 1 -t 600 ``` Version-Release number of selected component (if applicable): 4.9.12 How reproducible: Everytime Steps to Reproduce: 1. Create 2 Pods with iPerf installed in an IPv6 Cluster with a ServiceIP defined for one of the Pods 2. Start iPerf Server to accept connections 3. Start iPerf Client to connect to SVC IP using below command: ``` iperf3 -V -f m -l 265 -b 200m -c {SVC_IP} -u -i 1 -t 600 ``` Packet loss appears to happen both when the Pods are located on the same Node and different Nodes. This has only been shown to have issues in IPv6 deployments. Actual results: Running a systemtap script during the tests, we can see that there are many packets dropped (kfree_skb()) during the Netlink ovs upcall. ~~~ Sun Apr 24 11:43:54 2022 184 packets dropped at ovs_vport_cmd_get 85 packets dropped at netlink_attachskb 85 packets dropped at ovs_dp_process_packet . . Sun Apr 24 11:43:59 2022 12 packets dropped at queue_userspace_packet . . Sun Apr 24 11:48:44 2022 3207 packets dropped at netlink_attachskb 3207 packets dropped at ovs_dp_process_packet <----- OVS upcall here 61 packets dropped at __udp4_lib_rcv 53 packets dropped at ipv6_rcv 51 packets dropped at ovs_vport_cmd_get ~~~ Expected results: Packets arrive at their destination Pod. Additional info:
(In reply to Michael Washer from comment #0) > Description of problem: > There appears to be packet loss when using IPv6 with OpenShift 4.9 > > Running iPerf on an IPv6 Cluster with OVN-Kubernetes and OpenShift 4.9 using > the following flags, results in packet-loss. > > ``` > iperf3 -V -f m -l 265 -b 200m -c {SVC_IP} -u -i 1 -t 600 > ``` > > Version-Release number of selected component (if applicable): > 4.9.12 > > How reproducible: > Everytime > > Steps to Reproduce: > 1. Create 2 Pods with iPerf installed in an IPv6 Cluster with a ServiceIP > defined for one of the Pods > 2. Start iPerf Server to accept connections > 3. Start iPerf Client to connect to SVC IP using below command: > ``` > iperf3 -V -f m -l 265 -b 200m -c {SVC_IP} -u -i 1 -t 600 > ``` > > Packet loss appears to happen both when the Pods are located on the same > Node and different Nodes. This has only been shown to have issues in IPv6 > deployments. > > Actual results: > Running a systemtap script during the tests, we can see that there are many > packets dropped (kfree_skb()) during the Netlink ovs upcall. > > ~~~ > Sun Apr 24 11:43:54 2022 > 184 packets dropped at ovs_vport_cmd_get > 85 packets dropped at netlink_attachskb > 85 packets dropped at ovs_dp_process_packet > . > . > Sun Apr 24 11:43:59 2022 > 12 packets dropped at queue_userspace_packet > . > . > Sun Apr 24 11:48:44 2022 > 3207 packets dropped at netlink_attachskb > 3207 packets dropped at ovs_dp_process_packet <----- OVS upcall here > 61 packets dropped at __udp4_lib_rcv > 53 packets dropped at ipv6_rcv > 51 packets dropped at ovs_vport_cmd_get > ~~~ > > Expected results: > Packets arrive at their destination Pod. > > Additional info: hi Michael W, Thx for the bug and info. But is the packet loss only seen with IPv6? Or have we also seen any evidence of loss with Ipv4? Can you please ask JMaxx also when he comes back from PTO. Thx
Iperf over IPv6 on OVN Dual Stack - No Packet Loss - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams [ 5] 0.00-600.00 sec 29.1 GBytes 417 Mbits/sec 0.000 ms 0/122223633 (0%) sender [ 5] 0.00-600.04 sec 29.1 GBytes 417 Mbits/sec 0.002 ms 0/122223633 (0%) receiver CPU Utilization: local/sender 46.9% (3.1%u/43.8%s), remote/receiver 69.4% (6.0%u/63.3%s)
(In reply to Shane Ronan from comment #4) > Iperf over IPv6 on OVN Dual Stack - No Packet Loss You mean iperf used IPv6 while the overlay was using IPv4? Or IPv4 was just around, and everything for the test was using IPv6?
OVN was configured as Dual Stack, with PODS having both IPv4 and IPv6 addresses. The Iperf test was run over IPv6
(In reply to Michael Washer from comment #0) > Description of problem: > There appears to be packet loss when using IPv6 with OpenShift 4.9 > > Running iPerf on an IPv6 Cluster with OVN-Kubernetes and OpenShift 4.9 using > the following flags, results in packet-loss. Was any tuning applied to the hosts here? I was having drops in my test environment, but after applying the tuning available in https://bugzilla.redhat.com/show_bug.cgi?id=2085089#c3 (and other places), the drops are mostly gone. Now I have drops only at the start of the test, and they are in the network itself here. The sending node sent then, while the receiver never got them. Then I can leave it running and there are no drops.
Tuning is applied to the node where iperf3 pod is running as shown below. [core@worker05 ~]$ sudo sysctl -a | egrep 'net.core.rmem_max|net.core.wmem_max|net.core.rmem_default|net.core.wmem_default|net.core.optmem_max|net.core.netdev_max_backlog|net.ipv4.udp_rmem_min' net.core.netdev_max_backlog = 100000 net.core.optmem_max = 40960 net.core.rmem_default = 33554432 net.core.rmem_max = 33554432 net.core.wmem_default = 33554432 net.core.wmem_max = 33554432 net.ipv4.udp_rmem_min = 8192
tested with following ovn only script: systemctl start openvswitch systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:127.0.0.1:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=127.0.0.1 systemctl restart ovn-controller ovn-nbctl ls-add ls1 ovn-nbctl lsp-add ls1 ls1p1 ovn-nbctl lsp-set-addresses ls1p1 "00:00:00:00:01:01 192.168.1.1 2001::1" ovn-nbctl lsp-add ls1 ls1p2 ovn-nbctl lsp-set-addresses ls1p2 "00:00:00:00:01:02 192.168.1.2 2001::2" ovn-nbctl lr-add lr1 ovn-nbctl lrp-add lr1 lr1-ls1 00:00:00:0f:01:01 192.168.1.254/24 2001::a/64 ovn-nbctl lsp-add ls1 ls1-lr1 \ -- lsp-set-type ls1-lr1 router \ -- lsp-set-addresses ls1-lr1 router \ -- lsp-set-options ls1-lr1 router-port=lr1-ls1 ovn-nbctl lb-add lb0 1111::100 2001::2 ovn-nbctl ls-lb-add ls1 lb0 ovs-vsctl add-port br-int ls1p1 -- set interface ls1p1 type=internal external:iface-id=ls1p1 ip netns add ls1p1 ip link set ls1p1 netns ls1p1 ip netns exec ls1p1 ip link set ls1p1 address 00:00:00:00:01:01 ip netns exec ls1p1 ip link set ls1p1 up ip netns exec ls1p1 ip addr add 192.168.1.1/24 dev ls1p1 ip netns exec ls1p1 ip route add default via 192.168.1.254 ip netns exec ls1p1 ip addr add 2001::1/64 dev ls1p1 ip netns exec ls1p1 ip -6 route add default via 2001::a ovs-vsctl add-port br-int ls1p2 -- set interface ls1p2 type=internal external_ids:iface-id=ls1p2 ip netns add ls1p2 ip link set ls1p2 netns ls1p2 ip netns exec ls1p2 ip link set ls1p2 address 00:00:00:00:01:02 ip netns exec ls1p2 ip link set ls1p2 up ip netns exec ls1p2 ip addr add 192.168.1.2/24 dev ls1p2 ip netns exec ls1p2 ip addr add 2001::2/64 dev ls1p2 ip netns exec ls1p2 ip route add default via 192.168.1.254 ip netns exec ls1p2 ip route add default via 2001::a ip netns exec ls1p2 iperf3 -s -p 5201 -D sleep 2 ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c 1111::100 sysctl -w net.core.netdev_max_backlog=100000 sysctl -w net.core.optmem_max=81920 sysctl -w net.core.rmem_default=33554432 sysctl -w net.core.rmem_max=33554432 sysctl -w net.core.wmem_default=33554432 sysctl -w net.core.wmem_max=33554432 sysctl -w net.ipv4.udp_rmem_min=8192 ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c 1111::100 test result with ovn21.09-21.09.0-20 + openvswitch2.16-2.16.0-15 (ocp 4.19.12): + ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c 1111::100 iperf 3.5 Linux wsfd-advnetlab20.anl.lab.eng.bos.redhat.com 4.18.0-372.11.1.el8_6.x86_64 #1 SMP Wed May 25 11:35:25 EDT 2022 x86_64 Control connection MSS 1428 Time: Tue, 14 Jun 2022 07:45:17 GMT Connecting to host 1111::100, port 5201 Cookie: g5u7ckowkmk2r2agmhamnn36lzg2q4p2yeyv [ 5] local 2001::1 port 48386 connected to 1111::100 port 5201 Starting Test: protocol: UDP, 1 streams, 256 byte blocks, omitting 0 seconds, 600 second test, tos 0 [ ID] Interval Transfer Bitrate Total Datagrams [ 5] 0.00-1.00 sec 23.8 MBytes 200 Mbits/sec 97592 ...... [ 5] 597.00-598.00 sec 23.8 MBytes 200 Mbits/sec 97641 [ 5] 598.00-599.00 sec 23.8 MBytes 200 Mbits/sec 97667 [ 5] 599.00-600.00 sec 23.9 MBytes 200 Mbits/sec 97704 - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams [ 5] 0.00-600.00 sec 14.0 GBytes 200 Mbits/sec 0.000 ms 0/58593742 (0%) sender [ 5] 0.00-600.04 sec 14.0 GBytes 200 Mbits/sec 0.001 ms 2806/58593742 (0.0048%) receiver <== 0.0048% loss CPU Utilization: local/sender 60.1% (7.3%u/52.9%s), remote/receiver 60.6% (12.1%u/48.5%s) iperf Done. + sysctl -w net.core.netdev_max_backlog=100000 net.core.netdev_max_backlog = 100000 + sysctl -w net.core.optmem_max=81920 net.core.optmem_max = 81920 + sysctl -w net.core.rmem_default=33554432 net.core.rmem_default = 33554432 + sysctl -w net.core.rmem_max=33554432 net.core.rmem_max = 33554432 + sysctl -w net.core.wmem_default=33554432 net.core.wmem_default = 33554432 + sysctl -w net.core.wmem_max=33554432 net.core.wmem_max = 33554432 + sysctl -w net.ipv4.udp_rmem_min=8192 net.ipv4.udp_rmem_min = 8192 + ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c 1111::100 iperf 3.5 Linux wsfd-advnetlab20.anl.lab.eng.bos.redhat.com 4.18.0-372.11.1.el8_6.x86_64 #1 SMP Wed May 25 11:35:25 EDT 2022 x86_64 Control connection MSS 1428 Time: Tue, 14 Jun 2022 07:55:17 GMT Connecting to host 1111::100, port 5201 Cookie: ukpti2ebu2p26lty7rpitji5ytzvd635aspm [ 5] local 2001::1 port 47661 connected to 1111::100 port 5201 Starting Test: protocol: UDP, 1 streams, 256 byte blocks, omitting 0 seconds, 600 second test, tos 0 [ ID] Interval Transfer Bitrate Total Datagrams [ 5] 0.00-1.00 sec 23.8 MBytes 200 Mbits/sec 97556 [ 5] 1.00-2.00 sec 23.9 MBytes 200 Mbits/sec 97715 [ 5] 2.00-3.00 sec 23.8 MBytes 200 Mbits/sec 97657 ...... [ 5] 597.00-598.00 sec 23.9 MBytes 200 Mbits/sec 97731 [ 5] 598.00-599.00 sec 23.8 MBytes 200 Mbits/sec 97641 [ 5] 599.00-600.00 sec 23.8 MBytes 200 Mbits/sec 97663 - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams [ 5] 0.00-600.00 sec 14.0 GBytes 200 Mbits/sec 0.000 ms 0/58593666 (0%) sender [ 5] 0.00-600.04 sec 14.0 GBytes 200 Mbits/sec 0.001 ms 0/58593666 (0%) receiver <=== 0% loss CPU Utilization: local/sender 60.3% (8.1%u/52.3%s), remote/receiver 60.9% (12.0%u/48.9%s) iperf Done. the ovn only setup is not totally the same as ocp.
(In reply to Jianlin Shi from comment #26) > tested with following ovn only script: Interesting. AFAICT this is all within the same host, which is easier to reproduce as it doesn't depend much on the HW. ... > ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c > 1111::100 ... > + ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c > 1111::100 Is it me or in the 2 iterations, one had drops and the other had not? Can you please share the full output of the server side? It should help show which points during the test had drops.
(In reply to Marcelo Ricardo Leitner from comment #28) > (In reply to Jianlin Shi from comment #26) > > tested with following ovn only script: > > Interesting. AFAICT this is all within the same host, which is easier to > reproduce as it doesn't depend much on the HW. > > ... > > ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c > > 1111::100 > ... > > + ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c > > 1111::100 > > Is it me or in the 2 iterations, one had drops and the other had not? packet drop when sysctl setting is not tuned, and 0 loss after tuned. > Can you please share the full output of the server side? It should help show > which points during the test had drops.
(In reply to Marcelo Ricardo Leitner from comment #28) > (In reply to Jianlin Shi from comment #26) > > tested with following ovn only script: > > Interesting. AFAICT this is all within the same host, which is easier to > reproduce as it doesn't depend much on the HW. > > ... > > ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c > > 1111::100 > ... > > + ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c > > 1111::100 > > Is it me or in the 2 iterations, one had drops and the other had not? > Can you please share the full output of the server side? It should help show > which points during the test had drops. log for iperf3 server is attached
(In reply to Jianlin Shi from comment #29) > (In reply to Marcelo Ricardo Leitner from comment #28) > > (In reply to Jianlin Shi from comment #26) > > > tested with following ovn only script: > > > > Interesting. AFAICT this is all within the same host, which is easier to > > reproduce as it doesn't depend much on the HW. > > > > ... > > > ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c > > > 1111::100 > > ... > > > + ip netns exec ls1p1 iperf3 -V -f m -l 256 -b 200m -u -i 1 -t 600 -c > > > 1111::100 > > > > Is it me or in the 2 iterations, one had drops and the other had not? > > packet drop when sysctl setting is not tuned, and 0 loss after tuned. Ahh I see. I had missed that the first part is the script itself, and then its execution. > > > > Can you please share the full output of the server side? It should help show > > which points during the test had drops. The log and results are matching ours then. Thanks Jianlin.
Hi, We need to try to reproduce this in-house. I tried without OCP and so far I cannot see the issue. Can someone try to reproduce on a Red Hat OCP cluster? Regarding to: https://bugzilla.redhat.com/show_bug.cgi?id=2087604#c41 63319 packets dropped at netlink_attachskb 63319 packets dropped at ovs_dp_process_packet <---- That is the known bug of reporting drops twice and we know it is because the netlink socket is overflowing. The ovs2.16 has the buffer size increased to 4M already. Basically the OVS userspace is not getting enough CPU time. This sequence is a bit interesting: 0xffffffff90180510 : kfree_skb+0x0/0xa0 [kernel] 0xffffffffc0d0d09d : ovs_dp_process_packet+0x1bd/0x1d0 [openvswitch] 0xffffffffc0d09073 : clone_execute+0x1c3/0x1f0 [openvswitch] The packet had to recirculate, so it is entering the datapath again, but it got dropped. Shouldn't we see netlink_attachskb() failing too if that is a netlink buffer issue? Do we see a message like below in the log? It would help to explain the issue. "ovs: action execution error on datapath" Regarding to comment#56: I was expecting some improvements at least. So, in the case of process_backlog() OVS doesn't control when it is going to call it. The kernel does that. I mean, after the packet is queued into the CPU backlog [enqueue_to_backlog()], it schedules NAPI for that CPU. Then when the CPU is preempted, ksoftirq will check the NAPI and poll the packets calling process_backlog(). When process_backlog() runs, it will dequeue packets from the queue and push through the networking stack __netif_receive_skb(). Then __netif_receive_skb_core() will find that the packet has a dev with rx_handler set to netdev_frame_hook() which is when OVS modules start to do something for the first time. The netdev_frame_hook() will do some sanity checks and call ovs_vport_receive() which prepares the skb to finally enter the OVS data path at ovs_dp_process_packet(). In summary: <pod networking TX path> +--- tx packet +--- enqueue_to_backlog() <preemption happen> ksoftirqd runs +-- __do_softirq +--- net_rx_action() +--- loop over NAPI +--- poll [process_backlog()] +--- __netif_receive_skb() <-- Net stack begins +--- __netif_receive_skb_core() +--- netdev_frame_hook() <-- OVS begins +--- ovs_vport_receive() <-- OVS port receives skb +--- ovs_dp_process_packet() <-- DP processing begins We need to first identify if the problem is in the user space or kernel space. That is not clear yet for me. For example, if the flows are being updated and user space is struggling to update the data path, then packets might be dropped and we need to debug the user space. However, if flows are okay and there are packet drops then it is likely a kernel issue and we need to focus on the kernel. With that said, because it is iperf3 I don't think the flows are expiring, but they might get expunged when new containers are coming up. In that case, we will see upcalls and then we should monitor the miss and lost counters: # ovs-appctl dpctl/show | head -n 3 system@ovs-system: lookups: hit:113211077 missed:13103 lost:0 The missed counter is the number of packets going through upcall (kernel cache miss) and the lost counter is the number of packets dropped in the process. Please try to correlate those counters with the iperf3 issue. In parallel we can take some flow dumps every second with timestamps, so that we can identify the moment before and after the iperf3 issue: ovs-appctl dpctl/dump-flows The idea is to check if the flows remain in the kernel or not. If they are present, the issue is very likely to be in the kernel itself. I suspect the issue is in the kernel and something is preventing ksoftirq to run to drain the CPU's backlog and that's why veth devices are dropping packets. In that case, we need to monitor the stats at /proc/net/softnet_stat while reproducing the issue to confirm. If that is confirmed, the next steps is to find out what is keeping the CPU busy. Maybe we can pin the iperf3 threads to specific CPUs, which means we will know the CPU's backlogs and then use 'perf top' on those CPUs to see what is running on them. We could also use mpstat -P ALL 1 to monitor CPU stats. Do you think you can reproduce the issue creating/destroying less containers? I think you're doing 30, maybe we can try with 5 or 10 and see how it goes. If the drop window reduces or stops, we know it's load related too. Thanks, fbl
Hi >> > The idea is to check if the flows remain in the kernel or >> > not. If they are present, the issue is very likely to be >> > in the kernel itself. >> Okay. Sure it could be in the kernel or in the OVS module. I think that is likely due to [1]. >Should read: >Okay. Sure it could be in the kernel or in the OVS module. As there no upcalls for the iperf3 tuple during the test as per [1] iperf3 upcalls during the test should not be >causing this. Ok, then we know OVS user space is not the issue, at least, not directly. Let's focus on the kernel side. > I see a lot of nf_conntrack calls in the 5 second perf sample during the packet drop. > Around 968 calls from process_backlog(). > But it didn't help. Is there a way to notrack a port in OVS conntrack? The kernel mode is run-to-completion, so each packet will be fully processed by the stack. That means conntrack cost is included. Although the OVS module uses kernel's conntrack, you can't use iptables as it is separated. The only way would be to modify the OVS flow table to not use ct() calls. So, conntrack is expected to show up high in the profiling, but it should not be preventing the CPU to continue processing. I mean, we would see drops spread over time and not really a black out. A black out means that the CPU really got stuck not processing the backlog queue for seconds. >Wed Aug 3 02:05:09 UTC 2022 > system@ovs-system: > lookups: hit:3448792004 missed:11390247 lost:133 > flows: 2075 >Wed Aug 3 02:05:19 UTC 2022 <---- interestingly it took 10 seconds instead of 5 seconds for this sample which may have stalled due to the same reason as what is causing the delay in process_backlog(). > system@ovs-system: > lookups: hit:3450246898 missed:11391284 lost:133 Yes, that's why I suspect there is a generic kernel issue. This is concerning me at the moment: 1180594 ovs-vswitchd 362576 [011] 88807.339531: 838148 cycles: ↩ 1180595 ▹ ffffffff8253ab17 mutex_spin_on_owner+0x27 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩ 1180596 ▹ ffffffff82d4c55d __mutex_lock.isra.6+0x33d (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩ 1180597 ▹ ffffffff82ba9603 rtnetlink_rcv_msg+0x1c3 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩ 1180598 ▹ ffffffff82c0832c netlink_rcv_skb+0x4c (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩ 1180599 ▹ ffffffff82c07b4e netlink_unicast+0x19e (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩ 1180600 ▹ ffffffff82c07e14 netlink_sendmsg+0x204 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩ 1180601 ▹ ffffffff82b741fc sock_sendmsg+0x4c (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩ 1180602 ▹ ffffffff82b7451b ____sys_sendmsg+0x1eb (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩ 1180603 ▹ ffffffff82b75b9c ___sys_sendmsg+0x7c (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩ 1180604 ▹ ffffffff82b75c67 __sys_sendmsg+0x57 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩ 1180605 ▹ ffffffff8240420b do_syscall_64+0x5b (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩ 1180606 ▹ ffffffff82e000ad entry_SYSCALL_64_after_hwframe+0x65 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux)↩ There are at least two spin_locks to get that mutex. One is the optimistic and the other is to add the waiter. Maybe there is a bottleneck there because in this other case we have even the watchdog firing: https://bugzilla.redhat.com/show_bug.cgi?id=2107020#c25 > Jul 25 09:03:50 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Jul 25 09:03:50 kernel: ovs-vswitchd D 0 2534542 1 0x00000080 > Jul 25 09:03:50 kernel: Call Trace: > Jul 25 09:03:50 kernel: __schedule+0x2bd/0x760 > Jul 25 09:03:50 kernel: schedule+0x38/0xa0 > Jul 25 09:03:50 kernel: schedule_preempt_disabled+0xa/0x10 > Jul 25 09:03:50 kernel: __mutex_lock.isra.6+0x2d0/0x4a0 > Jul 25 09:03:50 kernel: ? do_sys_poll+0x313/0x520 > Jul 25 09:03:50 kernel: rtnetlink_rcv_msg+0x1c3/0x2f0 My suggestion is to run 'perf sched record' to all CPUs, then pin iperf3 client and server on separate CPUs, single threads. Once we see the issue, get the sysrq+t (stack traces) output each 2 seconds some times. When we pin the iperf3 doing TX, that means we know the CPU backlog, and then we can check in perf schedule if that CPU was preempting (or not), what tasks were running, etc. The stack traces (sysrq+t) will tell which points the CPU most probably was stuck. If we see those there, we might be able to see in other CPUs stack traces which one was holding the lock and why it wasn't moving. If we see no scheduling events during the problem and the stack trace looks like the one above, we know there is no issue with data path itself or process_backlog(). It is the netlink socket holding the CPUs. Another option is to lower the kernel watchdog to a couple seconds and see what we get from that. We should be able to see the stack holding the CPU, but not what other CPUs were doing. I am still checking if the dp flows provided are okay. fbl
Hi Jon, I think test02 reproduced another issue with similar end results. Reproducer start at: 02:03:15 +117s = 02:05:12 [ 5] 116.00-117.00 sec 23.8 MBytes 200 Mbits/sec 0.002 ms 0/94362 (0%) [ 5] 117.00-118.00 sec 8.78 MBytes 73.6 Mbits/sec 0.003 ms 59581/94315 (63%) [ 5] 118.00-119.00 sec 23.8 MBytes 199 Mbits/sec 0.004 ms 295/94332 (0.31%) That is not the full traffic hang we see in the other samples. Using grep, iperf3 seems to be running as below: Server iperf3 2206146 [060] Client iperf3 3112652 [022] ksoftirqd is active in these CPUs: # grep 'ksoftirqd/' perf.script.02:05:05 | awk '{ print $3 }' | tr -d '][' | sort -n | uniq -c | sort -nr 473 009 59 041 57 002 54 045 47 003 44 049 44 047 41 048 38 008 36 007 33 043 33 006 28 044 28 005 24 040 18 004 18 001 16 042 11 000 7 046 # grep 'ksoftirqd/' perf.script.02:05:10 | awk '{ print $3 }' | tr -d '][' | sort -n | uniq -c | sort -nr 470 009 67 003 41 008 34 001 30 044 30 006 29 007 25 041 22 004 19 043 19 005 18 040 17 000 15 049 12 002 11 046 5 045 4 048 grep 'ksoftirqd/' perf.script.02:05:16 | awk '{ print $3 }' | tr -d '][' | sort -n | uniq -c | sort -nr 354 009 25 002 23 042 22 005 17 040 16 047 15 045 13 046 12 007 11 049 10 004 9 000 7 043 6 048 6 041 6 001 5 044 5 003 4 008 3 006 So, I agree with Jon that it looks like RPS is being used. Focusing on CPU#9. Measuring the time between each call to process_backlog I found a window with 100x extra time: https://docs.google.com/spreadsheets/d/17Csv4MKCY0WGDEdxMO00K3ClESQR8ZqnrbLBr_oNWP8/edit?usp=sharing Relevant part: Meaning ts(rel) timestamp since the beginning of the trace diff the time from last call (usecs) ts timestamp of the calltrace prev ts timestamp of the previous calltrace ts absolute timestamp (ts converted to usecs) ts(rel) diff ts prev ts ts (abs) [...] 644076 174 88806.71339 88806.71322 88806713394 644236 160 88806.71355 88806.71339 88806713554 772795 128559 88806.84211 88806.71355 88806842113 1096032 323237 88807.16535 88806.84211 88807165350 1146116 50084 88807.21543 88807.16535 88807215434 1170815 24699 88807.24013 88807.21543 88807240133 1271092 100277 88807.34041 88807.24013 88807340410 1278077 6985 88807.3474 88807.34041 88807347395 1278164 87 88807.34748 88807.3474 88807347482 1278240 76 88807.34756 88807.34748 88807347558 [...] The interval to call process_backlog spikes to ~300msecs and it does align with iperf3 reproducing time. I see many traces like these below in that time window: runc 3285892 [009] 88806.846232: 224278 cycles: 5580021a6a03 [unknown] (/usr/bin/runc) 5580021a600c [unknown] (/usr/bin/runc) 5580021a7599 [unknown] (/usr/bin/runc) 5580021a3b8f [unknown] (/usr/bin/runc) 5580021a3a99 [unknown] (/usr/bin/runc) 5580021ccce5 [unknown] (/usr/bin/runc) 7f58bf7fde70 [unknown] ([unknown]) Several of this one below in sequence from 88806.855111 to 88806.898144 which is ~43ms. crio 214214 [009] 88806.855267: 512866 cycles: 56383270ec6f [unknown] (/usr/bin/crio) 56383270f869 [unknown] (/usr/bin/crio) 563832711e69 [unknown] (/usr/bin/crio) 563832719fd2 [unknown] (/usr/bin/crio) 563832c78a69 [unknown] (/usr/bin/crio) 563832c77d0b [unknown] (/usr/bin/crio) 5638331d8465 [unknown] (/usr/bin/crio) 5638331d9e9d [unknown] (/usr/bin/crio) 563833b96263 [unknown] (/usr/bin/crio) 563833b9867e [unknown] (/usr/bin/crio) 563833c15656 [unknown] (/usr/bin/crio) But they don't explain the issue because I also see: swapper 0 [009] 88806.898391: 746610 cycles: ffffffff82d4ea54 intel_idle+0x74 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82b1ed07 cpuidle_enter_state+0x87 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82b1f08c cpuidle_enter+0x2c (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82516464 do_idle+0x234 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff8251665f cpu_startup_entry+0x6f (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82455b79 start_secondary+0x199 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82400107 secondary_startup_64_no_verify+0xc2 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) That means the CPU went idle! Also, there were several calls to __do_softirq() which means that the network bottom half could have ran. Maybe the traffic moved to another CPU? But it weird that it came back later. I seems more likely that we had neighboring entry expired because there are many of these traces in the time window: perf3 3112652 [022] 88806.894374: 415451 cycles: ffffffff82b784bd sock_def_write_space+0x2d (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82b7a43a sock_wfree+0x3a (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82b80428 skb_release_head_state+0x68 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82b804ce skb_release_all+0xe (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82b80542 kfree_skb+0x32 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82ba5591 __neigh_event_send+0x1e1 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82ba586f neigh_resolve_output+0x17f (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82cb5d09 ip6_finish_output2+0x1b9 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82cb75ac ip6_output+0x6c (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82cb7c3e ip6_send_skb+0x1e (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82cd8d5d udp_v6_send_skb.isra.37+0x19d (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82cd9da9 udpv6_sendmsg+0xbd9 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82b741fc sock_sendmsg+0x4c (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82b74297 sock_write_iter+0x97 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) iperf3 3112652 [022] 88806.910217: 430227 cycles: ffffffff82ba04b6 dst_release+0x16 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82b804aa skb_release_head_state+0xea (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82b804ce skb_release_all+0xe (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82b80542 kfree_skb+0x32 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82ba5591 __neigh_event_send+0x1e1 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82ba586f neigh_resolve_output+0x17f (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82cb5d09 ip6_finish_output2+0x1b9 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82cb75ac ip6_output+0x6c (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82cb7c3e ip6_send_skb+0x1e (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) iperf3 3112652 [022] 88806.925388: 414424 cycles: ffffffff82b784bd sock_def_write_space+0x2d (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82b7a43a sock_wfree+0x3a (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82b80428 skb_release_head_state+0x68 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82b804ce skb_release_all+0xe (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82b80542 kfree_skb+0x32 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82ba5591 __neigh_event_send+0x1e1 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82ba586f neigh_resolve_output+0x17f (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) ffffffff82cb5d09 ip6_finish_output2+0x1b9 (/usr/lib/debug/lib/modules/4.18.0-305.49.1.el8_4.x86_64/vmlinux) That would open a traffic black hole until the IPv6 ND is done. We should monitor the counter 'unresolved_discards' in /proc/net/stat/ndisc_cache and perhaps see if increasing the queue length helps, though it's per interface and that will be a pain with multiple containers /proc/sys/net/ipv6/neigh/<iface>/unres_qlen So, the above explains why iperf3 continues to run and we see its output. It explains why process_backlog was not called (no packets in the queue) during that time. It explains why veth TX drops increased. It also explains Jon's previous comment: Dropwatch: Fri Jul 29 19:32:10 2022 337033 packets dropped at __neigh_event_send <------- drops as the neigh entry is unreachable It doesn't explain why CPU backlog queue overflows, but I am not sure if that happened during test02. I think iperf3 starts sending traffic in one direction and Linux doesn't see replies, so eventually the neighboring entry expires and we notice the issue. We need to monitor the IP6 ND cache while reproducing the issue as well. Thanks, fbl
Case closed. If there would need additional followup I'll update.