Description of problem: iperf3 to nat would fail when hw offload is enabled Version-Release number of selected component (if applicable): 4.18.0-315.el8 ovn-2021-21.03.0-40.el8fdp.x86_64 openvswitch2.15-2.15.0-24.el8fdp.x86_64 How reproducible: Always Steps to Reproduce: 1. setup vf: echo 4 > /sys/bus/pci/devices/0000:3b:00.0/sriov_numvfs echo 0000:3b:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind echo 0000:3b:00.3 > /sys/bus/pci/drivers/mlx5_core/unbind echo 0000:3b:00.4 > /sys/bus/pci/drivers/mlx5_core/unbind echo 0000:3b:00.5 > /sys/bus/pci/drivers/mlx5_core/unbind devlink dev eswitch set pci/0000:3b:00.0 mode switchdev set mtu for pf as 9000 2. create guest and attach representer into guest: virt-install --name g0 --vcpus=2 --ram=2048 --disk path=/var/lib/libvirt/images/g0.qcow2,device=disk,bus=virtio,format=qcow2 --network bridge=virbr0,model=virtio --boot hd --accelerate --force --graphic s none --noautoconsole virt-install --name g2 --vcpus=2 --ram=2048 --disk path=/var/lib/libvirt/images/g2.qcow2,device=disk,bus=virtio,format=qcow2 --network bridge=virbr0,model=virtio --boot hd --accelerate --force --graphic s none --noautoconsole cat vf.xml <interface type='hostdev' managed='yes'> <source> <address type='pci' domain='0x0000' bus='0x3b' slot='0x00' function='0x2'/> </source> <mac address='00:00:00:01:01:11'/> </interface> virsh attach-device g0 vf.xml cat vf.xml <interface type='hostdev' managed='yes'> <source> <address type='pci' domain='0x0000' bus='0x3b' slot='0x00' function='0x4'/> </source> <mac address='00:00:00:01:02:11'/> </interface> virsh attach-device g2 vf.xml 3. add representer into ovn ip link set eth0 down ip link set eth0 name s_pf0vf0 ovs-vsctl add-port br-int s_pf0vf0 -- set interface s_pf0vf0 external_ids:iface-id=s_pf0vf0 ip link set s_pf0vf0 up ip link set eth2 down ip link set eth2 name s_pf0vf2 ovs-vsctl add-port br-int s_pf0vf2 -- set interface s_pf0vf2 external_ids:iface-id=s_pf0vf2 ip link set s_pf0vf2 up 4. repeat step1 to step3 on client. 5. add ovn configuration: ovn-nbctl ls-add ls1 ovn-nbctl ls-add ls2 ovn-nbctl lr-add lr1 ovn-nbctl lrp-add lr1 lr1-ls1 00:00:00:00:00:01 192.168.1.254/24 2001::a/64 ovn-nbctl lsp-add ls1 ls1-lr1 ovn-nbctl lsp-set-type ls1-lr1 router ovn-nbctl lsp-set-options ls1-lr1 router-port=lr1-ls1 ovn-nbctl lsp-set-addresses ls1-lr1 router ovn-nbctl lrp-add lr1 lr1-ls2 00:00:00:00:00:02 172.17.$ip_subnet.254/24 7777:$ip_subnet::a/64 ovn-nbctl lsp-add ls2 ls2-lr1 ovn-nbctl lsp-set-type ls2-lr1 router ovn-nbctl lsp-set-options ls2-lr1 router-port=lr1-ls2 ovn-nbctl lsp-set-addresses ls2-lr1 router ovn-nbctl lsp-add ls1 s_pf0vf0 ovn-nbctl lsp-set-addresses s_pf0vf0 "00:00:00:01:01:11 192.168.1.11 2001::11" ovn-nbctl lsp-add ls2 s_pf0vf2 ovn-nbctl lsp-set-addresses s_pf0vf2 "00:00:00:01:02:11 172.17.172.11 7777:172::11" ovn-nbctl lsp-add ls1 c_pf0vf0 ovn-nbctl lsp-set-addresses c_pf0vf0 '00:00:00:01:01:13 192.168.1.13 2001::13' ovn-nbctl lsp-add ls1 c_pf0vf1 ovn-nbctl lsp-set-addresses c_pf0vf1 '00:00:00:01:01:14 192.168.1.14 2001::14' ovn-nbctl set logical_router lr1 options:chassis=$chassis_id ovn-nbctl lr-nat-add lr1 dnat_and_snat 3010::13 2001::13 ovn-nbctl lr-nat-add lr1 dnat 10.10.1.14 192.168.1.14 6. enable offload on client and server: ovs-vsctl set Open_vSwitch . other_config:hw-offload=true ovs-vsctl set Open_vSwitch . other_config:tc-policy=none systemctl restart openvswitch 7. run "iperf3 -u -c 3010::13 -t 1; iperf3 -u -c 10.10.1.14 -t 1" in vm g2 (172.17.172.11) Actual results: + iperf3 -u -c 10.10.1.14 -t 1 Connecting to host 10.10.1.14, port 5201 [ 5] local 172.17.172.11 port 41734 connected to 10.10.1.14 port 5201 iperf3: error - control socket has closed unexpectedly from server side (192.168.1.14): [root@localhost ~]# iperf3 -s -d ----------------------------------------------------------- Server listening on 5201 ----------------------------------------------------------- get_parameters: { "udp": true, "omit": 0, "time": 1, "parallel": 1, "len": 1448, "bandwidth": 1048576, "pacing_timer": 1000, "client_version": "3.5" } Accepted connection from 172.17.172.11, port 36242 SNDBUF is 212992, expecting 0 RCVBUF is 212992, expecting 0 Setting application pacing to 131072 [ 5] local 192.168.1.14 port 5201 connected to 172.17.172.11 port 34883 interval_len 1.001170 bytes_transferred 0 interval forces keep [ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams [ 5] 0.00-1.00 sec 0.00 Bytes 0.00 bits/sec 0.000 ms 0/0 (0%) interval_len 1.000020 bytes_transferred 0 interval forces keep [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 0.000 ms 0/0 (0%) interval_len 0.999985 bytes_transferred 0 interval forces keep [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0.000 ms 0/0 (0%) interval_len 1.000011 bytes_transferred 0 interval forces keep [ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 0.000 ms 0/0 (0%) interval_len 0.999997 bytes_transferred 0 interval forces keep [ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec 0.000 ms 0/0 (0%) iperf3: error - select failed: Bad file descriptor Expected results: iperf3 should pass Additional info: iperf3 WON'T fail if hw offload is disabled with hw-offload=false server: [root@wsfd-advnetlab16 offload_func]# ovn-nbctl show switch 19d0d05e-ffeb-44e9-a605-7287b7019024 (ls1) port c_pf0vf0 addresses: ["00:00:00:01:01:13 192.168.1.13 2001::13"] port ls1-lr1 type: router router-port: lr1-ls1 port s_pf1vf1 addresses: ["00:00:00:01:01:16 192.168.1.16 2001::16"] port s_pf0vf1 addresses: ["00:00:00:01:01:12 192.168.1.12 2001::12"] port s_pf1vf0 addresses: ["00:00:00:01:01:15 192.168.1.15 2001::15"] port s_pf0vf0 addresses: ["00:00:00:01:01:11 192.168.1.11 2001::11"] port c_pf0vf1 addresses: ["00:00:00:01:01:14 192.168.1.14 2001::14"] port ls1_lp type: localport addresses: ["00:00:00:01:01:02 192.168.1.2 2001::2"] switch 48e8446e-df9f-4b04-bba0-709909d1fa06 (ls2) port s_pf0vf2 addresses: ["00:00:00:01:02:11 172.17.172.11 7777:172::11"] port ls2_lp type: localport addresses: ["00:00:00:01:01:02 172.17.172.2 7777:172::2"] port c_pf0vf3 addresses: ["00:00:00:01:02:14 172.17.172.14 7777:172::14"] port s_pf0vf3 addresses: ["00:00:00:01:02:12 172.17.172.12 7777:172::12"] port ls2-lr1 type: router router-port: lr1-ls2 port c_pf0vf2 addresses: ["00:00:00:01:02:13 172.17.172.13 7777:172::13"] router 1afe2007-798a-41b9-a293-4899d6a1083d (lr1) port lr1-ls2 mac: "00:00:00:00:00:02" networks: ["172.17.172.254/24", "7777:172::a/64"] port lr1-ls1 mac: "00:00:00:00:00:01" networks: ["192.168.1.254/24", "2001::a/64"] nat 4b15999e-6e16-4813-bb67-8b07b5d19eae external ip: "3010::13" logical ip: "2001::13" type: "dnat_and_snat" nat 7744ac11-af05-40cf-8e46-ca4f60382c8b external ip: "10.10.1.14" logical ip: "192.168.1.14" type: "dnat" [root@wsfd-advnetlab16 offload_func]# ovs-vsctl show fb55eea2-d71c-480b-b52c-dda71890e775 Bridge br-int fail_mode: secure Port s_pf1vf1 Interface s_pf1vf1 Port ovn-48aaf6-0 Interface ovn-48aaf6-0 type: geneve options: {csum="true", key=flow, remote_ip="20.0.172.26"} Port s_pf0vf1 Interface s_pf0vf1 Port s_pf0vf2 Interface s_pf0vf2 Port s_pf1vf0 Interface s_pf1vf0 Port s_pf0vf0 Interface s_pf0vf0 Port s_pf0vf3 Interface s_pf0vf3 Port br-int Interface br-int type: internal ovs_version: "2.15.1" [root@wsfd-advnetlab16 offload_func]# ovn-nbctl lr-nat-list lr1 TYPE EXTERNAL_IP EXTERNAL_PORT LOGICAL_IP EXTERNAL_MAC LOGICAL_PORT dnat 10.10.1.14 192.168.1.14 dnat_and_snat 3010::13 2001::13 [root@wsfd-advnetlab16 offload_func]# uname -a Linux wsfd-advnetlab16.anl.lab.eng.bos.redhat.com 4.18.0-315.el8.x86_64 #1 SMP Thu Jun 17 14:56:40 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux [root@wsfd-advnetlab16 offload_func]# rpm -qa | grep -E "openvswitch2.15|ovn-2021" ovn-2021-21.03.0-40.el8fdp.x86_64 ovn-2021-host-21.03.0-40.el8fdp.x86_64 ovn-2021-central-21.03.0-40.el8fdp.x86_64 python3-openvswitch2.15-2.15.0-24.el8fdp.x86_64 eopenvswitch2.15-2.15.0-24.el8fdp.x86_64 [root@wsfd-advnetlab16 offload_func]# ethtool -i ens1f0 driver: mlx5e_rep version: 4.18.0-315.el8.x86_64 firmware-version: 16.30.1004 (MT_0000000013) expansion-rom-version: bus-info: 0000:3b:00.0 supports-statistics: yes supports-test: no supports-eeprom-access: no supports-register-dump: no supports-priv-flags: no [root@wsfd-advnetlab16 offload_func]# ovs-vsctl list open _uuid : fb55eea2-d71c-480b-b52c-dda71890e775 bridges : [6616fe95-7244-4a70-99bf-3f7b08abe995] cur_cfg : 14 datapath_types : [netdev, system] datapaths : {} db_version : "8.2.0" dpdk_initialized : false dpdk_version : "DPDK 20.11.1" external_ids : {hostname=wsfd-advnetlab16.anl.lab.eng.bos.redhat.com, ovn-encap-ip="20.0.172.25", ovn-encap-type=geneve, ovn-remote="tcp:20.0.172.25:6642", rundir="/var/run/openvswitch", system-id="98d8b1e3-e015-4cd1-a02f-c4a5a5a3b34b"} iface_types : [bareudp, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, stt, system, tap, vxlan] manager_options : [] next_cfg : 14 other_config : {hw-offload="true", tc-policy=none} ovs_version : "2.15.1" ssl : [] statistics : {} system_type : rhel system_version : "8.5" client: [root@wsfd-advnetlab19 offload_func]# ovs-vsctl show 85e93de3-ed46-4222-a00f-a7a94fb10c41 Bridge br-int fail_mode: secure Port c_pf0vf3 Interface c_pf0vf3 Port c_pf0vf0 Interface c_pf0vf0 Port ovn-98d8b1-0 Interface ovn-98d8b1-0 type: geneve options: {csum="true", key=flow, remote_ip="20.0.172.25"} Port c_pf0vf2 Interface c_pf0vf2 Port br-int Interface br-int type: internal Port c_pf0vf1 Interface c_pf0vf1 ovs_version: "2.15.1" [root@wsfd-advnetlab19 offload_func]# ethtool -i ens1f0 driver: mlx5e_rep version: 4.18.0-315.el8.x86_64 firmware-version: 16.30.1004 (MT_0000000013) expansion-rom-version: bus-info: 0000:3b:00.0 supports-statistics: yes supports-test: no supports-eeprom-access: no supports-register-dump: no supports-priv-flags: no
This bug, as is, is not ringing a bell here. My only suspicion is around the MTU mention. Only the PF had MTU changed to 9000? If the issue persist without MTU changes, then we will need: # ovs-appctl dpctl/dump-flows -m # tc -s filter show dev <representors> ingress traffic captures, either on representors or on VFs.
(In reply to Marcelo Ricardo Leitner from comment #1) > This bug, as is, is not ringing a bell here. > My only suspicion is around the MTU mention. Only the PF had MTU changed to > 9000? yes, and the pf is also the geneve lower device. > > If the issue persist without MTU changes, then we will need: > # ovs-appctl dpctl/dump-flows -m > # tc -s filter show dev <representors> ingress > > traffic captures, either on representors or on VFs. with kernel-4.18.0-320 and openvswitch2.15-26. it's hard to reproduce the issue. but "iperf3 -u -c 10.10.1.14 -t 1" would take a long time to finish. log would be attached.
The tc rules and ovs flows look fine to me. One thing to note here is that CT HWOL only offloads entries when the conntrack entry is in Established state. As the test is using UDP packets, that's when conntrack sees UDP packets from both sides. From server: Encap flows: (server -> client) ufid:bdebb6f9-3e2b-4393-ad4f-2ef8e1e6804d, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(s_pf0vf2),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:01:02:11,dst=00:00:00:00:00:02),eth_type(0x0800),ipv4(src=172.17.171.0/255.255.255.128,dst=10.10.1.14,proto=17,tos=0/0,ttl=64,frag=no),udp(src=0/0,dst=0/0), packets:91, bytes:135534, used:0.760s, offloaded:yes, dp:tc, actions:ct_clear,ct(commit,zone=16,nat(dst=192.168.1.14)),recirc(0xb) ufid:eee1d006-f4db-4cd3-9ca0-25768997b86a, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0xb),dp_hash(0/0),in_port(s_pf0vf2),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:01:02:11,dst=00:00:00:00:00:02),eth_type(0x0800),ipv4(src=128.0.0.0/192.0.0.0,dst=192.168.1.14,proto=17,tos=0/0x3,ttl=64,frag=no),udp(src=0/0,dst=0/0), packets:91, bytes:140580, used:0.760s, offloaded:yes, dp:tc, actions:ct_clear,set(tunnel(tun_id=0x1,dst=20.0.171.26,ttl=64,tp_dst=6081,key6(bad key length 1, expected 0)(01)geneve({class=0x102,type=0x80,len=4,0x20004}),flags(key))),set(eth(src=00:00:00:00:00:01,dst=00:00:00:01:01:14)),set(ipv4(ttl=63)),genev_sys_6081 Both have packets, and same amount. AFAICT, '7. run "iperf3 -u ...' is being run at the server. So the server seems to be pushing packets. Decap: (client -> server) ufid:1e381193-52ab-4f03-beae-d8b6721a9a3b, skb_priority(0/0),tunnel(tun_id=0x1,src=20.0.171.26,dst=20.0.171.25,ttl=0/0,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x40002/0x7fffffff}),flags(+key)),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(genev_sys_6081),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:01:01:14,dst=00:00:00:00:00:01),eth_type(0x0800),ipv4(src=192.168.1.0/255.255.255.128,dst=172.17.171.0/255.255.255.128,proto=0/0,tos=0/0,ttl=64,frag=no), packets:9, bytes:526, used:0.690s, offloaded:yes, dp:tc, actions:ct_clear,ct(zone=16,nat),recirc(0xca) ufid:6ab5e97c-fbe7-4db7-8b22-8215a9ed64a0, skb_priority(0/0),tunnel(tun_id=0x1,src=20.0.171.26,dst=20.0.171.25,ttl=0/0,tp_dst=6081,geneve({class=0x102/0,type=0x80/0,len=4,0x40002/0}),flags(+key)),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0xca),dp_hash(0/0),in_port(genev_sys_6081),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:01:01:14,dst=00:00:00:00:00:01),eth_type(0x0800),ipv4(src=0.0.0.0/128.0.0.0,dst=172.17.171.11,proto=17,tos=0/0,ttl=64,frag=no),udp(src=0/0,dst=0/0), packets:0, bytes:0, used:1.840s, offloaded:yes, dp:tc, actions:ct_clear,set(eth(src=00:00:00:00:00:02,dst=00:00:00:01:02:11)),set(ipv4(ttl=63)),s_pf0vf2 0 packets on UDP, 9 packets in the first flow (which doesn't match on proto) and there were 8 packets on the TCP flow. So I would think the 9th packet here is actually an UDP one that got handled by upcall, and triggered the insertion of the this last flow. Checking s_pf0vf2.pcap, this extra UDP packet is #17 there. Sounds like iperf3 does send an initial packet server->client (which I didn't know). Then the client pushes 4 packets and that's all we see (in the representor?). Please clarify on where the captures were taken. I don't understand why we see the whole TCP control connection on server but just SYNs and some rtx on client, and not even the FIN. I am thinking that by adjusting just the PF MTU to 9000, it is not generating a ICMP Frag Needed back to the UDP socket, which then is not fragmenting the packets. This packet gets encapsulated by the server but dropped by the switch or the client, as the UDP packets are 1490 bytes already, before geneve encapsulation. I couldn't find a parameter on iperf3 to reduce the packet size. If you do, that's a good test. Otherwise, please test with -R option. It will make the server push the traffic to client. Also, please try netperf. There, we can use the UDP_RR test and use "-r 1000,1000" to make it use smaller UDP packets.
(In reply to Marcelo Ricardo Leitner from comment #4) > I am thinking that by adjusting just the PF MTU to 9000, it is not > generating a ICMP Frag Needed back to the UDP socket, which then is not > fragmenting the packets. This packet gets encapsulated by the server but > dropped by the switch or the client, as the UDP packets are 1490 bytes > already, before geneve encapsulation. I don't know why it works without HWOL, btw. Maybe ovs kernel does something on this that I'm not seeing at the moment.
(In reply to Marcelo Ricardo Leitner from comment #4) > The tc rules and ovs flows look fine to me. > > One thing to note here is that CT HWOL only offloads entries when the > conntrack entry is in Established state. As the test is using UDP packets, > that's when conntrack sees UDP packets from both sides. > > From server: > > Encap flows: (server -> client) > > ufid:bdebb6f9-3e2b-4393-ad4f-2ef8e1e6804d, > skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0), > ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(s_pf0vf2),packet_type(ns=0/0, > id=0/0),eth(src=00:00:00:01:02:11,dst=00:00:00:00:00:02),eth_type(0x0800), > ipv4(src=172.17.171.0/255.255.255.128,dst=10.10.1.14,proto=17,tos=0/0,ttl=64, > frag=no),udp(src=0/0,dst=0/0), packets:91, bytes:135534, used:0.760s, > offloaded:yes, dp:tc, > actions:ct_clear,ct(commit,zone=16,nat(dst=192.168.1.14)),recirc(0xb) > > ufid:eee1d006-f4db-4cd3-9ca0-25768997b86a, > skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0), > ct_label(0/0),recirc_id(0xb),dp_hash(0/0),in_port(s_pf0vf2),packet_type(ns=0/ > 0,id=0/0),eth(src=00:00:00:01:02:11,dst=00:00:00:00:00:02),eth_type(0x0800), > ipv4(src=128.0.0.0/192.0.0.0,dst=192.168.1.14,proto=17,tos=0/0x3,ttl=64, > frag=no),udp(src=0/0,dst=0/0), packets:91, bytes:140580, used:0.760s, > offloaded:yes, dp:tc, > actions:ct_clear,set(tunnel(tun_id=0x1,dst=20.0.171.26,ttl=64,tp_dst=6081, > key6(bad key length 1, expected > 0)(01)geneve({class=0x102,type=0x80,len=4,0x20004}),flags(key))), > set(eth(src=00:00:00:00:00:01,dst=00:00:00:01:01:14)),set(ipv4(ttl=63)), > genev_sys_6081 > > Both have packets, and same amount. > AFAICT, '7. run "iperf3 -u ...' is being run at the server. So the server > seems to be pushing packets. > > > Decap: (client -> server) > > ufid:1e381193-52ab-4f03-beae-d8b6721a9a3b, > skb_priority(0/0),tunnel(tun_id=0x1,src=20.0.171.26,dst=20.0.171.25,ttl=0/0, > tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x40002/0x7fffffff}), > flags(+key)),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0), > ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(genev_sys_6081), > packet_type(ns=0/0,id=0/0),eth(src=00:00:00:01:01:14,dst=00:00:00:00:00:01), > eth_type(0x0800),ipv4(src=192.168.1.0/255.255.255.128,dst=172.17.171.0/255. > 255.255.128,proto=0/0,tos=0/0,ttl=64,frag=no), packets:9, bytes:526, > used:0.690s, offloaded:yes, dp:tc, > actions:ct_clear,ct(zone=16,nat),recirc(0xca) > > ufid:6ab5e97c-fbe7-4db7-8b22-8215a9ed64a0, > skb_priority(0/0),tunnel(tun_id=0x1,src=20.0.171.26,dst=20.0.171.25,ttl=0/0, > tp_dst=6081,geneve({class=0x102/0,type=0x80/0,len=4,0x40002/0}),flags(+key)), > skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0), > recirc_id(0xca),dp_hash(0/0),in_port(genev_sys_6081),packet_type(ns=0/0,id=0/ > 0),eth(src=00:00:00:01:01:14,dst=00:00:00:00:00:01),eth_type(0x0800), > ipv4(src=0.0.0.0/128.0.0.0,dst=172.17.171.11,proto=17,tos=0/0,ttl=64, > frag=no),udp(src=0/0,dst=0/0), packets:0, bytes:0, used:1.840s, > offloaded:yes, dp:tc, > actions:ct_clear,set(eth(src=00:00:00:00:00:02,dst=00:00:00:01:02:11)), > set(ipv4(ttl=63)),s_pf0vf2 > > 0 packets on UDP, 9 packets in the first flow (which doesn't match on proto) > and there were 8 packets on the TCP flow. So I would think the 9th packet > here is actually an UDP one that got handled by upcall, and triggered the > insertion of the this last flow. > > Checking s_pf0vf2.pcap, this extra UDP packet is #17 there. Sounds like > iperf3 does send an initial packet server->client (which I didn't know). > Then the client pushes 4 packets and that's all we see (in the > representor?). all packets are captured on representors: c_pf0vf1 and s_pf0vf2 > > Please clarify on where the captures were taken. I don't understand why we > see the whole TCP control connection on server but just SYNs and some rtx on > client, and not even the FIN. > > I am thinking that by adjusting just the PF MTU to 9000, it is not > generating a ICMP Frag Needed back to the UDP socket, which then is not > fragmenting the packets. This packet gets encapsulated by the server but > dropped by the switch or the client, as the UDP packets are 1490 bytes > already, before geneve encapsulation. mtu on switch and client are both 9000. so there should be no problem for receiving the packets. > > I couldn't find a parameter on iperf3 to reduce the packet size. If you do, > that's a good test. Otherwise, please test with -R option. It will make the > server push the traffic to client. > > Also, please try netperf. There, we can use the UDP_RR test and use "-r > 1000,1000" to make it use smaller UDP packets. netperf would also take a long time: netperf -6 -H 3010::13 -t UDP_RR -l 1;netperf -4 -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000 and if we only run netperf -4 -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000, the result looks good. tcpdump result would be attached.
(In reply to Jianlin Shi from comment #6) > mtu on switch and client are both 9000. so there should be no problem for > receiving the packets. Ok. > > > > > I couldn't find a parameter on iperf3 to reduce the packet size. If you do, > > that's a good test. Otherwise, please test with -R option. It will make the > > server push the traffic to client. > > > > Also, please try netperf. There, we can use the UDP_RR test and use "-r > > 1000,1000" to make it use smaller UDP packets. > > netperf would also take a long time: netperf -6 -H 3010::13 -t UDP_RR -l > 1;netperf -4 -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000 > and if we only run netperf -4 -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000, > the result looks good. Does that mean that # netperf -6 -H 3010::13 -t UDP_RR -l 1 is hogging it, while # netperf -4 -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000 works? If that's it, then it seems there's either an issue with ipv6 or MTU somehow. I don't see any ipv6 on the captures. Please try these commands separately, and also specify "-r 1000,1000" to the ipv6 one (and -l 1000 for iperf3 on ipv6 as well). Thanks.
(In reply to Marcelo Ricardo Leitner from comment #8) > (In reply to Jianlin Shi from comment #6) > > mtu on switch and client are both 9000. so there should be no problem for > > receiving the packets. > > Ok. > > > > > > > > > I couldn't find a parameter on iperf3 to reduce the packet size. If you do, > > > that's a good test. Otherwise, please test with -R option. It will make the > > > server push the traffic to client. > > > > > > Also, please try netperf. There, we can use the UDP_RR test and use "-r > > > 1000,1000" to make it use smaller UDP packets. > > > > netperf would also take a long time: netperf -6 -H 3010::13 -t UDP_RR -l > > 1;netperf -4 -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000 > > and if we only run netperf -4 -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000, > > the result looks good. > > Does that mean that > # netperf -6 -H 3010::13 -t UDP_RR -l 1 > is hogging it, while > # netperf -4 -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000 > works? no. run netperf -6 -H 3010::13 -t UDP_RR -l 1 and netperf -4 -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000 separately would both pass. but if run netperf -6 -H 3010::13 -t UDP_RR -l 1 first, then run netperf -4 -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000 right after the last command, netperf -4 -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000 would take a long time. > If that's it, then it seems there's either an issue with ipv6 or MTU somehow. > > I don't see any ipv6 on the captures. > > Please try these commands separately, and also specify "-r 1000,1000" to the > ipv6 one (and -l 1000 for iperf3 on ipv6 as well). Thanks.
(In reply to Jianlin Shi from comment #9) > > Does that mean that > > # netperf -6 -H 3010::13 -t UDP_RR -l 1 > > is hogging it, while > > # netperf -4 -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000 > > works? > > no. > run netperf -6 -H 3010::13 -t UDP_RR -l 1 and netperf -4 -H 10.10.1.14 -t > UDP_RR -l 1 -- -r 1000,1000 separately would both pass. > but if run netperf -6 -H 3010::13 -t UDP_RR -l 1 first, then run netperf -4 > -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000 right after the last command, > netperf -4 -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000 would take a long > time. Ok. Can you please share captures on the VFs, VF representors and PF ports? Please capture it so it includes the ipv6 test as well. Also, while the issue is happening, please capture a snapshot of /proc/net/nf_conntrack. I can see a 30s pause on s_pf0vf2_iperf.cap. Similar gap is in c_pf0vf1_iperf.pcap, but that's only a partial view of the traffic and I can't see if NAT or something is misbehaving. In s_pf0vf2_netperf, seems there is a retransmission that took 34s to happen. But if the rest of the traffic was offloaded, it's not visible there.
(In reply to Marcelo Ricardo Leitner from comment #10) > (In reply to Jianlin Shi from comment #9) > > > Does that mean that > > > # netperf -6 -H 3010::13 -t UDP_RR -l 1 > > > is hogging it, while > > > # netperf -4 -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000 > > > works? > > > > no. > > run netperf -6 -H 3010::13 -t UDP_RR -l 1 and netperf -4 -H 10.10.1.14 -t > > UDP_RR -l 1 -- -r 1000,1000 separately would both pass. > > but if run netperf -6 -H 3010::13 -t UDP_RR -l 1 first, then run netperf -4 > > -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000 right after the last command, > > netperf -4 -H 10.10.1.14 -t UDP_RR -l 1 -- -r 1000,1000 would take a long > > time. > > Ok. Can you please share captures on the VFs, VF representors and PF ports? > Please capture it so it includes the ipv6 test as well. > Also, while the issue is happening, please capture a snapshot of > /proc/net/nf_conntrack. > > I can see a 30s pause on s_pf0vf2_iperf.cap. Similar gap is in > c_pf0vf1_iperf.pcap, but that's only a partial view of the traffic and I > can't see if NAT or something is misbehaving. > > In s_pf0vf2_netperf, seems there is a retransmission that took 34s to > happen. But if the rest of the traffic was offloaded, it's not visible there. log uploaded here: http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_072101/
Disclaimer: I'm feeling humorous and I tried to make this a bit funny to read. :-) (In reply to Jianlin Shi from comment #11) > log uploaded here: > http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_072101/ Ok. Seems I found the issue within these captures, but I'll need to discuss it with the bigger team on why this is happening. The issue actually lies with the TCP control connections. $ tshark -r server/s_pf0vf2_invm_netperf.pcap 'frame.number > 24762 and ip' 24764 1.741460 10.10.1.14 → 172.17.169.11 TCP 74 12865 → 48829 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=2727099150 TSecr=889162376 WS=128 24765 1.741560 172.17.169.11 → 10.10.1.14 TCP 66 48829 → 12865 [ACK] Seq=1 Ack=1 Win=29312 [TCP CHECKSUM INCORRECT] Len=0 TSval=889162638 TSecr=2727099150 24766 1.747100 172.17.169.11 → 10.10.1.14 TCP 722 48829 → 12865 [PSH, ACK] Seq=1 Ack=1 Win=29312 [TCP CHECKSUM INCORRECT] Len=656 TSval=889162644 TSecr=2727099150 24767 1.783649 10.10.1.14 → 172.17.169.11 TCP 722 12865 → 48829 [PSH, ACK] Seq=1 Ack=657 Win=30336 Len=656 TSval=2727099296 TSecr=889162644 24768 1.783706 172.17.169.11 → 10.10.1.14 TCP 66 48829 → 12865 [ACK] Seq=657 Ack=657 Win=30592 [TCP CHECKSUM INCORRECT] Len=0 TSval=889162680 TSecr=2727099296 24769 1.783961 172.17.169.11 → 10.10.1.14 UDP 1042 60155 → 37477 Len=1000 [UDP CHECKSUM INCORRECT] 24770 1.819068 10.10.1.14 → 172.17.169.11 TCP 66 12865 → 48829 [ACK] Seq=1 Ack=657 Win=30336 Len=0 TSval=2727099294 TSecr=889162644 24771 1.819096 172.17.169.11 → 10.10.1.14 TCP 66 [TCP Dup ACK 24768#1] 48829 → 12865 [ACK] Seq=657 Ack=657 Win=30592 [TCP CHECKSUM INCORRECT] Len=0 TSval=889162716 TSecr=2727099296 This is the first indication of the problem. 24771 here is a dup of the ack at 24768, which is an ack to 24767. It was triggered by 24770. Note how there is an re-ordering here: 24767 1.783649 10.10.1.14 → 172.17.169.11 TCP 722 ... TSval=2727099296 TSecr=889162644 ^^^ [A] 24768 1.783706 172.17.169.11 → 10.10.1.14 TCP 66 ... Len=0 TSval=889162680 TSecr=2727099296 24770 1.819068 10.10.1.14 → 172.17.169.11 TCP 66 ... Len=0 TSval=2727099294 TSecr=889162644 ^^^ [B], older than [A] above 24771 1.819096 172.17.169.11 → 10.10.1.14 TCP 66 ... Len=0 TSval=889162716 TSecr=2727099296 ^^^ confirmed here this can happen because if the flow got offloaded, 24767 could have been handled entirely by the NIC and arrived before 24770, which could have been handled by sw. And it was, it's packet 12465 on s_pf0vf2_rep_netperf.pcap. The reordering is fine, TCP can recover from this just fine, but this is an important confirmation that the connection got offloaded. Despite that, UDP flow starts: (it actually started at packet 24769 above) 24772 1.826712 10.10.1.14 → 172.17.169.11 UDP 1042 37477 → 60155 Len=1000 24773 1.826919 172.17.169.11 → 10.10.1.14 UDP 1042 60155 → 37477 Len=1000 [UDP CHECKSUM INCORRECT] 24774 1.827508 10.10.1.14 → 172.17.169.11 UDP 1042 37477 → 60155 Len=1000 24775 1.827699 172.17.169.11 → 10.10.1.14 UDP 1042 60155 → 37477 Len=1000 [UDP CHECKSUM INCORRECT] 24776 1.828082 10.10.1.14 → 172.17.169.11 UDP 1042 37477 → 60155 Len=1000 24777 1.828153 172.17.169.11 → 10.10.1.14 UDP 1042 60155 → 37477 Len=1000 [UDP CHECKSUM INCORRECT] 24778 1.828353 10.10.1.14 → 172.17.169.11 UDP 1042 37477 → 60155 Len=1000 24779 1.828437 172.17.169.11 → 10.10.1.14 UDP 1042 60155 → 37477 Len=1000 [UDP CHECKSUM INCORRECT] 24780 1.828614 10.10.1.14 → 172.17.169.11 UDP 1042 37477 → 60155 Len=1000 24781 1.828676 172.17.169.11 → 10.10.1.14 UDP 1042 60155 → 37477 Len=1000 [UDP CHECKSUM INCORRECT] 24782 1.828878 10.10.1.14 → 172.17.169.11 UDP 1042 37477 → 60155 Len=1000 24783 1.828937 172.17.169.11 → 10.10.1.14 UDP 1042 60155 → 37477 Len=1000 [UDP CHECKSUM INCORRECT] 24784 1.829089 10.10.1.14 → 172.17.169.11 UDP 1042 37477 → 60155 Len=1000 24785 1.829146 172.17.169.11 → 10.10.1.14 UDP 1042 60155 → 37477 Len=1000 [UDP CHECKSUM INCORRECT] 24786 2.227145 10.10.1.14 → 172.17.169.11 TCP 722 [TCP Spurious Retransmission] 12865 → 48829 [PSH, ACK] Seq=1 Ack=657 Win=30336 Len=656 TSval=2727099774 TSecr=889162644 Heeey what is this doing here!? Why is the client retransmitting this? We told it on frame 24768 and 24771 above that we got this! 24787 2.227175 172.17.169.11 → 10.10.1.14 TCP 78 [TCP Dup ACK 24768#2] 48829 → 12865 [ACK] Seq=657 Ack=657 Win=30592 [TCP CHECKSUM INCORRECT] Len=0 TSval=889163124 TSecr=2727099774 SLE=1 SRE=657 No worries, we tell it again. 24788 2.784131 172.17.169.11 → 10.10.1.14 UDP 42 60155 → 37477 Len=0 [UDP CHECKSUM INCORRECT] 24789 4.092176 10.10.1.14 → 172.17.169.11 TCP 722 [TCP Spurious Retransmission] 12865 → 48829 [PSH, ACK] Seq=1 Ack=657 Win=30336 Len=656 TSval=2727101638 TSecr=889162644 "Oh come on, now 10.10.1.14 got to be kidding us, right? Are you not receiving our acks!?" (hint hint) 24790 4.092226 172.17.169.11 → 10.10.1.14 TCP 78 [TCP Dup ACK 24768#3] 48829 → 12865 [ACK] Seq=657 Ack=657 Win=30592 [TCP CHECKSUM INCORRECT] Len=0 TSval=889164989 TSecr=2727101638 SLE=1 SRE=657 24791 7.803142 10.10.1.14 → 172.17.169.11 TCP 722 [TCP Spurious Retransmission] 12865 → 48829 [PSH, ACK] Seq=1 Ack=657 Win=30336 Len=656 TSval=2727105350 TSecr=889162644 24792 7.803184 172.17.169.11 → 10.10.1.14 TCP 78 [TCP Dup ACK 24768#4] 48829 → 12865 [ACK] Seq=657 Ack=657 Win=30592 [TCP CHECKSUM INCORRECT] Len=0 TSval=889168700 TSecr=2727105350 SLE=1 SRE=657 24793 15.291198 10.10.1.14 → 172.17.169.11 TCP 722 [TCP Spurious Retransmission] 12865 → 48829 [PSH, ACK] Seq=1 Ack=657 Win=30336 Len=656 TSval=2727112838 TSecr=889162644 24794 15.291243 172.17.169.11 → 10.10.1.14 TCP 78 [TCP Dup ACK 24768#5] 48829 → 12865 [ACK] Seq=657 Ack=657 Win=30592 [TCP CHECKSUM INCORRECT] Len=0 TSval=889176188 TSecr=2727112838 SLE=1 SRE=657 Here, there is a delay that seems big enough to OVS expire the datapath flows. With that, it kills act_ct and conntrack entries get evicted from hw. 24795 30.403017 10.10.1.14 → 172.17.169.11 TCP 722 [TCP Spurious Retransmission] 12865 → 48829 [PSH, ACK] Seq=1 Ack=657 Win=30336 Len=656 TSval=2727127686 TSecr=889162644 24796 30.403064 172.17.169.11 → 10.10.1.14 TCP 78 [TCP Dup ACK 24768#6] 48829 → 12865 [ACK] Seq=657 Ack=657 Win=30592 [TCP CHECKSUM INCORRECT] Len=0 TSval=889191300 TSecr=2727127686 SLE=1 SRE=657 "OMG 10.10.1.14! Ok, we got this info already!" these are now handled in sw (likely ovs upcall) 24797 30.503506 10.10.1.14 → 172.17.169.11 TCP 722 12865 → 48829 [PSH, ACK] Seq=657 Ack=657 Win=30336 Len=656 TSval=2727128050 TSecr=889191300 "Ohhh 10.10.1.14, you finally moved on!" 24798 30.503546 172.17.169.11 → 10.10.1.14 TCP 66 48829 → 12865 [ACK] Seq=657 Ack=1313 Win=31872 [TCP CHECKSUM INCORRECT] Len=0 TSval=889191400 TSecr=2727128050 24799 30.513866 172.17.169.11 → 10.10.1.14 TCP 66 48829 → 12865 [FIN, ACK] Seq=657 Ack=1313 Win=31872 [TCP CHECKSUM INCORRECT] Len=0 TSval=889191411 TSecr=2727128050 24800 30.514781 10.10.1.14 → 172.17.169.11 TCP 66 12865 → 48829 [FIN, ACK] Seq=1313 Ack=658 Win=30336 Len=0 TSval=2727128061 TSecr=889191411 24801 30.514823 172.17.169.11 → 10.10.1.14 TCP 66 48829 → 12865 [ACK] Seq=658 Ack=1314 Win=31872 [TCP CHECKSUM INCORRECT] Len=0 TSval=889191411 TSecr=2727128061 And the test finishes. Now, why was the client retransmitting? Checking the captures on the client side. We have 2 VFs in there. One, handled the ipv6 test and the other, the ipv4 one. $ tshark -r c_pf0vf0_invm_netperf.pcap ipv6 | wc -l 24762 $ tshark -r c_pf0vf0_invm_netperf.pcap ip | wc -l 8 $ tshark -r c_pf0vf1_invm_netperf.pcap ipv6 | wc -l 0 $ tshark -r c_pf0vf1_invm_netperf.pcap ip | wc -l 31 So lets take a peek at c_pf0vf1_invm_netperf.pcap. Mind the different IP addresses.. we can use the pkt sizes and timestamps to match with the above. $ tshark -r c_pf0vf1_invm_netperf.pcap 1 0.000000 172.17.169.11 → 192.168.1.14 TCP 74 48829 → 12865 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=889162376 TSecr=0 WS=128 2 0.000085 192.168.1.14 → 172.17.169.11 TCP 74 12865 → 48829 [SYN, ACK] Seq=0 Ack=1 Win=28960 [TCP CHECKSUM INCORRECT] Len=0 MSS=1460 SACK_PERM=1 TSval=2727099150 TSecr=889162376 WS=128 3 0.138191 172.17.169.11 → 192.168.1.14 TCP 66 48829 → 12865 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=889162638 TSecr=2727099150 4 0.143662 172.17.169.11 → 192.168.1.14 TCP 722 48829 → 12865 [PSH, ACK] Seq=1 Ack=1 Win=29312 Len=656 TSval=889162644 TSecr=2727099150 5 0.143713 192.168.1.14 → 172.17.169.11 TCP 66 12865 → 48829 [ACK] Seq=1 Ack=657 Win=30336 [TCP CHECKSUM INCORRECT] Len=0 TSval=2727099294 TSecr=889162644 TSval=2727099294 ^^^ matches [B] above. 6 0.145234 192.168.1.14 → 172.17.169.11 TCP 722 12865 → 48829 [PSH, ACK] Seq=1 Ack=657 Win=30336 [TCP CHECKSUM INCORRECT] Len=656 TSval=2727099296 TSecr=889162644 TSval=2727099296 ^^^ matches [A] above. 1st xmit. Reordering confirmed. 7 0.220491 172.17.169.11 → 192.168.1.14 UDP 1042 60155 → 37477 Len=1000 8 0.220719 192.168.1.14 → 172.17.169.11 UDP 1042 37477 → 60155 Len=1000 [UDP CHECKSUM INCORRECT] 9 0.223539 172.17.169.11 → 192.168.1.14 UDP 1042 60155 → 37477 Len=1000 10 0.223747 192.168.1.14 → 172.17.169.11 UDP 1042 37477 → 60155 Len=1000 [UDP CHECKSUM INCORRECT] 11 0.224156 172.17.169.11 → 192.168.1.14 UDP 1042 60155 → 37477 Len=1000 12 0.224349 192.168.1.14 → 172.17.169.11 UDP 1042 37477 → 60155 Len=1000 [UDP CHECKSUM INCORRECT] 13 0.224549 172.17.169.11 → 192.168.1.14 UDP 1042 60155 → 37477 Len=1000 14 0.224621 192.168.1.14 → 172.17.169.11 UDP 1042 37477 → 60155 Len=1000 [UDP CHECKSUM INCORRECT] 15 0.224831 172.17.169.11 → 192.168.1.14 UDP 1042 60155 → 37477 Len=1000 16 0.224895 192.168.1.14 → 172.17.169.11 UDP 1042 37477 → 60155 Len=1000 [UDP CHECKSUM INCORRECT] 17 0.225069 172.17.169.11 → 192.168.1.14 UDP 1042 60155 → 37477 Len=1000 18 0.225160 192.168.1.14 → 172.17.169.11 UDP 1042 37477 → 60155 Len=1000 [UDP CHECKSUM INCORRECT] 19 0.225318 172.17.169.11 → 192.168.1.14 UDP 1042 60155 → 37477 Len=1000 20 0.225381 192.168.1.14 → 172.17.169.11 UDP 1042 37477 → 60155 Len=1000 [UDP CHECKSUM INCORRECT] 21 0.623280 192.168.1.14 → 172.17.169.11 TCP 722 [TCP Retransmission] 12865 → 48829 [PSH, ACK] Seq=1 Ack=657 Win=30336 [TCP CHECKSUM INCORRECT] Len=656 TSval=2727099774 TSecr=889162644 22 2.487277 192.168.1.14 → 172.17.169.11 TCP 722 [TCP Retransmission] 12865 → 48829 [PSH, ACK] Seq=1 Ack=657 Win=30336 [TCP CHECKSUM INCORRECT] Len=656 TSval=2727101638 TSecr=889162644 23 5.495297 00:00:00_01:01:14 → 00:00:00_00:00:01 ARP 42 Who has 192.168.1.254? Tell 192.168.1.14 24 5.500082 00:00:00_00:00:01 → 00:00:00_01:01:14 ARP 60 192.168.1.254 is at 00:00:00:00:00:01 25 6.199289 192.168.1.14 → 172.17.169.11 TCP 722 [TCP Retransmission] 12865 → 48829 [PSH, ACK] Seq=1 Ack=657 Win=30336 [TCP CHECKSUM INCORRECT] Len=656 TSval=2727105350 TSecr=889162644 26 13.687277 192.168.1.14 → 172.17.169.11 TCP 722 [TCP Retransmission] 12865 → 48829 [PSH, ACK] Seq=1 Ack=657 Win=30336 [TCP CHECKSUM INCORRECT] Len=656 TSval=2727112838 TSecr=889162644 Up to here, no acks! The client wasn't crazy, after all! Then we have the 15s delay, and: 27 28.535347 192.168.1.14 → 172.17.169.11 TCP 722 [TCP Retransmission] 12865 → 48829 [PSH, ACK] Seq=1 Ack=657 Win=30336 [TCP CHECKSUM INCORRECT] Len=656 TSval=2727127686 TSecr=889162644 28 28.899312 172.17.169.11 → 192.168.1.14 TCP 78 48829 → 12865 [ACK] Seq=657 Ack=657 Win=30592 Len=0 TSval=889191300 TSecr=2727127686 SLE=1 SRE=657 Finally, an ack got through. 29 28.899365 192.168.1.14 → 172.17.169.11 TCP 722 12865 → 48829 [PSH, ACK] Seq=657 Ack=657 Win=30336 [TCP CHECKSUM INCORRECT] Len=656 TSval=2727128050 TSecr=889191300 30 28.899948 172.17.169.11 → 192.168.1.14 TCP 66 48829 → 12865 [ACK] Seq=657 Ack=1313 Win=31872 Len=0 TSval=889191400 TSecr=2727128050 31 28.910431 172.17.169.11 → 192.168.1.14 TCP 66 48829 → 12865 [FIN, ACK] Seq=657 Ack=1313 Win=31872 Len=0 TSval=889191411 TSecr=2727128050 32 28.910732 192.168.1.14 → 172.17.169.11 TCP 66 12865 → 48829 [FIN, ACK] Seq=1313 Ack=658 Win=30336 [TCP CHECKSUM INCORRECT] Len=0 TSval=2727128061 TSecr=889191411 33 28.911179 172.17.169.11 → 192.168.1.14 TCP 66 48829 → 12865 [ACK] Seq=658 Ack=1314 Win=31872 Len=0 TSval=889191411 TSecr=2727128061 And the test finishes. Where did those acks go then!? Well, remember this part? I didn't mark it above to not leave a spoiler there ;-) vf0 handled ipv6, but: $ tshark -r c_pf0vf0_invm_netperf.pcap ipv6 | wc -l 24762 $ tshark -r c_pf0vf0_invm_netperf.pcap ip | wc -l 8 <---- hmmmm! Yes, there are the acks and 2 UDPs as well: $ tshark -r c_pf0vf0_invm_netperf.pcap ip 24763 1.528737 172.17.169.11 → 192.168.1.14 TCP 66 48829 → 12865 [ACK] Seq=1 Ack=1 Win=239 Len=0 TSval=889162680 TSecr=2727099296 24764 1.559703 172.17.169.11 → 192.168.1.14 TCP 66 [TCP Dup ACK 24763#1] 48829 → 12865 [ACK] Seq=1 Ack=1 Win=239 Len=0 TSval=889162716 TSecr=2727099296 24765 1.569744 172.17.169.11 → 192.168.1.14 UDP 1042 60155 → 37477 Len=1000 24766 1.967766 172.17.169.11 → 192.168.1.14 TCP 78 [TCP Dup ACK 24763#2] 48829 → 12865 [ACK] Seq=1 Ack=1 Win=239 Len=0 TSval=889163124 TSecr=2727099774 SLE=4294966641 SRE=1 24767 2.524712 172.17.169.11 → 192.168.1.14 UDP 60 60155 → 37477 Len=0 24768 3.832788 172.17.169.11 → 192.168.1.14 TCP 78 [TCP Dup ACK 24763#3] 48829 → 12865 [ACK] Seq=1 Ack=1 Win=239 Len=0 TSval=889164989 TSecr=2727101638 SLE=4294966641 SRE=1 24769 7.543747 172.17.169.11 → 192.168.1.14 TCP 78 [TCP Dup ACK 24763#4] 48829 → 12865 [ACK] Seq=1 Ack=1 Win=239 Len=0 TSval=889168700 TSecr=2727105350 SLE=4294966641 SRE=1 24770 15.031795 172.17.169.11 → 192.168.1.14 TCP 78 [TCP Dup ACK 24763#5] 48829 → 12865 [ACK] Seq=1 Ack=1 Win=239 Len=0 TSval=889176188 TSecr=2727112838 SLE=4294966641 SRE=1 I have no idea what could have caused this. @Jianlin, please try with tc-policy of skip-hw on both hosts. Captures are welcomed too. Anyway, I need to discuss this with Nvidia already. @Alaa, please share these captures with the team within Nvidia.
To ease read the above, the 3 captures I mentioned in there, are "invm", done in the guests.
> @Alaa, please share these captures with the team within Nvidia. forwarded to the team under internal RM ticket 2736692
tcpdump file when skip-hw is set: http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_072601_skip_hw/
(In reply to Jianlin Shi from comment #15) > tcpdump file when skip-hw is set: > http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_072601_skip_hw/ I'm seeing the same behavior here. Hmm.
Jianlin, can you please capture: # ovs-appctl dpctl/dump-flows -m # tc -s filter show dev <representors> ingress for the client, while the ipv4 test is stuck? That would be 1s after starting the test. As this is happening with skip_hw as well, it is more likely that either ovs is programming tc flows wrongly or tc is misunderstanding ovs. As the issue happens with both datapaths (tc sw and tc hw), it's more likely to be an issue on control path. But lets see.
(In reply to Marcelo Ricardo Leitner from comment #17) > Jianlin, can you please capture: > # ovs-appctl dpctl/dump-flows -m > # tc -s filter show dev <representors> ingress > > for the client, while the ipv4 test is stuck? That would be 1s after > starting the test. > > As this is happening with skip_hw as well, it is more likely that either ovs > is programming tc flows wrongly or tc is misunderstanding ovs. > As the issue happens with both datapaths (tc sw and tc hw), it's more likely > to be an issue on control path. But lets see. new log with tc and flow result uploaded: http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_0727_skip_hw_withflow/
http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_0727_skip_hw_withflow/server/ens1f0_iperf.pcap filtering with "tcp.stream eq 2". I can see different geneve opts being used for this connection, for packets server->client (i.e., with ip.dst = 172.17.169.11). Then here, on client, http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_0727_skip_hw_withflow/client/netperf_flow.log There are these two flows: ufid:92276a3c-7fdd-4587-9fe5-6f3169cc7e70, skb_priority(0/0),tunnel(tun_id=0x1,src=20.0.169.25,dst=20.0.169.26,ttl=0/0,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x20004/0x7fffffff}),flags(+key)),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(genev_sys_6081),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:00:00:01,dst=00:00:00:00:00:00/01:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:11, bytes:10166, used:4.640s, offloaded:yes, dp:tc, actions:c_pf0vf1 ufid:a82edb63-ca7e-4d72-b6bd-20af1f842e4c, skb_priority(0/0),tunnel(tun_id=0x1,src=20.0.169.25,dst=20.0.169.26,ttl=0/0,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x20003/0x7fffffff}),flags(+key)),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(genev_sys_6081),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:00:00:01,dst=00:00:00:00:00:00/01:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:5, bytes:1324, used:2.590s, offloaded:yes, dp:tc, actions:c_pf0vf0 Which are mostly the same and differ only on the geneve opts. The two above do basically: 0x20004->c_pf0vf1 and 0x20003->c_pf0vf0. Sounds like it's the server that is generating packets with the wrong options. Then on the server side: http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_0727_skip_hw_withflow/server/netperf_flow.log 2 for ipv4: ufid:3e4e52ed-c555-4fc6-bb1f-0abc14f3ec2b, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0xe4b),dp_hash(0/0),in_port(s_pf0vf2),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:01:02:11,dst=00:00:00:00:00:02),eth_type(0x0800),ipv4(src=128.0.0.0/192.0.0.0,dst=192.168.1.14,proto=6,tos=0/0x3,ttl=64,frag=no),tcp(src=0/0,dst=0/0), packets:6, bytes:1280, used:0.920s, offloaded:yes, dp:tc, actions:ct_clear,set(tunnel(tun_id=0x1,dst=20.0.169.26,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x20004}),flags(csum|key))),set(eth(src=00:00:00:00:00:01,dst=00:00:00:01:01:14)),set(ipv4(ttl=63)),genev_sys_6081 ufid:c832d8b4-597e-4e91-9b76-ffa0da62f3c5, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0xe4b),dp_hash(0/0),in_port(s_pf0vf2),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:01:02:11,dst=00:00:00:00:00:02),eth_type(0x0800),ipv4(src=128.0.0.0/192.0.0.0,dst=192.168.1.14,proto=17,tos=0/0x3,ttl=64,frag=no),udp(src=0/0,dst=0/0), packets:10, bytes:9442, used:2.970s, offloaded:yes, dp:tc, actions:ct_clear,set(tunnel(tun_id=0x1,dst=20.0.169.26,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x20004}),flags(csum|key))),set(eth(src=00:00:00:00:00:01,dst=00:00:00:01:01:14)),set(ipv4(ttl=63)),genev_sys_6081 2 for ipv6: ufid:d5279f13-71cd-4909-851a-786eeb5585d6, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0xe4b),dp_hash(0/0),in_port(s_pf0vf2),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:01:02:11,dst=00:00:00:00:00:02),eth_type(0x86dd),ipv6(src=7777:169::11,dst=2001::13,label=0/0,proto=6,tclass=0/0x3,hlimit=64,frag=no),tcp(src=0/0,dst=0/0), packets:6, bytes:1088, used:4.740s, offloaded:yes, dp:tc, actions:ct_clear,set(tunnel(tun_id=0x1,dst=20.0.169.26,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x20003}),flags(csum|key))),set(eth(src=00:00:00:00:00:01,dst=00:00:00:01:01:13)),set(ipv6(hlimit=63)),genev_sys_6081 ufid:2b68c46b-0c10-426c-a6ad-1fc63b04d102, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0xe4b),dp_hash(0/0),in_port(s_pf0vf2),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:01:02:11,dst=00:00:00:00:00:02),eth_type(0x86dd),ipv6(src=7777:169::11,dst=2001::13,label=0/0,proto=17,tclass=0/0x3,hlimit=64,frag=no),udp(src=0/0,dst=0/0), packets:12745, bytes:624504, used:4.740s, offloaded:yes, dp:tc, actions:ct_clear,set(tunnel(tun_id=0x1,dst=20.0.169.26,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x20003}),flags(csum|key))),set(eth(src=00:00:00:00:00:01,dst=00:00:00:01:01:13)),set(ipv6(hlimit=63)),genev_sys_6081 ipv4 is using 20004 while ipv6 is using 20003. So I can't understand why packets 88 and 99 on the capture above in the comment are using 0x20003 and the others for this connection, 20004. I need to understand more how geneve options are being used here. Note that the last captures were using HWOL, so not all traffic is visible on the PF and representor. One other weird thing worth nothing is that here: http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_0727_skip_hw_withflow/server/ens1f0_iperf.pcap packets 129 and 132 are likely a reordering within the server. The TSval for these are not in the right order, while they show up right on the client side, c_pf0vf1_invm_iperf.pcap. No idea how this happened on the server side.
(In reply to Jianlin Shi from comment #15) > tcpdump file when skip-hw is set: > http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_072601_skip_hw/ I was taking a deeper look at this one now and it seems it used skip_hw only at the server side. There are gaps on the representors for the client side that I cannot explain otherwise. Also, the s_pf0vf2_rep-iperf3.pcap is empty, which is weird with or without skip_hw, and I see only 2 IPv6 UDP packets on the server uplink representor (which indicates they got offloaded, because I see many more on invm capture). Today I realized I may have been looking at the captures from the wrong side. (In reply to Jianlin Shi from comment #0) ... > ovn-nbctl lsp-set-addresses s_pf0vf2 "00:00:00:01:02:11 172.17.172.11 > 7777:172::11" ... > 7. run "iperf3 -u -c 3010::13 -t 1; iperf3 -u -c 10.10.1.14 -t 1" in vm g2 > (172.17.172.11) I think I assumed the interfaces "s_" and "c_" would match the iperf/netperf sides, but seems it's the other way around? Please confirm. With that, from the captures on log_072601_skip_hw: http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_072601_skip_hw/server/ens1f0_iperf.pcap - The first packet that got routed to the wrong VF on client is packet #65. - The 2 packets that show up in the representor and have weird Geneve opts, are packets #36 and #37, from client. They have absolute seq of tcp.seq == 286805185 and tsval tcp.options.timestamp.tsval == 1032773167 - These 2 packets DON'T show up in the representors on the client side. - The client side DO have have 2 other packets with weird Geneve opts, packets #20 and #21. They have TSecr matching the TSval above and they ack the byte sent by packet #37 mentioned above. In short, the misrouting happens after the Geneve opts get funky and seems the tunnels end up swapped somehow. Jianlin, I need another test: - please make sure to use skip_hw on both sides. - no need to test netperf anymore, thanks. - capture the output of: 'tc -ts monitor' on the hosts during the whole test. The tc monitor output will help understand why packets #36 and #37 had different geneve opts. If there was datapath changes in between, we'll see it. Thanks!
On hw-offload and tc-police. Enabling hw-offload doesn't need an ovs restart, but tc-police is only initialized when hw-offload is enabled. So: setting hw-offload=true setting tc-police=skip_hw wields wrong results, as skip_hw will only be effective after a restart. Need it to be the other way around: setting tc-police=skip_hw setting hw-offload=true or restart ovs after adjusting everything.
(In reply to Marcelo Ricardo Leitner from comment #20) > (In reply to Jianlin Shi from comment #15) > > tcpdump file when skip-hw is set: > > http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_072601_skip_hw/ > > I was taking a deeper look at this one now and it seems it used skip_hw only > at the server side. There are gaps on the representors for the client side > that I cannot explain otherwise. Also, the s_pf0vf2_rep-iperf3.pcap is > empty, which is weird with or without skip_hw, and I see only 2 IPv6 UDP > packets on the server uplink representor (which indicates they got > offloaded, because I see many more on invm capture). > > Today I realized I may have been looking at the captures from the wrong side. > > (In reply to Jianlin Shi from comment #0) > ... > > ovn-nbctl lsp-set-addresses s_pf0vf2 "00:00:00:01:02:11 172.17.172.11 > > 7777:172::11" > ... > > 7. run "iperf3 -u -c 3010::13 -t 1; iperf3 -u -c 10.10.1.14 -t 1" in vm g2 > > (172.17.172.11) > > I think I assumed the interfaces "s_" and "c_" would match the iperf/netperf > sides, but seems it's the other way around? Please confirm. yes, s_ is the server side of the test case, it runs iperf3 -c. c_ would run iperf3 -s sorry for miunderstanding. > > With that, from the captures on log_072601_skip_hw: > http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_072601_skip_hw/ > server/ens1f0_iperf.pcap > - The first packet that got routed to the wrong VF on client is packet #65. > - The 2 packets that show up in the representor and have weird Geneve opts, > are packets #36 and #37, from client. > They have absolute seq of tcp.seq == 286805185 and tsval > tcp.options.timestamp.tsval == 1032773167 > - These 2 packets DON'T show up in the representors on the client side. > - The client side DO have have 2 other packets with weird Geneve opts, > packets #20 and #21. They have TSecr matching the TSval above and they ack > the byte sent by packet #37 mentioned above. > > In short, the misrouting happens after the Geneve opts get funky and seems > the tunnels end up swapped somehow. > > > Jianlin, I need another test: > - please make sure to use skip_hw on both sides. > - no need to test netperf anymore, thanks. > - capture the output of: 'tc -ts monitor' on the hosts during the whole test. in http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_072601_skip_hw/. I set tc-policy as ski-hw rather than skip_hw. and confirmed that the right configuration is skip_hw. and with 4.18.0-323. after set tc-policy as skip_hw. the issue doesn't happen. [root@wsfd-advnetlab16 offload_func]# rpm -qa | grep -E "openvswitch2.15|ovn-2021" ovn-2021-central-21.06.0-12.el8fdp.x86_64 openvswitch2.15-2.15.0-27.el8fdp.x86_64 ovn-2021-21.06.0-12.el8fdp.x86_64 ovn-2021-host-21.06.0-12.el8fdp.x86_64 python3-openvswitch2.15-2.15.0-27.el8fdp.x86_64 [root@wsfd-advnetlab16 offload_func]# ovs-vsctl list open _uuid : 239269dc-024a-4f75-8a7b-1b413bbb2969 bridges : [57531cc0-286d-4d8b-8ef1-8e30c9c1720a] cur_cfg : 13 datapath_types : [netdev, system] datapaths : {} db_version : "8.2.0" dpdk_initialized : false dpdk_version : "DPDK 20.11.1" external_ids : {hostname=wsfd-advnetlab16.anl.lab.eng.bos.redhat.com, ovn-encap-ip="20.0.170.25", ovn-encap-type=geneve, ovn-remote="tcp:20.0.170.25:6642", rundir="/var/run/openvswitch", system-id="9c0c427c-617e-4fff-ac34-eb8c9bacb87a"} iface_types : [bareudp, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, stt, system, tap, vxlan] manager_options : [] next_cfg : 13 other_config : {hw-offload="true", tc-policy=skip_hw} ovs_version : "2.15.2" ssl : [] statistics : {} system_type : rhel system_version : "8.5" > > The tc monitor output will help understand why packets #36 and #37 had > different geneve opts. If there was datapath changes in between, we'll see > it. > > Thanks!
log when set tc-policy=skip_hw: http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_080201_skip_hw/
after upgrade kernel to 4.18.0-324. the issue doesn't happen. [root@wsfd-advnetlab16 offload_func]# uname -a Linux wsfd-advnetlab16.anl.lab.eng.bos.redhat.com 4.18.0-324.el8.x86_64 #1 SMP Wed Jul 21 22:11:34 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
You mean that even without skip_hw it works with -324.el8? That's good news, Jianlin. Thanks. So we probably have a fix yet. In the middle of a haystack, but well.. :D (In reply to Jianlin Shi from comment #22) > and with 4.18.0-323. after set tc-policy as skip_hw. the issue doesn't > happen. (In reply to Jianlin Shi from comment #24) > after upgrade kernel to 4.18.0-324. the issue doesn't happen. In there, there's 6a8b79a09de1 Merge: mlx5: Update to v5.12 If I understood Jianlin right, Alaa, the fix is likely within the driver rebase in there.
(In reply to Jianlin Shi from comment #22) > in > http://netqe-bj.usersys.redhat.com/share/jishi/bz1975085/log_072601_skip_hw/. > I set tc-policy as ski-hw rather than skip_hw. > and confirmed that the right configuration is skip_hw. Ouch. I've done this myself and I'm sorry that this hit you. I seized the moment and reported: https://bugzilla.redhat.com/show_bug.cgi?id=1990130
Removing the UDP from summary as this affected TCP packets of the control connection as well.
(In reply to Marcelo Ricardo Leitner from comment #25) > You mean that even without skip_hw it works with -324.el8? on both 323 and 324, when set skip_hw (not the wrong setting skip-hw), the issue doesn't occur. > That's good news, Jianlin. Thanks. So we probably have a fix yet. In the > middle of a haystack, but well.. :D > > (In reply to Jianlin Shi from comment #22) > > and with 4.18.0-323. after set tc-policy as skip_hw. the issue doesn't > > happen. > > (In reply to Jianlin Shi from comment #24) > > after upgrade kernel to 4.18.0-324. the issue doesn't happen. > > In there, there's > 6a8b79a09de1 Merge: mlx5: Update to v5.12 > > If I understood Jianlin right, Alaa, the fix is likely within the driver > rebase in there. and on 324. even without skip_hw, the issue doesn't occur. which means the issue is fixed on 324.
I missed to reassign this one, as we mentioned on the mtg last week.
Hi, Jianlin. Marcelo said that we need to find the patch(es) that resolved this bug so that we can add them to RHEL-8.4.z if possible. Between kernels 4.18.0-323.el8 and 4.18.0-324.el8 a lot of changes were made to the mlx5 driver, so we need to bisect it... Can you help with providing a setup and steps for reproducing the bug? Thanks, Alaa
(In reply to Alaa Hleihel (NVIDIA Mellanox) from comment #30) > Hi, Jianlin. > > Marcelo said that we need to find the patch(es) that resolved this bug so > that we can add them to RHEL-8.4.z if possible. > Between kernels 4.18.0-323.el8 and 4.18.0-324.el8 a lot of changes were made > to the mlx5 driver, so we need to bisect it... > > Can you help with providing a setup and steps for reproducing the bug? > > Thanks, > Alaa 1. setup ovn (ovn and ovs should be installed), vm and sriov on server: cat > /etc/udev/rules.d/80-persistent-ens1f0.rules <<-EOF SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", KERNELS=="0000:3b:00.0", NAME="ens1f0" EOF echo 4 > /sys/bus/pci/devices/0000:3b:00.0/sriov_numvfs echo 0000:3b:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind echo 0000:3b:00.3 > /sys/bus/pci/drivers/mlx5_core/unbind echo 0000:3b:00.4 > /sys/bus/pci/drivers/mlx5_core/unbind echo 0000:3b:00.5 > /sys/bus/pci/drivers/mlx5_core/unbind devlink dev eswitch set pci/0000:3b:00.0 mode switchdev ip link set ens1f0 mtu 9000 ip link set ens1f0 up ip addr add 1.1.171.25/24 dev ens1f0 rm /var/lib/libvirt/images/g2.qcow2 cp /var/lib/libvirt/images/rhel8.3.qcow2 /var/lib/libvirt/images/g2.qcow2 virsh net-define /usr/share/libvirt/networks/default.xml; virsh net-start default virt-install --name g2 --vcpus=2 --ram=2048 --disk path=/var/lib/libvirt/images/g2.qcow2,device=disk,bus=virtio,format=qcow2 --network bridge=virbr0,model=virtio --boot hd --accelerate --force --graphics none --noautoconsole sleep 90 cat > vf.xml << EOF <interface type='hostdev' managed='yes'> <source> <address type='pci' domain='0x0000' bus='0x3b' slot='0x00' function='0x4'/> </source> <mac address='00:00:00:01:02:11'/> </interface> EOF virsh attach-device g2 vf.xml systemctl start openvswitch systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:1.1.171.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=1.1.171.25 systemctl restart ovn-controller ovn-nbctl ls-add ls1 ovn-nbctl ls-add ls2 ovn-nbctl lr-add lr1 ovn-nbctl lrp-add lr1 lr1-ls1 00:00:00:00:00:01 192.168.1.254/24 2001::a/64 ovn-nbctl lsp-add ls1 ls1-lr1 ovn-nbctl lsp-set-type ls1-lr1 router ovn-nbctl lsp-set-options ls1-lr1 router-port=lr1-ls1 ovn-nbctl lsp-set-addresses ls1-lr1 router ovn-nbctl lrp-add lr1 lr1-ls2 00:00:00:00:00:02 172.17.171.254/24 7777:171::a/64 ovn-nbctl lsp-add ls2 ls2-lr1 ovn-nbctl lsp-set-type ls2-lr1 router ovn-nbctl lsp-set-options ls2-lr1 router-port=lr1-ls2 ovn-nbctl lsp-set-addresses ls2-lr1 router ovn-nbctl lsp-add ls1 s_pf0vf0 ovn-nbctl lsp-set-addresses s_pf0vf0 "00:00:00:01:01:11 192.168.1.11 2001::11" ovn-nbctl lsp-add ls2 s_pf0vf2 ovn-nbctl lsp-set-addresses s_pf0vf2 "00:00:00:01:02:11 172.17.171.11 7777:171::11" ovn-nbctl lsp-add ls1 c_pf0vf0 ovn-nbctl lsp-set-addresses c_pf0vf0 '00:00:00:01:01:13 192.168.1.13 2001::13' ovn-nbctl lsp-add ls1 c_pf0vf1 ovn-nbctl lsp-set-addresses c_pf0vf1 '00:00:00:01:01:14 192.168.1.14 2001::14' ovn-nbctl lr-nat-add lr1 dnat_and_snat 3010::13 2001::13 ovn-nbctl lr-nat-add lr1 dnat 10.10.1.14 192.168.1.14 ip link set eth2 down ip link set eth2 name s_pf0vf2 ovs-vsctl add-port br-int s_pf0vf2 -- set interface s_pf0vf2 external_ids:iface-id=s_pf0vf2 ip link set s_pf0vf2 up ovs-vsctl set Open_vSwitch . other_config:hw-offload=true ovs-vsctl set Open_vSwitch . other_config:tc-policy=none systemctl restart openvswitch sleep 2 chassis_id=$(ovn-sbctl find chassis hostname=$(hostname) | awk '/^name/{print $3}' | sed 's/"//g') ovn-nbctl set logical_router lr1 options:chassis=$chassis_id 2. setup ovn, vm and sriov on client: cat > /etc/udev/rules.d/80-persistent-ens1f0.rules <<-EOF SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", KERNELS=="0000:3b:00.0", NAME="ens1f0" EOF echo 4 > /sys/bus/pci/devices/0000:3b:00.0/sriov_numvfs echo 0000:3b:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind echo 0000:3b:00.3 > /sys/bus/pci/drivers/mlx5_core/unbind echo 0000:3b:00.4 > /sys/bus/pci/drivers/mlx5_core/unbind echo 0000:3b:00.5 > /sys/bus/pci/drivers/mlx5_core/unbind devlink dev eswitch set pci/0000:3b:00.0 mode switchdev ip link set ens1f0 mtu 9000 ip link set ens1f0 up ip addr add 1.1.171.26/24 dev ens1f0 rm /var/lib/libvirt/images/g0.qcow2 cp /var/lib/libvirt/images/rhel8.3.qcow2 /var/lib/libvirt/images/g0.qcow2 virsh net-define /usr/share/libvirt/networks/default.xml; virsh net-start default virt-install --name g0 --vcpus=2 --ram=2048 --disk path=/var/lib/libvirt/images/g0.qcow2,device=disk,bus=virtio,format=qcow2 --network bridge=virbr0,model=virtio --boot hd --accelerate --force --graphics none --noautoconsole rm /var/lib/libvirt/images/g1.qcow2 cp /var/lib/libvirt/images/rhel8.3.qcow2 /var/lib/libvirt/images/g1.qcow2 virt-install --name g1 --vcpus=2 --ram=2048 --disk path=/var/lib/libvirt/images/g1.qcow2,device=disk,bus=virtio,format=qcow2 --network bridge=virbr0,model=virtio --boot hd --accelerate --force --graphics none --noautoconsole sleep 90 cat > vf.xml << EOF <interface type='hostdev' managed='yes'> <source> <address type='pci' domain='0x0000' bus='0x3b' slot='0x00' function='0x2'/> </source> <mac address='00:00:00:01:01:13'/> </interface> EOF virsh attach-device g0 vf.xml cat > vf.xml << EOF <interface type='hostdev' managed='yes'> <source> <address type='pci' domain='0x0000' bus='0x3b' slot='0x00' function='0x3'/> </source> <mac address='00:00:00:01:01:14'/> </interface> EOF virsh attach-device g1 vf.xml systemctl start openvswitch ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:1.1.171.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=1.1.171.26 systemctl restart ovn-controller ip link set eth0 down ip link set eth0 name c_pf0vf0 ovs-vsctl add-port br-int c_pf0vf0 -- set interface c_pf0vf0 external_ids:iface-id=c_pf0vf0 ip link set c_pf0vf0 up ip link set eth1 down ip link set eth1 name c_pf0vf1 ovs-vsctl add-port br-int c_pf0vf1 -- set interface c_pf0vf1 external_ids:iface-id=c_pf0vf1 ip link set c_pf0vf1 up ovs-vsctl set Open_vSwitch . other_config:hw-offload=true ovs-vsctl set Open_vSwitch . other_config:tc-policy=none systemctl restart openvswitch 3. setup ip and start iperf3 in vm g2 on server: yum -y install wget unzip tcpdump iperf3 nc nmcli d set ens6 managed no ip addr add 172.17.171.11/24 dev ens6 ip route add default via 172.17.171.254 dev ens6 ip addr add 7777:171::11/64 dev ens6 ip -6 route add default via 7777:171::a dev ens6 setenforce 0 systemctl stop firewalld 4. setup ip and start iperf3 in vm g0 on client: yum -y install wget unzip tcpdump iperf3 nc nmcli d set ens6 managed no ip addr add 192.168.1.13/24 dev ens6 ip route add default via 192.168.1.254 dev ens6 ip addr add 2001::13/64 dev ens6 ip -6 route add default via 2001::a dev ens6 iperf3 -s -D & setenforce 0 systemctl stop firewalld 5. setup ip and start iperf3 in vm g1 on client: yum -y install wget unzip tcpdump iperf3 nc reboot nmcli d set ens6 managed no ip addr add 192.168.1.14/24 dev ens6 ip route add default via 192.168.1.254 dev ens6 ip addr add 2001::14/64 dev ens6 ip -6 route add default via 2001::a dev ens6 setenforce 0 systemctl stop firewalld iperf3 -s -D & 6. run iperf3 in vm g2 on server: iperf3 -u -c 3010::13 -t 1; iperf3 -u -c 10.10.1.14 -t 1
Hi Jianlin, Can you please provide a setup where the issue repro ? Thanks, Amir
please refer to comment 31
Hi Marcelo, can you please help with providing a setup ?
Hi Amir, I'm afraid we can't at the moment. We're at very close to 8.5 GA and HW is scarce. Hopefully Jianlin's instructions on comment #31 are enough? Thanks for understanding.
I reproduce the issue on wsfd-advnetlab16.anl.lab.eng.bos.redhat.com, you can log in with root/redhat. and run "tmux attach -t test" to view the result: the kernel is 4.18.0-323, and when run "iperf3 -u -c 3010::13 -t 1; iperf3 -u -c 10.10.1.14 -t 1". the iperf3 -u -c 10.10.1.14 -t 1 would hang for seconds.
Hi Amir, Have you finished using the systems? if you are not using them, I need to re-provision the machines to run some other tests.
(In reply to Jianlin Shi from comment #37) > Hi Amir, > > Have you finished using the systems? if you are not using them, I need to > re-provision the machines to run some other tests. Hi, unfortunately I didn't have the opportunity to start. If you need it, You can have it and inform me when it is available again for bisecting. thanks!
(In reply to Amir Tzin from comment #38) > (In reply to Jianlin Shi from comment #37) > > Hi Amir, > > > > Have you finished using the systems? if you are not using them, I need to > > re-provision the machines to run some other tests. > > Hi, unfortunately I didn't have the opportunity to start. > If you need it, You can have it and inform me when it is available again for > bisecting. > thanks! can you start to use it now and inform me after you finish.
(In reply to Jianlin Shi from comment #39) > (In reply to Amir Tzin from comment #38) > > (In reply to Jianlin Shi from comment #37) > > > Hi Amir, > > > > > > Have you finished using the systems? if you are not using them, I need to > > > re-provision the machines to run some other tests. > > > > Hi, unfortunately I didn't have the opportunity to start. > > If you need it, You can have it and inform me when it is available again for > > bisecting. > > thanks! > > can you start to use it now and inform me after you finish. yes, I can start working on it today. Do you have a script for configuration after system boot ?
(In reply to Amir Tzin from comment #40) > > yes, > I can start working on it today. > Do you have a script for configuration after system boot ? on both systems: cd /mnt/tests/kernel/networking/openvswitch/ovn/offload_func source ~/env.sh export TEST_ITEMS=nic_nat_test make run
(In reply to Jianlin Shi from comment #41) > (In reply to Amir Tzin from comment #40) > > > > > yes, > > I can start working on it today. > > Do you have a script for configuration after system boot ? > > on both systems: > cd /mnt/tests/kernel/networking/openvswitch/ovn/offload_func > source ~/env.sh > export TEST_ITEMS=nic_nat_test > make run Hi Jianlin, what about the second setup for client side ? you only gave me one (wsfd-advnetlab16.anl.lab.eng.bos.redhat.com) is it wsfd-advnetlab19.anl.lab.eng.bos.redhat.com ? this one seems to be down.
(In reply to Amir Tzin from comment #42) > > what about the second setup for client side ? > you only gave me one (wsfd-advnetlab16.anl.lab.eng.bos.redhat.com) > is it wsfd-advnetlab19.anl.lab.eng.bos.redhat.com ? this one seems to be > down. it is wsfd-advnetlab17.anl.lab.eng.bos.redhat.com. the same username and password. and you can see 17 is logged in on the right side of the tmux session.
(In reply to Jianlin Shi from comment #43) > (In reply to Amir Tzin from comment #42) > > > > what about the second setup for client side ? > > you only gave me one (wsfd-advnetlab16.anl.lab.eng.bos.redhat.com) > > is it wsfd-advnetlab19.anl.lab.eng.bos.redhat.com ? this one seems to be > > down. > > it is wsfd-advnetlab17.anl.lab.eng.bos.redhat.com. the same username and > password. and you can see 17 is logged in on the right side of the tmux > session. thanks!
(In reply to Jianlin Shi from comment #43) > (In reply to Amir Tzin from comment #42) > > > > what about the second setup for client side ? > > you only gave me one (wsfd-advnetlab16.anl.lab.eng.bos.redhat.com) > > is it wsfd-advnetlab19.anl.lab.eng.bos.redhat.com ? this one seems to be > > down. > > it is wsfd-advnetlab17.anl.lab.eng.bos.redhat.com. the same username and > password. and you can see 17 is logged in on the right side of the tmux > session. after rebooting I cannot log to wsfd-advnetlab17.anl.lab.eng.bos.redhat.com with root/redhat are you sure this is the combination ?
(In reply to Amir Tzin from comment #45) > (In reply to Jianlin Shi from comment #43) > > (In reply to Amir Tzin from comment #42) > > > > > > what about the second setup for client side ? > > > you only gave me one (wsfd-advnetlab16.anl.lab.eng.bos.redhat.com) > > > is it wsfd-advnetlab19.anl.lab.eng.bos.redhat.com ? this one seems to be > > > down. > > > > it is wsfd-advnetlab17.anl.lab.eng.bos.redhat.com. the same username and > > password. and you can see 17 is logged in on the right side of the tmux > > session. > > after rebooting I cannot log to wsfd-advnetlab17.anl.lab.eng.bos.redhat.com > with root/redhat are you sure this is the combination ? I've reset the passwd again, please try again.
bisect result: ff8bc4d34935 ("net/mlx5e: Consider geneve_opts for encap contexts") is the first "good" commit
RHEL-8.4 test kernel with ("net/mlx5e: Consider geneve_opts for encap contexts") back ported passed the test. kernel 4.18.0_jenev_opts on wsfd-advnetlab16.anl.lab.eng.bos.redhat.com or https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=41057170
Super. That's now reason enough to request a z-stream of that commit. Amir, please: - considering it wasn't a trivial backport, make available a copy of the backported patch here or in the next bz: - add a z-stream request for it in the original bz that included this patch in y-stream, https://bugzilla.redhat.com/show_bug.cgi?id=1915308 We need the request to be made there (and not here), per the kernel sustaining team process. Thanks!
(In reply to Marcelo Ricardo Leitner from comment #49) > Super. That's now reason enough to request a z-stream of that commit. > > Amir, please: > - considering it wasn't a trivial backport, make available a copy of the > backported patch here or in the next bz: > - add a z-stream request for it in the original bz that included this patch > in y-stream, https://bugzilla.redhat.com/show_bug.cgi?id=1915308 > > We need the request to be made there (and not here), per the kernel > sustaining team process. > > Thanks! Hi, I made the z-stream request in https://bugzilla.redhat.com/show_bug.cgi?id=1915308 and also added there the 8.4 patch version.
Thanks Amir.
Jianlin, you may re-test this one with kernel-4.18.0-305.31.1.el8_4 or latest 8.y kernel. The fix was backported via https://bugzilla.redhat.com/show_bug.cgi?id=2023918 for 8.4.z.
(In reply to Marcelo Ricardo Leitner from comment #52) > Jianlin, you may re-test this one with kernel-4.18.0-305.31.1.el8_4 or > latest 8.y kernel. > The fix was backported via > https://bugzilla.redhat.com/show_bug.cgi?id=2023918 for 8.4.z. the problem doesn't exist on 4.18.0-305.31.1.el8_4.x86_64: [root@wsfd-advnetlab16 offload_func]# uname -a Linux wsfd-advnetlab16.anl.lab.eng.bos.redhat.com 4.18.0-305.31.1.el8_4.x86_64 #1 SMP Mon Dec 6 06:35:24 EST 2021 x86_64 x86_64 x86_64 GNU/Linux iperf3 -u -c 3010::13 -t 1; time iperf3 -u -c 10.10.1.14 -t 1: Connecting to host 3010::13, port 5201 [ 5] local 7777:183::11 port 52951 connected to 3010::13 port 5201 [ ID] Interval Transfer Bitrate Total Datagrams [ 5] 0.00-1.00 sec 128 KBytes 1.05 Mbits/sec 92 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams [ 5] 0.00-1.00 sec 128 KBytes 1.05 Mbits/sec 0.000 ms 0/92 (0%) sender [ 5] 0.00-1.12 sec 128 KBytes 935 Kbits/sec 0.032 ms 0/92 (0%) receiver iperf Done. Connecting to host 10.10.1.14, port 5201 [ 5] local 172.17.183.11 port 39777 connected to 10.10.1.14 port 5201 [ ID] Interval Transfer Bitrate Total Datagrams [ 5] 0.00-1.00 sec 129 KBytes 1.05 Mbits/sec 91 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams [ 5] 0.00-1.00 sec 129 KBytes 1.05 Mbits/sec 0.000 ms 0/91 (0%) sender [ 5] 0.00-1.03 sec 129 KBytes 1.02 Mbits/sec 0.157 ms 0/91 (0%) receiver iperf Done. real 0m1.442s user 0m0.026s sys 0m0.039s
Cool. Thanks Jianlin. Can we close this bz then? I think we're good here now.
(In reply to Marcelo Ricardo Leitner from comment #54) > Cool. Thanks Jianlin. > Can we close this bz then? I think we're good here now. yes, I agree