Bug 2108213

Summary: [OSP17] Traffic (UDP, TCP, ICMP) is not offloaded when using a VLAN provider network
Product: Red Hat OpenStack Reporter: Miguel Angel Nieto <mnietoji>
Component: openvswitchAssignee: Haresh Khandelwal <hakhande>
Status: VERIFIED --- QA Contact: Miguel Angel Nieto <mnietoji>
Severity: high Docs Contact:
Priority: high    
Version: 17.0 (Wallaby)CC: apevec, chrisw, fleitner, hakhande, lariel, mleitner, oblaut, pgrist, ralonsoh, vkhitrin
Target Milestone: gaKeywords: Regression, TestOnly, Triaged
Target Release: 17.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Miguel Angel Nieto 2022-07-18 15:35:05 UTC
Description of problem:

TCP traffic not offloaded when using a VLAN provider network


Version-Release number of selected component (if applicable):
RHOS-17.0-RHEL-9-20220701.n.1


How reproducible:
1. Deploy ovs hwoffload setup (i attach templates)
2. Execute testcase nfv_tempest_plugin.tests.scenario.test_nfv_offload.TestNfvOffload.test_offload_tcp
This testcase will 2 create 2 vms, one in each compute. Each vm will have 2 interfaces (geneve and vlan). Both of them should offload traffic. I have checked that geneve traffic is offloaded, but not vlan traffic. Testcase execute following commands in vms:
nohup iperf -s -B 30.30.220.191 -p 8272 -t 10
nohup iperf -c 30.30.220.191 -T s2 -p 8272 -t  10 

It get flows in computes with the following command: sudo ovs-appctl dpctl/dump-flows -m type=offloaded

compute1:
ufid:12231e66-fd0f-45a9-a9ea-0d75b892cbd8, skb_priority(0/0),tunnel(tun_id=0x4,src=10.10.141.174,dst=10.10.141.126,ttl=0/0,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x40002/0x7fffffff}),flags(+key)),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(genev_sys_6081),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:95:5e:11,dst=00:00:00:00:00:00/01:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=17,tos=0/0,ttl=0/0,frag=no),udp(src=0/0,dst=32768/0x8000), packets:2, bytes:180, used:3.630s, offloaded:yes, dp:tc, actions:enp4s0f0np0_8
ufid:483cc00d-6030-450a-bb73-8ed8634fe34d, skb_priority(0/0),tunnel(tun_id=0x4,src=10.10.141.174,dst=10.10.141.126,ttl=0/0,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x40002/0x7fffffff}),flags(+key)),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(genev_sys_6081),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:95:5e:11,dst=00:00:00:00:00:00/01:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=6,tos=0/0,ttl=0/0,frag=no),tcp(src=0/0,dst=0/0), packets:21, bytes:3507, used:9.900s, offloaded:yes, dp:tc, actions:enp4s0f0np0_8
ufid:1d6af21d-766c-4e34-a7ee-38dd662e383f, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp4s0f0np0_8),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:ec:d9:45,dst=fa:16:3e:95:5e:11),eth_type(0x0800),ipv4(src=20.20.220.128/255.255.255.192,dst=128.0.0.0/192.0.0.0,proto=17,tos=0/0x3,ttl=64,frag=no),udp(src=0/0,dst=0/0x800), packets:0, bytes:0, used:4.690s, offloaded:yes, dp:tc, actions:set(tunnel(tun_id=0x2,dst=10.10.141.174,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x40002}),flags(csum|key))),set(eth(src=fa:16:3e:d1:b8:15,dst=9c:cc:83:58:1c:60)),set(ipv4(ttl=63)),genev_sys_6081
ufid:afcda996-412d-4527-b407-aeb873d7b26b, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp4s0f0np0_8),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:ec:d9:45,dst=fa:16:3e:95:5e:11),eth_type(0x0800),ipv4(src=20.20.220.128/255.255.255.192,dst=10.35.0.0/255.255.128.0,proto=6,tos=0/0x3,ttl=64,frag=no),tcp(src=0/0,dst=0/0), packets:17, bytes:4773, used:9.900s, offloaded:yes, dp:tc, actions:set(tunnel(tun_id=0x2,dst=10.10.141.174,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x40002}),flags(csum|key))),set(eth(src=fa:16:3e:d1:b8:15,dst=9c:cc:83:58:1c:60)),set(ipv4(ttl=63)),genev_sys_6081
ufid:09326dbc-d9de-4792-a6f7-2f6d205b3925, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp4s0f0np0_8),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:ec:d9:45,dst=fa:16:3e:95:5e:11),eth_type(0x0800),ipv4(src=20.20.220.128/255.255.255.192,dst=208.0.0.0/240.0.0.0,proto=17,tos=0/0x3,ttl=64,frag=no),udp(src=0/0,dst=0/0x800), packets:0, bytes:0, used:9.590s, offloaded:yes, dp:tc, actions:set(tunnel(tun_id=0x2,dst=10.10.141.174,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x40002}),flags(csum|key))),set(eth(src=fa:16:3e:d1:b8:15,dst=9c:cc:83:58:1c:60)),set(ipv4(ttl=63)),genev_sys_6081
ufid:a0775850-cc15-45cf-8d44-77ff3c565e12, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(mx-bond),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:6f:2c:38,dst=fa:16:3e:d0:5a:c5),eth_type(0x8100),vlan(vid=148,pcp=0),encap(eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no)), packets:2802527, bytes:25256047746, used:1.010s, offloaded:yes, dp:tc, actions:pop_vlan,enp4s0f1np1_0
ufid:3565313d-98ab-4f98-8ecf-3ba5cfb5db4f, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp4s0f1np1_0),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:d0:5a:c5,dst=fa:16:3e:6f:2c:38),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=30.30.220.128/255.255.255.192,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:245028, bytes:17171012, used:1.010s, offloaded:yes, dp:tc, actions:push_vlan(vid=148,pcp=0),mx-bond

compute2:
ufid:dfa6a2e4-6957-4ce2-b4cc-df0e87dd976b, skb_priority(0/0),tunnel(tun_id=0x4,src=10.10.141.174,dst=10.10.141.137,ttl=0/0,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x40003/0x7fffffff}),flags(+key)),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(genev_sys_6081),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:95:5e:11,dst=00:00:00:00:00:00/01:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=6,tos=0/0,ttl=0/0,frag=no),tcp(src=0/0,dst=0/0), packets:21, bytes:3507, used:10.620s, offloaded:yes, dp:tc, actions:ens6f0np0_0
ufid:23d1761f-6fa0-42ff-9696-1dbedd218f48, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens6f1np1_3),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:6f:2c:38,dst=fa:16:3e:d0:5a:c5),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=30.30.220.128/255.255.255.192,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:3460223, bytes:31204156766, used:0.250s, offloaded:yes, dp:tc, actions:push_vlan(vid=148,pcp=0),mx-bond
ufid:bbdf2718-600c-4954-8273-e5b067941e4c, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens6f0np0_0),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:e8:b9:45,dst=fa:16:3e:95:5e:11),eth_type(0x0800),ipv4(src=20.20.220.0/255.255.255.128,dst=10.35.0.0/255.255.128.0,proto=6,tos=0/0x3,ttl=64,frag=no),tcp(src=0/0,dst=0/0), packets:17, bytes:4773, used:10.620s, offloaded:yes, dp:tc, actions:set(tunnel(tun_id=0x2,dst=10.10.141.174,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x40002}),flags(csum|key))),set(eth(src=fa:16:3e:d1:b8:15,dst=9c:cc:83:58:1c:60)),set(ipv4(ttl=63)),genev_sys_6081
ufid:a0f02637-438c-4aa6-b18b-55edb8250a71, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(mx-bond),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:d0:5a:c5,dst=fa:16:3e:6f:2c:38),eth_type(0x8100),vlan(vid=148,pcp=0),encap(eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no)), packets:305081, bytes:20135884, used:0.250s, offloaded:yes, dp:tc, actions:pop_vlan,ens6f1np1_3

It also executes tcpdump in representor port:
compute1
sudo nohup timeout 10 tcpdump -i enp4s0f1np1_0 -nne  ether host fa:16:3e:d0:5a:c5 and  ether host fa:16:3e:6f:2c:38

.................
 4294825019 ecr 4294828310], length 8948
15:14:34.983925 fa:16:3e:6f:2c:38 > fa:16:3e:d0:5a:c5, ethertype IPv4 (0x0800), length 9014: 30.30.220.185.38970 > 30.30.220.191.8272: Flags [.], seq 3457325889:3457334837, ack 1, win 210, options [nop,nop,TS val 4294825019 ecr 4294828310], length 8948
15:14:34.983927 fa:16:3e:6f:2c:38 > fa:16:3e:d0:5a:c5, ethertype IPv4 (0x0800), length 9014: 30.30.220.185.38970 > 30.30.220.191.8272: Flags [.], seq 3457334837:3457343785, ack 1, win 210, options [nop,nop,TS val 4294825019 ecr 4294828310], length 8948
15:14:34.983928 fa:16:3e:6f:2c:38 > fa:16:3e:d0:5a:c5, ethertype IPv4 (0x0800), length 9014: 30.30.220.185.38970 > 30.30.220.191.8272: Flags [.], seq 3457343785:3457352733, ack 1, win 210, options [nop,nop,TS val 4294825019 ecr 4294828310], length 8948

62648 packets captured
100311 packets received by filter
37638 packets dropped by kernel


compute2
sudo nohup timeout 10 tcpdump -i ens6f1np1_3 -nne  ether host fa:16:3e:d0:5a:c5 and  ether host fa:16:3e:6f:2c:38

4816899,nop,nop,sack 1 {88227281:88379397}], length 0
15:14:26.864019 fa:16:3e:d0:5a:c5 > fa:16:3e:6f:2c:38, ethertype IPv4 (0x0800), length 78: 30.30.220.191.8272 > 30.30.220.185.38970: Flags [.], ack 88075165, win 23947, options [nop,nop,TS val 4294820191 ecr 4294816899,nop,nop,sack 1 {88227281:88388345}], length 0
15:14:26.864025 fa:16:3e:d0:5a:c5 > fa:16:3e:6f:2c:38, ethertype IPv4 (0x0800), length 78: 30.30.220.191.8272 > 30.30.220.185.38970: Flags [.], ack 88084113, win 23909, options [nop,nop,TS val 4294820191 ecr 4294816900,nop,nop,sack 1 {88227281:88388345}], length 0
15:14:26.864140 fa:16:3e:d0:5a:c5 > fa:16:3e:6f:2c:38, ethertype IPv4 (0x0800), length 78: 30.30.220.191.8272 > 30.30.220.185.38970: Flags [.], ack 88119905, win 23909, options [nop,nop,TS val 4294820191 ecr 4294816900,nop,nop,sack 1 {88227281:88388345}], length 0

652 packets captured
1461 packets received by filter
0 packets dropped by kernel


Packets in this compute have size 0.





Actual results:
packets are not offloaded (packets in representor port)



Expected results:
Traffic should be offloaded (no packets in representor port)



Additional info:
I will attach sos report for computes and templates

Comment 1 Miguel Angel Nieto 2022-07-18 16:06:49 UTC
Haresh asked me to execute "sudo ovs-appctl dpctl/dump-flows -m" instead of "sudo ovs-appctl dpctl/dump-flows -m type=offloaded". I have executed again the testcase, so ips an mac address are different

(overcloud) [stack@undercloud-0 ~]$ openstack server list --all-projects
/usr/lib/python3.9/site-packages/ansible/_vendor/__init__.py:42: UserWarning: One or more Python packages bundled by this ansible-core distribution were already loaded (pyparsing). This may result in undefined behavior.
  warnings.warn('One or more Python packages bundled by this ansible-core distribution were already '
+--------------------------------------+------------------------------------------+--------+--------------------------------------------------------------------------------------------+---------------------------------------+--------------------+
| ID                                   | Name                                     | Status | Networks                                                                                   | Image                                 | Flavor             |
+--------------------------------------+------------------------------------------+--------+--------------------------------------------------------------------------------------------+---------------------------------------+--------------------+
| 9f1ede19-0296-4efe-a8a3-60f5a751ba92 | tempest-TestNfvOffload-server-1841197126 | ACTIVE | mellanox-geneve-provider=10.35.228.44, 20.20.220.110; mellanox-vlan-provider=30.30.220.146 | rhel-guest-image-7-6-210-x86-64-qcow2 | nfv_qe_base_flavor |
| 23ad7819-39ac-4dd8-b9d1-ea210829783e | tempest-TestNfvOffload-server-305132433  | ACTIVE | mellanox-geneve-provider=10.35.228.41, 20.20.220.124; mellanox-vlan-provider=30.30.220.151 | rhel-guest-image-7-6-210-x86-64-qcow2 | nfv_qe_base_flavor |
+--------------------------------------+------------------------------------------+--------+--------------------------------------------------------------------------------------------+---------------------------------------+--------------------+
(overcloud) [stack@undercloud-0 ~]$ openstack port list | egrep "20.20.220.110|20.20.220.124|30.30.220.146|30.30.220.151"
/usr/lib/python3.9/site-packages/ansible/_vendor/__init__.py:42: UserWarning: One or more Python packages bundled by this ansible-core distribution were already loaded (pyparsing). This may result in undefined behavior.
  warnings.warn('One or more Python packages bundled by this ansible-core distribution were already '
| 59c85595-ba11-48af-8490-c6cdbbf06dc3 | tempest-port-smoke-732069959  | fa:16:3e:64:85:84 | ip_address='30.30.220.151', subnet_id='392d313a-da87-4368-a408-c40b617c6faa' | ACTIVE |
| 74efbd2f-b580-48cc-8ea1-826961b3554b | tempest-port-smoke-1921923533 | fa:16:3e:8c:e4:11 | ip_address='30.30.220.146', subnet_id='392d313a-da87-4368-a408-c40b617c6faa' | ACTIVE |
| c7ce91cf-deeb-4046-8bbf-d96e1a5af1cd | tempest-port-smoke-140043879  | fa:16:3e:1d:41:08 | ip_address='20.20.220.110', subnet_id='9e83be3c-e290-4949-96af-ee7911713397' | ACTIVE |
| f9f09e90-bab6-4da0-849c-1d51c911f79a | tempest-port-smoke-1159573315 | fa:16:3e:4e:f8:a8 | ip_address='20.20.220.124', subnet_id='9e83be3c-e290-4949-96af-ee7911713397' | ACTIVE |

(overcloud) [stack@undercloud-0 ~]$ ssh cloud-user.228.44 "sudo ip a"
Warning: Permanently added '10.35.228.44' (ED25519) to the list of known hosts.
cloud-user.228.44's password: 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc mq state UP group default qlen 1000
    link/ether fa:16:3e:1d:41:08 brd ff:ff:ff:ff:ff:ff
    inet 20.20.220.110/24 brd 20.20.220.255 scope global noprefixroute dynamic eth0
       valid_lft 42862sec preferred_lft 42862sec
    inet6 fe80::f816:3eff:fe1d:4108/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether fa:16:3e:8c:e4:11 brd ff:ff:ff:ff:ff:ff
    inet 30.30.220.146/24 brd 30.30.220.255 scope global noprefixroute dynamic eth1
       valid_lft 42862sec preferred_lft 42862sec
    inet6 fe80::f816:3eff:fe8c:e411/64 scope link 
       valid_lft forever preferred_lft forever
(overcloud) [stack@undercloud-0 ~]$ ssh cloud-user.228.41 "sudo ip a"
Warning: Permanently added '10.35.228.41' (ED25519) to the list of known hosts.
cloud-user.228.41's password: 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc mq state UP group default qlen 1000
    link/ether fa:16:3e:4e:f8:a8 brd ff:ff:ff:ff:ff:ff
    inet 20.20.220.124/24 brd 20.20.220.255 scope global noprefixroute dynamic eth0
       valid_lft 42835sec preferred_lft 42835sec
    inet6 fe80::f816:3eff:fe4e:f8a8/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether fa:16:3e:64:85:84 brd ff:ff:ff:ff:ff:ff
    inet 30.30.220.151/24 brd 30.30.220.255 scope global noprefixroute dynamic eth1
       valid_lft 42835sec preferred_lft 42835sec
    inet6 fe80::f816:3eff:fe64:8584/64 scope link 
       valid_lft forever preferred_lft forever


nohup  iperf -s -B 30.30.220.151 -p 8293 -t 10
nohup  iperf -c 30.30.220.151 -T s2 -p 8293 -t

Executing command: sudo ovs-appctl dpctl/dump-flows -m
ufid:142fdac7-5b73-418a-b9aa-3aeb1a18bbd6, skb_priority(0/0),tunnel(tun_id=0x4,src=10.10.141.174,dst=10.10.141.126,ttl=0/0,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x40002/0x7fffffff}),flags(+key)),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(genev_sys_6081),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:e0:0a:bf,dst=00:00:00:00:00:00/01:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=6,tos=0/0,ttl=0/0,frag=no),tcp(src=0/0,dst=0/0), packets:169, bytes:27322, used:9.510s, offloaded:yes, dp:tc, actions:enp4s0f0np0_8
ufid:87d1b972-cd86-4990-a785-0b3ed829b244, skb_priority(0/0),tunnel(tun_id=0x4,src=10.10.141.174,dst=10.10.141.126,ttl=0/0,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x40002/0x7fffffff}),flags(+key)),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(genev_sys_6081),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:e0:0a:bf,dst=00:00:00:00:00:00/01:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=17,tos=0/0,ttl=0/0,frag=no),udp(src=0/0,dst=32768/0x8000), packets:109, bytes:35363, used:4.840s, offloaded:yes, dp:tc, actions:enp4s0f0np0_8
ufid:d8a3ce0a-ef6d-4210-8056-eb80c9743a5b, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp4s0f0np0_8),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:4e:f8:a8,dst=fa:16:3e:e0:0a:bf),eth_type(0x0800),ipv4(src=20.20.220.0/255.255.255.128,dst=192.168.0.0/255.255.128.0,proto=17,tos=0/0x3,ttl=64,frag=no),udp(src=0/0,dst=0/0x800), packets:88, bytes:13142, used:10.540s, offloaded:yes, dp:tc, actions:set(tunnel(tun_id=0x2,dst=10.10.141.174,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x30002}),flags(csum|key))),set(eth(src=fa:16:3e:d1:b8:15,dst=9c:cc:83:58:1c:60)),set(ipv4(ttl=63)),genev_sys_6081
ufid:1e856f2e-0080-412b-bef6-a19661747fe4, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp4s0f0np0_8),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:4e:f8:a8,dst=fa:16:3e:e0:0a:bf),eth_type(0x0800),ipv4(src=20.20.220.0/255.255.255.128,dst=10.35.0.0/255.255.128.0,proto=6,tos=0/0x3,ttl=64,frag=no),tcp(src=0/0,dst=0/0), packets:138, bytes:37992, used:9.510s, offloaded:yes, dp:tc, actions:set(tunnel(tun_id=0x2,dst=10.10.141.174,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x30002}),flags(csum|key))),set(eth(src=fa:16:3e:d1:b8:15,dst=9c:cc:83:58:1c:60)),set(ipv4(ttl=63)),genev_sys_6081
ufid:9901ee4a-832f-492b-969f-810eafd4ef2d, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp4s0f0np0_8),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:4e:f8:a8,dst=fa:16:3e:e0:0a:bf),eth_type(0x0800),ipv4(src=20.20.220.0/255.255.255.128,dst=10.40.0.0/255.248.0.0,proto=17,tos=0/0x3,ttl=64,frag=no),udp(src=0/0,dst=0/0x800), packets:88, bytes:13142, used:4.840s, offloaded:yes, dp:tc, actions:set(tunnel(tun_id=0x2,dst=10.10.141.174,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x30002}),flags(csum|key))),set(eth(src=fa:16:3e:d1:b8:15,dst=9c:cc:83:58:1c:60)),set(ipv4(ttl=63)),genev_sys_6081
ufid:2100ff16-9a46-4e0a-be57-21353ba8a8fb, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(mx-bond),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:8c:e4:11,dst=ff:ff:ff:ff:ff:ff),eth_type(0x8100),vlan(vid=148,pcp=0),encap(eth_type(0x0806),arp(sip=0.0.0.0/0.0.0.0,tip=0.0.0.0/0.0.0.0,op=0/0,sha=00:00:00:00:00:00/00:00:00:00:00:00,tha=00:00:00:00:00:00/00:00:00:00:00:00)), packets:0, bytes:0, used:10.140s, dp:tc, actions:br-link0,pop_vlan,tap2a3e2f82-90,enp4s0f1np1_0
ufid:465287c3-7b06-4b15-af19-6464cf85b155, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(mx-bond),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:8c:e4:11,dst=fa:16:3e:64:85:84),eth_type(0x8100),vlan(vid=148,pcp=0),encap(eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no)), packets:3363635, bytes:30318112454, used:0.680s, offloaded:yes, dp:tc, actions:pop_vlan,enp4s0f1np1_0
ufid:ca20e7c7-5676-46cc-b5a3-221501099080, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp4s0f1np1_0),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:64:85:84,dst=fa:16:3e:8c:e4:11),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=30.30.220.128/255.255.255.192,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:311067, bytes:21780606, used:0.680s, offloaded:yes, dp:tc, actions:push_vlan(vid=148,pcp=0),mx-bond
ufid:696412c9-0cb5-4687-b772-4367514d19d9, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp4s0f1np1_0),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:64:85:84,dst=fa:16:3e:8c:e4:11),eth_type(0x0806),arp(sip=0.0.0.0/0.0.0.0,tip=30.30.220.146,op=2,sha=00:00:00:00:00:00/00:00:00:00:00:00,tha=00:00:00:00:00:00/00:00:00:00:00:00), packets:0, bytes:0, used:10.140s, dp:tc, actions:push_vlan(vid=148,pcp=0),mx-bond
ufid:e310c615-28cd-4484-b175-c663387ef5b0, recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(mx-bond),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=f4:52:14:25:28:74,dst=01:80:c2:00:00:00),eth_type(0/0xffff), packets:1150, bytes:69000, used:1.861s, dp:ovs, actions:drop
ufid:31287494-84ec-4f81-bef1-2e866f2ffc7d, recirc_id(0),dp_hash(0/0),skb_priority(0/0),tunnel(tun_id=0x0,src=10.10.141.104,dst=10.10.141.126,ttl=0/0,flags(-df+csum+key)),in_port(genev_sys_6081),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=17,tos=0/0,ttl=0/0,frag=no),udp(src=0/0,dst=3784), packets:161, bytes:10626, used:0.781s, dp:ovs, actions:userspace(pid=4294967295,slow_path(bfd))
ufid:abfc4e9b-bc22-402c-8712-becc06d1acb2, recirc_id(0),dp_hash(0/0),skb_priority(0/0),tunnel(tun_id=0x0,src=10.10.141.150,dst=10.10.141.126,ttl=0/0,flags(-df+csum+key)),in_port(genev_sys_6081),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=17,tos=0/0,ttl=0/0,frag=no),udp(src=0/0,dst=3784), packets:160, bytes:10560, used:0.382s, dp:ovs, actions:userspace(pid=4294967295,slow_path(bfd))
ufid:878e63cb-e8f9-4184-baab-ed4aca0ebcab, recirc_id(0),dp_hash(0/0),skb_priority(0/0),tunnel(tun_id=0x0,src=10.10.141.174,dst=10.10.141.126,ttl=0/0,flags(-df+csum+key)),in_port(genev_sys_6081),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=17,tos=0/0,ttl=0/0,frag=no),udp(src=0/0,dst=3784), packets:161, bytes:10626, used:0.173s, dp:ovs, actions:userspace(pid=4294967295,slow_path(bfd))

Executing command: sudo ovs-appctl dpctl/dump-flows -m
ufid:ebe42885-3175-4c70-acc2-c2004d5da5a1, skb_priority(0/0),tunnel(tun_id=0x4,src=10.10.141.174,dst=10.10.141.137,ttl=0/0,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x40003/0x7fffffff}),flags(+key)),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(genev_sys_6081),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:e0:0a:bf,dst=00:00:00:00:00:00/01:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=6,tos=0/0,ttl=0/0,frag=no),tcp(src=0/0,dst=0/0), packets:194, bytes:32502, used:9.500s, offloaded:yes, dp:tc, actions:ens6f0np0_0
ufid:5a38f280-f765-48a5-9911-69af0dbe8b20, skb_priority(0/0),tunnel(tun_id=0x4,src=10.10.141.174,dst=10.10.141.137,ttl=0/0,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x40003/0x7fffffff}),flags(+key)),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(genev_sys_6081),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:e0:0a:bf,dst=00:00:00:00:00:00/01:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=17,tos=0/0,ttl=0/0,frag=no),udp(src=0/0,dst=32768/0x8000), packets:3, bytes:270, used:8.050s, offloaded:yes, dp:tc, actions:ens6f0np0_0
ufid:e77d45b7-4d4e-4d18-9cb7-9efcb3de576a, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens6f1np1_3),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:8c:e4:11,dst=fa:16:3e:64:85:84),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=30.30.220.128/255.255.255.192,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:3626303, bytes:32701908862, used:0.150s, offloaded:yes, dp:tc, actions:push_vlan(vid=148,pcp=0),mx-bond
ufid:ce293fc0-1f10-4633-ad6a-bcf21c77c27f, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens6f0np0_0),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:1d:41:08,dst=fa:16:3e:e0:0a:bf),eth_type(0x0800),ipv4(src=20.20.220.0/255.255.255.128,dst=10.35.0.0/255.255.128.0,proto=6,tos=0/0x3,ttl=64,frag=no),tcp(src=0/0,dst=0/0), packets:156, bytes:44266, used:9.500s, offloaded:yes, dp:tc, actions:set(tunnel(tun_id=0x2,dst=10.10.141.174,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x30002}),flags(csum|key))),set(eth(src=fa:16:3e:d1:b8:15,dst=9c:cc:83:58:1c:60)),set(ipv4(ttl=63)),genev_sys_6081
ufid:08b2c32d-4c78-47b2-afde-d49fe7932fcb, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens6f0np0_0),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:1d:41:08,dst=fa:16:3e:e0:0a:bf),eth_type(0x0800),ipv4(src=20.20.220.0/255.255.255.128,dst=128.0.0.0/192.0.0.0,proto=17,tos=0/0x3,ttl=64,frag=no),udp(src=0/0,dst=0/0x800), packets:2, bytes:304, used:8.050s, offloaded:yes, dp:tc, actions:set(tunnel(tun_id=0x2,dst=10.10.141.174,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x30002}),flags(csum|key))),set(eth(src=fa:16:3e:d1:b8:15,dst=9c:cc:83:58:1c:60)),set(ipv4(ttl=63)),genev_sys_6081
ufid:067f7537-7d48-4e37-b1c9-e17915f8c93c, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(mx-bond),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:64:85:84,dst=fa:16:3e:8c:e4:11),eth_type(0x8100),vlan(vid=148,pcp=0),encap(eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no)), packets:324587, bytes:21378092, used:0.150s, offloaded:yes, dp:tc, actions:pop_vlan,ens6f1np1_3
ufid:4e57c833-474a-40fe-8c97-18cee1df0092, recirc_id(0),dp_hash(0/0),skb_priority(0/0),tunnel(tun_id=0x0,src=10.10.141.150,dst=10.10.141.137,ttl=0/0,flags(-df+csum+key)),in_port(genev_sys_6081),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=17,tos=0/0,ttl=0/0,frag=no),udp(src=0/0,dst=3784), packets:157, bytes:10362, used:0.369s, dp:ovs, actions:userspace(pid=4294967295,slow_path(bfd))
ufid:a9f2dab8-3eb9-44dd-9963-3045537177c7, recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(mx-bond),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=f4:52:14:25:28:7a,dst=01:80:c2:00:00:00),eth_type(0/0xffff), packets:1555, bytes:93300, used:0.672s, dp:ovs, actions:drop
ufid:fa4ec406-aa30-4303-9889-03b2a33ff6cb, recirc_id(0),dp_hash(0/0),skb_priority(0/0),tunnel(tun_id=0x0,src=10.10.141.174,dst=10.10.141.137,ttl=0/0,flags(-df+csum+key)),in_port(genev_sys_6081),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=17,tos=0/0,ttl=0/0,frag=no),udp(src=0/0,dst=3784), packets:158, bytes:10428, used:0.170s, dp:ovs, actions:userspace(pid=4294967295,slow_path(bfd))
ufid:331614a0-fe53-47eb-90cc-ef6877b0d45a, recirc_id(0),dp_hash(0/0),skb_priority(0/0),tunnel(tun_id=0x0,src=10.10.141.104,dst=10.10.141.137,ttl=0/0,flags(-df+csum+key)),in_port(genev_sys_6081),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=17,tos=0/0,ttl=0/0,frag=no),udp(src=0/0,dst=3784), packets:159, bytes:10494, used:0.466s, dp:ovs, actions:userspace(pid=4294967295,slow_path(bfd))

timeout 10 tcpdump -i enp4s0f1np1_0 -nne  ether host fa:16:3e:64:85:84 and  ether host fa:16:3e:8c:e4:11
15:53:34.761787 fa:16:3e:8c:e4:11 > fa:16:3e:64:85:84, ethertype IPv4 (0x0800), length 9014: 30.30.220.146.50802 > 30.30.220.151.8293: Flags [.], seq 249222505:249231453, ack 1, win 210, options [nop,nop,TS val 4294778732 ecr 4294786378], length 8948
15:53:34.761789 fa:16:3e:8c:e4:11 > fa:16:3e:64:85:84, ethertype IPv4 (0x0800), length 9014: 30.30.220.146.50802 > 30.30.220.151.8293: Flags [.], seq 249231453:249240401, ack 1, win 210, options [nop,nop,TS val 4294778732 ecr 4294786378], length 8948
15:53:34.761790 fa:16:3e:8c:e4:11 > fa:16:3e:64:85:84, ethertype IPv4 (0x0800), length 9014: 30.30.220.146.50802 > 30.30.220.151.8293: Flags [.], seq 249240401:249249349, ack 1, win 210, options [nop,nop,TS val 4294778732 ecr 4294786378], length 8948

15943 packets captured
27895 packets received by filter
11952 packets dropped by kernel





timeout 10 tcpdump -i enp4s0f1np1_0 -nne  ether host fa:16:3e:64:85:84 and  ether host fa:16:3e:8c:e4:11
15:53:34.761117 fa:16:3e:64:85:84 > fa:16:3e:8c:e4:11, ethertype IPv4 (0x0800), length 66: 30.30.220.151.8293 > 30.30.220.146.50802: Flags [.], ack 248372445, win 24576, options [nop,nop,TS val 4294786379 ecr 4294778732], length 0
15:53:34.761120 fa:16:3e:64:85:84 > fa:16:3e:8c:e4:11, ethertype IPv4 (0x0800), length 66: 30.30.220.151.8293 > 30.30.220.146.50802: Flags [.], ack 248497717, win 24081, options [nop,nop,TS val 4294786379 ecr 4294778732], length 0
15:53:34.761125 fa:16:3e:64:85:84 > fa:16:3e:8c:e4:11, ethertype IPv4 (0x0800), length 66: 30.30.220.151.8293 > 30.30.220.146.50802: Flags [.], ack 248560353, win 24576, options [nop,nop,TS val 4294786379 ecr 4294778732], length 0

3656 packets captured
3656 packets received by filter
0 packets dropped by kernel

Comment 5 Miguel Angel Nieto 2022-07-19 08:13:38 UTC
I rebooted everything, computes and controllers and run the testcase and it passed, so I would say it is missing to restart some process or the compute itself during deployment.

Comment 6 Miguel Angel Nieto 2022-07-29 09:55:43 UTC
I investigated it further and this is the behaviour just after deployment:

* ICMP: ping from a vm in one compute to a vm in other compute using a vlan provider network
We can see executing tcpdump in representor port that each 30 seconds there are a few packets that are not offloaded (we can see them in representor port), but this happen only to a single direction, in one compute there are only icmp replies while in the other one there are only icmp requests.

* UDP: send udp traffic from a vm in one compute to a vm in other compute using a vlan provider network. It is used iperf to send/receive traffic
we can see in the compute in which it is executed UDP receiver than no packet is offloaded while in the server in which udp sender is executed all traffic is offloaded

* TCP send tcp traffic from a vm in one compute to a vm in other compute using a vlan provider network. It is used iperf to send/receive traffic
we can see similar behaviour to the ICMP traffic. Each 30 seconds we see a few packets in representor port, so, for some reason those packets are not offloaded.

After rebooting the compute, the problem is solved for the current spawn vms and for new vms too.

Comment 9 Haresh Khandelwal 2022-09-13 17:57:31 UTC
Hi,

I am able to reproduce this issue in my OSP17.0 environment. 

RHEL:9
Kernel: 5.14.0-70.22.1.el9_0
Ovs: openvswitch2.17-2.17.0-32.1.el9fdp.x86_64
Ovn: ovn22.03-22.03.0-69.el9fdp.x86_64
Firwamre: 16.31.1014 (DEL0000000015)

So, issue here is,
- TCP, UDP, ICMP traffic get offloaded between 2 VMs hosted on 2 different compute nodes
- Problem starts when ping is on, VM1 sends arp request and after VM2' arp response, few packets egressing from VM2 and ingressing to VM1 are seen on repsentor port at VM1. No packets seen from VM1 to VM2 direction on RP.
- Packets on reprenstor port are observed for the brief time, ~2-3 seconds. However, when arp exchange starts again, we see packets on RP again for 2-3 seconds. This continues till arp exchange happens.
- When static arp configured for both VM's interface mac addresses, I dont see this issue.
- This issue is seen even after rebooting the compute node. Miguel (reporter) says he didnt see this issue after reboot.
- This issue can be reproduce without VM as well. I configured VF's kernel interface to reproduce this issue (RP attached to ovs bridge and ovn contrller etc.).
- This issue happens with SMFS and DMFS.
- This issue happens with connection tracking as well.
- Miguel said he didn't see this issue in 16.2. I couldn't deploy 16.2 to confirm the same.

Here is the capture when traffic starts, This is the initial packet exchanges before flow gets offloaded.

VM1: f8:f2:1e:03:bf:f4/6.6.6.133
VM2: c2:64:5c:2b:8f:75/6.6.6.6

09:01:26.035783 c2:64:5c:2b:8f:75 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 56: Ethernet (len 6), IPv4 (len 4), Request who-has 6.6.6.133 tell 6.6.6.6, length 42
09:01:26.035793 c2:64:5c:2b:8f:75 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 56: Ethernet (len 6), IPv4 (len 4), Request who-has 6.6.6.133 tell 6.6.6.6, length 42
09:01:26.035904 f8:f2:1e:03:bf:f4 > c2:64:5c:2b:8f:75, ethertype ARP (0x0806), length 52: Ethernet (len 6), IPv4 (len 4), Reply 6.6.6.133 is-at f8:f2:1e:03:bf:f4, length 38
09:01:26.035904 f8:f2:1e:03:bf:f4 > c2:64:5c:2b:8f:75, ethertype ARP (0x0806), length 52: Ethernet (len 6), IPv4 (len 4), Reply 6.6.6.133 is-at f8:f2:1e:03:bf:f4, length 38
09:01:26.128038 c2:64:5c:2b:8f:75 > f8:f2:1e:03:bf:f4, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 57357, offset 0, flags [DF], proto ICMP (1), length 84)
    6.6.6.6 > 6.6.6.133: ICMP echo request, id 40965, seq 1, length 64
09:01:26.128096 c2:64:5c:2b:8f:75 > f8:f2:1e:03:bf:f4, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 57357, offset 0, flags [DF], proto ICMP (1), length 84)
    6.6.6.6 > 6.6.6.133: ICMP echo request, id 40965, seq 1, length 64
09:01:26.128096 f8:f2:1e:03:bf:f4 > c2:64:5c:2b:8f:75, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 54606, offset 0, flags [none], proto ICMP (1), length 84)
    6.6.6.133 > 6.6.6.6: ICMP echo reply, id 40965, seq 1, length 64
09:01:26.128119 f8:f2:1e:03:bf:f4 > c2:64:5c:2b:8f:75, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 54607, offset 0, flags [none], proto ICMP (1), length 84)
    6.6.6.133 > 6.6.6.6: ICMP echo reply, id 40965, seq 1, length 64
09:01:31.128985 c2:64:5c:2b:8f:75 > f8:f2:1e:03:bf:f4, ethertype ARP (0x0806), length 56: Ethernet (len 6), IPv4 (len 4), Reply 6.6.6.6 is-at c2:64:5c:2b:8f:75, length 42


Here on-wards, we see whenever there is a arp exchange, ingress packets observed on the representor port at VM1.

09:01:58.696284 f8:f2:1e:03:bf:f4 > c2:64:5c:2b:8f:75, ethertype ARP (0x0806), length 52: Ethernet (len 6), IPv4 (len 4), Request who-has 6.6.6.6 tell 6.6.6.133, length 38
09:01:58.697149 c2:64:5c:2b:8f:75 > f8:f2:1e:03:bf:f4, ethertype ARP (0x0806), length 56: Ethernet (len 6), IPv4 (len 4), Reply 6.6.6.6 is-at c2:64:5c:2b:8f:75, length 42
09:01:58.697155 c2:64:5c:2b:8f:75 > f8:f2:1e:03:bf:f4, ethertype ARP (0x0806), length 56: Ethernet (len 6), IPv4 (len 4), Reply 6.6.6.6 is-at c2:64:5c:2b:8f:75, length 42
09:01:58.810577 c2:64:5c:2b:8f:75 > f8:f2:1e:03:bf:f4, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 8670, offset 0, flags [DF], proto ICMP (1), length 84)
    6.6.6.6 > 6.6.6.133: ICMP echo request, id 40965, seq 33, length 64 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
09:02:00.073163 c2:64:5c:2b:8f:75 > f8:f2:1e:03:bf:f4, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 8922, offset 0, flags [DF], proto ICMP (1), length 84)
    6.6.6.6 > 6.6.6.133: ICMP echo request, id 40965, seq 34, length 64 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
09:02:00.073173 c2:64:5c:2b:8f:75 > f8:f2:1e:03:bf:f4, ethertype ARP (0x0806), length 56: Ethernet (len 6), IPv4 (len 4), Request who-has 6.6.6.133 tell 6.6.6.6, length 42
09:02:26.343955 f8:f2:1e:03:bf:f4 > c2:64:5c:2b:8f:75, ethertype ARP (0x0806), length 52: Ethernet (len 6), IPv4 (len 4), Request who-has 6.6.6.6 tell 6.6.6.133, length 38
09:02:26.344808 c2:64:5c:2b:8f:75 > f8:f2:1e:03:bf:f4, ethertype ARP (0x0806), length 56: Ethernet (len 6), IPv4 (len 4), Reply 6.6.6.6 is-at c2:64:5c:2b:8f:75, length 42

09:02:53.991637 f8:f2:1e:03:bf:f4 > c2:64:5c:2b:8f:75, ethertype ARP (0x0806), length 52: Ethernet (len 6), IPv4 (len 4), Request who-has 6.6.6.6 tell 6.6.6.133, length 38
09:02:53.992408 c2:64:5c:2b:8f:75 > f8:f2:1e:03:bf:f4, ethertype ARP (0x0806), length 56: Ethernet (len 6), IPv4 (len 4), Reply 6.6.6.6 is-at c2:64:5c:2b:8f:75, length 42
09:02:53.992414 c2:64:5c:2b:8f:75 > f8:f2:1e:03:bf:f4, ethertype ARP (0x0806), length 56: Ethernet (len 6), IPv4 (len 4), Reply 6.6.6.6 is-at c2:64:5c:2b:8f:75, length 42
09:02:54.106582 c2:64:5c:2b:8f:75 > f8:f2:1e:03:bf:f4, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 37300, offset 0, flags [DF], proto ICMP (1), length 84)
    6.6.6.6 > 6.6.6.133: ICMP echo request, id 40965, seq 87, length 64 <<<<<<<<<<<<<<<<<<<<<<<<<<<<
09:02:55.130585 c2:64:5c:2b:8f:75 > f8:f2:1e:03:bf:f4, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 37379, offset 0, flags [DF], proto ICMP (1), length 84)
    6.6.6.6 > 6.6.6.133: ICMP echo request, id 40965, seq 88, length 64 <<<<<<<<<<<<<<<<<


Below are the flows when traffic is offloaded (no arp).

ufid:2a7ed797-6a57-447f-9470-d7f93c6530e0, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(mx-bond),packet_type(ns=0/0,id=0/0),eth(src=c2:64:5c:2b:8f:75,dst=f8:f2:1e:03:bf:f4),eth_type(0x8100),vlan(vid=415,pcp=0),encap(eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no)), packets:18, bytes:1764, used:0.270s, offloaded:yes, dp:tc, actions:pop_vlan,enp4s0f0np0_0

ufid:58a8158d-1204-4b74-9a5d-ebd614b3d6e9, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp4s0f0np0_0),packet_type(ns=0/0,id=0/0),eth(src=f8:f2:1e:03:bf:f4,dst=c2:64:5c:2b:8f:75),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:18, bytes:1836, used:0.270s, offloaded:yes, dp:tc, actions:push_vlan(vid=415,pcp=0),mx-bond


Below are the flows when arp exchange happens and few packets steered towards kernel.

ufid:3acaa7c9-a72c-4027-aa14-4c18970e7ef5, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(mx-bond),packet_type(ns=0/0,id=0/0),eth(src=c2:64:5c:2b:8f:75,dst=f8:f2:1e:03:bf:f4),eth_type(0x8100),vlan(vid=415,pcp=0),encap(eth_type(0x0806),arp(sip=0.0.0.0/0.0.0.0,tip=0.0.0.0/0.0.0.0,op=0/0,sha=00:00:00:00:00:00/00:00:00:00:00:00,tha=00:00:00:00:00:00/00:00:00:00:00:00)), packets:1, bytes:42, used:9.020s, offloaded:yes, dp:tc, actions:pop_vlan,enp4s0f0np0_0

ufid:2a7ed797-6a57-447f-9470-d7f93c6530e0, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(mx-bond),packet_type(ns=0/0,id=0/0),eth(src=c2:64:5c:2b:8f:75,dst=f8:f2:1e:03:bf:f4),eth_type(0x8100),vlan(vid=415,pcp=0),encap(eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no)), packets:42, bytes:4088, used:0.340s, offloaded:yes, dp:tc, actions:pop_vlan,enp4s0f0np0_0

ufid:58a8158d-1204-4b74-9a5d-ebd614b3d6e9, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp4s0f0np0_0),packet_type(ns=0/0,id=0/0),eth(src=f8:f2:1e:03:bf:f4,dst=c2:64:5c:2b:8f:75),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:42, bytes:4284, used:0.340s, offloaded:yes, dp:tc, actions:push_vlan(vid=415,pcp=0),mx-bond

ufid:dc404229-4b69-4534-9bb5-0957419bf818, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp4s0f0np0_0),packet_type(ns=0/0,id=0/0),eth(src=f8:f2:1e:03:bf:f4,dst=c2:64:5c:2b:8f:75),eth_type(0x0806),arp(sip=0.0.0.0/0.0.0.0,tip=0.0.0.0/0.0.0.0,op=0/0,sha=00:00:00:00:00:00/00:00:00:00:00:00,tha=00:00:00:00:00:00/00:00:00:00:00:00), packets:1, bytes:64, used:8.340s, offloaded:yes, dp:tc, actions:push_vlan(vid=415,pcp=0),mx-bond


From the flows, i see, arp request packets from VM1 here are not broadcasted, hence, due to presence of dst mac and other match, action has 1 port which makes it offloaded. 
Reply supposed to be unicast,  so arp replies from VM2 offloaded as well.

Is it possible that HW here un-offload the packets coming from VM2 after its arp reply? And offloaded once it settles down again?
I dont see ovs kernel datapath, tc or ovn have any role in this issue. 

I will speak with Nvidia folks in next meeting. 

Thanks

Comment 10 Ariel Levkovich 2022-09-15 18:14:42 UTC
U will always see ARP in SW because OVS ages out the ARP rule due to no activity and when another arp will be sent it will go to the slow path at first (it's a short lived flow so offload won't get to catch it in HW actually).

Do u see other packets except from ARP no being offloaded?

It is possible that OVS offload revalidator invokes cleanup due to low packet rate and then the flows are added back again.

Comment 11 Haresh Khandelwal 2022-09-16 09:58:27 UTC
Hi Ariel,

(In reply to Ariel Levkovich from comment #10)
> U will always see ARP in SW because OVS ages out the ARP rule due to no
> activity and when another arp will be sent it will go to the slow path at
> first (it's a short lived flow so offload won't get to catch it in HW
> actually).
> 

Here, traffic is constantly running between 2 end vms. So ideally there should not be arp request from either sides, but the guest os (rhel in this case) sends out arp request with random interval between the traffic. Till this point every packets are offlaoded. Now, The peer side respond back with arp reply and here problem starts. Just after its reply, we see few packets from peer on representor port. ~2-3 later, they get offloaded and not seen in RP. 

The kernel of the guest may be aging out arp cache and thus requesting peer's mac. So, is it possible HW too erased flows related to traffic coming from peer?

> Do u see other packets except from ARP no being offloaded?
> 
> It is possible that OVS offload revalidator invokes cleanup due to low
> packet rate and then the flows are added back again.

Thanks

Comment 13 Haresh Khandelwal 2022-09-21 12:53:04 UTC
Hi Ariel,

Do you have any comment on comment#11?

Thanks
-Haresh

Comment 14 Ariel Levkovich 2022-09-21 13:37:11 UTC
Hi Haresh,

HW doesn't erase flows on its own, it has to come from SW stack. That's why I suggested to trace the OVS datapath and see if it ages out relevant flows and then re-inserts them (perhaps using vswitchd log or tc monitor).

Ariel

Comment 15 Haresh Khandelwal 2022-09-21 15:08:47 UTC
Hi Ariel,

I enabled netlink debugs and i dont see any flow withdrawn for ip traffic, there are few which are for ARP. Does your eyes see something else from below?

2022-09-13T14:16:16.835Z|00490|dpif_netlink(handler1)|DBG|added flow
2022-09-13T14:16:16.836Z|00491|dpif_netlink(handler1)|DBG|system@ovs-system: put[create] ufid:7891c381-7b16-49a1-8a72-4ade6a375f77 recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(2),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=c2:64:5c:2b:8f:75,dst=46:a5:06:c2:39:af),eth_type(0x8100),vlan(vid=415,pcp=0),encap(eth_type(0x0800),ipv4(src=6.6.6.6/0.0.0.0,dst=6.6.6.133/0.0.0.0,proto=1/0,tos=0/0,ttl=64/0,frag=no),icmp(type=8/0,code=0/0)), actions:pop_vlan,5

2022-09-13T14:16:16.841Z|00492|dpif_netlink(handler1)|DBG|added flow
2022-09-13T14:16:16.841Z|00493|dpif_netlink(handler1)|DBG|system@ovs-system: put[create] ufid:e39a20c8-e997-47f9-a885-121e5d0a2c62 recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(5),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=46:a5:06:c2:39:af,dst=c2:64:5c:2b:8f:75),eth_type(0x0800),ipv4(src=6.6.6.133/0.0.0.0,dst=6.6.6.6/0.0.0.0,proto=1/0,tos=0/0,ttl=64/0,frag=no),icmp(type=0/0,code=0/0), actions:push_vlan(vid=415,pcp=0),2



2022-09-13T14:16:21.929Z|00498|dpif_netlink(handler1)|DBG|added flow
2022-09-13T14:16:21.929Z|00499|dpif_netlink(handler1)|DBG|system@ovs-system: put[create] ufid:19057674-22d5-4e94-8e90-8827c80c7ea7 recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(5),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=46:a5:06:c2:39:af,dst=c2:64:5c:2b:8f:75),eth_type(0x0806),arp(sip=6.6.6.133/0.0.0.0,tip=6.6.6.6/0.0.0.0,op=1/0,sha=46:a5:06:c2:39:af/00:00:00:00:00:00,tha=00:00:00:00:00:00/00:00:00:00:00:00), actions:push_vlan(vid=415,pcp=0),2

2022-09-13T14:16:21.929Z|00500|dpif_netlink(handler1)|DBG|added flow
2022-09-13T14:16:21.929Z|00501|dpif_netlink(handler1)|DBG|system@ovs-system: put[create] ufid:97873bd6-ce9f-4aaa-aea5-154107a7e49e recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(2),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=c2:64:5c:2b:8f:75,dst=46:a5:06:c2:39:af),eth_type(0x8100),vlan(vid=415,pcp=0),encap(eth_type(0x0806),arp(sip=6.6.6.6/0.0.0.0,tip=6.6.6.133/0.0.0.0,op=2/0,sha=c2:64:5c:2b:8f:75/00:00:00:00:00:00,tha=46:a5:06:c2:39:af/00:00:00:00:00:00)), actions:pop_vlan,5



2022-09-13T14:16:33.272Z|00295|dpif_netlink(revalidator6)|DBG|system@ovs-system: flow_del ufid:97873bd6-ce9f-4aaa-aea5-154107a7e49e recirc_id(0),dp_hash(0),skb_priority(0),in_port(2),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=c2:64:5c:2b:8f:75,dst=46:a5:06:c2:39:af),eth_type(0x8100),vlan(vid=415,pcp=0),encap(eth_type(0x0806),arp(sip=6.6.6.6,tip=6.6.6.133,op=2,sha=c2:64:5c:2b:8f:75,tha=46:a5:06:c2:39:af)), packets:2, bytes:98, used:10.310s
2022-09-13T14:16:33.272Z|00001|dpif_netlink(revalidator7)|DBG|system@ovs-system: flow_del ufid:19057674-22d5-4e94-8e90-8827c80c7ea7 recirc_id(0),dp_hash(0),skb_priority(0),in_port(5),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=46:a5:06:c2:39:af,dst=c2:64:5c:2b:8f:75),eth_type(0x0806),arp(sip=6.6.6.133,tip=6.6.6.6,op=1,sha=46:a5:06:c2:39:af,tha=00:00:00:00:00:00), packets:2, bytes:128, used:10.311s



2022-09-13T14:16:50.601Z|00506|dpif_netlink(handler1)|DBG|added flow
2022-09-13T14:16:50.601Z|00507|dpif_netlink(handler1)|DBG|system@ovs-system: put[create] ufid:19057674-22d5-4e94-8e90-8827c80c7ea7 recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(5),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=46:a5:06:c2:39:af,dst=c2:64:5c:2b:8f:75),eth_type(0x0806),arp(sip=6.6.6.133/0.0.0.0,tip=6.6.6.6/0.0.0.0,op=1/0,sha=46:a5:06:c2:39:af/00:00:00:00:00:00,tha=00:00:00:00:00:00/00:00:00:00:00:00), actions:push_vlan(vid=415,pcp=0),2
2022-09-13T14:16:50.601Z|00508|dpif_netlink(handler1)|DBG|added flow
2022-09-13T14:16:50.601Z|00509|dpif_netlink(handler1)|DBG|system@ovs-system: put[create] ufid:97873bd6-ce9f-4aaa-aea5-154107a7e49e recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(2),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=c2:64:5c:2b:8f:75,dst=46:a5:06:c2:39:af),eth_type(0x8100),vlan(vid=415,pcp=0),encap(eth_type(0x0806),arp(sip=6.6.6.6/0.0.0.0,tip=6.6.6.133/0.0.0.0,op=2/0,sha=c2:64:5c:2b:8f:75/00:00:00:00:00:00,tha=46:a5:06:c2:39:af/00:00:00:00:00:00)), actions:pop_vlan,5


2022-09-13T14:17:02.806Z|00298|dpif_netlink(revalidator6)|DBG|system@ovs-system: flow_del ufid:97873bd6-ce9f-4aaa-aea5-154107a7e49e recirc_id(0),dp_hash(0),skb_priority(0),in_port(2),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=c2:64:5c:2b:8f:75,dst=46:a5:06:c2:39:af),eth_type(0x8100),vlan(vid=415,pcp=0),encap(eth_type(0x0806),arp(sip=6.6.6.6,tip=6.6.6.133,op=2,sha=c2:64:5c:2b:8f:75,tha=46:a5:06:c2:39:af)), packets:1, bytes:56, used:10.150s
2022-09-13T14:17:02.807Z|00299|dpif_netlink(revalidator6)|DBG|system@ovs-system: flow_del ufid:19057674-22d5-4e94-8e90-8827c80c7ea7 recirc_id(0),dp_hash(0),skb_priority(0),in_port(5),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=46:a5:06:c2:39:af,dst=c2:64:5c:2b:8f:75),eth_type(0x0806),arp(sip=6.6.6.133,tip=6.6.6.6,op=1,sha=46:a5:06:c2:39:af,tha=00:00:00:00:00:00), packets:1, bytes:64, used:10.150s


2022-09-13T14:17:19.273Z|00514|dpif_netlink(handler1)|DBG|added flow
2022-09-13T14:17:19.273Z|00515|dpif_netlink(handler1)|DBG|system@ovs-system: put[create] ufid:19057674-22d5-4e94-8e90-8827c80c7ea7 recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(5),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=46:a5:06:c2:39:af,dst=c2:64:5c:2b:8f:75),eth_type(0x0806),arp(sip=6.6.6.133/0.0.0.0,tip=6.6.6.6/0.0.0.0,op=1/0,sha=46:a5:06:c2:39:af/00:00:00:00:00:00,tha=00:00:00:00:00:00/00:00:00:00:00:00), actions:push_vlan(vid=415,pcp=0),2
2022-09-13T14:17:19.273Z|00516|dpif_netlink(handler1)|DBG|added flow
2022-09-13T14:17:19.273Z|00517|dpif_netlink(handler1)|DBG|system@ovs-system: put[create] ufid:97873bd6-ce9f-4aaa-aea5-154107a7e49e recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(2),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=c2:64:5c:2b:8f:75,dst=46:a5:06:c2:39:af),eth_type(0x8100),vlan(vid=415,pcp=0),encap(eth_type(0x0806),arp(sip=6.6.6.6/0.0.0.0,tip=6.6.6.133/0.0.0.0,op=2/0,sha=c2:64:5c:2b:8f:75/00:00:00:00:00:00,tha=46:a5:06:c2:39:af/00:00:00:00:00:00)), actions:pop_vlan,5

Thanks

Comment 16 Marcelo Ricardo Leitner 2022-09-27 13:18:49 UTC
(In reply to Haresh Khandelwal from comment #9)
> Hi,
> 
> I am able to reproduce this issue in my OSP17.0 environment. 
> 
> RHEL:9
> Kernel: 5.14.0-70.22.1.el9_0
> Ovs: openvswitch2.17-2.17.0-32.1.el9fdp.x86_64
> Ovn: ovn22.03-22.03.0-69.el9fdp.x86_64
> Firwamre: 16.31.1014 (DEL0000000015)
> 
> So, issue here is,
> - TCP, UDP, ICMP traffic get offloaded between 2 VMs hosted on 2 different
> compute nodes
> - Problem starts when ping is on, VM1 sends arp request and after VM2' arp
> response, few packets egressing from VM2 and ingressing to VM1 are seen on
> repsentor port at VM1. No packets seen from VM1 to VM2 direction on RP.
> - Packets on reprenstor port are observed for the brief time, ~2-3 seconds.
> However, when arp exchange starts again, we see packets on RP again for 2-3
> seconds. This continues till arp exchange happens.
> - When static arp configured for both VM's interface mac addresses, I dont
> see this issue.

Folks, this very much resembles the chain 0 update issue.
The kernel mentioned above doesn't have the fix.
I only can't tell why a reboot solves the issue, though.

(upstream)
commit e65812fd22eba32f11abe28cb377cbd64cfb1ba0
Author: Marcelo Ricardo Leitner <marcelo.leitner>
Date:   Thu Apr 7 11:29:23 2022 -0300

    net/sched: fix initialization order when updating chain 0 head

    Currently, when inserting a new filter that needs to sit at the head
    of chain 0, it will first update the heads pointer on all devices using
    the (shared) block, and only then complete the initialization of the new
    element so that it has a "next" element.

    This can lead to a situation that the chain 0 head is propagated to
    another CPU before the "next" initialization is done. When this race
    condition is triggered, packets being matched on that CPU will simply
    miss all other filters, and will flow through the stack as if there were
    no other filters installed. If the system is using OVS + TC, such
    packets will get handled by vswitchd via upcall, which results in much
    higher latency and reordering. For other applications it may result in
    packet drops.

    (...)

Fixed in RHEL8 via https://bugzilla.redhat.com/show_bug.cgi?id=2044711
And RHEL9 via https://bugzilla.redhat.com/show_bug.cgi?id=2090410 , on kernel-5.14.0-116.el9 (9.1).
Can you please try this one?

Btw, we may be missing many fixes in 9.0 due to some misguided information on upgrade compatibility we had between RHEL8 to 9. We will go over the 9.1 fixes and request z-streams to accommodate that.
Ariel, not sure if you remember, but I had mentioned this on our mtgs 2 or 3 times now.
Please take 9.0 bugs with a grain of salt until then.

Comment 17 Haresh Khandelwal 2022-09-28 10:15:35 UTC
Thanks Marcelo,

On reboot part, I dont see reboot solves this issue. I am seeing same behavior even after reboot.
I see this fix is not back ported to 9.0, OSP17.0 will remain on 9.0.Z. Anyway, OSP17.0 is short lift so not expecting customer to deploy that for NFV. we are fine with 9.1 only fix. 

Hi Miguel, Just to confirm if this patch fix the issue, can you please try your CI jobs with kernel-5.14.0-116.el9?

Thanks

Comment 18 Miguel Angel Nieto 2022-09-28 22:25:50 UTC
Hi

As I had a setup already deployed based on osp17.0 with offload and ml2_ovs, I have tried there because the issue was reproducible too (First time I saw this issue was with ovn and I opened the bz)

I have used this kernel http://download-node-02.eng.bos.redhat.com/rhel-9/composes/RHEL-9/RHEL-9.2.0-20220927.0/compose/BaseOS/x86_64/os/Packages/kernel-5.14.0-168.el9.x86_64.rpm

I have deployed 2 vms (one in each compute) and all testcases have passed (tcp, udp and icmp flows are offloaded). I will verify in a ovn setup too.

In ml2-ovs setup I found an extra issue [1] that I thought it may be related with this one, but it must be different, because it failed with the new kernel too

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2127167

Comment 19 Haresh Khandelwal 2022-09-29 07:32:54 UTC
Thanks Miguel for quick validation. Please evaluate on ml2/ovn as well.
For Bz#2127167, lets talk there.

Thanks

Comment 20 Miguel Angel Nieto 2022-10-03 12:47:19 UTC
Used  newer kernel in a offload ovn setup and the issue is solved

Linux computehwoffload-r730 5.14.0-168.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Sep 23 09:31:26 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

Comment 21 Flavio Leitner 2022-12-20 20:23:28 UTC
Since this is fixed with a new kernel, can we move the ticket to the correct component?
Thanks
fbl

Comment 26 Miguel Angel Nieto 2023-01-05 10:39:56 UTC
*** Bug 2109452 has been marked as a duplicate of this bug. ***

Comment 27 Miguel Angel Nieto 2023-01-20 09:03:24 UTC
*** Bug 2162489 has been marked as a duplicate of this bug. ***

Comment 28 Miguel Angel Nieto 2023-03-22 11:57:35 UTC
Verified in osp17.1
RHOS-17.1-RHEL-9-20230315.n.1