Bug 2175161

Summary: [17.1][OVN HWOFFLOAD][ConnectX-6] Only vfs from one of the nics in the bond works
Product: Red Hat OpenStack Reporter: Miguel Angel Nieto <mnietoji>
Component: openvswitchAssignee: Eelco Chaudron <echaudro>
Status: CLOSED NOTABUG QA Contact: Eran Kuris <ekuris>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 17.1 (Wallaby)CC: amorenoz, apevec, chrisw, fleitner, hakhande, mleitner, mpattric
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-09 13:59:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Miguel Angel Nieto 2023-03-03 12:08:20 UTC
Description of problem:
I have the following configuration:

OVS HWoffload bond (active-backup)
[root@compute-0 tripleo-admin]# ovs-vsctl show
    Bridge br-link0
        fail_mode: standalone
        Port patch-provnet-c6ce1e5c-9f15-4a13-a566-302b4f60d981-to-br-int
            Interface patch-provnet-c6ce1e5c-9f15-4a13-a566-302b4f60d981-to-br-int
                type: patch
                options: {peer=patch-br-int-to-provnet-c6ce1e5c-9f15-4a13-a566-302b4f60d981}
        Port br-link0
            Interface br-link0
                type: internal
        Port patch-provnet-cafacc6b-ae66-4398-a790-54532ea7cd59-to-br-int
            Interface patch-provnet-cafacc6b-ae66-4398-a790-54532ea7cd59-to-br-int
                type: patch
                options: {peer=patch-br-int-to-provnet-cafacc6b-ae66-4398-a790-54532ea7cd59}
        Port mx-bond
            Interface mx-bond

[root@compute-0 tripleo-admin]# cat /proc/net/bonding/mx-bond 
Ethernet Channel Bonding Driver: v5.14.0-283.el9.x86_64

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: ens2f0np0
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: ens2f0np0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b8:3f:d2:6f:ed:f0
Slave queue ID: 0

Slave Interface: ens2f1np1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b8:3f:d2:6f:ed:f1
Slave queue ID: 0

VFS
[root@compute-0 tripleo-admin]# ip link show ens2f0np0
12: ens2f0np0: <BROADCAST,MULTICAST,PROMISC,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master mx-bond state UP mode DEFAULT group default qlen 1000
    link/ether b8:3f:d2:6f:ed:f0 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 1     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 2     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 3     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 4     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 5     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 6     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 7     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 8     link/ether fa:16:3e:5b:6b:35 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 9     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    altname enp23s0f0np0
[root@compute-0 tripleo-admin]# ip link show ens2f1np1
13: ens2f1np1: <BROADCAST,MULTICAST,PROMISC,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master mx-bond state UP mode DEFAULT group default qlen 1000
    link/ether b8:3f:d2:6f:ed:f0 brd ff:ff:ff:ff:ff:ff permaddr b8:3f:d2:6f:ed:f1
    vf 0     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 1     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 2     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 3     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 4     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 5     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 6     link/ether fa:16:3e:c3:7e:d8 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 7     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 8     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 9     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    altname enp23s0f1np1


vfs in ens2f0np0 are used geneve while ens2f1np1 are used for vlan
    NovaPCIPassthrough:
      # Geneve/VxLAN configuration for Mellanox HW-Offload NIC
      - devname: "ens2f0np0"
        trusted: "true"
        physical_network: null
      # VLAN configuration for Mellanox HW-Offload NIC
      - devname: "ens2f1np1"
        trusted: "true"
        physical_network: mx-network

I spawn vms with both interfaces:
(overcloud) [stack@undercloud-0 ~]$ openstack server list --all-projects
+--------------------------------------+------------------------------------------+--------+---------------------------------------------------------------------------------------------+----------------------------------------+--------------------+
| ID                                   | Name                                     | Status | Networks                                                                                    | Image                                  | Flavor             |
+--------------------------------------+------------------------------------------+--------+---------------------------------------------------------------------------------------------+----------------------------------------+--------------------+
| f0c30f2d-6cc1-4819-81f4-0ebae06f7e4e | tempest-TestNfvOffload-server-574685884  | ACTIVE | mellanox-geneve-provider=192.168.39.54, 20.20.220.195; mellanox-vlan-provider=30.30.220.143 | rhel-guest-image-8.7-1660.x86_64.qcow2 | nfv_qe_base_flavor |
| f4123c13-f0e9-446b-9789-6c737061e2f5 | tempest-TestNfvOffload-server-2132320008 | ACTIVE | mellanox-geneve-provider=192.168.39.51, 20.20.220.142; mellanox-vlan-provider=30.30.220.149 | rhel-guest-image-8.7-1660.x86_64.qcow2 | nfv_qe_base_flavor |
+--------------------------------------+------------------------------------------+--------+---------------------------------------------------------------------------------------------+-------------------------

Trying to ping from inside the vm from one vm to the other one using those interfaces, only geneve is working:
[cloud-user@tempest-testnfvoffload-server-2132320008 ~]$ ping -c 1 -w 1 20.20.220.195
PING 20.20.220.195 (20.20.220.195) 56(84) bytes of data.
64 bytes from 20.20.220.195: icmp_seq=1 ttl=64 time=10.3 ms

--- 20.20.220.195 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 10.336/10.336/10.336/0.000 ms
[cloud-user@tempest-testnfvoffload-server-2132320008 ~]$ ping -c 1 -w 1 30.30.220.143
PING 30.30.220.143 (30.30.220.143) 56(84) bytes of data.

--- 30.30.220.143 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

If I change active port to the other one:
[root@compute-0 tripleo-admin]# ifenslave -c mx-bond ens2f1np1                                                                                                                                                     
[root@compute-0 tripleo-admin]# cat /proc/net/bonding/mx-bond                                                                                                                                                      
Ethernet Channel Bonding Driver: v5.14.0-283.el9.x86_64                                                                                                                                                            
                                                                                                                                                                                                                   
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: ens2f1np1
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: ens2f0np0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b8:3f:d2:6f:ed:f0
Slave queue ID: 0
                                                    
Slave Interface: ens2f1np1
MII Status: up   
Speed: 10000 Mbps
Duplex: full         
Link Failure Count: 0               
Permanent HW addr: b8:3f:d2:6f:ed:f1
Slave queue ID: 0

Now it is the opposite:
[cloud-user@tempest-testnfvoffload-server-2132320008 ~]$ ping -c 1 -w 1 20.20.220.195
PING 20.20.220.195 (20.20.220.195) 56(84) bytes of data.

--- 20.20.220.195 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

[cloud-user@tempest-testnfvoffload-server-2132320008 ~]$ ping -c 1 -w 1 30.30.220.143
PING 30.30.220.143 (30.30.220.143) 56(84) bytes of data.
64 bytes from 30.30.220.143: icmp_seq=1 ttl=64 time=340 ms

--- 30.30.220.143 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 340.399/340.399/340.399/0.000 ms




Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20230301.n.1

How reproducible:
Deploy ovshwoffload with cx6 nics
Spawn 2 vms with vlan and geneve interfaces
ping from inside the vm using both interfaces
Only 1 interface works
Change the active port in the bond
ping from inside the vm using both interfaces
The interface failing before works now and the one working before fails now

Actual results:
Only 1 interface in the vm works


Expected results:
Both interfaces in the vm should work 

Additional info:

Comment 1 Eelco Chaudron 2023-03-09 10:20:40 UTC
Hi Michael,

I have a couple of questions.

- Is this something new, i.e. was it working before? If so, can you bisect which component is the culprit (kernel version, ovs version)?
- Were you able to isolate where/what traffic was lost? Is it the initial ARP, ingress, or egress? 
- Do you see any OpenFlow rules being installed, are they offloaded or not?
- Does it work when you disable hw offload in OVS?

To speed things up, it might be good to get access to your setup. If possible I am able to get a bug chunk of time allocated on Tuesday or Thursday.

Comment 2 Eelco Chaudron 2023-03-09 10:27:46 UTC
One more question; Are the VFs configured/used through OVS, or are they bypassing OVS?

Comment 5 Miguel Angel Nieto 2023-03-09 13:59:27 UTC
There was a misconfiguration in CX6.
We wanted to use the nic only mode (not the smart nic mode), but some of the parameters were wrong, so there were many problems as the one described here.
It was solved by configuring it properly.

[root@compute-0 tripleo-admin]# lspci | grep X-6
17:00.0 Ethernet controller: Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller (rev 01)
17:00.1 Ethernet controller: Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller (rev 01)


sudo yum install mstflint

pcis="17:00.0 17:00.1"

# Step 1
for pci in $pcis;do
  mstconfig -y -d 0000:${pci} s INTERNAL_CPU_MODEL=1
done

# Step 2
sudo reboot

# Step 3
for pci in $pcis;do
  mstconfig -y -d 0000:${pci} s INTERNAL_CPU_PAGE_SUPPLIER=EXT_HOST_PF
  mstconfig -y -d 0000:${pci} s INTERNAL_CPU_ESWITCH_MANAGER=EXT_HOST_PF
  mstconfig -y -d 0000:${pci} s INTERNAL_CPU_IB_VPORT0=EXT_HOST_PF
  mstconfig -y -d 0000:${pci} s INTERNAL_CPU_OFFLOAD_ENGINE=DISABLED
  mstfwreset -y -d 0000:${pci} reset
done

# Check configuration
for pci in $pcis;do
   mstconfig -d 0000:${pci} query | grep INTERNAL | awk -v pci=$pci '{print pci,$0}'
done
******************************

17:00.0          INTERNAL_CPU_MODEL                          EMBEDDED_CPU(1) 
17:00.0          INTERNAL_CPU_PAGE_SUPPLIER                  EXT_HOST_PF(1)  
17:00.0          INTERNAL_CPU_ESWITCH_MANAGER                EXT_HOST_PF(1)  
17:00.0          INTERNAL_CPU_IB_VPORT0                      EXT_HOST_PF(1)  
17:00.0          INTERNAL_CPU_OFFLOAD_ENGINE                 DISABLED(1)     
17:00.0          INTERNAL_CPU_RSHIM                          ENABLED(0)      
17:00.1          INTERNAL_CPU_MODEL                          EMBEDDED_CPU(1) 
17:00.1          INTERNAL_CPU_PAGE_SUPPLIER                  EXT_HOST_PF(1)  
17:00.1          INTERNAL_CPU_ESWITCH_MANAGER                EXT_HOST_PF(1)  
17:00.1          INTERNAL_CPU_IB_VPORT0                      EXT_HOST_PF(1)  
17:00.1          INTERNAL_CPU_OFFLOAD_ENGINE                 DISABLED(1)     
17:00.1          INTERNAL_CPU_RSHIM                          ENABLED(0)