Bug 2224982 - [ovs dpdk bonding]ovs-acppctl bond/show status is earlier than ovs-vsctl get interface link_state
Summary: [ovs dpdk bonding]ovs-acppctl bond/show status is earlier than ovs-vsctl get ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch
Version: RHEL 9.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Mike Pattrick
QA Contact: mhou
URL:
Whiteboard:
: 2224981 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-24 07:39 UTC by mhou
Modified: 2023-07-25 02:23 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-25 02:23:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-3049 0 None None None 2023-07-24 07:40:53 UTC

Description mhou 2023-07-24 07:39:50 UTC
Description of problem:
create balance-slb port or balance-tcp port with lacp as active and bond-downdelay=60000 bond-updelay=60000. When try down active member, find may_enable status (ovs-appctl bond/show)changed before lisk_state changed(ovs-vsctl get interface)

This issue occurred when ovs side configure lacp as active. 

Version-Release number of selected component (if applicable):
openvswitch3.1-3.1.0-35.el9fdp

test nic:
# ethtool -i ens1f0np0
driver: mlx5_core
version: 5.14.0-341.el9.x86_64
firmware-version: 16.34.1002 (MT_0000000080)
expansion-rom-version: 
bus-info: 0000:12:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

# ethtool -i ens1f1np1
driver: mlx5_core
version: 5.14.0-341.el9.x86_64
firmware-version: 16.34.1002 (MT_0000000080)
expansion-rom-version: 
bus-info: 0000:12:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

How reproducible: always


Steps to Reproduce:
1. create ovs dpdkbonding port.
systemctl restart openvswitch
ovs-appctl vlog/set console:lacp:dbg syslog:lacp:dbg file:lacp:dbg
ovs-appctl vlog/set console:bond:dbg syslog:bond:dbg file:bond:dbg
ovs-appctl vlog/set console:ovs_rcu:dbg syslog:ovs_rcu:dbg file:ovs_rcu:dbg
ovs-appctl vlog/set console:memory:dbg syslog:memory:dbg file:memory:dbg
ovs-vsctl set Open_vSwitch . other_config={}
ovs-vsctl --if-exists del-br bondbridge
ovs-vsctl --may-exist add-br bondbridge
ovs-vsctl set int bondbridge mtu_request=9000
ovs-ofctl mod-port bondbridge bondbridge up
ovs-vsctl add-bond bondbridge balance-slb ens1f1np1 ens1f0np0 lacp=active bond_mode=balance-slb  -- set Interface ens1f1np1 type=dpdk options:dpdk-devargs=class=eth,mac=04:3f:72:b0:35:33 options:n_rxq=4 mtu_request=9000  -- set Interface ens1f0np0 type=dpdk options:dpdk-devargs=class=eth,mac=04:3f:72:b0:35:32 options:n_rxq=4 mtu_request=9000  -- set Port balance-slb  bond-downdelay=60000 bond-updelay=60000

please note: use pci address as devargs won't aviod this issue.
ovs-vsctl add-bond bondbridge balance-slb ens1f1np1 ens1f0np0 lacp=active bond_mode=balance-slb  -- set Interface ens1f1np1 type=dpdk options:dpdk-devargs=0000:12:00.0 options:n_rxq=4 mtu_request=9000  -- set Interface ens1f0np0 type=dpdk options:dpdk-devargs=0000:12:00.1 options:n_rxq=4 mtu_request=9000  -- set Port balance-slb  bond-downdelay=60000 bond-updelay=60000

2. check lacp status and bond status
# ovs-appctl bond/show
---- balance-slb ----
bond_mode: balance-slb
bond may use recirculation: no, Recirc-ID : -1
bond-hash-basis: 0
lb_output action: disabled, bond-id: -1
all members active: false
updelay: 60000 ms
downdelay: 60000 ms
next rebalance: 5495 ms
lacp_status: negotiated
lacp_fallback_ab: false
active-backup primary: <none>
active member mac: 04:3f:72:b0:35:32(ens1f0np0)

member ens1f0np0: enabled
  active member
  may_enable: true

member ens1f1np1: enabled
  may_enable: true

# ovs-appctl lacp/show
---- balance-slb ----
  status: active negotiated
  sys_id: 04:3f:72:b0:35:32
  sys_priority: 65534
  aggregation key: 7
  lacp_time: slow

member: ens1f0np0: current attached
  port_id: 8
  port_priority: 65535
  may_enable: true

  actor sys_id: 04:3f:72:b0:35:32
  actor sys_priority: 65534
  actor port_id: 8
  actor port_priority: 65535
  actor key: 7
  actor state: activity aggregation synchronized collecting distributing

  partner sys_id: b0:c5:3c:f6:36:c7
  partner sys_priority: 32768
  partner port_id: 265
  partner port_priority: 32768
  partner key: 56
  partner state: activity aggregation synchronized collecting distributing

member: ens1f1np1: current attached
  port_id: 7
  port_priority: 65535
  may_enable: true

  actor sys_id: 04:3f:72:b0:35:32
  actor sys_priority: 65534
  actor port_id: 7
  actor port_priority: 65535
  actor key: 7
  actor state: activity aggregation synchronized collecting distributing

  partner sys_id: b0:c5:3c:f6:36:c7
  partner sys_priority: 32768
  partner port_id: 266
  partner port_priority: 32768
  partner key: 56
  partner state: activity aggregation synchronized collecting distributing


3. down current active member on switch side.
# swcfg port_down 9364 "eth1/3/1"
spawn ssh admin.88.13
User Access Verification
(admin.88.13) Password: 

Cisco Nexus Operating System (NX-OS) Software
TAC support: http://www.cisco.com/tac
Copyright (C) 2002-2022, Cisco and/or its affiliates.
All rights reserved.
The copyrights to certain works contained in this software are
owned by other third parties and used and distributed under their own
licenses, such as open source.  This software is provided "as is," and unless
otherwise stated, there is no warranty, express or implied, including but not
limited to warranties of merchantability and fitness for a particular purpose.
Certain components of this software are licensed under
the GNU General Public License (GPL) version 2.0 or 
GNU General Public License (GPL) version 3.0  or the GNU
Lesser General Public License (LGPL) Version 2.1 or 
Lesser General Public License (LGPL) Version 2.0. 
A copy of each such license is available at
http://www.opensource.org/licenses/gpl-2.0.php and
http://opensource.org/licenses/gpl-3.0.html and
http://www.opensource.org/licenses/lgpl-2.1.php and
http://www.gnu.org/licenses/old-licenses/library.txt.
sw-cisco9364# \r
               ^
% Incomplete command at '^' marker.
sw-cisco9364# terminal length 0
sw-cisco9364# configure t
Enter configuration commands, one per line. End with CNTL/Z.
sw-cisco9364(config)# interface eth1/3/1 
sw-cisco9364(config-if)# shutdown
sw-cisco9364(config-if)# end
sw-cisco9364# show interface eth1/3/1 status

--------------------------------------------------------------------------------
Port          Name               Status    Vlan      Duplex  Speed   Type
--------------------------------------------------------------------------------
Eth1/3/1      --                 disabled  trunk     auto    auto    QSFP-4SFP25G-CU3M
sw-cisco9364# end
sw-cisco9364# 

4. check the members status flapping on ovs-switchd.log. wait for 2-3s, the port link status changed.
2023-07-24T06:55:32.678Z|01850|bond|INFO|member ens1f0np0: link state down
2023-07-24T06:55:32.678Z|01851|bond|INFO|member ens1f0np0: will be disabled if it stays down for 60000 ms
2023-07-24T06:55:35.660Z|01852|netdev_dpdk|DBG|Port 0 Link Down

same situation can monitor on ovs-appctl bond/show and ovs-vsctl get interface ens1f0np0 link_state

5. confiure balance-slb port with lacp passive
ovs-vsctl add-bond bondbridge balance-slb ens1f1np1 ens1f0np0 lacp=passive bond_mode=balance-slb  -- set Interface ens1f1np1 type=dpdk options:dpdk-devargs=0000:12:00.0 options:n_rxq=4 mtu_request=9000  -- set Interface ens1f0np0 type=dpdk options:dpdk-devargs=0000:12:00.1 options:n_rxq=4 mtu_request=9000  -- set Port balance-slb  bond-downdelay=60000 bond-updelay=60000

6. down slave interface ens1f0np0 and check ovs-vswitchd.lgo
swcfg port_down 9364 "eth1/3/1"
spawn ssh admin.88.13
User Access Verification
(admin.88.13) Password: 

Cisco Nexus Operating System (NX-OS) Software
TAC support: http://www.cisco.com/tac
Copyright (C) 2002-2022, Cisco and/or its affiliates.
All rights reserved.
The copyrights to certain works contained in this software are
owned by other third parties and used and distributed under their own
licenses, such as open source.  This software is provided "as is," and unless
otherwise stated, there is no warranty, express or implied, including but not
limited to warranties of merchantability and fitness for a particular purpose.
Certain components of this software are licensed under
the GNU General Public License (GPL) version 2.0 or 
GNU General Public License (GPL) version 3.0  or the GNU
Lesser General Public License (LGPL) Version 2.1 or 
Lesser General Public License (LGPL) Version 2.0. 
A copy of each such license is available at
http://www.opensource.org/licenses/gpl-2.0.php and
http://opensource.org/licenses/gpl-3.0.html and
http://www.opensource.org/licenses/lgpl-2.1.php and
http://www.gnu.org/licenses/old-licenses/library.txt.
sw-cisco9364# \r
               ^
% Incomplete command at '^' marker.
sw-cisco9364# terminal length 0
sw-cisco9364# configure t
Enter configuration commands, one per line. End with CNTL/Z.
sw-cisco9364(config)# interface eth1/3/1 
sw-cisco9364(config-if)# shutdown
sw-cisco9364(config-if)# end
sw-cisco9364# show interface eth1/3/1 status

--------------------------------------------------------------------------------
Port          Name               Status    Vlan      Duplex  Speed   Type
--------------------------------------------------------------------------------
Eth1/3/1      --                 disabled  trunk     auto    auto    QSFP-4SFP25G-CU3M
sw-cisco9364# end
sw-cisco9364# 


2023-07-24T07:27:44.443Z|02177|netdev_dpdk|DBG|Port 0 Link Down
2023-07-24T07:27:45.952Z|02178|bond|DBG|bond balance-slb: ens1f0np0 0kB (h88: 0kB), ens1f1np1 0kB
2023-07-24T07:27:54.635Z|02179|bond|INFO|member ens1f0np0: link state down
2023-07-24T07:27:54.635Z|02180|bond|INFO|member ens1f0np0: will be disabled if it stays down for 60000 ms

Actual results:
1. When ovs is configured with lacp active, the member status check of the port is earlier than the link status of the port.

Expected results:
1. link status flapping should earlier than member status check.

Additional info:
On beaker job, I found that the member state is earlier than the port state 30 seconds to flip
https://beaker.engineering.redhat.com/recipes/14281186#task163400891
https://beaker-archive.hosts.prod.psi.bos.redhat.com/beaker-logs/2023/07/80999/8099964/14281186/163400891/ovs-vswitchd.log
2023-07-23T21:27:51.472Z|00292|bond|DBG|bond balance-slb: ens1f0np0 1kB (h40: 0kB + h161: 0kB + h183: 0kB + h140: 0kB + h241: 0kB), ens1f1np1 0kB (h156: 0kB + h108: 0kB + h5: 0kB + h72: 0kB)
2023-07-23T21:28:01.935Z|00293|bond|DBG|bond balance-slb: ens1f0np0 0kB (h40: 0kB + h161: 0kB + h183: 0kB + h140: 0kB + h241: 0kB), ens1f1np1 0kB (h156: 0kB + h108: 0kB + h5: 0kB + h72: 0kB)
2023-07-23T21:28:03.937Z|00294|bond|INFO|member ens1f0np0: link state down
2023-07-23T21:28:03.937Z|00295|bond|INFO|member ens1f0np0: will be disabled if it stays down for 60000 ms
2023-07-23T21:28:11.951Z|00296|bond|DBG|bond balance-slb: ens1f0np0 0kB (h40: 0kB + h161: 0kB + h183: 0kB + h140: 0kB + h241: 0kB), ens1f1np1 0kB (h156: 0kB + h108: 0kB + h5: 0kB + h72: 0kB)
2023-07-23T21:28:21.961Z|00297|bond|DBG|bond balance-slb: ens1f0np0 0kB (h40: 0kB + h161: 0kB + h183: 0kB + h140: 0kB + h241: 0kB), ens1f1np1 0kB (h156: 0kB + h108: 0kB + h5: 0kB + h72: 0kB)
2023-07-23T21:28:31.971Z|00298|bond|DBG|bond balance-slb: ens1f0np0 0kB (h40: 0kB + h161: 0kB + h183: 0kB + h140: 0kB + h241: 0kB), ens1f1np1 0kB (h156: 0kB + h108: 0kB + h5: 0kB + h72: 0kB)
2023-07-23T21:28:33.247Z|00299|netdev_dpdk|DBG|Port 0 Link Down
2023-07-23T21:28:42.266Z|00300|bond|DBG|bond balance-slb: ens1f0np0 0kB (h40: 0kB + h161: 0kB + h183: 0kB + h140: 0kB + h241: 0kB), ens1f1np1 0kB (h156: 0kB + h108: 0kB + h5: 0kB + h72: 0kB)
2023-07-23T21:28:52.277Z|00301|bond|DBG|bond balance-slb: ens1f0np0 0kB (h183: 0kB + h40: 0kB + h161: 0kB + h140: 0kB + h241: 0kB), ens1f1np1 0kB (h156: 0kB + h108: 0kB + h5: 0kB + h72: 0kB)
2023-07-23T21:29:02.287Z|00302|bond|DBG|bond balance-slb: ens1f0np0 0kB (h183: 0kB + h40: 0kB + h161: 0kB), ens1f1np1 0kB (h156: 0kB + h108: 0kB)
2023-07-23T21:29:03.938Z|00303|bond|INFO|member ens1f0np0: disabled
2023-07-23T21:29:03.939Z|00304|bond|INFO|bond balance-slb: active member is now ens1f1np1

Comment 1 mhou 2023-07-24 08:02:32 UTC
*** Bug 2224981 has been marked as a duplicate of this bug. ***

Comment 2 mhou 2023-07-25 02:23:32 UTC
After check with Mike. Mike thinks this phenomenon is expected.

The bond log message happens during the bridge update, so it could be delayed if something else on the bridge is taking too long. So neither are instantaneous.

The DPDK log message is just on a 5 second timer. the bond log message starts off in the main run() function. run() keeps everything up to date and that means calling bond_run(). bond_run() updates which ports are active / inactive, up or down, and decides if flows need to be revalidated.

These run in different threads, and because of that 5 second timer. You're a lot more likely to get the bond log message first. If you ran this test 100 times, I bet 75% of the time, the bond log message would come first. Because run() takes less than 5 seconds usually.

may_enable is not the same as link status,It also includes things like lacp status. If corresponding switch port down, the lacp status will change. Then appctl check the bond change and display it. lacp status changed earlier than link flapping. That's why may_enable: false. ovs-vsctl get interface ens1f0np0 link_state is complex. OVS is setting that value in OVS DB. So there could be a delay there.


Note You need to log in before you can comment on or make changes to this bug.