Bug 2036113 - cluster scaling new nodes ovs-configuration fails on all new nodes
Summary: cluster scaling new nodes ovs-configuration fails on all new nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.10.0
Assignee: Jaime Caamaño Ruiz
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks: 2038249
TreeView+ depends on / blocked
 
Reported: 2021-12-29 18:51 UTC by Alvaro Soto
Modified: 2023-09-18 04:29 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-12 04:40:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2907 0 None Merged Bug 2036113: configure-ovs: cleanup leftovers from previous run 2022-03-01 15:00:36 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:40:42 UTC

Description Alvaro Soto 2021-12-29 18:51:57 UTC
After completing an initial cluster build and upgrading to 4.7.37, an attempt was made to scale in additional nodes. These nodes join the cluster successfully and then the workerperf node role label is added to them. After reboot, the ovs-configuration.sh script is failing and the nodes are stuck in ready,schedulingdisabled status

Here is the end of the journal for ovs-configuration:


Dec 28 13:42:51 worker-03 configure-ovs.sh[2364331]: + nmcli conn up ovs-if-br-ex
Dec 28 13:42:51 worker-03 configure-ovs.sh[2364331]: Error: Connection activation failed: A dependency of the connection failed
Dec 28 13:42:51 worker-03 configure-ovs.sh[2364331]: Hint: use 'journalctl -xe NM_CONNECTION=a473fbb3-54fc-46a2-bdb9-5684ee6c3021 + NM_DEVICE=br-ex' to get more details.
Dec 28 13:42:51 worker-03 configure-ovs.sh[2364331]: + sleep 5
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + counter=5
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + '[' 5 -lt 5 ']'
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + echo 'ERROR: Failed to activate ovs-if-br-ex NM connection'
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: ERROR: Failed to activate ovs-if-br-ex NM connection
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + set +e
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + nmcli conn down ovs-if-br-ex
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: Error: 'ovs-if-br-ex' is not an active connection.
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: Error: no active connection provided.
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + nmcli conn down ovs-if-phys0
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: Error: 'ovs-if-phys0' is not an active connection.
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: Error: no active connection provided.
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + nmcli conn up 52eecf5a-df5e-30ae-9ca1-6297f0239027
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: Connection successfully activated (master waiting for slaves) (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/105)
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + exit 1
Dec 28 13:42:56 worker-03 systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=1/FAILURE
Dec 28 13:42:56 worker-03 systemd[1]: ovs-configuration.service: Failed with result 'exit-code'.
Dec 28 13:42:56 worker-03 systemd[1]: Failed to start Configures OVS with proper host networking configuration.
Dec 28 13:42:56 worker-03 systemd[1]: ovs-configuration.service: Consumed 2.250s CPU time

Also

Br-ex might be failing to come up due to this warning: 

Dec 28 22:35:55 worker-03 NetworkManager[4955]: <warn>  [1640730955.1234] device br-ex could not be added to a ovs port: Error running the transaction: constraint violation: Transaction causes multiple rows in "Bridge" table to have identical values (br-ex) for index on column "name".  First row, with UUID 7e950337-0d75-48a3-aa55-8f305cb90f0c, existed in the database before this transaction and was not modified by the transaction.  Second row, with UUID 2bec689d-50f4-4ad4-b117-57266c11ed90, was inserted by this transaction.

Might be linked to the existing entry for the bridge in ovs. 

sos_commands/networkmanager/nmcli_con_show_id_br-ex  
  shows  connection.uuid:                        e8b55300-d9e8-434b-bcd6-c0bec962516b 
but the  sos_commands/openvswitch/ovs-vsctl_list_bridge_br-ex 
  shows external_ids        : {NM.connection.uuid="590c6f76-f177-4427-a614-8b1c6bd719c9"}

Comment 1 Gabriel Diotte 2022-01-04 18:51:35 UTC
Following a troubleshooting call on 03114000, we isolated the issue to a stale uuid within ovs.

Running ovs-vsctl del-br br-ex allowed the ovs-configuration to succeed.

What we're wondering about at this point is which mechanism exists to ensure the ovs UUID matches the NetworkManager UUID. It seems that rebooting allows those two to misalign.

Comment 3 Gabriel Diotte 2022-01-04 20:18:02 UTC
As a more persistent workaround, this is what we're doing, line 123 of the script.


 if ! nmcli connection show br-ex &> /dev/null; then
    nmcli c add type ovs-bridge \
        con-name br-ex \
        conn.interface br-ex \
        802-3-ethernet.mtu ${iface_mtu} \
        802-3-ethernet.cloned-mac-address ${iface_mac} \
        ipv4.route-metric 100 \
        ipv6.route-metric 100 \
        ${extra_brex_args}
  fi

becomes

 if ! nmcli connection show br-ex &> /dev/null; then
    ovs-vsctl --if-exists del-br br-ex
    nmcli c add type ovs-bridge \
        con-name br-ex \
        conn.interface br-ex \
        802-3-ethernet.mtu ${iface_mtu} \
        802-3-ethernet.cloned-mac-address ${iface_mac} \
        ipv4.route-metric 100 \
        ipv6.route-metric 100 \
        ${extra_brex_args}
  fi

Notice that we add the bridge deletion in ovs before creating it again through nmcli. This workaround has so far shown to address the uuid mismatch by validating the bridge doesn't exist in ovs before recreating it through networkmanager. 

This is, effectively, the previous workaround but automated.

Is there a long term solution to this that would be safer?

Comment 23 Darren Carpenter 2022-03-02 13:27:41 UTC
Hi Jcaamano,

When they scale out/in, the nodes are re-labeled and get rebooted which applies the ovs-configuration that gets stuck due to lack of persistence. It would seem that an error occurs that causes these files to either not get cleaned up on start, or they don't clean up because I assume the naming convention changes.

Comment 26 errata-xmlrpc 2022-03-12 04:40:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 32 Red Hat Bugzilla 2023-09-18 04:29:47 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.