2038249 – [4.9.z backport] cluster scaling new nodes ovs-configuration fails on all new nodes

Bug 2038249 - [4.9.z backport] cluster scaling new nodes ovs-configuration fails on all new nodes

Summary: [4.9.z backport] cluster scaling new nodes ovs-configuration fails on all new...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.9.z
Assignee:	Jaime Caamaño Ruiz
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2061641 (view as bug list)
Depends On:	2036113 2053431
Blocks:	2062655
TreeView+	depends on / blocked

Reported:	2022-01-07 16:03 UTC by Jaime Caamaño Ruiz
Modified:	2022-03-16 11:39 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2062655 (view as bug list)
Environment:
Last Closed:	2022-03-16 11:38:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2901	0	None	open	[release-4.9] Bug 2038249: Improvements for configure-ovs script	2022-03-01 14:58:49 UTC
Red Hat Product Errata	RHBA-2022:0798	0	None	None	None	2022-03-16 11:39:09 UTC

Description Jaime Caamaño Ruiz 2022-01-07 16:03:11 UTC

This bug was initially created as a copy of Bug #2036113

I am copying this bug because: 



After completing an initial cluster build and upgrading to 4.7.37, an attempt was made to scale in additional nodes. These nodes join the cluster successfully and then the workerperf node role label is added to them. After reboot, the ovs-configuration.sh script is failing and the nodes are stuck in ready,schedulingdisabled status

Here is the end of the journal for ovs-configuration:


Dec 28 13:42:51 worker-03 configure-ovs.sh[2364331]: + nmcli conn up ovs-if-br-ex
Dec 28 13:42:51 worker-03 configure-ovs.sh[2364331]: Error: Connection activation failed: A dependency of the connection failed
Dec 28 13:42:51 worker-03 configure-ovs.sh[2364331]: Hint: use 'journalctl -xe NM_CONNECTION=a473fbb3-54fc-46a2-bdb9-5684ee6c3021 + NM_DEVICE=br-ex' to get more details.
Dec 28 13:42:51 worker-03 configure-ovs.sh[2364331]: + sleep 5
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + counter=5
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + '[' 5 -lt 5 ']'
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + echo 'ERROR: Failed to activate ovs-if-br-ex NM connection'
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: ERROR: Failed to activate ovs-if-br-ex NM connection
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + set +e
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + nmcli conn down ovs-if-br-ex
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: Error: 'ovs-if-br-ex' is not an active connection.
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: Error: no active connection provided.
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + nmcli conn down ovs-if-phys0
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: Error: 'ovs-if-phys0' is not an active connection.
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: Error: no active connection provided.
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + nmcli conn up 52eecf5a-df5e-30ae-9ca1-6297f0239027
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: Connection successfully activated (master waiting for slaves) (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/105)
Dec 28 13:42:56 worker-03 configure-ovs.sh[2364331]: + exit 1
Dec 28 13:42:56 worker-03 systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=1/FAILURE
Dec 28 13:42:56 worker-03 systemd[1]: ovs-configuration.service: Failed with result 'exit-code'.
Dec 28 13:42:56 worker-03 systemd[1]: Failed to start Configures OVS with proper host networking configuration.
Dec 28 13:42:56 worker-03 systemd[1]: ovs-configuration.service: Consumed 2.250s CPU time

Also

Br-ex might be failing to come up due to this warning: 

Dec 28 22:35:55 worker-03 NetworkManager[4955]: <warn>  [1640730955.1234] device br-ex could not be added to a ovs port: Error running the transaction: constraint violation: Transaction causes multiple rows in "Bridge" table to have identical values (br-ex) for index on column "name".  First row, with UUID 7e950337-0d75-48a3-aa55-8f305cb90f0c, existed in the database before this transaction and was not modified by the transaction.  Second row, with UUID 2bec689d-50f4-4ad4-b117-57266c11ed90, was inserted by this transaction.

Might be linked to the existing entry for the bridge in ovs. 

sos_commands/networkmanager/nmcli_con_show_id_br-ex  
  shows  connection.uuid:                        e8b55300-d9e8-434b-bcd6-c0bec962516b 
but the  sos_commands/openvswitch/ovs-vsctl_list_bridge_br-ex 
  shows external_ids        : {NM.connection.uuid="590c6f76-f177-4427-a614-8b1c6bd719c9"}

Comment 4 Ross Brattain 2022-02-24 01:57:16 UTC

jq on worker depends on https://github.com/openshift/openshift-ansible/pull/12376

Comment 5 Ross Brattain 2022-02-24 15:00:02 UTC

tested RHEL worker scale with openshift/machine-config-operator#2901 and openshift/openshift-ansible#12376

1545:Feb 24 14:37:32.630836 ip-10.compute.internal configure-ovs.sh[1557]: ++ jq '.[0].addr_info | map(. | select(.family == "inet")) | length'
1550:Feb 24 14:37:32.685834 ip-10.compute.internal configure-ovs.sh[1557]: ++ jq '.[0].addr_info | map(. | select(.family == "inet6" and .scope != "link")) | length'


Source RPM  : jq-1.6-2.el8.src.rpm

ip-10.compute.internal   Ready    worker   19m    v1.22.3+b93fd35   10.0.52.124   <none>        Red Hat Enterprise Linux 8.4 (Ootpa)                           4.18.0-348.12.2.el8_5.x86_64   cri-o://1.22.2-2.rhaos4.9.gitb030be8.el8

Comment 10 Victor Voronkov 2022-03-15 17:26:05 UTC

*** Bug 2061641 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2022-03-16 11:38:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.24 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0798

Note You need to log in before you can comment on or make changes to this bug.