2110550 – overcloud redeploy fails on: "Set connection" of ovn_cluster_north/south_db task

Bug 2110550 - overcloud redeploy fails on: "Set connection" of ovn_cluster_north/south_db task

Summary: overcloud redeploy fails on: "Set connection" of ovn_cluster_north/south_db task

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	tripleo-ansible
Sub Component:
Version:	17.0 (Wallaby)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ga
Target Release:	17.0
Assignee:	Terry Wilson
QA Contact:	Joe H. Rahme
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2123168 (view as bug list)
Depends On:
Blocks:	ovsdbclustering
TreeView+	depends on / blocked

Reported:	2022-07-25 15:17 UTC by Marian Krcmarik
Modified:	2023-01-11 20:27 UTC (History)
CC List:	11 users (show)
Fixed In Version:	tripleo-ansible-3.3.1-0.20220720020866.fa5422f.el9ost openstack-tripleo-heat-templates-14.3.1-0.20220719171723.feca772.el9ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-21 12:24:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	851450	None	master: MERGED	tripleo-heat-templates: Set OVSDB Connection.probe_interval (I7d8d0530c367708215437c9ac11a6fc17235e784)	2022-08-24 13:33:32 UTC
OpenStack gerrit	851452	None	master: MERGED	tripleo-ansible: Leave connection setup to THT (I690c32a2ed2142e4be6518ea27fa153c318c94b8)	2022-08-24 13:33:38 UTC
OpenStack gerrit	851888	None	stable/wallaby: MERGED	tripleo-ansible: Leave connection setup to THT (I690c32a2ed2142e4be6518ea27fa153c318c94b8)	2022-08-24 13:33:43 UTC
OpenStack gerrit	851889	None	stable/wallaby: MERGED	tripleo-heat-templates: Set OVSDB Connection.probe_interval (I7d8d0530c367708215437c9ac11a6fc17235e784)	2022-08-24 13:33:49 UTC
OpenStack gerrit	853101	None	stable/wallaby: MERGED	tripleo-heat-templates: Fix ovsdb-server for IPv6 listening addresses (I1d04eedeb7290408f612933427a763288e4ba10b)	2022-08-24 13:33:54 UTC
Red Hat Issue Tracker	OSP-17833	None	None	None	2022-07-25 16:11:39 UTC
Red Hat Product Errata	RHEA-2022:6543	None	None	None	2022-09-21 12:24:40 UTC

Description Marian Krcmarik 2022-07-25 15:17:09 UTC

Description of problem:
The redeploy/update of a overcloud stack (in order to update some overcloud settings) sometimes fails on the "Set connection" task on the following error:
2022-07-25 08:19:26.429837 | 525400cc-c740-9f23-b52a-00000000c37b |      FATAL | Set connection | central-controller-0 | error={"changed": true, "cmd": "podman exec ovn_cluster_north_db_server bash -c \"ovn-nbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6641\"\npodman exec ovn_cluster_south_db_server bash -c \"ovn-sbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642\"\n", "delta": "0:00:00.560672", "end": "2022-07-25 08:19:26.390425", "msg": "non-zero return code", "rc": 1, "start": "2022-07-25 08:19:25.829753", "stderr": "time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\ntime=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\novn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()", "stderr_lines": ["time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "time=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()"], "stdout": "", "stdout_lines": []}

The task tries to set the connection parameter and is being executed on the "bootstrap" node which is central-controller-0 node in this case.
If I run the command manually on the node, I get following:
[root@central-controller-0 /]# ovn-sbctl -v -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642
2022-07-25T15:01:27Z|00001|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering BACKOFF
2022-07-25T15:01:27Z|00002|ovn_dbctl|INFO|Called as ovn-sbctl -v -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642
2022-07-25T15:01:27Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting...
2022-07-25T15:01:27Z|00004|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering CONNECTING
2022-07-25T15:01:27Z|00005|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: SERVER_SCHEMA_REQUESTED -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419
2022-07-25T15:01:27Z|00006|poll_loop|DBG|wakeup due to [POLLOUT] on fd 4 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:153
2022-07-25T15:01:27Z|00007|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected
2022-07-25T15:01:27Z|00008|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering ACTIVE
2022-07-25T15:01:27Z|00009|jsonrpc|DBG|unix:/var/run/ovn/ovnsb_db.sock: send request, method="get_schema", params=["_Server"], id=1
2022-07-25T15:01:27Z|00010|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: SERVER_SCHEMA_REQUESTED -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419
2022-07-25T15:01:27Z|00011|poll_loop|DBG|wakeup due to [POLLIN] on fd 4 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157
2022-07-25T15:01:27Z|00012|jsonrpc|DBG|unix:/var/run/ovn/ovnsb_db.sock: received reply, result=

...stripped some output..

2022-07-25T15:01:27Z|00017|ovsdb_cs|INFO|unix:/var/run/ovn/ovnsb_db.sock: clustered database server is not cluster leader; trying another server
2022-07-25T15:01:27Z|00018|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering RECONNECT
2022-07-25T15:01:27Z|00019|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: SERVER_MONITOR_REQUESTED -> RETRY at lib/ovsdb-cs.c:2011
2022-07-25T15:01:27Z|00020|poll_loop|DBG|wakeup due to 0-ms timeout at lib/reconnect.c:677
2022-07-25T15:01:27Z|00021|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connection attempt timed out
2022-07-25T15:01:27Z|00022|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering BACKOFF
2022-07-25T15:01:27Z|00023|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: RETRY -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419
ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()

It seems that central-controller-0 node is not a leader of the cluster while It probably was in the time of initial deployment when the node was the boostrap node and the command tries to apply the settings on the leader node but fails to connect to it.

If I run the command on the leader node (in this case it's central-controller-1) the command is successful:
[root@central-controller-1 /]# ovs-appctl -t /var/lib/openvswitch/ovnsb_db.ctl cluster/status OVN_Southbound | grep Role
Role: leader

[root@central-controller-1 /]# ovn-sbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642
[root@central-controller-1 /]#

The leader node for OVN_Northbound is still central-controller-0 so it does not fail there in this case.

Version-Release number of selected component (if applicable):
ansible-tripleo-ipsec-11.0.1-0.20210910011424.b5559c8.el9ost.noarch
ansible-tripleo-ipa-0.2.3-0.20220301190449.6b0ed82.el9ost.noarch
ansible-role-tripleo-modify-image-1.3.1-0.20220216001439.30d23d5.el9ost.noarch
tripleo-ansible-3.3.1-0.20220720020859.fa5422f.el9ost.noarch

How reproducible:
Sometimes

Steps to Reproduce:
1. Deploy an overcloud of OSP17.0 with RAFT enabled for OVN
2. Perform some ovecloud actions (or eventually move leader node of OVN_North/Southbound out of boostrap node)
3. Redeploy overcloud

Actual results:
2022-07-25 08:19:26.429837 | 525400cc-c740-9f23-b52a-00000000c37b |      FATAL | Set connection | central-controller-0 | error={"changed": true, "cmd": "podman exec ovn_cluster_north_db_server bash -c \"ovn-nbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6641\"\npodman exec ovn_cluster_south_db_server bash -c \"ovn-sbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642\"\n", "delta": "0:00:00.560672", "end": "2022-07-25 08:19:26.390425", "msg": "non-zero return code", "rc": 1, "start": "2022-07-25 08:19:25.829753", "stderr": "time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\ntime=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\novn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()", "stderr_lines": ["time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "time=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()"], "stdout": "", "stdout_lines": []}

Expected results:
Successful redeploy

Additional info:

Comment 1 Terry Wilson 2022-07-25 17:23:54 UTC

I can see how if the leader changed this would fail. A simple solution would be to add --no-leader-only to the ovn-sbctl calls. With that said, depending on how we solve bz2101588, these calls (which are a workaround) may go away. Ultimately, they're kind of broken in that they are adding a connection object to the DB to listen on *all interfaces*. We'd need to check the iptables rules on the controllers to verify that those ports are blocked on non-ctlplane interfaces.

Comment 19 errata-xmlrpc 2022-09-21 12:24:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543

Comment 20 Jakub Libosvar 2023-01-11 20:27:20 UTC

*** Bug 2123168 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.