Bug 2110550 - overcloud redeploy fails on: "Set connection" of ovn_cluster_north/south_db task
Summary: overcloud redeploy fails on: "Set connection" of ovn_cluster_north/south_db task
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ga
: 17.0
Assignee: Terry Wilson
QA Contact: Joe H. Rahme
URL:
Whiteboard:
: 2123168 (view as bug list)
Depends On:
Blocks: ovsdbclustering
TreeView+ depends on / blocked
 
Reported: 2022-07-25 15:17 UTC by Marian Krcmarik
Modified: 2023-01-11 20:27 UTC (History)
11 users (show)

Fixed In Version: tripleo-ansible-3.3.1-0.20220720020866.fa5422f.el9ost openstack-tripleo-heat-templates-14.3.1-0.20220719171723.feca772.el9ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-21 12:24:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 851450 0 None master: MERGED tripleo-heat-templates: Set OVSDB Connection.probe_interval (I7d8d0530c367708215437c9ac11a6fc17235e784) 2022-08-24 13:33:32 UTC
OpenStack gerrit 851452 0 None master: MERGED tripleo-ansible: Leave connection setup to THT (I690c32a2ed2142e4be6518ea27fa153c318c94b8) 2022-08-24 13:33:38 UTC
OpenStack gerrit 851888 0 None stable/wallaby: MERGED tripleo-ansible: Leave connection setup to THT (I690c32a2ed2142e4be6518ea27fa153c318c94b8) 2022-08-24 13:33:43 UTC
OpenStack gerrit 851889 0 None stable/wallaby: MERGED tripleo-heat-templates: Set OVSDB Connection.probe_interval (I7d8d0530c367708215437c9ac11a6fc17235e784) 2022-08-24 13:33:49 UTC
OpenStack gerrit 853101 0 None stable/wallaby: MERGED tripleo-heat-templates: Fix ovsdb-server for IPv6 listening addresses (I1d04eedeb7290408f612933427a763288e4ba10b) 2022-08-24 13:33:54 UTC
Red Hat Issue Tracker OSP-17833 0 None None None 2022-07-25 16:11:39 UTC
Red Hat Product Errata RHEA-2022:6543 0 None None None 2022-09-21 12:24:40 UTC

Description Marian Krcmarik 2022-07-25 15:17:09 UTC
Description of problem:
The redeploy/update of a overcloud stack (in order to update some overcloud settings) sometimes fails on the "Set connection" task on the following error:
2022-07-25 08:19:26.429837 | 525400cc-c740-9f23-b52a-00000000c37b |      FATAL | Set connection | central-controller-0 | error={"changed": true, "cmd": "podman exec ovn_cluster_north_db_server bash -c \"ovn-nbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6641\"\npodman exec ovn_cluster_south_db_server bash -c \"ovn-sbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642\"\n", "delta": "0:00:00.560672", "end": "2022-07-25 08:19:26.390425", "msg": "non-zero return code", "rc": 1, "start": "2022-07-25 08:19:25.829753", "stderr": "time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\ntime=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\novn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()", "stderr_lines": ["time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "time=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()"], "stdout": "", "stdout_lines": []}

The task tries to set the connection parameter and is being executed on the "bootstrap" node which is central-controller-0 node in this case.
If I run the command manually on the node, I get following:
[root@central-controller-0 /]# ovn-sbctl -v -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642
2022-07-25T15:01:27Z|00001|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering BACKOFF
2022-07-25T15:01:27Z|00002|ovn_dbctl|INFO|Called as ovn-sbctl -v -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642
2022-07-25T15:01:27Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting...
2022-07-25T15:01:27Z|00004|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering CONNECTING
2022-07-25T15:01:27Z|00005|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: SERVER_SCHEMA_REQUESTED -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419
2022-07-25T15:01:27Z|00006|poll_loop|DBG|wakeup due to [POLLOUT] on fd 4 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:153
2022-07-25T15:01:27Z|00007|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected
2022-07-25T15:01:27Z|00008|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering ACTIVE
2022-07-25T15:01:27Z|00009|jsonrpc|DBG|unix:/var/run/ovn/ovnsb_db.sock: send request, method="get_schema", params=["_Server"], id=1
2022-07-25T15:01:27Z|00010|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: SERVER_SCHEMA_REQUESTED -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419
2022-07-25T15:01:27Z|00011|poll_loop|DBG|wakeup due to [POLLIN] on fd 4 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157
2022-07-25T15:01:27Z|00012|jsonrpc|DBG|unix:/var/run/ovn/ovnsb_db.sock: received reply, result=

...stripped some output..

2022-07-25T15:01:27Z|00017|ovsdb_cs|INFO|unix:/var/run/ovn/ovnsb_db.sock: clustered database server is not cluster leader; trying another server
2022-07-25T15:01:27Z|00018|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering RECONNECT
2022-07-25T15:01:27Z|00019|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: SERVER_MONITOR_REQUESTED -> RETRY at lib/ovsdb-cs.c:2011
2022-07-25T15:01:27Z|00020|poll_loop|DBG|wakeup due to 0-ms timeout at lib/reconnect.c:677
2022-07-25T15:01:27Z|00021|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connection attempt timed out
2022-07-25T15:01:27Z|00022|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering BACKOFF
2022-07-25T15:01:27Z|00023|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: RETRY -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419
ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()

It seems that central-controller-0 node is not a leader of the cluster while It probably was in the time of initial deployment when the node was the boostrap node and the command tries to apply the settings on the leader node but fails to connect to it.

If I run the command on the leader node (in this case it's central-controller-1) the command is successful:
[root@central-controller-1 /]# ovs-appctl -t /var/lib/openvswitch/ovnsb_db.ctl cluster/status OVN_Southbound | grep Role
Role: leader

[root@central-controller-1 /]# ovn-sbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642
[root@central-controller-1 /]#

The leader node for OVN_Northbound is still central-controller-0 so it does not fail there in this case.

Version-Release number of selected component (if applicable):
ansible-tripleo-ipsec-11.0.1-0.20210910011424.b5559c8.el9ost.noarch
ansible-tripleo-ipa-0.2.3-0.20220301190449.6b0ed82.el9ost.noarch
ansible-role-tripleo-modify-image-1.3.1-0.20220216001439.30d23d5.el9ost.noarch
tripleo-ansible-3.3.1-0.20220720020859.fa5422f.el9ost.noarch

How reproducible:
Sometimes

Steps to Reproduce:
1. Deploy an overcloud of OSP17.0 with RAFT enabled for OVN
2. Perform some ovecloud actions (or eventually move leader node of OVN_North/Southbound out of boostrap node)
3. Redeploy overcloud

Actual results:
2022-07-25 08:19:26.429837 | 525400cc-c740-9f23-b52a-00000000c37b |      FATAL | Set connection | central-controller-0 | error={"changed": true, "cmd": "podman exec ovn_cluster_north_db_server bash -c \"ovn-nbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6641\"\npodman exec ovn_cluster_south_db_server bash -c \"ovn-sbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642\"\n", "delta": "0:00:00.560672", "end": "2022-07-25 08:19:26.390425", "msg": "non-zero return code", "rc": 1, "start": "2022-07-25 08:19:25.829753", "stderr": "time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\ntime=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\novn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()", "stderr_lines": ["time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "time=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()"], "stdout": "", "stdout_lines": []}

Expected results:
Successful redeploy

Additional info:

Comment 1 Terry Wilson 2022-07-25 17:23:54 UTC
I can see how if the leader changed this would fail. A simple solution would be to add --no-leader-only to the ovn-sbctl calls. With that said, depending on how we solve bz2101588, these calls (which are a workaround) may go away. Ultimately, they're kind of broken in that they are adding a connection object to the DB to listen on *all interfaces*. We'd need to check the iptables rules on the controllers to verify that those ports are blocked on non-ctlplane interfaces.

Comment 19 errata-xmlrpc 2022-09-21 12:24:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543

Comment 20 Jakub Libosvar 2023-01-11 20:27:20 UTC
*** Bug 2123168 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.