Bug 2110550

Summary: overcloud redeploy fails on: "Set connection" of ovn_cluster_north/south_db task
Product: Red Hat OpenStack Reporter: Marian Krcmarik <mkrcmari>
Component: tripleo-ansibleAssignee: Terry Wilson <twilson>
Status: CLOSED ERRATA QA Contact: Joe H. Rahme <jhakimra>
Severity: high Docs Contact:
Priority: high    
Version: 17.0 (Wallaby)CC: akatz, bcafarel, ekuris, jpretori, jschluet, mlavalle, pgrist, ramishra, skaplons, spower, twilson
Target Milestone: gaKeywords: Triaged
Target Release: 17.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tripleo-ansible-3.3.1-0.20220720020866.fa5422f.el9ost openstack-tripleo-heat-templates-14.3.1-0.20220719171723.feca772.el9ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-21 12:24:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1503518    

Description Marian Krcmarik 2022-07-25 15:17:09 UTC
Description of problem:
The redeploy/update of a overcloud stack (in order to update some overcloud settings) sometimes fails on the "Set connection" task on the following error:
2022-07-25 08:19:26.429837 | 525400cc-c740-9f23-b52a-00000000c37b |      FATAL | Set connection | central-controller-0 | error={"changed": true, "cmd": "podman exec ovn_cluster_north_db_server bash -c \"ovn-nbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6641\"\npodman exec ovn_cluster_south_db_server bash -c \"ovn-sbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642\"\n", "delta": "0:00:00.560672", "end": "2022-07-25 08:19:26.390425", "msg": "non-zero return code", "rc": 1, "start": "2022-07-25 08:19:25.829753", "stderr": "time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\ntime=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\novn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()", "stderr_lines": ["time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "time=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()"], "stdout": "", "stdout_lines": []}

The task tries to set the connection parameter and is being executed on the "bootstrap" node which is central-controller-0 node in this case.
If I run the command manually on the node, I get following:
[root@central-controller-0 /]# ovn-sbctl -v -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642
2022-07-25T15:01:27Z|00001|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering BACKOFF
2022-07-25T15:01:27Z|00002|ovn_dbctl|INFO|Called as ovn-sbctl -v -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642
2022-07-25T15:01:27Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting...
2022-07-25T15:01:27Z|00004|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering CONNECTING
2022-07-25T15:01:27Z|00005|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: SERVER_SCHEMA_REQUESTED -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419
2022-07-25T15:01:27Z|00006|poll_loop|DBG|wakeup due to [POLLOUT] on fd 4 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:153
2022-07-25T15:01:27Z|00007|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected
2022-07-25T15:01:27Z|00008|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering ACTIVE
2022-07-25T15:01:27Z|00009|jsonrpc|DBG|unix:/var/run/ovn/ovnsb_db.sock: send request, method="get_schema", params=["_Server"], id=1
2022-07-25T15:01:27Z|00010|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: SERVER_SCHEMA_REQUESTED -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419
2022-07-25T15:01:27Z|00011|poll_loop|DBG|wakeup due to [POLLIN] on fd 4 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157
2022-07-25T15:01:27Z|00012|jsonrpc|DBG|unix:/var/run/ovn/ovnsb_db.sock: received reply, result=

...stripped some output..

2022-07-25T15:01:27Z|00017|ovsdb_cs|INFO|unix:/var/run/ovn/ovnsb_db.sock: clustered database server is not cluster leader; trying another server
2022-07-25T15:01:27Z|00018|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering RECONNECT
2022-07-25T15:01:27Z|00019|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: SERVER_MONITOR_REQUESTED -> RETRY at lib/ovsdb-cs.c:2011
2022-07-25T15:01:27Z|00020|poll_loop|DBG|wakeup due to 0-ms timeout at lib/reconnect.c:677
2022-07-25T15:01:27Z|00021|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connection attempt timed out
2022-07-25T15:01:27Z|00022|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering BACKOFF
2022-07-25T15:01:27Z|00023|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: RETRY -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419
ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()

It seems that central-controller-0 node is not a leader of the cluster while It probably was in the time of initial deployment when the node was the boostrap node and the command tries to apply the settings on the leader node but fails to connect to it.

If I run the command on the leader node (in this case it's central-controller-1) the command is successful:
[root@central-controller-1 /]# ovs-appctl -t /var/lib/openvswitch/ovnsb_db.ctl cluster/status OVN_Southbound | grep Role
Role: leader

[root@central-controller-1 /]# ovn-sbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642
[root@central-controller-1 /]#

The leader node for OVN_Northbound is still central-controller-0 so it does not fail there in this case.

Version-Release number of selected component (if applicable):
ansible-tripleo-ipsec-11.0.1-0.20210910011424.b5559c8.el9ost.noarch
ansible-tripleo-ipa-0.2.3-0.20220301190449.6b0ed82.el9ost.noarch
ansible-role-tripleo-modify-image-1.3.1-0.20220216001439.30d23d5.el9ost.noarch
tripleo-ansible-3.3.1-0.20220720020859.fa5422f.el9ost.noarch

How reproducible:
Sometimes

Steps to Reproduce:
1. Deploy an overcloud of OSP17.0 with RAFT enabled for OVN
2. Perform some ovecloud actions (or eventually move leader node of OVN_North/Southbound out of boostrap node)
3. Redeploy overcloud

Actual results:
2022-07-25 08:19:26.429837 | 525400cc-c740-9f23-b52a-00000000c37b |      FATAL | Set connection | central-controller-0 | error={"changed": true, "cmd": "podman exec ovn_cluster_north_db_server bash -c \"ovn-nbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6641\"\npodman exec ovn_cluster_south_db_server bash -c \"ovn-sbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642\"\n", "delta": "0:00:00.560672", "end": "2022-07-25 08:19:26.390425", "msg": "non-zero return code", "rc": 1, "start": "2022-07-25 08:19:25.829753", "stderr": "time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\ntime=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\novn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()", "stderr_lines": ["time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "time=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()"], "stdout": "", "stdout_lines": []}

Expected results:
Successful redeploy

Additional info:

Comment 1 Terry Wilson 2022-07-25 17:23:54 UTC
I can see how if the leader changed this would fail. A simple solution would be to add --no-leader-only to the ovn-sbctl calls. With that said, depending on how we solve bz2101588, these calls (which are a workaround) may go away. Ultimately, they're kind of broken in that they are adding a connection object to the DB to listen on *all interfaces*. We'd need to check the iptables rules on the controllers to verify that those ports are blocked on non-ctlplane interfaces.

Comment 19 errata-xmlrpc 2022-09-21 12:24:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543

Comment 20 Jakub Libosvar 2023-01-11 20:27:20 UTC
*** Bug 2123168 has been marked as a duplicate of this bug. ***