Description of problem: The redeploy/update of a overcloud stack (in order to update some overcloud settings) sometimes fails on the "Set connection" task on the following error: 2022-07-25 08:19:26.429837 | 525400cc-c740-9f23-b52a-00000000c37b | FATAL | Set connection | central-controller-0 | error={"changed": true, "cmd": "podman exec ovn_cluster_north_db_server bash -c \"ovn-nbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6641\"\npodman exec ovn_cluster_south_db_server bash -c \"ovn-sbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642\"\n", "delta": "0:00:00.560672", "end": "2022-07-25 08:19:26.390425", "msg": "non-zero return code", "rc": 1, "start": "2022-07-25 08:19:25.829753", "stderr": "time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\ntime=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\novn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()", "stderr_lines": ["time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "time=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()"], "stdout": "", "stdout_lines": []} The task tries to set the connection parameter and is being executed on the "bootstrap" node which is central-controller-0 node in this case. If I run the command manually on the node, I get following: [root@central-controller-0 /]# ovn-sbctl -v -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642 2022-07-25T15:01:27Z|00001|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering BACKOFF 2022-07-25T15:01:27Z|00002|ovn_dbctl|INFO|Called as ovn-sbctl -v -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642 2022-07-25T15:01:27Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting... 2022-07-25T15:01:27Z|00004|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering CONNECTING 2022-07-25T15:01:27Z|00005|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: SERVER_SCHEMA_REQUESTED -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419 2022-07-25T15:01:27Z|00006|poll_loop|DBG|wakeup due to [POLLOUT] on fd 4 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:153 2022-07-25T15:01:27Z|00007|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected 2022-07-25T15:01:27Z|00008|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering ACTIVE 2022-07-25T15:01:27Z|00009|jsonrpc|DBG|unix:/var/run/ovn/ovnsb_db.sock: send request, method="get_schema", params=["_Server"], id=1 2022-07-25T15:01:27Z|00010|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: SERVER_SCHEMA_REQUESTED -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419 2022-07-25T15:01:27Z|00011|poll_loop|DBG|wakeup due to [POLLIN] on fd 4 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 2022-07-25T15:01:27Z|00012|jsonrpc|DBG|unix:/var/run/ovn/ovnsb_db.sock: received reply, result= ...stripped some output.. 2022-07-25T15:01:27Z|00017|ovsdb_cs|INFO|unix:/var/run/ovn/ovnsb_db.sock: clustered database server is not cluster leader; trying another server 2022-07-25T15:01:27Z|00018|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering RECONNECT 2022-07-25T15:01:27Z|00019|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: SERVER_MONITOR_REQUESTED -> RETRY at lib/ovsdb-cs.c:2011 2022-07-25T15:01:27Z|00020|poll_loop|DBG|wakeup due to 0-ms timeout at lib/reconnect.c:677 2022-07-25T15:01:27Z|00021|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connection attempt timed out 2022-07-25T15:01:27Z|00022|reconnect|DBG|unix:/var/run/ovn/ovnsb_db.sock: entering BACKOFF 2022-07-25T15:01:27Z|00023|ovsdb_cs|DBG|unix:/var/run/ovn/ovnsb_db.sock: RETRY -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419 ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed () It seems that central-controller-0 node is not a leader of the cluster while It probably was in the time of initial deployment when the node was the boostrap node and the command tries to apply the settings on the leader node but fails to connect to it. If I run the command on the leader node (in this case it's central-controller-1) the command is successful: [root@central-controller-1 /]# ovs-appctl -t /var/lib/openvswitch/ovnsb_db.ctl cluster/status OVN_Southbound | grep Role Role: leader [root@central-controller-1 /]# ovn-sbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642 [root@central-controller-1 /]# The leader node for OVN_Northbound is still central-controller-0 so it does not fail there in this case. Version-Release number of selected component (if applicable): ansible-tripleo-ipsec-11.0.1-0.20210910011424.b5559c8.el9ost.noarch ansible-tripleo-ipa-0.2.3-0.20220301190449.6b0ed82.el9ost.noarch ansible-role-tripleo-modify-image-1.3.1-0.20220216001439.30d23d5.el9ost.noarch tripleo-ansible-3.3.1-0.20220720020859.fa5422f.el9ost.noarch How reproducible: Sometimes Steps to Reproduce: 1. Deploy an overcloud of OSP17.0 with RAFT enabled for OVN 2. Perform some ovecloud actions (or eventually move leader node of OVN_North/Southbound out of boostrap node) 3. Redeploy overcloud Actual results: 2022-07-25 08:19:26.429837 | 525400cc-c740-9f23-b52a-00000000c37b | FATAL | Set connection | central-controller-0 | error={"changed": true, "cmd": "podman exec ovn_cluster_north_db_server bash -c \"ovn-nbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6641\"\npodman exec ovn_cluster_south_db_server bash -c \"ovn-sbctl -p /etc/pki/tls/private/ovn_dbs.key -c /etc/pki/tls/certs/ovn_dbs.crt -C /etc/ipa/ca.crt set-connection pssl:6642\"\n", "delta": "0:00:00.560672", "end": "2022-07-25 08:19:26.390425", "msg": "non-zero return code", "rc": 1, "start": "2022-07-25 08:19:25.829753", "stderr": "time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\ntime=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"\novn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()", "stderr_lines": ["time=\"2022-07-25T08:19:25Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "time=\"2022-07-25T08:19:26Z\" level=warning msg=\" binary not found, container dns will not be enabled\"", "ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()"], "stdout": "", "stdout_lines": []} Expected results: Successful redeploy Additional info:
I can see how if the leader changed this would fail. A simple solution would be to add --no-leader-only to the ovn-sbctl calls. With that said, depending on how we solve bz2101588, these calls (which are a workaround) may go away. Ultimately, they're kind of broken in that they are adding a connection object to the DB to listen on *all interfaces*. We'd need to check the iptables rules on the controllers to verify that those ports are blocked on non-ctlplane interfaces.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:6543
*** Bug 2123168 has been marked as a duplicate of this bug. ***