Description of problem: Updating from 16.1/any-z to 16.1/RHOS-16.1-RHEL-8-20210824.n.2 fails either during tempest or during reboot. The visible error is always the same: - WARNING oslo_db.sqlalchemy.engines [req-7da071eb-c6dd-406c-85ae-0fb658219ea0 - - - - -] SQL connection failed. -5653 attempts left.: oslo_db.exception.DBConnectionEr ror: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') filling up in the ctl-X/var/log/containers/keystone/keystone.log file. Keystone is unable to communicate with mysql. So this happens after everything has been "successfully" updated. Version-Release number of selected component (if applicable): The previous puddle was working fine RHOS-16.1-RHEL-8-20210818.n.0, it starts to break on RHOS-16.1-RHEL-8-20210824.n.2. How reproducible: the problem is consistent, sometime happening during reboot, sometime during the last tempest test, it all depends when the first call to the overcloud keystone happens in the update process.
During a debug session with @dciabrini we went close to the source of the issue. Keystone request to galera cluster follows that path: keystone -> haproxy -> mysql haproxy uses clustercheck podman process to determine if mysql is available or not on port 9200: listen mysql bind 172.17.1.149:3306 transparent option tcpka option httpchk option tcplog stick on dst stick-table type ip size 1000 timeout client 90m timeout server 90m server controller-0.internalapi.redhat.local 172.17.1.87:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200 server controller-1.internalapi.redhat.local 172.17.1.97:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200 server controller-2.internalapi.redhat.local 172.17.1.46:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200 but it appears that nothing is listening on port 9200. Inside the clustercheck container, xinetd is used to the trigger clustercheck, but it doesn't start correctly. As nothing listen to 9200, haproxy assumes that galera is down and doesn't forward the request to galera. When triggering xinetd -d manually inside the container we get : bad address given for: galera-monitor Adding : flag = IPv4 to the /var/lib/config-data/clustercheck/etc/xinetd.d/galera-monitor configuration "solves" the issues and xinetd starts properly. We still need to find what changed in the latest puddle to have xinetd fails by default.
This is the 16.1 version of https://bugzilla.redhat.com/show_bug.cgi?id=1971001 Root cause breaking things is: https://review.opendev.org/c/openstack/tripleo-ansible/+/788100 The fix is to backport https://review.opendev.org/c/openstack/tripleo-ansible/+/796044/ to 16.1
*** Bug 2005849 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3762