Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2000088

Summary:	[update] 16.1 update to latest 16.1 failed after reboot as keystone cannot communicate with mariadb.
Product:	Red Hat OpenStack	Reporter:	Sofer Athlan-Guyot <sathlang>
Component:	tripleo-ansible	Assignee:	Sofer Athlan-Guyot <sathlang>
Status:	CLOSED ERRATA	QA Contact:	Jason Grosso <jgrosso>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	16.1 (Train)	CC:	eolivare, jgrosso, jpretori, lmiccini, mburns, mciecier, michele, spower
Target Milestone:	z7	Keywords:	Triaged
Target Release:	16.1 (Train on RHEL 8.2)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	tripleo-ansible-0.5.1-1.20210713143309.el8ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-12-09 20:20:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sofer Athlan-Guyot 2021-09-01 11:19:09 UTC

Description of problem:

Updating from 16.1/any-z to 16.1/RHOS-16.1-RHEL-8-20210824.n.2 fails either during tempest or during reboot.

The visible error is always the same:

 - WARNING oslo_db.sqlalchemy.engines [req-7da071eb-c6dd-406c-85ae-0fb658219ea0 - - - - -] SQL connection failed. -5653 attempts left.: oslo_db.exception.DBConnectionEr
ror: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')

filling up in the ctl-X/var/log/containers/keystone/keystone.log file.

Keystone is unable to communicate with mysql.

So this happens after everything has been "successfully" updated.

Version-Release number of selected component (if applicable):

The previous puddle was working fine  RHOS-16.1-RHEL-8-20210818.n.0, it starts to break on RHOS-16.1-RHEL-8-20210824.n.2.

How reproducible: the problem is consistent, sometime happening during reboot, sometime during the last tempest test, it all depends when the first call to the overcloud keystone happens in the update process.

Comment 3 Sofer Athlan-Guyot 2021-09-01 12:53:26 UTC

During a debug session with @dciabrini we went close to the source of the issue.


Keystone request to galera cluster follows that path:

 keystone -> haproxy -> mysql

haproxy uses clustercheck podman process to determine if mysql is available or not on port 9200:

listen mysql
  bind 172.17.1.149:3306 transparent
  option tcpka
  option httpchk
  option tcplog
  stick on dst
  stick-table type ip size 1000
  timeout client 90m
  timeout server 90m
  server controller-0.internalapi.redhat.local 172.17.1.87:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200
  server controller-1.internalapi.redhat.local 172.17.1.97:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200
  server controller-2.internalapi.redhat.local 172.17.1.46:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200

but it appears that nothing is listening on port 9200.

Inside the clustercheck container, xinetd is used to the trigger clustercheck, but it doesn't start correctly.

As nothing listen to 9200, haproxy assumes that galera is down and doesn't forward the request to galera.

When triggering xinetd -d manually inside the container we get :

bad address given for: galera-monitor

Adding :

 flag = IPv4

to the /var/lib/config-data/clustercheck/etc/xinetd.d/galera-monitor configuration "solves" the issues and xinetd starts properly.

We still need to find what changed in the latest puddle to have xinetd fails by default.

Comment 4 Michele Baldessari 2021-09-01 13:11:51 UTC

This is the 16.1 version of https://bugzilla.redhat.com/show_bug.cgi?id=1971001

Root cause breaking things is: https://review.opendev.org/c/openstack/tripleo-ansible/+/788100

The fix is to backport https://review.opendev.org/c/openstack/tripleo-ansible/+/796044/ to 16.1

Comment 15 Luca Miccini 2021-11-30 07:51:31 UTC

*** Bug 2005849 has been marked as a duplicate of this bug. ***

Comment 21 errata-xmlrpc 2021-12-09 20:20:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3762