Bug 2000088 - [update] 16.1 update to latest 16.1 failed after reboot as keystone cannot communicate with mariadb.
Summary: [update] 16.1 update to latest 16.1 failed after reboot as keystone cannot co...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z7
: 16.1 (Train on RHEL 8.2)
Assignee: Sofer Athlan-Guyot
QA Contact: Jason Grosso
URL:
Whiteboard:
: 2005849 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-01 11:19 UTC by Sofer Athlan-Guyot
Modified: 2021-12-09 20:20 UTC (History)
8 users (show)

Fixed In Version: tripleo-ansible-0.5.1-1.20210713143309.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-12-09 20:20:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-8076 0 None None None 2021-11-18 11:34:44 UTC
Red Hat Issue Tracker UPG-3277 0 None None None 2021-09-01 11:20:57 UTC
Red Hat Product Errata RHBA-2021:3762 0 None None None 2021-12-09 20:20:59 UTC

Description Sofer Athlan-Guyot 2021-09-01 11:19:09 UTC
Description of problem:

Updating from 16.1/any-z to 16.1/RHOS-16.1-RHEL-8-20210824.n.2 fails either during tempest or during reboot.

The visible error is always the same:

 - WARNING oslo_db.sqlalchemy.engines [req-7da071eb-c6dd-406c-85ae-0fb658219ea0 - - - - -] SQL connection failed. -5653 attempts left.: oslo_db.exception.DBConnectionEr
ror: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')

filling up in the ctl-X/var/log/containers/keystone/keystone.log file.

Keystone is unable to communicate with mysql.

So this happens after everything has been "successfully" updated.

Version-Release number of selected component (if applicable):

The previous puddle was working fine  RHOS-16.1-RHEL-8-20210818.n.0, it starts to break on RHOS-16.1-RHEL-8-20210824.n.2.

How reproducible: the problem is consistent, sometime happening during reboot, sometime during the last tempest test, it all depends when the first call to the overcloud keystone happens in the update process.

Comment 3 Sofer Athlan-Guyot 2021-09-01 12:53:26 UTC
During a debug session with @dciabrini we went close to the source of the issue.


Keystone request to galera cluster follows that path:

 keystone -> haproxy -> mysql

haproxy uses clustercheck podman process to determine if mysql is available or not on port 9200:

listen mysql
  bind 172.17.1.149:3306 transparent
  option tcpka
  option httpchk
  option tcplog
  stick on dst
  stick-table type ip size 1000
  timeout client 90m
  timeout server 90m
  server controller-0.internalapi.redhat.local 172.17.1.87:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200
  server controller-1.internalapi.redhat.local 172.17.1.97:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200
  server controller-2.internalapi.redhat.local 172.17.1.46:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200

but it appears that nothing is listening on port 9200.

Inside the clustercheck container, xinetd is used to the trigger clustercheck, but it doesn't start correctly.

As nothing listen to 9200, haproxy assumes that galera is down and doesn't forward the request to galera.

When triggering xinetd -d manually inside the container we get :

bad address given for: galera-monitor

Adding :

 flag = IPv4

to the /var/lib/config-data/clustercheck/etc/xinetd.d/galera-monitor configuration "solves" the issues and xinetd starts properly.

We still need to find what changed in the latest puddle to have xinetd fails by default.

Comment 4 Michele Baldessari 2021-09-01 13:11:51 UTC
This is the 16.1 version of https://bugzilla.redhat.com/show_bug.cgi?id=1971001

Root cause breaking things is: https://review.opendev.org/c/openstack/tripleo-ansible/+/788100

The fix is to backport https://review.opendev.org/c/openstack/tripleo-ansible/+/796044/ to 16.1

Comment 15 Luca Miccini 2021-11-30 07:51:31 UTC
*** Bug 2005849 has been marked as a duplicate of this bug. ***

Comment 21 errata-xmlrpc 2021-12-09 20:20:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3762


Note You need to log in before you can comment on or make changes to this bug.