Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2075472

Summary:	FFU can get stuck when the mariadb upgrade fails to execute
Product:	Red Hat OpenStack	Reporter:	Damien Ciabrini <dciabrin>
Component:	openstack-tripleo-heat-templates	Assignee:	mbollo
Status:	CLOSED ERRATA	QA Contact:	Khomesh Thakre <kthakre>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	16.2 (Train)	CC:	arcsingh, jpretori, lbezdick, mbollo, mburns, sgolovat, sukar
Target Milestone:	z4	Keywords:	Triaged
Target Release:	16.2 (Train on RHEL 8.4)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-11.6.1-2.20220622004851.40471e6.el8ost	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-12-07 19:22:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Damien Ciabrini 2022-04-14 10:22:15 UTC

Description of problem:

When performing a FFU to Train, the mariadb version changes from 10.1 to 10.3, so in addition to upgrading container image to mariadb 10.3, the various mariadb tables
must be upgraded on disk.

The mariadb upgrade steps are executed during the upgrade tasks, and they are run in a transient container [1]. The expectation when the upgrade is ran for the first
time is that 1) no mariadb esrver is running, and that 2) mariadb has been stopped cleanly by the previous upgrade tasks.

The clena stop is currently very important as otherwise, mariadb may leave a spurious transaction in the transaction coordinator log (tc.log). Such transaction is never dropped automatically by the upgrade tasks, to ensure that we can never drop data silently. The drawback is that manual intervention must be taken to restart mariadb when the stop was unclean, otherwise the FFU will fail because the mariadb upgrade cannot be performed.

This cautious approach can cause troubles when the FFU workflow is being run multiple times (e.g. in an attempt to fix unrelated errors), as re-running the FFU from an unclean state might cause the mariadb server to be stopped unexpectedly, leave a transaction behind in tc.log and cause subsequent FFU to fail.

However, once the database upgrade has been performed once, it is not necessary to re-run it, and any tc.log left after the upgrade is automatically rolledback
by the galera resource agent (when this is safe to do so). So the FFU shouldn't break when the upgrade has been performed at least once succesfully.

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/database/mysql-pacemaker-puppet.yaml#L585

Version-Release number of selected component (if applicable):

How reproducible:

When the mariadb container has been stopped improperly

Steps to Reproduce:
1. perform a FFU of OSP13 to OSP16. Or deploy a stock OSP16 env
2. initiate a pcs cluster stop one a controller node and kill the galera container at the same time
3. re-run the upgrade-tasks on that node

Actual results:
The upgrade task may fail due to a stuck transaction in the redo log

Expected results:
The mysql upgrade task shouldn't be restarted as the database is already running the latest mysql version

Additional info:

Comment 4 Luca Miccini 2022-06-21 09:22:08 UTC

*** Bug 2096202 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2022-12-07 19:22:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8794