Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1398349

Summary:	Galera cluster in failed status after failed update
Product:	Red Hat OpenStack	Reporter:	Ondrej <ochalups>
Component:	rhosp-director	Assignee:	Damien Ciabrini <dciabrin>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Omri Hochman <ohochman>
Severity:	medium	Docs Contact:
Priority:	high
Version:	8.0 (Liberty)	CC:	dbecker, fdinitto, mbayer, mburns, morazi, nsantos, ochalups, rhel-osp-director-maint, ushkalim
Target Milestone:	async	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-12-13 15:36:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ondrej 2016-11-24 14:00:59 UTC

Description of problem:
After failed stack-update, the galera-cluster fails too and it's not possible to bootstrap it as the primary node always fails initial monitor action when resource agent tries to poll the status.
Database got upgraded, but /etc/sysconfig/clustercheck wasn't updated as it should have been to use user clustercheck and according password, not root.

Pacemaker resource cleanup leaves the primary node in failed state, with Failed initial monitor action error message:
# pcs status
.. 
 Master/Slave Set: galera-master [galera]
      galera     (ocf::heartbeat:galera):        FAILED Master controller0 (unmanaged)
      Slaves: [ controller1 controller2 ]
Failed Actions:
 * galera_promote_0 on controller0 'unknown error' (1): call=1880, status=complete, exitreason='Failed initial monitor action',
     last-rc-change='Thu Nov 24 11:06:06 2016', queued=0ms, exec=6621ms

/var/log/mysqld.log
...
Nov 23 16:19:14 [19733] controller0.localdomain       lrmd:   notice: operation_finished:        galera_promote_0:22147:stderr [ ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO) ]
Nov 23 16:19:14 [19733] controller0.localdomain       lrmd:   notice: operation_finished:        galera_promote_0:22147:stderr [ ocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user 'root' has permissions to view status ]
Nov 23 16:19:14 [19733] controller0.localdomain       lrmd:   notice: operation_finished:        galera_promote_0:22147:stderr [ ocf-exit-reason:local node  is started, but not in primary mode. Unknown state. ]
Nov 23 16:19:14 [19733] controller0.localdomain       lrmd:   notice: operation_finished:        galera_promote_0:22147:stderr [ ocf-exit-reason:Failed initial monitor action ]

Database uses clustercheck user:
MariaDB [(none)]> select host,user,password from mysql.user;
 +------------------------+--------------+-------------------------------------------+
 | host                   | user         | password                                  |
 +------------------------+--------------+-------------------------------------------+
...
 | localhost              | clustercheck | *password |

# /usr/bin/clustercheck
 HTTP/1.1 503 Service Unavailable
 Content-Type: text/plain
 Connection: close
 Content-Length: 36
 Galera cluster node is not synced.

# cat /etc/sysconfig/clustercheck
MYSQL_USERNAME=root
MYSQL_PASSWORD=''
Version-Release number of selected component (if applicable):

Fix is to update to clustercheck user and his password from undercloud:/home/stack/tripleo-overcloud-passwords and cleanup galera-master resource

[undercloud] # grep clustercheck /home/stack/tripleo-overcloud-passwords
OVERCLOUD_MYSQL_CLUSTERCHECK_PASSWORD=password

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Red Hat Bugzilla 2023-09-14 03:35:02 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days