| Summary: | Galera cluster in failed status after failed update | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Ondrej <ochalups> |
| Component: | rhosp-director | Assignee: | Damien Ciabrini <dciabrin> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Omri Hochman <ohochman> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 8.0 (Liberty) | CC: | dbecker, fdinitto, mbayer, mburns, morazi, nsantos, ochalups, rhel-osp-director-maint, ushkalim |
| Target Milestone: | async | Keywords: | ZStream |
| Target Release: | --- | Flags: | dciabrin:
needinfo?
(ochalups) |
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-12-13 15:36:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Description of problem: After failed stack-update, the galera-cluster fails too and it's not possible to bootstrap it as the primary node always fails initial monitor action when resource agent tries to poll the status. Database got upgraded, but /etc/sysconfig/clustercheck wasn't updated as it should have been to use user clustercheck and according password, not root. Pacemaker resource cleanup leaves the primary node in failed state, with Failed initial monitor action error message: # pcs status .. Master/Slave Set: galera-master [galera] galera (ocf::heartbeat:galera): FAILED Master controller0 (unmanaged) Slaves: [ controller1 controller2 ] Failed Actions: * galera_promote_0 on controller0 'unknown error' (1): call=1880, status=complete, exitreason='Failed initial monitor action', last-rc-change='Thu Nov 24 11:06:06 2016', queued=0ms, exec=6621ms /var/log/mysqld.log ... Nov 23 16:19:14 [19733] controller0.localdomain lrmd: notice: operation_finished: galera_promote_0:22147:stderr [ ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO) ] Nov 23 16:19:14 [19733] controller0.localdomain lrmd: notice: operation_finished: galera_promote_0:22147:stderr [ ocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user 'root' has permissions to view status ] Nov 23 16:19:14 [19733] controller0.localdomain lrmd: notice: operation_finished: galera_promote_0:22147:stderr [ ocf-exit-reason:local node is started, but not in primary mode. Unknown state. ] Nov 23 16:19:14 [19733] controller0.localdomain lrmd: notice: operation_finished: galera_promote_0:22147:stderr [ ocf-exit-reason:Failed initial monitor action ] Database uses clustercheck user: MariaDB [(none)]> select host,user,password from mysql.user; +------------------------+--------------+-------------------------------------------+ | host | user | password | +------------------------+--------------+-------------------------------------------+ ... | localhost | clustercheck | *password | # /usr/bin/clustercheck HTTP/1.1 503 Service Unavailable Content-Type: text/plain Connection: close Content-Length: 36 Galera cluster node is not synced. # cat /etc/sysconfig/clustercheck MYSQL_USERNAME=root MYSQL_PASSWORD='' Version-Release number of selected component (if applicable): Fix is to update to clustercheck user and his password from undercloud:/home/stack/tripleo-overcloud-passwords and cleanup galera-master resource [undercloud] # grep clustercheck /home/stack/tripleo-overcloud-passwords OVERCLOUD_MYSQL_CLUSTERCHECK_PASSWORD=password How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: