Bug 1398349

Summary: Galera cluster in failed status after failed update
Product: Red Hat OpenStack Reporter: Ondrej <ochalups>
Component: rhosp-directorAssignee: Damien Ciabrini <dciabrin>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Omri Hochman <ohochman>
Severity: medium Docs Contact:
Priority: high    
Version: 8.0 (Liberty)CC: dbecker, fdinitto, mbayer, mburns, morazi, nsantos, ochalups, rhel-osp-director-maint, ushkalim
Target Milestone: asyncKeywords: ZStream
Target Release: ---Flags: dciabrin: needinfo? (ochalups)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-13 15:36:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Ondrej 2016-11-24 14:00:59 UTC
Description of problem:
After failed stack-update, the galera-cluster fails too and it's not possible to bootstrap it as the primary node always fails initial monitor action when resource agent tries to poll the status.
Database got upgraded, but /etc/sysconfig/clustercheck wasn't updated as it should have been to use user clustercheck and according password, not root.

Pacemaker resource cleanup leaves the primary node in failed state, with Failed initial monitor action error message:
# pcs status
.. 
 Master/Slave Set: galera-master [galera]
      galera     (ocf::heartbeat:galera):        FAILED Master controller0 (unmanaged)
      Slaves: [ controller1 controller2 ]
Failed Actions:
 * galera_promote_0 on controller0 'unknown error' (1): call=1880, status=complete, exitreason='Failed initial monitor action',
     last-rc-change='Thu Nov 24 11:06:06 2016', queued=0ms, exec=6621ms

/var/log/mysqld.log
...
Nov 23 16:19:14 [19733] controller0.localdomain       lrmd:   notice: operation_finished:        galera_promote_0:22147:stderr [ ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO) ]
Nov 23 16:19:14 [19733] controller0.localdomain       lrmd:   notice: operation_finished:        galera_promote_0:22147:stderr [ ocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user 'root' has permissions to view status ]
Nov 23 16:19:14 [19733] controller0.localdomain       lrmd:   notice: operation_finished:        galera_promote_0:22147:stderr [ ocf-exit-reason:local node  is started, but not in primary mode. Unknown state. ]
Nov 23 16:19:14 [19733] controller0.localdomain       lrmd:   notice: operation_finished:        galera_promote_0:22147:stderr [ ocf-exit-reason:Failed initial monitor action ]

Database uses clustercheck user:
MariaDB [(none)]> select host,user,password from mysql.user;
 +------------------------+--------------+-------------------------------------------+
 | host                   | user         | password                                  |
 +------------------------+--------------+-------------------------------------------+
...
 | localhost              | clustercheck | *password |

# /usr/bin/clustercheck
 HTTP/1.1 503 Service Unavailable
 Content-Type: text/plain
 Connection: close
 Content-Length: 36
 Galera cluster node is not synced.

# cat /etc/sysconfig/clustercheck
MYSQL_USERNAME=root
MYSQL_PASSWORD=''
Version-Release number of selected component (if applicable):

Fix is to update to clustercheck user and his password from undercloud:/home/stack/tripleo-overcloud-passwords and cleanup galera-master resource

[undercloud] # grep clustercheck /home/stack/tripleo-overcloud-passwords
OVERCLOUD_MYSQL_CLUSTERCHECK_PASSWORD=password

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info: