Bug 1398349 - Galera cluster in failed status after failed update [NEEDINFO]
Summary: Galera cluster in failed status after failed update
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: async
: ---
Assignee: Damien Ciabrini
QA Contact: Omri Hochman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-24 14:00 UTC by Ondrej
Modified: 2019-12-16 07:27 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-12-13 15:36:30 UTC
Target Upstream Version:
dciabrin: needinfo? (ochalups)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2779221 0 None None None 2016-11-24 14:01:21 UTC

Description Ondrej 2016-11-24 14:00:59 UTC
Description of problem:
After failed stack-update, the galera-cluster fails too and it's not possible to bootstrap it as the primary node always fails initial monitor action when resource agent tries to poll the status.
Database got upgraded, but /etc/sysconfig/clustercheck wasn't updated as it should have been to use user clustercheck and according password, not root.

Pacemaker resource cleanup leaves the primary node in failed state, with Failed initial monitor action error message:
# pcs status
.. 
 Master/Slave Set: galera-master [galera]
      galera     (ocf::heartbeat:galera):        FAILED Master controller0 (unmanaged)
      Slaves: [ controller1 controller2 ]
Failed Actions:
 * galera_promote_0 on controller0 'unknown error' (1): call=1880, status=complete, exitreason='Failed initial monitor action',
     last-rc-change='Thu Nov 24 11:06:06 2016', queued=0ms, exec=6621ms

/var/log/mysqld.log
...
Nov 23 16:19:14 [19733] controller0.localdomain       lrmd:   notice: operation_finished:        galera_promote_0:22147:stderr [ ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO) ]
Nov 23 16:19:14 [19733] controller0.localdomain       lrmd:   notice: operation_finished:        galera_promote_0:22147:stderr [ ocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user 'root' has permissions to view status ]
Nov 23 16:19:14 [19733] controller0.localdomain       lrmd:   notice: operation_finished:        galera_promote_0:22147:stderr [ ocf-exit-reason:local node  is started, but not in primary mode. Unknown state. ]
Nov 23 16:19:14 [19733] controller0.localdomain       lrmd:   notice: operation_finished:        galera_promote_0:22147:stderr [ ocf-exit-reason:Failed initial monitor action ]

Database uses clustercheck user:
MariaDB [(none)]> select host,user,password from mysql.user;
 +------------------------+--------------+-------------------------------------------+
 | host                   | user         | password                                  |
 +------------------------+--------------+-------------------------------------------+
...
 | localhost              | clustercheck | *password |

# /usr/bin/clustercheck
 HTTP/1.1 503 Service Unavailable
 Content-Type: text/plain
 Connection: close
 Content-Length: 36
 Galera cluster node is not synced.

# cat /etc/sysconfig/clustercheck
MYSQL_USERNAME=root
MYSQL_PASSWORD=''
Version-Release number of selected component (if applicable):

Fix is to update to clustercheck user and his password from undercloud:/home/stack/tripleo-overcloud-passwords and cleanup galera-master resource

[undercloud] # grep clustercheck /home/stack/tripleo-overcloud-passwords
OVERCLOUD_MYSQL_CLUSTERCHECK_PASSWORD=password

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:


Note You need to log in before you can comment on or make changes to this bug.