Hide Forgot
Created attachment 1196048 [details] sosreport-controller-3 Description of problem: I'm following the controller replacement docs[1] and the procedure fails during step 14 while trying to start galera on the replaced controller node: Master/Slave Set: galera-master [galera] galera (ocf::heartbeat:galera): FAILED Master overcloud-controller-3 (unmanaged) * galera_promote_0 on overcloud-controller-3 'unknown error' (1): call=223, status=complete, exitreason='Failed initial monitor action', last-rc-change='Tue Aug 30 14:21:31 2016', queued=0ms, exec=10828ms in /var/log/messages: Aug 30 10:21:41 localhost galera(galera)[1802]: ERROR: Unable to retrieve wsrep_cluster_status, verify check_user '' has permissions to view status Aug 30 10:21:41 localhost galera(galera)[1802]: ERROR: local node <overcloud-controller-3> is started, but not in primary mode. Unknown state. Aug 30 10:21:41 localhost galera(galera)[1802]: ERROR: Failed initial monitor action Aug 30 10:21:41 localhost lrmd[30559]: notice: galera_promote_0:1802:stderr [ ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO) ] Aug 30 10:21:41 localhost lrmd[30559]: notice: galera_promote_0:1802:stderr [ ocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user '' has permissions to view status ] Aug 30 10:21:41 localhost lrmd[30559]: notice: galera_promote_0:1802:stderr [ ocf-exit-reason:local node <overcloud-controller-3> is started, but not in primary mode. Unknown state. ] Aug 30 10:21:41 localhost lrmd[30559]: notice: galera_promote_0:1802:stderr [ ocf-exit-reason:Failed initial monitor action ] Aug 30 10:21:41 localhost crmd[30562]: notice: Operation galera_promote_0: unknown error (node=overcloud-controller-3, call=223, rc=1, cib-update=77, confirmed=true) Aug 30 10:21:41 localhost crmd[30562]: notice: overcloud-controller-3-galera_promote_0:223 [ ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)\nocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user '' has permissions to view status\nocf-exit-reason:local node <overcloud-controller-3> is started, but not in primary mode. Unknown state.\nocf-exit-reason:Failed initial monitor action\n ] [1] https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/director-installation-and-usage/94-replacing-controller-nodes Version-Release number of selected component (if applicable): resource-agents-3.9.5-54.el7_2.16.x86_64 openstack-tripleo-heat-templates-0.8.14-18.el7ost.noarch How reproducible: 2/2 Steps to Reproduce: 1. Follow the docs to replace overcloud-controller-1 Actual results: Procedure fails while waiting for Galera to start on all nodes (step 14) Expected results: Galera gets started on all nodes. Additional info: Attaching the sosreport. Please let me know if a reproducing system is needed for investigation.
The same error happens with OSP9. I see that we introduced password authentication for mysql and there's no /root/.my.cnf file on the replaced controller. Now even if I add it, run 'pcs resource cleanup galera overcloud-controller-3' I end up with the same error: * galera_promote_0 on overcloud-controller-3 'unknown error' (1): call=565, status=complete, exitreason='Failed initial monitor action', last-rc-change='Wed Aug 31 06:40:54 2016', queued=0ms, exec=8358ms Aug 31 06:41:02 overcloud-controller-3 galera(galera)[28927]: ERROR: Unable to retrieve wsrep_cluster_status, verify check_user '' has permissions to view status Aug 31 06:41:02 overcloud-controller-3 galera(galera)[28927]: ERROR: local node <overcloud-controller-3> is started, but not in primary mode. Unknown state. Aug 31 06:41:02 overcloud-controller-3 galera(galera)[28927]: ERROR: Failed initial monitor action Aug 31 06:41:02 overcloud-controller-3 lrmd[3410]: notice: galera_promote_0:28927:stderr [ ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO) ] Aug 31 06:41:02 overcloud-controller-3 lrmd[3410]: notice: galera_promote_0:28927:stderr [ ocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user '' has permissions to view status ] Aug 31 06:41:02 overcloud-controller-3 lrmd[3410]: notice: galera_promote_0:28927:stderr [ ocf-exit-reason:local node <overcloud-controller-3> is started, but not in primary mode. Unknown state. ] Aug 31 06:41:02 overcloud-controller-3 lrmd[3410]: notice: galera_promote_0:28927:stderr [ ocf-exit-reason:Failed initial monitor action ] Aug 31 06:41:02 overcloud-controller-3 crmd[3413]: notice: Operation galera_promote_0: unknown error (node=overcloud-controller-3, call=565, rc=1, cib-update=220, confirmed=true) Aug 31 06:41:02 overcloud-controller-3 crmd[3413]: notice: overcloud-controller-3-galera_promote_0:565 [ ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)\nocf-exit-reason:Unable to retrieve wsrep_cluster_ status, verify check_user '' has permissions to view status\nocf-exit-reason:local node <overcloud-controller-3> is started, but not in primary mode. Unknown state.\nocf-exit-reason:Failed initial monitor action\n ] Aug 31 06:41:05 overcloud-controller-3 os-collect-config: /var/lib/os-collect-config/local-data not found. Skipping
Update: the missing file on the replaced controller was /etc/sysconfig/clustercheck . I'm going to rerun the procedure and copy it before step 14.
OK, so we need both /root/.my.cnf and /etc/sysconfig/clustercheck copied from one of the existing controllers to the replaced controller before running step 13 that brings the cluster out of maintenance. Moving this to the docs component.
Dan, do you think we can add these steps to the docs please? Thank you.
Hi Marius, Sorry for the long wait on this BZ. I originally modified the OSP10 docs to include these steps with an intention to backport to OSP 9 and 8. I've now pushed an update to the OSP9 and OSP8 docs to include the following two steps (step 8 and step 9) as part of the process: 8. Configure the Galera cluster check on the new node. Copy the /etc/sysconfig/clustercheck from the existing node to the same location on the new node. 9. Configure the root user’s Galera access on the new node. Copy the /root/.my.cnf from the existing node to the same location on the new node. OSP8 version: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes-Manual_Intervention OSP9 version: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/9/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes-Manual_Intervention Was there anything further required for this BZ?
(In reply to Dan Macpherson from comment #8) > Hi Marius, > > Sorry for the long wait on this BZ. I originally modified the OSP10 docs to > include these steps with an intention to backport to OSP 9 and 8. > > I've now pushed an update to the OSP9 and OSP8 docs to include the following > two steps (step 8 and step 9) as part of the process: > > 8. Configure the Galera cluster check on the new node. Copy the > /etc/sysconfig/clustercheck from the existing node to the same location on > the new node. > > 9. Configure the root user’s Galera access on the new node. Copy the > /root/.my.cnf from the existing node to the same location on the new node. > > OSP8 version: > https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/ > html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes- > Manual_Intervention > > OSP9 version: > https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/9/ > html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes- > Manual_Intervention > > Was there anything further required for this BZ? That is all. Thank you, Dan!
Thanks, Marius. And again sorry about the long wait.