Bug 1371665

Summary: [documentation] Controller replacement fails during step 14. Wait until the Galera service starts on all nodes.
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: documentationAssignee: Dan Macpherson <dmacpher>
Status: CLOSED CURRENTRELEASE QA Contact: RHOS Documentation Team <rhos-docs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 8.0 (Liberty)CC: dbecker, dciabrin, djuran, dmacpher, jschluet, jslagle, mburns, mcornea, morazi, ochalups, rhel-osp-director-maint, srevivo
Target Milestone: gaKeywords: Documentation, Regression
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-08 12:41:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
sosreport-controller-3 none

Description Marius Cornea 2016-08-30 18:29:11 UTC
Created attachment 1196048 [details]
sosreport-controller-3

Description of problem:

I'm following the controller replacement docs[1] and the procedure fails during step 14 while trying to start galera on the replaced controller node:

 Master/Slave Set: galera-master [galera]
     galera	(ocf::heartbeat:galera):	FAILED Master overcloud-controller-3 (unmanaged)

* galera_promote_0 on overcloud-controller-3 'unknown error' (1): call=223, status=complete, exitreason='Failed initial monitor action',
    last-rc-change='Tue Aug 30 14:21:31 2016', queued=0ms, exec=10828ms

in /var/log/messages:

Aug 30 10:21:41 localhost galera(galera)[1802]: ERROR: Unable to retrieve wsrep_cluster_status, verify check_user '' has permissions to view status
Aug 30 10:21:41 localhost galera(galera)[1802]: ERROR: local node <overcloud-controller-3> is started, but not in primary mode. Unknown state.
Aug 30 10:21:41 localhost galera(galera)[1802]: ERROR: Failed initial monitor action
Aug 30 10:21:41 localhost lrmd[30559]:  notice: galera_promote_0:1802:stderr [ ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO) ]
Aug 30 10:21:41 localhost lrmd[30559]:  notice: galera_promote_0:1802:stderr [ ocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user '' has permissions to view status ]
Aug 30 10:21:41 localhost lrmd[30559]:  notice: galera_promote_0:1802:stderr [ ocf-exit-reason:local node <overcloud-controller-3> is started, but not in primary mode. Unknown state. ]
Aug 30 10:21:41 localhost lrmd[30559]:  notice: galera_promote_0:1802:stderr [ ocf-exit-reason:Failed initial monitor action ]
Aug 30 10:21:41 localhost crmd[30562]:  notice: Operation galera_promote_0: unknown error (node=overcloud-controller-3, call=223, rc=1, cib-update=77, confirmed=true)
Aug 30 10:21:41 localhost crmd[30562]:  notice: overcloud-controller-3-galera_promote_0:223 [ ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)\nocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user '' has permissions to view status\nocf-exit-reason:local node <overcloud-controller-3> is started, but not in primary mode. Unknown state.\nocf-exit-reason:Failed initial monitor action\n ]

[1] https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/director-installation-and-usage/94-replacing-controller-nodes


Version-Release number of selected component (if applicable):
resource-agents-3.9.5-54.el7_2.16.x86_64
openstack-tripleo-heat-templates-0.8.14-18.el7ost.noarch

How reproducible:
2/2

Steps to Reproduce:
1. Follow the docs to replace overcloud-controller-1

Actual results:
Procedure fails while waiting for Galera to start on all nodes (step 14)

Expected results:
Galera gets started on all nodes.

Additional info:
Attaching the sosreport. Please let me know if a reproducing system is needed for investigation.

Comment 2 Marius Cornea 2016-08-31 06:43:49 UTC
The same error happens with OSP9. I see that we introduced password authentication for mysql and there's no /root/.my.cnf file on the replaced controller. Now even if I add it, run 'pcs resource cleanup galera overcloud-controller-3' I end up with the same error:

* galera_promote_0 on overcloud-controller-3 'unknown error' (1): call=565, status=complete, exitreason='Failed initial monitor action',
    last-rc-change='Wed Aug 31 06:40:54 2016', queued=0ms, exec=8358ms

Aug 31 06:41:02 overcloud-controller-3 galera(galera)[28927]: ERROR: Unable to retrieve wsrep_cluster_status, verify check_user '' has permissions to view status
Aug 31 06:41:02 overcloud-controller-3 galera(galera)[28927]: ERROR: local node <overcloud-controller-3> is started, but not in primary mode. Unknown state.
Aug 31 06:41:02 overcloud-controller-3 galera(galera)[28927]: ERROR: Failed initial monitor action
Aug 31 06:41:02 overcloud-controller-3 lrmd[3410]:  notice: galera_promote_0:28927:stderr [ ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO) ]
Aug 31 06:41:02 overcloud-controller-3 lrmd[3410]:  notice: galera_promote_0:28927:stderr [ ocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user '' has permissions to view status ]
Aug 31 06:41:02 overcloud-controller-3 lrmd[3410]:  notice: galera_promote_0:28927:stderr [ ocf-exit-reason:local node <overcloud-controller-3> is started, but not in primary mode. Unknown state. ]
Aug 31 06:41:02 overcloud-controller-3 lrmd[3410]:  notice: galera_promote_0:28927:stderr [ ocf-exit-reason:Failed initial monitor action ]
Aug 31 06:41:02 overcloud-controller-3 crmd[3413]:  notice: Operation galera_promote_0: unknown error (node=overcloud-controller-3, call=565, rc=1, cib-update=220, confirmed=true)
Aug 31 06:41:02 overcloud-controller-3 crmd[3413]:  notice: overcloud-controller-3-galera_promote_0:565 [ ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)\nocf-exit-reason:Unable to retrieve wsrep_cluster_
status, verify check_user '' has permissions to view status\nocf-exit-reason:local node <overcloud-controller-3> is started, but not in primary mode. Unknown state.\nocf-exit-reason:Failed initial monitor action\n ]
Aug 31 06:41:05 overcloud-controller-3 os-collect-config: /var/lib/os-collect-config/local-data not found. Skipping

Comment 3 Marius Cornea 2016-08-31 08:24:25 UTC
Update: the missing file on the replaced controller was /etc/sysconfig/clustercheck . I'm going to rerun the procedure and copy it before step 14.

Comment 4 Marius Cornea 2016-08-31 14:10:10 UTC
OK, so we need both /root/.my.cnf and /etc/sysconfig/clustercheck  copied from one of the existing controllers to the replaced controller before running step 13 that brings the cluster out of maintenance. Moving this to the docs component.

Comment 5 Marius Cornea 2016-08-31 14:10:45 UTC
Dan, do you think we can add these steps to the docs please? Thank you.

Comment 8 Dan Macpherson 2017-03-08 03:11:00 UTC
Hi Marius,

Sorry for the long wait on this BZ. I originally modified the OSP10 docs to include these steps with an intention to backport to OSP 9 and 8.

I've now pushed an update to the OSP9 and OSP8 docs to include the following two steps (step 8 and step 9) as part of the process:

8. Configure the Galera cluster check on the new node. Copy the /etc/sysconfig/clustercheck from the existing node to the same location on the new node.

9. Configure the root user’s Galera access on the new node. Copy the /root/.my.cnf from the existing node to the same location on the new node.

OSP8 version:
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes-Manual_Intervention

OSP9 version:
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/9/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes-Manual_Intervention

Was there anything further required for this BZ?

Comment 9 Marius Cornea 2017-03-08 08:31:15 UTC
(In reply to Dan Macpherson from comment #8)
> Hi Marius,
> 
> Sorry for the long wait on this BZ. I originally modified the OSP10 docs to
> include these steps with an intention to backport to OSP 9 and 8.
> 
> I've now pushed an update to the OSP9 and OSP8 docs to include the following
> two steps (step 8 and step 9) as part of the process:
> 
> 8. Configure the Galera cluster check on the new node. Copy the
> /etc/sysconfig/clustercheck from the existing node to the same location on
> the new node.
> 
> 9. Configure the root user’s Galera access on the new node. Copy the
> /root/.my.cnf from the existing node to the same location on the new node.
> 
> OSP8 version:
> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/
> html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes-
> Manual_Intervention
> 
> OSP9 version:
> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/9/
> html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes-
> Manual_Intervention
> 
> Was there anything further required for this BZ?

That is all. Thank you, Dan!

Comment 10 Dan Macpherson 2017-03-08 12:41:29 UTC
Thanks, Marius. And again sorry about the long wait.