Bug 1326507 - rhel-osp-director: failed to replace controller on 8.0: Error: /usr/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1 returned 1 instead of one of [0]
Summary: rhel-osp-director: failed to replace controller on 8.0: Error: /usr/sbin/pcs ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: documentation
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: async
: 8.0 (Liberty)
Assignee: Dan Macpherson
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
: 1336468 (view as bug list)
Depends On:
Blocks: 1286302 1338623
TreeView+ depends on / blocked
 
Reported: 2016-04-12 20:37 UTC by Alexander Chuzhoy
Modified: 2016-06-16 04:41 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1338623 (view as bug list)
Environment:
Last Closed: 2016-06-16 04:41:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Alexander Chuzhoy 2016-04-12 20:37:45 UTC
rhel-osp-director: failed to replace controller on 8.0: Error: /usr/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1 returned 1 instead of one of [0]


Environment:
openstack-tripleo-heat-templates-0.8.14-5.el7ost.noarch
openstack-puppet-modules-7.0.17-1.el7ost.noarch
instack-undercloud-2.2.7-2.el7ost.noarch
openstack-tripleo-heat-templates-kilo-0.8.14-5.el7ost.noarch


Steps to reproduce:
1. Deploy OC 7.3
2. Upgrade to 8.0
3. Start a procedure to replace a controller:
"After identifying the node index, redeploy the Overcloud and include the remove-node.yaml environment file"

Here's the replacement command:
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates $THT \
-e $THT/environments/storage-environment.yaml \
-e $THT/environments/network-isolation.yaml \
-e /home/stack/ssl-heat-templates/environments/puppet-ceph-external.yaml \
-e /home/stack/network-environment.yaml \
-e /home/stack/ssl-heat-templates/environments/enable-tls.yaml \
-e /home/stack/ssl-heat-templates/environments/inject-trust-anchor.yaml \
-e /home/stack/post.yaml \
--control-scale 3 \
--compute-scale 1 \
--compute-flavor compute --control-flavor control --ceph-storage-flavor ceph-storage  \
--neutron-tunnel-types vxlan,gre --neutron-network-type vxlan,gre \
--ntp-server clock.redhat.com \
-e /home/stack/remove-node.yaml \
--timeout 180


Here's the deployment command:
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates $THT \
-e $THT/environments/storage-environment.yaml \
-e $THT/environments/network-isolation.yaml \
-e /home/stack/ssl-heat-templates/environments/puppet-ceph-external.yaml \
-e /home/stack/network-environment.yaml \
-e /home/stack/ssl-heat-templates/environments/enable-tls.yaml \
-e /home/stack/ssl-heat-templates/environments/inject-trust-anchor.yaml \
-e /home/stack/post.yaml \
--control-scale 3 \
--compute-scale 1 \
--compute-flavor compute --control-flavor control --ceph-storage-flavor ceph-storage  \
--neutron-tunnel-types vxlan,gre --neutron-network-type vxlan,gre \
--ntp-server clock.redhat.com \
--timeout 180



Result:
2016-04-12 20:26:34 [ControllerDeployment]: SIGNAL_COMPLETE  Unknown                                                                         Stack overcloud UPDATE_FAILED                                                                   Heat Stack update failed.    


runing heat deployment-show revealed:

Notice: /File[/etc/haproxy/haproxy.cfg]/seluser: seluser changed 'unconfined_u' to 'system_u'
Notice: Finished catalog run in 3739.17 seconds
", "deploy_stderr": "Could not retrieve fact='apache_version', resolution='<anonymous>': undefined method `[]' for nil:NilClass
Could not retrieve fact='apache_version', resolution='<anonymous>': undefined method `[]' for nil:NilClass
Warning: Scope(Class[Mongodb::Server]): Replset specified, but no replset_members or replset_config provided.
Warning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications.
Error: /usr/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1 returned 1 instead of one of [0]
Error: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: change from notrun to 0 failed: /usr/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1 returned 1 instead of one of [0]
Warning: /Stage[main]/Pacemaker::Corosync/Notify[pacemaker settled]: Skipping because of failed dependencies
Warning: /Stage[main]/Pacemaker::Stonith/Exec[Disable STONITH]: Skipping because of failed dependencies
", "deploy_status_code": 6 }, "creation_time": "2016-04-12T19:23:37", "updated_time": "2016-04-12T20:26:28", "input_values": {}, "action": "CREATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 6", "id": "6c9eb7cb-c57b-421e-b179-366001731070" }
[stack@undercloud ~]$


Expected result:
Replace a controller with no issues

Comment 4 Angus Thomas 2016-04-13 11:59:47 UTC
Eng are working on a documentation update to resolve this issue.

Comment 5 Alexander Chuzhoy 2016-04-14 13:25:20 UTC
Reproduced the issue on a clean 8.0 deployment (not after upgrade).
This blocks verification of https://bugzilla.redhat.com/show_bug.cgi?id=1286302.

Comment 6 Michele Baldessari 2016-04-14 14:24:05 UTC
So controller-1 was replaced by a new node controller-3. The reason for the
failure is that on controller-3 puppet gets the following from pcs status:
Error: /usr/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1 returned 1 instead of on

The above is just a symptom. The real cause is that neither corosync
nor pacemaker have even been started on this node:
~sosreport-overcloud-controller-3.localdomain-20160412204841  
╰─$ grep -Eir "corosync|pacemaker" sos_commands/systemd/systemctl_list-units_--all
  corosync.service                                                                         loaded    inactive dead      Corosync Cluster Engine
  pacemaker.service                                                                        loaded    inactive dead      Pacemaker High Availability Cluster Manager

On the other two node (0,2) pacemaker runs fine and pcs status has
the proper "partition with quorum" output.

So we need to understand why puppet has not managed to spin up pacemaker
on controller-3. It correctly set up pcsd:
pcsd.service loaded    active   running   PCS GUI and remote configuration interface

But it seems it gave up right after pcsd was set up and the hacluster pass was
set:
Apr 12 16:26:26 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Service/Service[pcsd]/ensure: ensure changed 'stopped' to 'running'#033[0m
Apr 12 16:26:26 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[enable-not-start-tripleo_cluster]/returns: executed successfully#033[0m
Apr 12 16:26:26 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[Set password for hacluster user on tripleo_cluster]/returns: executed successfully#033[0m
Apr 12 16:26:26 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[auth-successful-across-all-nodes]/returns: executed successfully#033[0m
Apr 12 16:26:26 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Error: cluster is not currently running on this node#033[0m
Apr 12 16:26:26 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Notify[pacemaker settled]: Dependency Exec[wait-for-settle] has failures: true#033[0m
Apr 12 16:26:26 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Stonith/Exec[Disable STONITH]: Dependency Exec[wait-for-settle] has failures: true#033[0m

Comment 7 Hugh Brock 2016-04-14 15:25:20 UTC
Dan Macpherson, just assigned this to you -- can you work with Michele to sort out what (if anything) is missing in the controller replacement docs for OSP 8?

Comment 8 Dan Macpherson 2016-04-20 12:53:21 UTC
This error is normal behavior (at least in terms of our current process for replacing controller nodes). As Michele said in comment #6, this error occurs because the node hasn't joined the cluster yet. After this failure occurs you need to manually remove the details for the old node (which at this stage has been deleted) and add the new node to the cluster. You also need to update the keystone files on the new node.

After this, ControllerLoadBalancerDeployment_Step1 should succeed.

However, I've encountered a new issue at ControllerServicesBaseDeployment_Step2. It looks like the Puppet does a health check on galera (using clustercheck). However, it appears galera is nonoperational on the new node. I've tried to restart it but it doesn't seems to working and I can't figure out what's wrong or what I should do next.

I might need some help with diagnosing the galera on the cluster. Michele, any chance you could help me with some diagnostic steps?

Comment 11 Michele Baldessari 2016-04-20 19:29:10 UTC
So Dan sent me some logs about galera failing to start and it is a repeating sequence of the following:
160420 15:30:11 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended
160420 15:30:23 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
160420 15:30:23 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.1d7gIm' --pid-file='/var/lib/mysql/overcloud-controller-3.localdomain-recover.pid'
160420 15:30:23 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
160420 15:30:23 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
160420 15:30:23 [Warning] Could not increase number of max_open_files to more than 1024 (request: 4907)
160420 15:30:25 mysqld_safe WSREP: Recovered position 00000000-0000-0000-0000-000000000000:-1
160420 15:30:25 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
160420 15:30:25 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
160420 15:30:25 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1'
160420 15:30:25 [Warning] Could not increase number of max_open_files to more than 1024 (request: 4907)
160420 15:30:25 InnoDB: The InnoDB memory heap is disabled
160420 15:30:25 InnoDB: Mutexes and rw_locks use GCC atomic builtins
160420 15:30:25 InnoDB: Compressed tables use zlib 1.2.7
160420 15:30:25 InnoDB: Using Linux native AIO
160420 15:30:25 InnoDB: Initializing buffer pool, size = 128.0M
160420 15:30:25 InnoDB: Completed initialization of buffer pool
160420 15:30:25 InnoDB: highest supported file format is Barracuda.
160420 15:30:26  InnoDB: Waiting for the background threads to start
160420 15:30:27 Percona XtraDB (http://www.percona.com) 5.5.41-MariaDB-37.0 started; log sequence number 1598129
160420 15:30:27 [Note] Plugin 'FEEDBACK' is disabled.
160420 15:30:27 [Warning] Failed to setup SSL
160420 15:30:27 [Warning] SSL error: SSL_CTX_set_default_verify_paths failed
160420 15:30:27 [Note] Server socket created on IP: '192.168.201.33'.
160420 15:30:27 [Note] WSREP: Recovered position: 00000000-0000-0000-0000-000000000000:-1
160420 15:30:27  InnoDB: Starting shutdown...
160420 15:30:27  InnoDB: Shutdown completed; log sequence number 1598129
160420 15:30:27 [Note] /usr/libexec/mysqld: Shutdown complete

160420 15:30:28 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended


So galera starts correctly but seems to be shut down right afterwards. This
makes me suspect that it is pacemaker deciding to shut this node.

Can you please post three sosreports from all the controllers so we can try
and figure out what is going on here?

Tomorrow I am travelling, if it is urgent please check with dciabrin

Comment 26 Dan Macpherson 2016-05-12 23:20:23 UTC
Sasha has verified the Controller node replacement procedure in this BZ:

https://bugzilla.redhat.com/show_bug.cgi?id=1327717

Closing this BZ.

Comment 27 Marius Cornea 2016-05-16 15:14:29 UTC
After step 9: Enable Galera on the new node I get the following results:

 Master/Slave Set: galera-master [galera]
     galera	(ocf::heartbeat:galera):	FAILED Master overcloud-controller-0 (unmanaged)
     galera	(ocf::heartbeat:galera):	FAILED Master overcloud-controller-2 (unmanaged)
     Stopped: [ overcloud-controller-3 ]
 Clone Set: mongod-clone [mongod]
--
* galera_promote_0 on overcloud-controller-0 'unknown error' (1): call=371, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.',
    last-rc-change='Mon May 16 14:30:35 2016', queued=0ms, exec=130ms
* galera_promote_0 on overcloud-controller-2 'unknown error' (1): call=367, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.',
    last-rc-change='Mon May 16 14:30:40 2016', queued=0ms, exec=130ms
* galera_monitor_20000 on overcloud-controller-3 'not running' (7): call=891, status=complete, exitreason='none',
    last-rc-change='Mon May 16 14:39:36 2016', queued=57ms, exec=59ms

Comment 29 Marius Cornea 2016-05-16 15:15:19 UTC
*** Bug 1336468 has been marked as a duplicate of this bug. ***

Comment 42 Marius Cornea 2016-06-06 15:06:37 UTC
I'm following the docs and I have a couple of suggestions:

1. At the end of Finalizing Overcloud Services we could do a 'pcs resource cleanup' to clear up any Failed actions that show up in pcs status.

2. Delete existing neutron-agents which point to overcloud-controller-1.localdomain. I think this should be done after ⁠Finalizing L3 Agent Router Hosting.

neutron agent-list  -F id -F host | grep overcloud-controller-1
neutron agent-delete $id

3. In the end nova-consoleauth doesn't run on the new controller:

[stack@undercloud ~]$ nova service-list  | grep consoleauth
| 11 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-06-06T14:55:15.000000 | -               |
| 14 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-06-06T14:55:16.000000 | -               |


so we might want to restart the openstack-nova-consoleauth resource for it to show up in the service list:

After 'pcs resource restart openstack-nova-consoleauth':

[stack@undercloud ~]$ nova service-list  | grep consoleauth
| 11 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-06-06T15:03:42.000000 | -               |
| 14 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-06-06T15:03:42.000000 | -               |
| 57 | nova-consoleauth | overcloud-controller-3.localdomain | internal | enabled | up    | 2016-06-06T15:03:42.000000 | -               |

What do you think?

Comment 43 Dan Macpherson 2016-06-07 01:08:08 UTC
These things should be fine.

In general, though, was the procedure a success? Or did you still encounter issues when starting Galera?

Comment 44 Marius Cornea 2016-06-07 04:24:53 UTC
(In reply to Dan Macpherson from comment #43)
> These things should be fine.
> 
> In general, though, was the procedure a success? Or did you still encounter
> issues when starting Galera?

Yes, it goes smooth, I haven't hit any issues so far.

Comment 45 Dan Macpherson 2016-06-07 06:04:31 UTC
Awesome. Tonight I'll make the adjustments from your previous comment and we should have this issue resolved completely.

Comment 47 Dan Macpherson 2016-06-07 15:07:54 UTC
Staged version:
https://access.stage.redhat.com/documentation/en/red-hat-openstack-platform/8/director-installation-and-usage/94-replacing-controller-nodes

Marius, how do the changes look? Is it okay to switch this BZ to VERIFIED?

Comment 48 Marius Cornea 2016-06-08 07:55:15 UTC
(In reply to Dan Macpherson from comment #47)
> Staged version:
> https://access.stage.redhat.com/documentation/en/red-hat-openstack-platform/
> 8/director-installation-and-usage/94-replacing-controller-nodes
> 
> Marius, how do the changes look? Is it okay to switch this BZ to VERIFIED?

I checked the staged version but it looks that the changes are not there. Could you please check again that the changes are present? Thanks.

Comment 49 Dan Macpherson 2016-06-08 13:59:31 UTC
I performed a book rebuild. The changes should be there now.

Comment 50 Marius Cornea 2016-06-09 14:22:35 UTC
Looks good. I moved it to verfied. Thanks!

Comment 51 Dan Macpherson 2016-06-16 04:41:16 UTC
Changes now live on the customer portal.


Note You need to log in before you can comment on or make changes to this bug.