rhel-osp-director: failed to replace controller on 8.0: Error: /usr/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1 returned 1 instead of one of [0] Environment: openstack-tripleo-heat-templates-0.8.14-5.el7ost.noarch openstack-puppet-modules-7.0.17-1.el7ost.noarch instack-undercloud-2.2.7-2.el7ost.noarch openstack-tripleo-heat-templates-kilo-0.8.14-5.el7ost.noarch Steps to reproduce: 1. Deploy OC 7.3 2. Upgrade to 8.0 3. Start a procedure to replace a controller: "After identifying the node index, redeploy the Overcloud and include the remove-node.yaml environment file" Here's the replacement command: export THT=/usr/share/openstack-tripleo-heat-templates openstack overcloud deploy --templates $THT \ -e $THT/environments/storage-environment.yaml \ -e $THT/environments/network-isolation.yaml \ -e /home/stack/ssl-heat-templates/environments/puppet-ceph-external.yaml \ -e /home/stack/network-environment.yaml \ -e /home/stack/ssl-heat-templates/environments/enable-tls.yaml \ -e /home/stack/ssl-heat-templates/environments/inject-trust-anchor.yaml \ -e /home/stack/post.yaml \ --control-scale 3 \ --compute-scale 1 \ --compute-flavor compute --control-flavor control --ceph-storage-flavor ceph-storage \ --neutron-tunnel-types vxlan,gre --neutron-network-type vxlan,gre \ --ntp-server clock.redhat.com \ -e /home/stack/remove-node.yaml \ --timeout 180 Here's the deployment command: export THT=/usr/share/openstack-tripleo-heat-templates openstack overcloud deploy --templates $THT \ -e $THT/environments/storage-environment.yaml \ -e $THT/environments/network-isolation.yaml \ -e /home/stack/ssl-heat-templates/environments/puppet-ceph-external.yaml \ -e /home/stack/network-environment.yaml \ -e /home/stack/ssl-heat-templates/environments/enable-tls.yaml \ -e /home/stack/ssl-heat-templates/environments/inject-trust-anchor.yaml \ -e /home/stack/post.yaml \ --control-scale 3 \ --compute-scale 1 \ --compute-flavor compute --control-flavor control --ceph-storage-flavor ceph-storage \ --neutron-tunnel-types vxlan,gre --neutron-network-type vxlan,gre \ --ntp-server clock.redhat.com \ --timeout 180 Result: 2016-04-12 20:26:34 [ControllerDeployment]: SIGNAL_COMPLETE Unknown Stack overcloud UPDATE_FAILED Heat Stack update failed. runing heat deployment-show revealed: Notice: /File[/etc/haproxy/haproxy.cfg]/seluser: seluser changed 'unconfined_u' to 'system_u' Notice: Finished catalog run in 3739.17 seconds ", "deploy_stderr": "Could not retrieve fact='apache_version', resolution='<anonymous>': undefined method `[]' for nil:NilClass Could not retrieve fact='apache_version', resolution='<anonymous>': undefined method `[]' for nil:NilClass Warning: Scope(Class[Mongodb::Server]): Replset specified, but no replset_members or replset_config provided. Warning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications. Error: /usr/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1 returned 1 instead of one of [0] Error: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: change from notrun to 0 failed: /usr/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1 returned 1 instead of one of [0] Warning: /Stage[main]/Pacemaker::Corosync/Notify[pacemaker settled]: Skipping because of failed dependencies Warning: /Stage[main]/Pacemaker::Stonith/Exec[Disable STONITH]: Skipping because of failed dependencies ", "deploy_status_code": 6 }, "creation_time": "2016-04-12T19:23:37", "updated_time": "2016-04-12T20:26:28", "input_values": {}, "action": "CREATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 6", "id": "6c9eb7cb-c57b-421e-b179-366001731070" } [stack@undercloud ~]$ Expected result: Replace a controller with no issues
Eng are working on a documentation update to resolve this issue.
Reproduced the issue on a clean 8.0 deployment (not after upgrade). This blocks verification of https://bugzilla.redhat.com/show_bug.cgi?id=1286302.
So controller-1 was replaced by a new node controller-3. The reason for the failure is that on controller-3 puppet gets the following from pcs status: Error: /usr/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1 returned 1 instead of on The above is just a symptom. The real cause is that neither corosync nor pacemaker have even been started on this node: ~sosreport-overcloud-controller-3.localdomain-20160412204841 ╰─$ grep -Eir "corosync|pacemaker" sos_commands/systemd/systemctl_list-units_--all corosync.service loaded inactive dead Corosync Cluster Engine pacemaker.service loaded inactive dead Pacemaker High Availability Cluster Manager On the other two node (0,2) pacemaker runs fine and pcs status has the proper "partition with quorum" output. So we need to understand why puppet has not managed to spin up pacemaker on controller-3. It correctly set up pcsd: pcsd.service loaded active running PCS GUI and remote configuration interface But it seems it gave up right after pcsd was set up and the hacluster pass was set: Apr 12 16:26:26 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Service/Service[pcsd]/ensure: ensure changed 'stopped' to 'running'#033[0m Apr 12 16:26:26 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[enable-not-start-tripleo_cluster]/returns: executed successfully#033[0m Apr 12 16:26:26 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[Set password for hacluster user on tripleo_cluster]/returns: executed successfully#033[0m Apr 12 16:26:26 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[auth-successful-across-all-nodes]/returns: executed successfully#033[0m Apr 12 16:26:26 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Error: cluster is not currently running on this node#033[0m Apr 12 16:26:26 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Notify[pacemaker settled]: Dependency Exec[wait-for-settle] has failures: true#033[0m Apr 12 16:26:26 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Stonith/Exec[Disable STONITH]: Dependency Exec[wait-for-settle] has failures: true#033[0m
Dan Macpherson, just assigned this to you -- can you work with Michele to sort out what (if anything) is missing in the controller replacement docs for OSP 8?
This error is normal behavior (at least in terms of our current process for replacing controller nodes). As Michele said in comment #6, this error occurs because the node hasn't joined the cluster yet. After this failure occurs you need to manually remove the details for the old node (which at this stage has been deleted) and add the new node to the cluster. You also need to update the keystone files on the new node. After this, ControllerLoadBalancerDeployment_Step1 should succeed. However, I've encountered a new issue at ControllerServicesBaseDeployment_Step2. It looks like the Puppet does a health check on galera (using clustercheck). However, it appears galera is nonoperational on the new node. I've tried to restart it but it doesn't seems to working and I can't figure out what's wrong or what I should do next. I might need some help with diagnosing the galera on the cluster. Michele, any chance you could help me with some diagnostic steps?
So Dan sent me some logs about galera failing to start and it is a repeating sequence of the following: 160420 15:30:11 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended 160420 15:30:23 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 160420 15:30:23 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.1d7gIm' --pid-file='/var/lib/mysql/overcloud-controller-3.localdomain-recover.pid' 160420 15:30:23 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295 160420 15:30:23 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295 160420 15:30:23 [Warning] Could not increase number of max_open_files to more than 1024 (request: 4907) 160420 15:30:25 mysqld_safe WSREP: Recovered position 00000000-0000-0000-0000-000000000000:-1 160420 15:30:25 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295 160420 15:30:25 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295 160420 15:30:25 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1' 160420 15:30:25 [Warning] Could not increase number of max_open_files to more than 1024 (request: 4907) 160420 15:30:25 InnoDB: The InnoDB memory heap is disabled 160420 15:30:25 InnoDB: Mutexes and rw_locks use GCC atomic builtins 160420 15:30:25 InnoDB: Compressed tables use zlib 1.2.7 160420 15:30:25 InnoDB: Using Linux native AIO 160420 15:30:25 InnoDB: Initializing buffer pool, size = 128.0M 160420 15:30:25 InnoDB: Completed initialization of buffer pool 160420 15:30:25 InnoDB: highest supported file format is Barracuda. 160420 15:30:26 InnoDB: Waiting for the background threads to start 160420 15:30:27 Percona XtraDB (http://www.percona.com) 5.5.41-MariaDB-37.0 started; log sequence number 1598129 160420 15:30:27 [Note] Plugin 'FEEDBACK' is disabled. 160420 15:30:27 [Warning] Failed to setup SSL 160420 15:30:27 [Warning] SSL error: SSL_CTX_set_default_verify_paths failed 160420 15:30:27 [Note] Server socket created on IP: '192.168.201.33'. 160420 15:30:27 [Note] WSREP: Recovered position: 00000000-0000-0000-0000-000000000000:-1 160420 15:30:27 InnoDB: Starting shutdown... 160420 15:30:27 InnoDB: Shutdown completed; log sequence number 1598129 160420 15:30:27 [Note] /usr/libexec/mysqld: Shutdown complete 160420 15:30:28 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended So galera starts correctly but seems to be shut down right afterwards. This makes me suspect that it is pacemaker deciding to shut this node. Can you please post three sosreports from all the controllers so we can try and figure out what is going on here? Tomorrow I am travelling, if it is urgent please check with dciabrin
Sasha has verified the Controller node replacement procedure in this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1327717 Closing this BZ.
After step 9: Enable Galera on the new node I get the following results: Master/Slave Set: galera-master [galera] galera (ocf::heartbeat:galera): FAILED Master overcloud-controller-0 (unmanaged) galera (ocf::heartbeat:galera): FAILED Master overcloud-controller-2 (unmanaged) Stopped: [ overcloud-controller-3 ] Clone Set: mongod-clone [mongod] -- * galera_promote_0 on overcloud-controller-0 'unknown error' (1): call=371, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.', last-rc-change='Mon May 16 14:30:35 2016', queued=0ms, exec=130ms * galera_promote_0 on overcloud-controller-2 'unknown error' (1): call=367, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.', last-rc-change='Mon May 16 14:30:40 2016', queued=0ms, exec=130ms * galera_monitor_20000 on overcloud-controller-3 'not running' (7): call=891, status=complete, exitreason='none', last-rc-change='Mon May 16 14:39:36 2016', queued=57ms, exec=59ms
*** Bug 1336468 has been marked as a duplicate of this bug. ***
I'm following the docs and I have a couple of suggestions: 1. At the end of Finalizing Overcloud Services we could do a 'pcs resource cleanup' to clear up any Failed actions that show up in pcs status. 2. Delete existing neutron-agents which point to overcloud-controller-1.localdomain. I think this should be done after Finalizing L3 Agent Router Hosting. neutron agent-list -F id -F host | grep overcloud-controller-1 neutron agent-delete $id 3. In the end nova-consoleauth doesn't run on the new controller: [stack@undercloud ~]$ nova service-list | grep consoleauth | 11 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up | 2016-06-06T14:55:15.000000 | - | | 14 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up | 2016-06-06T14:55:16.000000 | - | so we might want to restart the openstack-nova-consoleauth resource for it to show up in the service list: After 'pcs resource restart openstack-nova-consoleauth': [stack@undercloud ~]$ nova service-list | grep consoleauth | 11 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up | 2016-06-06T15:03:42.000000 | - | | 14 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up | 2016-06-06T15:03:42.000000 | - | | 57 | nova-consoleauth | overcloud-controller-3.localdomain | internal | enabled | up | 2016-06-06T15:03:42.000000 | - | What do you think?
These things should be fine. In general, though, was the procedure a success? Or did you still encounter issues when starting Galera?
(In reply to Dan Macpherson from comment #43) > These things should be fine. > > In general, though, was the procedure a success? Or did you still encounter > issues when starting Galera? Yes, it goes smooth, I haven't hit any issues so far.
Awesome. Tonight I'll make the adjustments from your previous comment and we should have this issue resolved completely.
Staged version: https://access.stage.redhat.com/documentation/en/red-hat-openstack-platform/8/director-installation-and-usage/94-replacing-controller-nodes Marius, how do the changes look? Is it okay to switch this BZ to VERIFIED?
(In reply to Dan Macpherson from comment #47) > Staged version: > https://access.stage.redhat.com/documentation/en/red-hat-openstack-platform/ > 8/director-installation-and-usage/94-replacing-controller-nodes > > Marius, how do the changes look? Is it okay to switch this BZ to VERIFIED? I checked the staged version but it looks that the changes are not there. Could you please check again that the changes are present? Thanks.
I performed a book rebuild. The changes should be there now.
Looks good. I moved it to verfied. Thanks!
Changes now live on the customer portal.