| Summary: | Controller replacement procedure fails while waiting for Galera to start on all the nodes | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> |
| Component: | documentation | Assignee: | Dan Macpherson <dmacpher> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | RHOS Documentation Team <rhos-docs> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 10.0 (Newton) | CC: | abeekhof, agurenko, dbecker, dciabrin, dmacpher, fdinitto, jslagle, mburns, mcornea, michele, morazi, rhel-osp-director-maint, srevivo |
| Target Milestone: | ga | Keywords: | Documentation, ZStream |
| Target Release: | 10.0 (Newton) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-07-20 08:46:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
There might be two unrelated things at play here which caused galera cluster to stop and fail restarting.
sosreports from rhbz#1393367 show that somewhere during the replacement procedure, pacemaker got requested to restart galera: e.g. for ctrl0:
Nov 09 11:45:33 [32212] overcloud-controller-0.localdomain pengine: info: check_action_definition: params:reload <parameters wsrep_cluster_address="gcomm://overcloud-controller-0,overcloud-controller-3,overcloud-controller-2" enable_creation="true" additional_parameters="--open-files-limit=16384"/>
Nov 09 11:45:33 [32212] overcloud-controller-0.localdomain pengine: info: check_action_definition: Parameters to galera:0_start_0 on overcloud-controller-0 changed: was 7581597d2403b46ccfd48d3a7285d437 vs. now 6e14c76ea825f45eb0b458c1dd4d5cdf (reload:3.0.10) 0:0;30:11:0:1cafa7b3-f67e-4617-9fcd-803aa5bfb178
Nov 09 11:45:33 [32212] overcloud-controller-0.localdomain pengine: info: check_action_definition: params:reload <parameters wsrep_cluster_address="gcomm://overcloud-controller-0,overcloud-controller-3,overcloud-controller-2" enable_creation="true" additional_parameters="--open-files-limit=16384" CRM_meta_timeout="30000"/>
Nov 09 11:45:33 [32212] overcloud-controller-0.localdomain pengine: info: check_action_definition: Parameters to galera:0_monitor_10000 on overcloud-controller-0 changed: was a541fe86df643b88084fbbbc2f568fbf vs. now 25c5cf6f7050773f4744106cac3a6396 (reload:3.0.10) 0:8;29:12:8:1cafa7b3-f67e-4617-9fcd-803aa5bfb178
The only log that i could find which could indicate changes to the galera resource for a few minutes back:
Nov 09 10:41:03 overcloud-controller-0.localdomain os-collect-config[3073]: Notice: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql/Tripleo::Pacemaker::Resource_restart_flag[galera-master]/Exec[galera-master resource restart flag]: Triggered 'refresh' from 1 events
Now, on the existing ctrl0 and ctrl2. The restart apparently called the "promote" op before going to the Stopped state for all cloned resources, which is unexpected for the galera resource agent, as it will try to restart galera without having elected a bootstrap node before.
pcs status' output:
Master/Slave Set: galera-master [galera]
galera (ocf::heartbeat:galera): FAILED Master overcloud-controller-0 (blocked)
galera (ocf::heartbeat:galera): FAILED Master overcloud-controller-2 (blocked)
Slaves: [ overcloud-controller-3 ]
Failed Actions:
* galera_promote_0 on overcloud-controller-0 'unknown error' (1): call=79, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.',
last-rc-change='Wed Nov 9 11:45:43 2016', queued=0ms, exec=249ms
* galera_promote_0 on overcloud-controller-2 'unknown error' (1): call=71, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.',
last-rc-change='Wed Nov 9 11:45:43 2016', queued=0ms, exec=230ms
Anyway, the galera resource should not be restarted until ctrl3 is added into the cluster, otherwise the bootstrap won't we possible.
One should probably wrap the replacement steps around "pcs resource unmanage galera" to prevent any voluntary restart.
I'm trying to figure out a sequence of pcs commands that would work
Rationale for updating the step involving galera in the procedure: While the nodes are being replaced, the galera cluster _must not_ be stopped (or restarted), because it needs to know the last commit of _all_ the nodes specified in the gcomm:// address in order to be able to elect a bootstrap node and restart the galera cluster. This bz shows that for some reason, steps of the procedure can now cause the galera resource to be restarted, so this must be prevented. I would propose to add three steps: 1. Move the galera resource out of pacemaker control _just before_ removing the corosync node: pcs resource unmanage galera Reading the procedure, I think you should add end of section 9.4.2, just before: [stack@director ~]$ openstack overcloud deploy --templates --control-scale 3 -e ~/templates/remove-controller.yaml [OTHER OPTIONS] 2. Once the new node is added to the corosync cluster, force pacemaker to reprobe the state of the galera resource on all the nodes. In section 9.4.3, right after step 11, add: pcs resource cleanup galera 3. Give back control of galera to pacemaker. pcs resource manage galera Knowing the state of the resource, pacemaker won't try to restart it, and there will be no service outage when galera will be started on the new node. Marius, could you test the procedure with those 3 additional steps? I can follow up the actions if needed. (In reply to Damien Ciabrini from comment #3) > Rationale for updating the step involving galera in the procedure: > While the nodes are being replaced, the galera cluster _must not_ be > stopped (or restarted), because it needs to know the last commit of _all_ > the nodes specified in the gcomm:// address in order to be able to elect a > bootstrap node and restart the galera cluster. > > This bz shows that for some reason, steps of the procedure can now cause the > galera resource to be restarted, so this must be prevented. > > I would propose to add three steps: > > 1. Move the galera resource out of pacemaker control _just before_ removing > the corosync node: > > pcs resource unmanage galera > > Reading the procedure, I think you should add end of section 9.4.2, just > before: > > [stack@director ~]$ openstack overcloud deploy --templates --control-scale 3 > -e ~/templates/remove-controller.yaml [OTHER OPTIONS] > > 2. Once the new node is added to the corosync cluster, force pacemaker to > reprobe the state of the galera resource on all the nodes. In section 9.4.3, > right after step 11, add: > > pcs resource cleanup galera > > 3. Give back control of galera to pacemaker. > > pcs resource manage galera > > Knowing the state of the resource, pacemaker won't try to restart it, and > there will be no service outage when galera will be started on the new node. > > Marius, could you test the procedure with those 3 additional steps? I can > follow up the actions if needed. Thanks, Damien. I was able to move forward after adding the mentioned steps. One thing I'm noticing though is that in the end all the resources show up as unmanaged. Should there be another step to get them managed or should this be covered by the overcloud deploy re-run(right after step 16 in 9.4.3) which should get the deployment in a converged state? [heat-admin@overcloud-controller-0 ~]$ sudo pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: overcloud-controller-0 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum Last updated: Fri Nov 11 09:09:16 2016 Last change: Thu Nov 10 21:16:21 2016 by root via cibadmin on overcloud-controller-3 *** Resource management is DISABLED *** The cluster will not attempt to start, stop or recover services 3 nodes and 19 resources configured Online: [ overcloud-controller-0 overcloud-controller-2 overcloud-controller-3 ] Full list of resources: ip-10.0.0.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (unmanaged) ip-192.168.0.14 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 (unmanaged) Clone Set: haproxy-clone [haproxy] (unmanaged) haproxy (systemd:haproxy): Started overcloud-controller-0 (unmanaged) haproxy (systemd:haproxy): Started overcloud-controller-2 (unmanaged) haproxy (systemd:haproxy): Started overcloud-controller-3 (unmanaged) Master/Slave Set: galera-master [galera] (unmanaged) galera (ocf::heartbeat:galera): Master overcloud-controller-0 (unmanaged) galera (ocf::heartbeat:galera): Master overcloud-controller-2 (unmanaged) galera (ocf::heartbeat:galera): Master overcloud-controller-3 (unmanaged) ip-10.0.0.145 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 (unmanaged) Clone Set: rabbitmq-clone [rabbitmq] (unmanaged) rabbitmq (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0 (unmanaged) rabbitmq (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 (unmanaged) rabbitmq (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-3 (unmanaged) ip-10.0.1.12 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (unmanaged) ip-10.0.0.18 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (unmanaged) Master/Slave Set: redis-master [redis] (unmanaged) redis (ocf::heartbeat:redis): Master overcloud-controller-0 (unmanaged) redis (ocf::heartbeat:redis): Slave overcloud-controller-2 (unmanaged) redis (ocf::heartbeat:redis): Slave overcloud-controller-3 (unmanaged) ip-172.16.18.27 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 (unmanaged) openstack-cinder-volume (systemd:openstack-cinder-volume): Started overcloud-controller-0 (unmanaged) Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled Moving forward - I was able to bring the resource back to managed state after running 'pcs property set maintenance-mode=false --wait' Marius, since this appears to be a documentation/procedural issue, should we reassign the bug to documentation? (In reply to Fabio Massimo Di Nitto from comment #6) > Marius, since this appears to be a documentation/procedural issue, should we > reassign the bug to documentation? Yes, switched the component. Hi Marius and Damien, I've been trying to reproduce the issue on my own environment but can't seem to, so I might need some help with inserting those three steps. Here is the current documentation: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/director_installation_and_usage/sect-scaling_the_overcloud#sect-Replacing_Controller_Nodes-Manual_Intervention Just to follow Damien's instructions... > 1. Move the galera resource out of pacemaker control _just before_ removing > the corosync node: > > pcs resource unmanage galera > > Reading the procedure, I think you should add end of section 9.4.2, just > before: > > [stack@director ~]$ openstack overcloud deploy --templates --control-scale 3 > -e ~/templates/remove-controller.yaml [OTHER OPTIONS] This is pretty straightforward. No questions here. > 2. Once the new node is added to the corosync cluster, force pacemaker to > reprobe the state of the galera resource on all the nodes. In section 9.4.3, > right after step 11, add: > > pcs resource cleanup galera So Step 11 now appears to be where you edit the corosync.conf file and change the nodeid, so the steps might have changed since this bug was filed. Which command do we run "pcs resource cleanup galera" after? Is it after when we add the node ("sudo pcs cluster node add overcloud-controller-3") or after we restart corosync ("sudo pcs cluster reload corosync")? > 3. Give back control of galera to pacemaker. > > pcs resource manage galera > > Knowing the state of the resource, pacemaker won't try to restart it, and > there will be no service outage when galera will be started on the new node. At what stage do we run this command? Any help with this is most appreciated. (In reply to Dan Macpherson from comment #9) > Hi Marius and Damien, > > I've been trying to reproduce the issue on my own environment but can't seem > to, so I might need some help with inserting those three steps. > > Here is the current documentation: > https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/ > html/director_installation_and_usage/sect-scaling_the_overcloud#sect- > Replacing_Controller_Nodes-Manual_Intervention > > Just to follow Damien's instructions... > > > 1. Move the galera resource out of pacemaker control _just before_ removing > > the corosync node: > > > > pcs resource unmanage galera > > > > Reading the procedure, I think you should add end of section 9.4.2, just > > before: > > > > [stack@director ~]$ openstack overcloud deploy --templates --control-scale 3 > > -e ~/templates/remove-controller.yaml [OTHER OPTIONS] > > This is pretty straightforward. No questions here. > > > 2. Once the new node is added to the corosync cluster, force pacemaker to > > reprobe the state of the galera resource on all the nodes. In section 9.4.3, > > right after step 11, add: > > > > pcs resource cleanup galera > > So Step 11 now appears to be where you edit the corosync.conf file and > change the nodeid, so the steps might have changed since this bug was filed. > Which command do we run "pcs resource cleanup galera" after? Is it after > when we add the node ("sudo pcs cluster node add overcloud-controller-3") or > after we restart corosync ("sudo pcs cluster reload corosync")? According to my notes and script that I used at the moment when I tested it his should be added right after pcs cluster start overcloud-controller-3(8.4.3.13): pcs resource cleanup galera > > 3. Give back control of galera to pacemaker. > > > > pcs resource manage galera > > > > Knowing the state of the resource, pacemaker won't try to restart it, and > > there will be no service outage when galera will be started on the new node. > > At what stage do we run this command? This should be right after the step above: pcs resource manage galera In the automated script this is the blocke describing these steps: - name: start cluster on new controller node shell: pcs cluster start overcloud-controller-3 - block: - name: reprobe state of galera resource shell: pcs resource cleanup galera - name: give back control of galera to pcmk shell: pcs resource manage galera when: installer.product.version > 9 > Any help with this is most appreciated. Thanks, Marius! Now that you've pointed it out, I can see why the steps need to be placed in that order. I've included additional step 1 here: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes-Node_Replacement The text reads: """ The overcloud’s database must continue running during the replacement procedure. To ensure Pacemaker does not stop Galera during this procedure, select a running Controller node and run the following command on the undercloud using the Controller node’s IP address: [stack@director ~]$ ssh heat-admin.0.47 "sudo pcs resource unmanage galera" """ And additional steps 2 and 3 are here: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes-Manual_Intervention The text reads: """ Restart the Galera cluster and return it to Pacemaker management: [heat-admin@overcloud-controller-0 ~]$ sudo pcs resource cleanup galera [heat-admin@overcloud-controller-0 ~]$ sudo pcs resource manage galera """ Does all this align with what you've tested? Anything I should clarify? Marius, given the changes that Dan did I am going ahead an close this one. Please reopen if you feel the changes need more work. thanks, Michele |
Description of problem: I am following the controller replacement procedure[1] on OSP10 and it's failing in step 14. Wait until the Galera service starts on all nodes: This is the output of pcs status: pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: overcloud-controller-0 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum Last updated: Wed Nov 9 12:17:11 2016 Last change: Wed Nov 9 11:46:15 2016 by root via crm_attribute on overcloud-controller-3 3 nodes and 19 resources configured Online: [ overcloud-controller-0 overcloud-controller-2 overcloud-controller-3 ] Full list of resources: ip-192.168.0.17 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-10.0.1.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 ip-10.0.0.147 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 Clone Set: haproxy-clone [haproxy] Started: [ overcloud-controller-0 overcloud-controller-2 overcloud-controller-3 ] Master/Slave Set: galera-master [galera] galera (ocf::heartbeat:galera): FAILED Master overcloud-controller-0 (blocked) galera (ocf::heartbeat:galera): FAILED Master overcloud-controller-2 (blocked) Slaves: [ overcloud-controller-3 ] ip-10.0.0.19 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-172.16.18.34 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 Clone Set: rabbitmq-clone [rabbitmq] Started: [ overcloud-controller-0 overcloud-controller-2 overcloud-controller-3 ] ip-10.0.0.18 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 Master/Slave Set: redis-master [redis] Masters: [ overcloud-controller-0 ] Slaves: [ overcloud-controller-2 overcloud-controller-3 ] openstack-cinder-volume (systemd:openstack-cinder-volume): Started overcloud-controller-0 Failed Actions: * galera_promote_0 on overcloud-controller-0 'unknown error' (1): call=79, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.', last-rc-change='Wed Nov 9 11:45:43 2016', queued=0ms, exec=249ms * galera_promote_0 on overcloud-controller-2 'unknown error' (1): call=71, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.', last-rc-change='Wed Nov 9 11:45:43 2016', queued=0ms, exec=230ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [1] https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/single/director-installation-and-usage/#sect-Replacing_Controller_Nodes Version-Release number of selected component (if applicable): mariadb-galera-common-5.5.42-5.el7ost.x86_64 mariadb-libs-5.5.52-1.el7.x86_64 mariadb-5.5.52-1.el7.x86_64 mariadb-galera-server-5.5.42-5.el7ost.x86_64 resource-agents-3.9.5-82.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. Follow the controller replacement procedure on OSP10 Actual results: Galera cannot start on all the nodes in the cluster. Expected results: Galera start on all nodes in the cluster and can proceed with the replacement procedure.