Hide Forgot
Created attachment 1143148 [details] crm report Description of problem: [root@airfrance-2 ~]# crm_simulate -SL Current cluster status: Node airfrance-1 (1): UNCLEAN (offline) Online: [ airfrance-2 ] RemoteOnline: [ airfrance-3 ] airfrance-3 (ocf::pacemaker:remote): Started airfrance-1 (UNCLEAN) killer (stonith:fence_xvm): Started airfrance-2 test (ocf::heartbeat:Dummy): Started airfrance-3 Master/Slave Set: galera-master [galera] galera (ocf::pacemaker:Stateful): Master airfrance-1 (UNCLEAN) Masters: [ airfrance-2 ] Stopped: [ airfrance-3 ] Transition Summary: * Stop airfrance-3 (airfrance-1) * Stop test (airfrance-3 - blocked) * Demote galera:0 (Master -> Stopped airfrance-1) Executing cluster transition: * Fencing airfrance-1 (reboot) * Pseudo action: stonith_complete * Pseudo action: airfrance-3_stop_0 Transition failed: terminated An invalid transition was produced [...] [root@airfrance-2 ~]# pcs resource cleanup test Waiting for 1 replies from the CRMd. OK Cleaning up test on airfrance-2, removing fail-count-test Cleaning up test on airfrance-3, removing fail-count-test [root@airfrance-2 ~]# crm_simulate -SL Current cluster status: Node airfrance-1 (1): UNCLEAN (offline) Online: [ airfrance-2 ] RemoteOnline: [ airfrance-3 ] airfrance-3 (ocf::pacemaker:remote): Started airfrance-1 (UNCLEAN) killer (stonith:fence_xvm): Started airfrance-2 test (ocf::heartbeat:Dummy): Started airfrance-3 Master/Slave Set: galera-master [galera] galera (ocf::pacemaker:Stateful): Master airfrance-1 (UNCLEAN) Masters: [ airfrance-2 ] Stopped: [ airfrance-3 ] Transition Summary: * Stop airfrance-3 (airfrance-1) * Stop test (airfrance-3 - blocked) * Demote galera:0 (Master -> Stopped airfrance-1) Executing cluster transition: * Fencing airfrance-1 (reboot) * Pseudo action: stonith_complete * Pseudo action: airfrance-3_stop_0 Transition failed: terminated An invalid transition was produced [...] [root@airfrance-2 ~]# pcs resource cleanup airfrance-3 Waiting for 1 replies from the CRMd. OK Cleaning up airfrance-3 on airfrance-2, removing fail-count-airfrance-3 Cleaning up airfrance-3 on airfrance-3, removing fail-count-airfrance-3 [root@airfrance-2 ~]# crm_simulate -SL Current cluster status: Node airfrance-1 (1): UNCLEAN (offline) Online: [ airfrance-2 ] RemoteOnline: [ airfrance-3 ] airfrance-3 (ocf::pacemaker:remote): Started airfrance-1 (UNCLEAN) killer (stonith:fence_xvm): Started airfrance-2 test (ocf::heartbeat:Dummy): Started airfrance-3 Master/Slave Set: galera-master [galera] galera (ocf::pacemaker:Stateful): Master airfrance-1 (UNCLEAN) Masters: [ airfrance-2 ] Stopped: [ airfrance-3 ] Transition Summary: * Stop airfrance-3 (airfrance-1) * Stop test (airfrance-3 - blocked) * Demote galera:0 (Master -> Stopped airfrance-1) Executing cluster transition: * Fencing airfrance-1 (reboot) * Pseudo action: stonith_complete * Pseudo action: airfrance-3_stop_0 Transition failed: terminated An invalid transition was produced [...] [root@airfrance-2 ~]# pcs stonith confirm airfrance-0 Node: airfrance-0 confirmed fenced [root@airfrance-2 ~]# crm_simulate -SL Current cluster status: Node airfrance-1 (1): UNCLEAN (offline) Online: [ airfrance-2 ] RemoteOnline: [ airfrance-3 ] airfrance-3 (ocf::pacemaker:remote): Started airfrance-1 (UNCLEAN) killer (stonith:fence_xvm): Started airfrance-2 test (ocf::heartbeat:Dummy): Started airfrance-3 Master/Slave Set: galera-master [galera] galera (ocf::pacemaker:Stateful): Master airfrance-1 (UNCLEAN) Masters: [ airfrance-2 ] Stopped: [ airfrance-3 ] Transition Summary: * Stop airfrance-3 (airfrance-1) * Stop test (airfrance-3 - blocked) * Demote galera:0 (Master -> Stopped airfrance-1) Executing cluster transition: * Fencing airfrance-1 (reboot) * Pseudo action: stonith_complete * Pseudo action: airfrance-3_stop_0 Transition failed: terminated An invalid transition was produced [...] Version-Release number of selected component (if applicable): Pacemaker 1.1.13-10.el7_2.2 How reproducible: trivial Steps to Reproduce: 1. 2 nodes + 1 remote 2. constraints as per crm report 3. virsh destroy node 1 Actual results: cluster fully locked up because the stop action for the resource inside the remote node cannot run - cleanup of the resource in the remote node: no effect - cleanup of the remote node resource: no effect - confirm stonith action: no effect Expected results: even if the transition cant complete (which it should!) at least one of the cleanup actions should unblock the cluster Additional info: see attachment (if crm_report ever completes due to bug #1323544)
A combination of factors were at play here: 1. Fencing was temporarily busted (fence_xvm returned no hosts) 2. If you look at the transcript above, I confirmed "airfrance-0" which doesn't exist instead of airfrance-1 3. The constraint: <rsc_location id="cli-ban-airfrance-3-on-airfrance-2" rsc="airfrance-3" role="Started" node="airfrance-2" score="-INFINITY"/> prevented the connection resource from being started on the remaining node, this tripped up the "i don't really need to stop resources on the remote node" logic. Which needs to be fixed but can be done at a lower priority.
Capacity constrained, moving to 7.4
Unfortunately this is unlikely to be addressed in 7.4 timeframe
The invalid transition can be reproduced with upstream versions through 1.1.16, but no longer occurs with upstream 1.1.17 and later, so it was fixed somewhere along the way. I'll add an anonymized regression test for it. QA: To test, grab cib.xml.live from the attached crm_report and run "crm_simulate -Sx cib.xml.live". Before the fix, it outputs "Invalid transition", while after the fix, the simulation completes successfully.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0860