Bug 1265208

Summary: Cluster state inconsistency during rolling upgrade
Product: Red Hat Enterprise Linux 7 Reporter: Patrik Hagara <phagara>
Component: pacemakerAssignee: Andrew Beekhof <abeekhof>
Status: CLOSED WONTFIX QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.1CC: abeekhof, cfeist, cluster-maint, jkortus, phagara
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-19 02:34:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
crm report
none
rolling upgrade test log
none
'pcs status' output from a good node
none
'pcs status' output from a bad/inconsistent node
none
crm report with pacemaker-1.1.12-22.el7_1.4
none
crm report with brew scratch build none

Description Patrik Hagara 2015-09-22 11:41:13 UTC
Created attachment 1075783 [details]
crm report

Description of problem:
While testing the rolling upgrade scenario of a pacemaker cluster starting on (RHEL-7.0), the first node upgraded to 7.1 sees a different cluster state after unstandby -- according to other nodes it has rejoined the cluster, but the node itself still thinks it's in standby mode. The update was performed as described in [1].

[1] https://access.redhat.com/solutions/1297313


Version-Release number of selected component (if applicable):
"old" nodes: pacemaker-1.1.10-32.el7_0.1.x86_64
upgraded node: pacemaker-1.1.12-22.el7.x86_64


How reproducible:
always


Steps to Reproduce:
1. running RHEL-7.0 cluster
2. perform a rolling upgrade of all nodes to 7.0.z (described in [1])
3. cluster running fine, state consistent across nodes
4. upgrade a single node to 7.1 (standby, stop, yum update, reboot, start, unstandby)
5. run "pcs status" on all nodes, note the inconsistency on the upgraded node


Actual results:
upgraded node has an inconsistent/invalid view of cluster state


Expected results:
all nodes must agree on the current cluster state


Additional info:
beaker job: https://beaker.cluster-qe.lab.eng.brq.redhat.com/bkr/jobs/25164

Comment 1 Patrik Hagara 2015-09-22 11:41:53 UTC
Created attachment 1075784 [details]
rolling upgrade test log

Comment 2 Andrew Beekhof 2015-09-23 22:18:24 UTC
Please explain what "upgraded node has an inconsistent/invalid view of cluster state" means.

Comment 3 Patrik Hagara 2015-09-24 14:28:08 UTC
Created attachment 1076563 [details]
'pcs status' output from a good node

Comment 4 Patrik Hagara 2015-09-24 14:28:40 UTC
Created attachment 1076564 [details]
'pcs status' output from a bad/inconsistent node

Comment 5 Andrew Beekhof 2015-09-24 22:22:40 UTC
Did you do this step:

Confirm that everything appears to be normal, and that the node seems to be a proper member joined with the other members. Check pcs status or other commands as needed to confirm the node's state in the cluster.

Or jump straight from 'pcs cluster start' to 'pcs cluster unstandby' ?

Comment 6 Andrew Beekhof 2015-09-24 23:02:37 UTC
The last qarshd logs are:

Sep 22 03:54:24 host-064 qarshd[31455]: Running cmdline: pcs cluster standby host-064
Sep 22 03:54:24 host-064 qarshd[31473]: Running cmdline: crmadmin -S $(crmadmin -D | rev | cut -d " " -f1 | rev)
Sep 22 03:54:30 host-064 qarshd[31550]: Running cmdline: crmadmin -S $(crmadmin -D | rev | cut -d " " -f1 | rev)
Sep 22 03:54:30 host-064 qarshd[31563]: Running cmdline: pcs cluster stop host-064
Sep 22 03:54:32 host-064 qarshd[31621]: Running cmdline: yum -y update
Sep 22 03:54:38 host-064 qarshd[31683]: Running cmdline: reboot

How did the cluster get out of standby?

Comment 7 Andrew Beekhof 2015-09-24 23:47:34 UTC
Looks like something incorrectly triggered the compatibility code on host-063:

# grep "legacy" /Users/beekhof/Downloads/crm_report/host-06*.virt.lab.msp.redhat.com/pacemaker.log
/Users/beekhof/Downloads/crm_report/host-064.virt.lab.msp.redhat.com/pacemaker.log:Sep 22 04:03:02 [1906] host-064.virt.lab.msp.redhat.com        cib:     info: parse_peer_options_v2: 	Detected legacy cib_apply_diff global update from host-063


and no attempts are made to disable it until another node joins/leaves:

cib/main.c-423-    switch (type) {
cib/main.c-424-        case crm_status_processes:
cib/main.c:425:            if (legacy_mode && is_not_set(node->processes, crm_get_cluster_proc())) {
cib/main.c-426-                uint32_t old = data? *(const uint32_t *)data : 0;
cib/main.c-427-
cib/main.c-428-                if ((node->processes ^ old) & crm_proc_cpg) {
cib/main.c:429:                    crm_info("Attempting to disable legacy mode after %s left the cluster",
cib/main.c-430-                             node->uname);
cib/main.c:431:                    legacy_mode = FALSE;

This meant that 064 delegated:

Sep 22 04:03:47 [1906] host-064.virt.lab.msp.redhat.com        cib:     info: cib_process_request: 	Forwarding cib_delete operation for section nodes to master (origin=local/crm_attribute/4)

to the DC:

Sep 22 04:03:46 [1911] host-063.virt.lab.msp.redhat.com        cib:   notice: cib:diff: 	Diff: --- 0.34.20
Sep 22 04:03:46 [1911] host-063.virt.lab.msp.redhat.com        cib:   notice: cib:diff: 	Diff: +++ 0.35.1 0786e5a242d27c835cdc1e0ba48c6c53
Sep 22 04:03:46 [1911] host-063.virt.lab.msp.redhat.com        cib:   notice: cib:diff: 	--           <nvpair id="nodes-3-standby" name="standby" value="on"/>
Sep 22 04:03:46 [1911] host-063.virt.lab.msp.redhat.com        cib:   notice: cib:diff: 	++ <cib epoch="35" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Tue Sep 22 04:03:46 2015" update-origin="host-064" update-client="crm_attribute" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="2"/>
Sep 22 04:03:46 [1911] host-063.virt.lab.msp.redhat.com        cib:     info: cib_process_request: 	Completed cib_delete operation for section nodes: OK (rc=0, origin=host-064/crm_attribute/4, version=0.35.1)

but never applied it itself. 

It is possible the DC did try to help out, but the diff had other issues:

Sep 22 04:03:47 [1906] host-064.virt.lab.msp.redhat.com        cib:     info: xml_apply_patchset: 	v1 digest mis-match: expected 0786e5a242d27c835cdc1e0ba48c6c53, calculated 22e849322db4bfc87e5d23d566c93883
Sep 22 04:03:47 [1906] host-064.virt.lab.msp.redhat.com        cib:     info: __xml_diff_object: 	Moved node@id (0 -> 1)
Sep 22 04:03:47 [1906] host-064.virt.lab.msp.redhat.com        cib:     info: __xml_diff_object: 	Moved node@uname (1 -> 0)
Sep 22 04:03:47 [1906] host-064.virt.lab.msp.redhat.com        cib:     info: xml_patch_version_check: 	Current num_updates is too high (20 > 1)

Investigating why the compatibility code was triggered on 063

Comment 8 Andrew Beekhof 2015-09-25 00:24:13 UTC
Oh, the other two nodes are still on the old cib code.

Some compatibility fixes went into pacemaker-1.1.12-22.el7_1.4, it would be worth testing with that instead.

Comment 9 Patrik Hagara 2015-09-30 13:28:50 UTC
Reproducible even with pacemaker-1.1.12-22.el7_1.4 -- virt-072 (the DC, old pacemaker version) sees all nodes online, both upgraded nodes (73 and 78) see virt-073 in standby.

[root@virt-071 ~]# qarsh virt-072 rpm -q pacemaker
pacemaker-1.1.10-32.el7_0.1.x86_64
[root@virt-071 ~]# qarsh virt-073 rpm -q pacemaker
pacemaker-1.1.12-22.el7_1.4.x86_64
[root@virt-071 ~]# qarsh virt-078 rpm -q pacemaker
pacemaker-1.1.12-22.el7_1.4.x86_64

Comment 10 Patrik Hagara 2015-09-30 13:30:14 UTC
Created attachment 1078641 [details]
crm report with pacemaker-1.1.12-22.el7_1.4

Comment 11 Andrew Beekhof 2015-10-13 02:36:29 UTC
Ok, I managed to reproduce this.  Thank-you for persisting.

One of the resources for the new CIB was problems with the old patch format.
The improvements to the patch handing meant that there were additional conditions under which they would be rejected.

In the cases I saw, it was caused by either:
- fingerprint mismatches between the actual and expected results, or
- attrd doing some local writes which bumped the local version past the DC's.

The situation would have resolved itself with further cib updates (Eg. resource stop/start operations would have triggered a resync), which is probably why customers haven't reported this.

The following patch basically dials up the paranoia and does a full re-sync if a patchset fails to apply while we are in legacy mode.
Scratch build available at:
   http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9947531

Can you put that through your test and confirm?


diff --git a/cib/messages.c b/cib/messages.c
index eca63b9..ff197cd 100644
--- a/cib/messages.c
+++ b/cib/messages.c
@@ -369,6 +369,7 @@ cib_server_process_diff(const char *op, int options, const char *section, xmlNod
     }
 
     rc = cib_process_diff(op, options, section, req, input, existing_cib, result_cib, answer);
+    crm_trace("result: %s (%d), %s", pcmk_strerror(rc), rc, cib_is_master?"master":"slave");
 
     if (rc == -pcmk_err_diff_resync && cib_is_master == FALSE) {
         free_xml(*result_cib);
@@ -380,6 +381,13 @@ cib_server_process_diff(const char *op, int options, const char *section, xmlNod
         if (options & cib_force_diff) {
             crm_warn("Not requesting full refresh in R/W mode");
         }
+
+    } else if(rc != pcmk_ok && cib_legacy_mode()) {
+        crm_warn("Something went wrong in compatibility mode, requesting full refresh");
+        xml_log_patchset(LOG_INFO, __FUNCTION__, input);
+        free_xml(*result_cib);
+        *result_cib = NULL;
+        send_sync_request(NULL);
     }
 
     return rc;

Comment 12 Patrik Hagara 2015-10-15 09:04:16 UTC
Created attachment 1083198 [details]
crm report with brew scratch build

Comment 13 Patrik Hagara 2015-10-15 09:13:48 UTC
Looks like a full resync was triggered, but my test still failed while trying to move a resource to specific node (virt-073 and virt-078 are running pacemaker from brew, virt-079 is on rhel-7.0.z pacemaker version):

> 15.10.2015 10:52:49 INFO GenericTest.run: running move
> 15.10.2015 10:52:49 INFO GenericTest.run: -----------
> 15.10.2015 10:52:49 DEBUG clusterstack.loggedcall: Calling `<function nodes_by_resource_role at 0xc0a578>` with args=(<QAlib.clusterstack.PacemakerClusterStack object at 0xe16890>, 'www'), kwargs={'role': 'Started'}
> 15.10.2015 10:52:50 DEBUG clusterstack.loggedcall: `<function nodes_by_resource_role at 0xc0a578>` took 0.93s to return: ['virt-073.cluster-qe.lab.eng.brq.redhat.com']
> 15.10.2015 10:52:50 DEBUG clusterstack.loggedcall: Calling `<function resource_move at 0xc0a938>` with args=(<QAlib.clusterstack.PacemakerClusterStack object at 0xe16890>, 'www'), kwargs={'moveto': 'virt-078.cluster-qe.lab.eng.brq.redhat.com', 'timeout': 60}
> 15.10.2015 10:52:50 DEBUG clusterstack.loggedcall: Calling `<function nodes_by_resource_role at 0xc0a578>` with args=(<QAlib.clusterstack.PacemakerClusterStack object at 0xe16890>, 'www'), kwargs={'role': 'Started'}
> 15.10.2015 10:52:51 DEBUG clusterstack.loggedcall: `<function nodes_by_resource_role at 0xc0a578>` took 0.98s to return: ['virt-073.cluster-qe.lab.eng.brq.redhat.com']
> 15.10.2015 10:52:52 ERROR clusterstack.resource_move: move (by starting) failed:
> Traceback (most recent call last):
>   File "/usr/tests/sts-rhel7.0/lib/QAlib/clusterstack.py", line 580, in resource_move
>     "pcs resource move %s %s" % (resource, moveto))
>   File "/usr/tests/sts-rhel7.0/lib/QAlib/cmd.py", line 47, in qarsh
>     % (args, ret.retcode, ret.out, ret.err))
> QarshError: qarsh command `['qarsh', 'root.lab.eng.brq.redhat.com', 'pcs resource move www virt-078.cluster-qe.lab.eng.brq.redhat.com']` exited 1: STDOUT=``; STDERR=`Error: error moving/banning/clearing resource
> Error performing operation: Update was older than existing configuration
> 
> `
> 15.10.2015 10:52:52 DEBUG clusterstack.loggedcall: `<function resource_move at 0xc0a938>` raised an exception after 1.46s
> Traceback (most recent call last):
>   File "/usr/tests/sts-rhel7.0/lib/QAlib/clusterstack.py", line 51, in loggedcall
>     ret = func(*args, **kargs)
>   File "/usr/tests/sts-rhel7.0/lib/QAlib/clusterstack.py", line 583, in resource_move
>     raise ClusterStackError("Failed to start (move) resource"
> ClusterStackError: Failed to start (move) resource (see above for details)
> 15.10.2015 10:52:52 ERROR GenericTest.run: Failure on move.
> 15.10.2015 10:52:52 ERROR pacemaker-test.<module>: Test RollerApache failed.

Comment 14 Patrik Hagara 2015-10-15 09:28:19 UTC
> Oct 15 10:52:51 virt-073 qarshd[14745]: Running cmdline: pcs resource move www virt-078.cluster-qe.lab.eng.brq.redhat.com
> Oct 15 10:52:52 virt-073 cib[1467]: warning: Something went wrong in compatibility mode, requesting full refresh
> Oct 15 10:52:52 virt-073 cib[1467]: error: crm_xml_replace: Triggered assert at xml.c:2739 : old_value != value
> Oct 15 10:52:52 virt-073 attrd[1470]: notice: Updating all attributes after cib_refresh_notify event
> Oct 15 10:52:52 virt-073 apache(www)[14749]: INFO: Attempting graceful stop of apache PID 14361
> Oct 15 10:52:53 virt-073 apache(www)[14749]: INFO: apache stopped.
> Oct 15 10:52:53 virt-073 crmd[1472]: notice: Operation www_stop_0: ok (node=virt-073.cluster-qe.lab.eng.brq.redhat.com, call=27, rc=0, cib-update=23, confirmed=true)
> Oct 15 10:52:53 virt-073 cib[1467]: warning: Something went wrong in compatibility mode, requesting full refresh
> Oct 15 10:52:53 virt-073 crmd[1472]: warning: Resource update 23 failed: (rc=-205) Update was older than existing configuration
> Oct 15 10:52:54 virt-073 cib[1467]: notice: Not applying diff 0.0.2 -> 0.46.3 (sync in progress)
> Oct 15 10:52:54 virt-073 cib[1467]: notice: Not applying diff 0.0.3 -> 0.46.4 (sync in progress)

Comment 15 Patrik Hagara 2015-10-15 12:02:45 UTC
and on the node running old pacemaker:

> Oct 15 10:52:52 virt-079 cib[1615]: notice: cib:diff: Diff: --- 0.45.3
> Oct 15 10:52:52 virt-079 cib[1615]: notice: cib:diff: Diff: +++ 0.46.1 0a6c16f96e84ac649376aebb06869aaf
> Oct 15 10:52:52 virt-079 cib[1615]: notice: cib:diff: --       <rsc_location node="virt-073.cluster-qe.lab.eng.brq.redhat.com" id="cli-prefer-www"/>
> Oct 15 10:52:52 virt-079 cib[1615]: notice: cib:diff: ++       <rsc_location id="cli-prefer-www" rsc="www" role="Started" node="virt-078.cluster-qe.lab.eng.brq.redhat.com" score="INFINITY"/>
> Oct 15 10:52:52 virt-079 crmd[1620]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Oct 15 10:52:52 virt-079 pengine[1619]: notice: LogActions: Move    www (Started virt-073.cluster-qe.lab.eng.brq.redhat.com -> virt-078.cluster-qe.lab.eng.brq.redhat.com)
> Oct 15 10:52:52 virt-079 pengine[1619]: notice: process_pe_message: Calculated Transition 26: /var/lib/pacemaker/pengine/pe-input-26.bz2
> Oct 15 10:52:52 virt-079 crmd[1620]: notice: te_rsc_command: Initiating action 16: stop www_stop_0 on virt-073.cluster-qe.lab.eng.brq.redhat.com
> Oct 15 10:52:52 virt-079 stonith-ng[1616]: notice: stonith_device_register: Added 'fence-virt-073' to the device list (3 active devices)
> Oct 15 10:52:52 virt-079 stonith-ng[1616]: notice: stonith_device_register: Device 'fence-virt-078' already existed in device list (3 active devices)
> Oct 15 10:52:52 virt-079 stonith-ng[1616]: notice: stonith_device_register: Added 'fence-virt-079' to the device list (3 active devices)
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update   <diff format="2">
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update     <version>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update       <source admin_epoch="0" epoch="46" num_updates="1"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update       <target admin_epoch="0" epoch="46" num_updates="2"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update     </version>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update     <change operation="modify" path="/cib">
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update       <change-list>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update         <change-attr name="num_updates" operation="set" value="2"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update       </change-list>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update       <change-result>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update         <cib epoch="46" num_updates="2" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Thu Oct 15 10:52:52 2015" update-origin="virt-079.cluster-qe.lab.eng.brq.redhat.com" update-client="virt-078.cluster-qe.lab.eng.brq.redhat.com" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="3"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update       </change-result>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update     </change>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update   </diff>
> Oct 15 10:52:52 virt-079 cib[1615]: error: cib_process_diff: Diff -1.-1.-1 -> -1.-1.-1 from virt-078.cluster-qe.lab.eng.brq.redhat.com not applied to 0.46.1: + and - versions in the diff did not change in global update
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update   <diff format="2">
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update     <version>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update       <source admin_epoch="0" epoch="46" num_updates="2"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update       <target admin_epoch="0" epoch="46" num_updates="3"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update     </version>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update     <change operation="modify" path="/cib">
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update       <change-list>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update         <change-attr name="num_updates" operation="set" value="3"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update       </change-list>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update       <change-result>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update         <cib epoch="46" num_updates="3" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Thu Oct 15 10:52:52 2015" update-origin="virt-079.cluster-qe.lab.eng.brq.redhat.com" update-client="virt-078.cluster-qe.lab.eng.brq.redhat.com" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="3"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update       </change-result>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update     </change>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update   </diff>
> Oct 15 10:52:52 virt-079 cib[1615]: error: cib_process_diff: Diff -1.-1.-1 -> -1.-1.-1 from virt-078.cluster-qe.lab.eng.brq.redhat.com not applied to 0.46.1: + and - versions in the diff did not change in global update
> Oct 15 10:52:53 virt-079 crmd[1620]: notice: te_rsc_command: Initiating action 17: start www_start_0 on virt-078.cluster-qe.lab.eng.brq.redhat.com
> Oct 15 10:52:54 virt-079 crmd[1620]: notice: te_rsc_command: Initiating action 18: monitor www_monitor_30000 on virt-078.cluster-qe.lab.eng.brq.redhat.com
> Oct 15 10:52:54 virt-079 crmd[1620]: notice: run_graph: Transition 26 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-26.bz2): Complete
> Oct 15 10:52:54 virt-079 crmd[1620]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]