Bug 1265208
Summary: | Cluster state inconsistency during rolling upgrade | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Patrik Hagara <phagara> | ||||||||||||||
Component: | pacemaker | Assignee: | Andrew Beekhof <abeekhof> | ||||||||||||||
Status: | CLOSED WONTFIX | QA Contact: | cluster-qe <cluster-qe> | ||||||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||||||
Priority: | unspecified | ||||||||||||||||
Version: | 7.1 | CC: | abeekhof, cfeist, cluster-maint, jkortus, phagara | ||||||||||||||
Target Milestone: | rc | ||||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||
OS: | Unspecified | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2016-02-19 02:34:47 UTC | Type: | Bug | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Attachments: |
|
Description
Patrik Hagara
2015-09-22 11:41:13 UTC
Created attachment 1075784 [details]
rolling upgrade test log
Please explain what "upgraded node has an inconsistent/invalid view of cluster state" means. Created attachment 1076563 [details]
'pcs status' output from a good node
Created attachment 1076564 [details]
'pcs status' output from a bad/inconsistent node
Did you do this step: Confirm that everything appears to be normal, and that the node seems to be a proper member joined with the other members. Check pcs status or other commands as needed to confirm the node's state in the cluster. Or jump straight from 'pcs cluster start' to 'pcs cluster unstandby' ? The last qarshd logs are: Sep 22 03:54:24 host-064 qarshd[31455]: Running cmdline: pcs cluster standby host-064 Sep 22 03:54:24 host-064 qarshd[31473]: Running cmdline: crmadmin -S $(crmadmin -D | rev | cut -d " " -f1 | rev) Sep 22 03:54:30 host-064 qarshd[31550]: Running cmdline: crmadmin -S $(crmadmin -D | rev | cut -d " " -f1 | rev) Sep 22 03:54:30 host-064 qarshd[31563]: Running cmdline: pcs cluster stop host-064 Sep 22 03:54:32 host-064 qarshd[31621]: Running cmdline: yum -y update Sep 22 03:54:38 host-064 qarshd[31683]: Running cmdline: reboot How did the cluster get out of standby? Looks like something incorrectly triggered the compatibility code on host-063: # grep "legacy" /Users/beekhof/Downloads/crm_report/host-06*.virt.lab.msp.redhat.com/pacemaker.log /Users/beekhof/Downloads/crm_report/host-064.virt.lab.msp.redhat.com/pacemaker.log:Sep 22 04:03:02 [1906] host-064.virt.lab.msp.redhat.com cib: info: parse_peer_options_v2: Detected legacy cib_apply_diff global update from host-063 and no attempts are made to disable it until another node joins/leaves: cib/main.c-423- switch (type) { cib/main.c-424- case crm_status_processes: cib/main.c:425: if (legacy_mode && is_not_set(node->processes, crm_get_cluster_proc())) { cib/main.c-426- uint32_t old = data? *(const uint32_t *)data : 0; cib/main.c-427- cib/main.c-428- if ((node->processes ^ old) & crm_proc_cpg) { cib/main.c:429: crm_info("Attempting to disable legacy mode after %s left the cluster", cib/main.c-430- node->uname); cib/main.c:431: legacy_mode = FALSE; This meant that 064 delegated: Sep 22 04:03:47 [1906] host-064.virt.lab.msp.redhat.com cib: info: cib_process_request: Forwarding cib_delete operation for section nodes to master (origin=local/crm_attribute/4) to the DC: Sep 22 04:03:46 [1911] host-063.virt.lab.msp.redhat.com cib: notice: cib:diff: Diff: --- 0.34.20 Sep 22 04:03:46 [1911] host-063.virt.lab.msp.redhat.com cib: notice: cib:diff: Diff: +++ 0.35.1 0786e5a242d27c835cdc1e0ba48c6c53 Sep 22 04:03:46 [1911] host-063.virt.lab.msp.redhat.com cib: notice: cib:diff: -- <nvpair id="nodes-3-standby" name="standby" value="on"/> Sep 22 04:03:46 [1911] host-063.virt.lab.msp.redhat.com cib: notice: cib:diff: ++ <cib epoch="35" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Tue Sep 22 04:03:46 2015" update-origin="host-064" update-client="crm_attribute" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="2"/> Sep 22 04:03:46 [1911] host-063.virt.lab.msp.redhat.com cib: info: cib_process_request: Completed cib_delete operation for section nodes: OK (rc=0, origin=host-064/crm_attribute/4, version=0.35.1) but never applied it itself. It is possible the DC did try to help out, but the diff had other issues: Sep 22 04:03:47 [1906] host-064.virt.lab.msp.redhat.com cib: info: xml_apply_patchset: v1 digest mis-match: expected 0786e5a242d27c835cdc1e0ba48c6c53, calculated 22e849322db4bfc87e5d23d566c93883 Sep 22 04:03:47 [1906] host-064.virt.lab.msp.redhat.com cib: info: __xml_diff_object: Moved node@id (0 -> 1) Sep 22 04:03:47 [1906] host-064.virt.lab.msp.redhat.com cib: info: __xml_diff_object: Moved node@uname (1 -> 0) Sep 22 04:03:47 [1906] host-064.virt.lab.msp.redhat.com cib: info: xml_patch_version_check: Current num_updates is too high (20 > 1) Investigating why the compatibility code was triggered on 063 Oh, the other two nodes are still on the old cib code. Some compatibility fixes went into pacemaker-1.1.12-22.el7_1.4, it would be worth testing with that instead. Reproducible even with pacemaker-1.1.12-22.el7_1.4 -- virt-072 (the DC, old pacemaker version) sees all nodes online, both upgraded nodes (73 and 78) see virt-073 in standby. [root@virt-071 ~]# qarsh virt-072 rpm -q pacemaker pacemaker-1.1.10-32.el7_0.1.x86_64 [root@virt-071 ~]# qarsh virt-073 rpm -q pacemaker pacemaker-1.1.12-22.el7_1.4.x86_64 [root@virt-071 ~]# qarsh virt-078 rpm -q pacemaker pacemaker-1.1.12-22.el7_1.4.x86_64 Created attachment 1078641 [details]
crm report with pacemaker-1.1.12-22.el7_1.4
Ok, I managed to reproduce this. Thank-you for persisting. One of the resources for the new CIB was problems with the old patch format. The improvements to the patch handing meant that there were additional conditions under which they would be rejected. In the cases I saw, it was caused by either: - fingerprint mismatches between the actual and expected results, or - attrd doing some local writes which bumped the local version past the DC's. The situation would have resolved itself with further cib updates (Eg. resource stop/start operations would have triggered a resync), which is probably why customers haven't reported this. The following patch basically dials up the paranoia and does a full re-sync if a patchset fails to apply while we are in legacy mode. Scratch build available at: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9947531 Can you put that through your test and confirm? diff --git a/cib/messages.c b/cib/messages.c index eca63b9..ff197cd 100644 --- a/cib/messages.c +++ b/cib/messages.c @@ -369,6 +369,7 @@ cib_server_process_diff(const char *op, int options, const char *section, xmlNod } rc = cib_process_diff(op, options, section, req, input, existing_cib, result_cib, answer); + crm_trace("result: %s (%d), %s", pcmk_strerror(rc), rc, cib_is_master?"master":"slave"); if (rc == -pcmk_err_diff_resync && cib_is_master == FALSE) { free_xml(*result_cib); @@ -380,6 +381,13 @@ cib_server_process_diff(const char *op, int options, const char *section, xmlNod if (options & cib_force_diff) { crm_warn("Not requesting full refresh in R/W mode"); } + + } else if(rc != pcmk_ok && cib_legacy_mode()) { + crm_warn("Something went wrong in compatibility mode, requesting full refresh"); + xml_log_patchset(LOG_INFO, __FUNCTION__, input); + free_xml(*result_cib); + *result_cib = NULL; + send_sync_request(NULL); } return rc; Created attachment 1083198 [details]
crm report with brew scratch build
Looks like a full resync was triggered, but my test still failed while trying to move a resource to specific node (virt-073 and virt-078 are running pacemaker from brew, virt-079 is on rhel-7.0.z pacemaker version):
> 15.10.2015 10:52:49 INFO GenericTest.run: running move
> 15.10.2015 10:52:49 INFO GenericTest.run: -----------
> 15.10.2015 10:52:49 DEBUG clusterstack.loggedcall: Calling `<function nodes_by_resource_role at 0xc0a578>` with args=(<QAlib.clusterstack.PacemakerClusterStack object at 0xe16890>, 'www'), kwargs={'role': 'Started'}
> 15.10.2015 10:52:50 DEBUG clusterstack.loggedcall: `<function nodes_by_resource_role at 0xc0a578>` took 0.93s to return: ['virt-073.cluster-qe.lab.eng.brq.redhat.com']
> 15.10.2015 10:52:50 DEBUG clusterstack.loggedcall: Calling `<function resource_move at 0xc0a938>` with args=(<QAlib.clusterstack.PacemakerClusterStack object at 0xe16890>, 'www'), kwargs={'moveto': 'virt-078.cluster-qe.lab.eng.brq.redhat.com', 'timeout': 60}
> 15.10.2015 10:52:50 DEBUG clusterstack.loggedcall: Calling `<function nodes_by_resource_role at 0xc0a578>` with args=(<QAlib.clusterstack.PacemakerClusterStack object at 0xe16890>, 'www'), kwargs={'role': 'Started'}
> 15.10.2015 10:52:51 DEBUG clusterstack.loggedcall: `<function nodes_by_resource_role at 0xc0a578>` took 0.98s to return: ['virt-073.cluster-qe.lab.eng.brq.redhat.com']
> 15.10.2015 10:52:52 ERROR clusterstack.resource_move: move (by starting) failed:
> Traceback (most recent call last):
> File "/usr/tests/sts-rhel7.0/lib/QAlib/clusterstack.py", line 580, in resource_move
> "pcs resource move %s %s" % (resource, moveto))
> File "/usr/tests/sts-rhel7.0/lib/QAlib/cmd.py", line 47, in qarsh
> % (args, ret.retcode, ret.out, ret.err))
> QarshError: qarsh command `['qarsh', 'root.lab.eng.brq.redhat.com', 'pcs resource move www virt-078.cluster-qe.lab.eng.brq.redhat.com']` exited 1: STDOUT=``; STDERR=`Error: error moving/banning/clearing resource
> Error performing operation: Update was older than existing configuration
>
> `
> 15.10.2015 10:52:52 DEBUG clusterstack.loggedcall: `<function resource_move at 0xc0a938>` raised an exception after 1.46s
> Traceback (most recent call last):
> File "/usr/tests/sts-rhel7.0/lib/QAlib/clusterstack.py", line 51, in loggedcall
> ret = func(*args, **kargs)
> File "/usr/tests/sts-rhel7.0/lib/QAlib/clusterstack.py", line 583, in resource_move
> raise ClusterStackError("Failed to start (move) resource"
> ClusterStackError: Failed to start (move) resource (see above for details)
> 15.10.2015 10:52:52 ERROR GenericTest.run: Failure on move.
> 15.10.2015 10:52:52 ERROR pacemaker-test.<module>: Test RollerApache failed.
> Oct 15 10:52:51 virt-073 qarshd[14745]: Running cmdline: pcs resource move www virt-078.cluster-qe.lab.eng.brq.redhat.com
> Oct 15 10:52:52 virt-073 cib[1467]: warning: Something went wrong in compatibility mode, requesting full refresh
> Oct 15 10:52:52 virt-073 cib[1467]: error: crm_xml_replace: Triggered assert at xml.c:2739 : old_value != value
> Oct 15 10:52:52 virt-073 attrd[1470]: notice: Updating all attributes after cib_refresh_notify event
> Oct 15 10:52:52 virt-073 apache(www)[14749]: INFO: Attempting graceful stop of apache PID 14361
> Oct 15 10:52:53 virt-073 apache(www)[14749]: INFO: apache stopped.
> Oct 15 10:52:53 virt-073 crmd[1472]: notice: Operation www_stop_0: ok (node=virt-073.cluster-qe.lab.eng.brq.redhat.com, call=27, rc=0, cib-update=23, confirmed=true)
> Oct 15 10:52:53 virt-073 cib[1467]: warning: Something went wrong in compatibility mode, requesting full refresh
> Oct 15 10:52:53 virt-073 crmd[1472]: warning: Resource update 23 failed: (rc=-205) Update was older than existing configuration
> Oct 15 10:52:54 virt-073 cib[1467]: notice: Not applying diff 0.0.2 -> 0.46.3 (sync in progress)
> Oct 15 10:52:54 virt-073 cib[1467]: notice: Not applying diff 0.0.3 -> 0.46.4 (sync in progress)
and on the node running old pacemaker:
> Oct 15 10:52:52 virt-079 cib[1615]: notice: cib:diff: Diff: --- 0.45.3
> Oct 15 10:52:52 virt-079 cib[1615]: notice: cib:diff: Diff: +++ 0.46.1 0a6c16f96e84ac649376aebb06869aaf
> Oct 15 10:52:52 virt-079 cib[1615]: notice: cib:diff: -- <rsc_location node="virt-073.cluster-qe.lab.eng.brq.redhat.com" id="cli-prefer-www"/>
> Oct 15 10:52:52 virt-079 cib[1615]: notice: cib:diff: ++ <rsc_location id="cli-prefer-www" rsc="www" role="Started" node="virt-078.cluster-qe.lab.eng.brq.redhat.com" score="INFINITY"/>
> Oct 15 10:52:52 virt-079 crmd[1620]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Oct 15 10:52:52 virt-079 pengine[1619]: notice: LogActions: Move www (Started virt-073.cluster-qe.lab.eng.brq.redhat.com -> virt-078.cluster-qe.lab.eng.brq.redhat.com)
> Oct 15 10:52:52 virt-079 pengine[1619]: notice: process_pe_message: Calculated Transition 26: /var/lib/pacemaker/pengine/pe-input-26.bz2
> Oct 15 10:52:52 virt-079 crmd[1620]: notice: te_rsc_command: Initiating action 16: stop www_stop_0 on virt-073.cluster-qe.lab.eng.brq.redhat.com
> Oct 15 10:52:52 virt-079 stonith-ng[1616]: notice: stonith_device_register: Added 'fence-virt-073' to the device list (3 active devices)
> Oct 15 10:52:52 virt-079 stonith-ng[1616]: notice: stonith_device_register: Device 'fence-virt-078' already existed in device list (3 active devices)
> Oct 15 10:52:52 virt-079 stonith-ng[1616]: notice: stonith_device_register: Added 'fence-virt-079' to the device list (3 active devices)
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <diff format="2">
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <version>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <source admin_epoch="0" epoch="46" num_updates="1"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <target admin_epoch="0" epoch="46" num_updates="2"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update </version>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <change operation="modify" path="/cib">
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <change-list>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <change-attr name="num_updates" operation="set" value="2"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update </change-list>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <change-result>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <cib epoch="46" num_updates="2" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Thu Oct 15 10:52:52 2015" update-origin="virt-079.cluster-qe.lab.eng.brq.redhat.com" update-client="virt-078.cluster-qe.lab.eng.brq.redhat.com" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="3"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update </change-result>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update </change>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update </diff>
> Oct 15 10:52:52 virt-079 cib[1615]: error: cib_process_diff: Diff -1.-1.-1 -> -1.-1.-1 from virt-078.cluster-qe.lab.eng.brq.redhat.com not applied to 0.46.1: + and - versions in the diff did not change in global update
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <diff format="2">
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <version>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <source admin_epoch="0" epoch="46" num_updates="2"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <target admin_epoch="0" epoch="46" num_updates="3"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update </version>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <change operation="modify" path="/cib">
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <change-list>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <change-attr name="num_updates" operation="set" value="3"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update </change-list>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <change-result>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update <cib epoch="46" num_updates="3" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Thu Oct 15 10:52:52 2015" update-origin="virt-079.cluster-qe.lab.eng.brq.redhat.com" update-client="virt-078.cluster-qe.lab.eng.brq.redhat.com" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="3"/>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update </change-result>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update </change>
> Oct 15 10:52:52 virt-079 cib[1615]: warning: cib_process_diff: Bad global update </diff>
> Oct 15 10:52:52 virt-079 cib[1615]: error: cib_process_diff: Diff -1.-1.-1 -> -1.-1.-1 from virt-078.cluster-qe.lab.eng.brq.redhat.com not applied to 0.46.1: + and - versions in the diff did not change in global update
> Oct 15 10:52:53 virt-079 crmd[1620]: notice: te_rsc_command: Initiating action 17: start www_start_0 on virt-078.cluster-qe.lab.eng.brq.redhat.com
> Oct 15 10:52:54 virt-079 crmd[1620]: notice: te_rsc_command: Initiating action 18: monitor www_monitor_30000 on virt-078.cluster-qe.lab.eng.brq.redhat.com
> Oct 15 10:52:54 virt-079 crmd[1620]: notice: run_graph: Transition 26 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-26.bz2): Complete
> Oct 15 10:52:54 virt-079 crmd[1620]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
|