Bug 684825
Summary: | [PATCH] Fix pacemaker's wrong quorum view in a CMAN+pacemaker cluster | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Simone Gotti <simone.gotti> | ||||
Component: | pacemaker | Assignee: | Andrew Beekhof <abeekhof> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 6.0 | CC: | cluster-maint, djansa, uwe.knop | ||||
Target Milestone: | rc | Keywords: | TechPreview | ||||
Target Release: | 6.1 | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | pacemaker-1.1.5-3.el6 | Doc Type: | Technology Preview | ||||
Doc Text: |
In a cluster environment managed by both Pacemaker and the CMAN cluster management subsystem, frequent leaving and joining of a node could cause Pacemaker's quorum view to be incorrect. This update applies a patch that addresses this issue, so that the leaving and joining of a node no longer causes Pacemaker's quorum view to be different from CMAN's.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2011-05-19 13:49:39 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Its very important that we have an up-to-date view of membership/quorum. The attached patch is correct. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: In a cluster environment managed by both Pacemaker and the CMAN cluster management subsystem, frequent leaving and joining of a node could cause Pacemaker's quorum view to be incorrect. This update applies a patch that addresses this issue, so that the leaving and joining of a node no longer causes Pacemaker's quorum view to be different from CMAN's. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0642.html |
Created attachment 484227 [details] Patch to deque all the cman events from crmd Testing a cman+pacemaker cluster on rhel6 I noticed a very nasty behavior when some nodes were leaving and rejoining the cluster. When a nodes starts leaving and rejoining the cluster the quorum view of pacemaker starts becoming sometimes different from the quorum view of cman. The one not telling the truth was pacemaker. I reproduced the problem with a simple test case made of 2 nodes using cman (no two_nodes flag) and pacemaker (started only on the first node: pcmk01). For the tests I was using the latest version of pacemaker (1.1.5) while keeping the original versions of corosync and cluster (cman) packages provided by the rhel6 (corosync-1.2.3-21.el6.x86_64, cman-3.0.12-23.el6.4.x86_64) The problem is that when a node joins a cluster (starting cman) the cman on the other nodes emits not one but 2 events (I didn't investigated if this is normal or present only in some versions of cman) but when crmd calls cman_dispatch it's using the flag CMAN_DISPATCH_ONE so only one of the two events is dequeued. In the subsequent cluster event the old one is dequeued. The fix I tried used CMAN_DISPATCH_ALL instead of CMAN_DISPATCH_ONE and looks like its working. I'm CCing the cluster-devel list as they can be interested in the double event emitted by cman. Thanks. Bye! == Test case == === Without the patch === Start with both nodes with cman started (so the cluster is quorate). Now stop cman on pcmk02. Output on pcmk01: pcmk01 corosync[16793]: [CMAN ] quorum lost, blocking activity pcmk01 corosync[16793]: [QUORUM] This node is within the non-primary component and will NOT provide any services. pcmk01 corosync[16793]: [QUORUM] Members[1]: 1 pcmk01 corosync[16793]: [TOTEM ] A processor joined or left the membership and a new membership was formed. pcmk01 corosync[16793]: [CPG ] downlist received left_list: 1 pcmk01 corosync[16793]: [CPG ] chosen downlist from node r(0) ip(192.168.200.71) pcmk01 corosync[16793]: [MAIN ] Completed service synchronization, ready to provide service. pcmk01 crmd: [16993]: notice: cman_event_callback: Membership 668: quorum lost Only one event is enqueued. Now start again cman on pcmk02. Output on pcmk01: pcmk01 corosync[16793]: [TOTEM ] A processor joined or left the membership and a new membership was formed. pcmk01 corosync[16793]: [CMAN ] quorum regained, resuming activity pcmk01 corosync[16793]: [QUORUM] This node is within the primary component and will provide service. pcmk01 corosync[16793]: [QUORUM] Members[2]: 1 2 pcmk01 corosync[16793]: [QUORUM] Members[2]: 1 2 pcmk01 crmd: [16993]: notice: cman_event_callback: Membership 672: quorum acquired pcmk01 corosync[16793]: [CPG ] downlist received left_list: 0 pcmk01 corosync[16793]: [CPG ] downlist received left_list: 0 pcmk01 corosync[16793]: [CPG ] chosen downlist from node r(0) ip(192.168.200.71) pcmk01 corosync[16793]: [MAIN ] Completed service synchronization, ready to provide service. As you can see two events are enqueued and only one si dequeued (due to the CMAN_DISPATCH_ONE flag passed to cman_dispatch). The quorum is regained both on cman and crmd. But there's another event saying that the quorum is regained in the queue. Now stop again cman on pcmk02. Output on pcmk01: pcmk01 corosync[16793]: [CMAN ] quorum lost, blocking activity pcmk01 corosync[16793]: [QUORUM] This node is within the non-primary component and will NOT provide any services. pcmk01 corosync[16793]: [QUORUM] Members[1]: 1 pcmk01 corosync[16793]: [TOTEM ] A processor joined or left the membership and a new membership was formed. pcmk01 corosync[16793]: [CPG ] downlist received left_list: 1 pcmk01 corosync[16793]: [CPG ] chosen downlist from node r(0) ip(192.168.200.71) pcmk01 corosync[16793]: [MAIN ] Completed service synchronization, ready to provide service. pcmk01 crmd: [16993]: info: cman_event_callback: Membership 676: quorum retained CMAN says that the quorum is lost and only one event is dispatched. But crmd dequeued the previous event and thinks that we have the quorum. Now start again cman on pcmk02. Output on pcmk01: pcmk01 corosync[16793]: [TOTEM ] A processor joined or left the membership and a new membership was formed. pcmk01 corosync[16793]: [CMAN ] quorum regained, resuming activity pcmk01 corosync[16793]: [QUORUM] This node is within the primary component and will provide service. pcmk01 corosync[16793]: [QUORUM] Members[2]: 1 2 pcmk01 corosync[16793]: [QUORUM] Members[2]: 1 2 pcmk01 crmd: [16993]: notice: cman_event_callback: Membership 680: quorum lost pcmk01 corosync[16793]: [CPG ] downlist received left_list: 0 pcmk01 corosync[16793]: [CPG ] downlist received left_list: 0 pcmk01 corosync[16793]: [CPG ] chosen downlist from node r(0) ip(192.168.200.71) pcmk01 corosync[16793]: [MAIN ] Completed service synchronization, ready to provide service. CMAN says that the quorum is regained but crmd dequeued again the old event and now it says that the quorum is lost. And so on... === With the patch === stop cman on pcmk02. Output on pcmk01: pcmk01 corosync[13149]: [CMAN ] quorum lost, blocking activity pcmk01 corosync[13149]: [QUORUM] This node is within the non-primary component and will NOT provide any services. pcmk01 corosync[13149]: [QUORUM] Members[1]: 1 pcmk01 corosync[13149]: [TOTEM ] A processor joined or left the membership and a new membership was formed. pcmk01 corosync[13149]: [CPG ] downlist received left_list: 1 pcmk01 corosync[13149]: [CPG ] chosen downlist from node r(0) ip(192.168.200.71) pcmk01 corosync[13149]: [MAIN ] Completed service synchronization, ready to provide service. pcmk01 crmd: [13351]: notice: cman_event_callback: Membership 648: quorum lost Only one event is enqueued. Now start again cman on pcmk02. Output on pcmk01: pcmk01 corosync[13149]: [TOTEM ] A processor joined or left the membership and a new membership was formed. pcmk01 corosync[13149]: [CMAN ] quorum regained, resuming activity pcmk01 corosync[13149]: [QUORUM] This node is within the primary component and will provide service. pcmk01 corosync[13149]: [QUORUM] Members[2]: 1 2 pcmk01 corosync[13149]: [QUORUM] Members[2]: 1 2 pcmk01 crmd: [13351]: notice: cman_event_callback: Membership 652: quorum acquired pcmk01 corosync[13149]: [CPG ] downlist received left_list: 0 pcmk01 corosync[13149]: [CPG ] downlist received left_list: 0 pcmk01 corosync[13149]: [CPG ] chosen downlist from node r(0) ip(192.168.200.71) pcmk01 corosync[13149]: [MAIN ] Completed service synchronization, ready to provide service. pcmk01 crmd: [13351]: info: cman_event_callback: Membership 652: quorum retained As you can see two events are enqueued and both are dequeued.