Bug 819144
Summary: | fence_tool reports incorrect data in corner case situation | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | sunshineway <xwzh2008> | ||||
Component: | cluster | Assignee: | David Teigland <teigland> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | low | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 16 | CC: | agk, ccaulfie, cfeist, cluster-maint, fdinitto, lhh, mkelly, rpeterso, swhiteho, teigland | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-02-14 00:50:45 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Please attach also cluster.conf. If the issue is reproducible constantly, can you also run a session with <logging debug="on"/> and capture sosreports from both nodes? what version of cman/corosync etc. are installed on the system? (In reply to comment #2) > Please attach also cluster.conf. > The cluster.conf fallows: <?xml version="1.0"?> <cluster config_version="1" name="test-cluster"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="node01" nodeid="1" votes="1"> </clusternode> <clusternode name="node02" nodeid="2" votes="1"> </clusternode> </clusternodes> <cman two_node="1" expected_votes="1"> </cman> <fencedevices> </fencedevices> <rm> <failoverdomains/> <resources/> </rm> <logging debug="on"/> </cluster> > If the issue is reproducible constantly, can you also run a session with > <logging debug="on"/> and capture sosreports from both nodes? > The debug is already turned on . And the corosync debug message is in attachment. > what version of cman/corosync etc. are installed on the system? cman-3.0.12.1-23.el6.x86_64 corosync-1.4.1-4.el6.x86_64 Since RHEL 6.3 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
>
> Steps to Reproduce:
> 1.create a two nodes cluster;
> 2.make one node hangs;
> 3.recover this node from hanging;
> 4.reboot or shutdown this node then fence_ack_manual it.
I am puzzled.. how do you make the node hang? what process do you use to simulate the failure?
(In reply to comment #5) > > > > Steps to Reproduce: > > 1.create a two nodes cluster; > > 2.make one node hangs; > > 3.recover this node from hanging; > > 4.reboot or shutdown this node then fence_ack_manual it. > > > I am puzzled.. how do you make the node hang? what process do you use to > simulate the failure? The method of making a node hang as follows: step 1: I use VirtualBox to create two virtual node of redhat 6.2, create a two nodes' cluster by Conga, and run it; step 2: Then, I close one of virtual node with option "Save the machine state"; When online node detects the "closed" node lost heartbeat and begins to fence it but failed; step 3: Then I open the "closed" node, its state is recovered and corosync is running as before. The "online" node detects the "closed" node is alive and forms a new membership like: Apr 19 08:55:19 node02 corosync[23125]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 19 08:55:19 node02 corosync[23125]: [QUORUM] Members[2]: 1 2 Apr 19 08:55:19 node02 corosync[23125]: [QUORUM] Members[2]: 1 2 Apr 19 08:55:19 node02 dlm_controld[23187]: cpg_mcast_joined error 12 handle 3006c83e00000000 protocol Apr 19 08:55:19 node02 gfs_controld[23244]: cpg_mcast_joined error 12 handle 71f3245400000000 protocol But,dlm_controld and gfs_controld daemon print cpg_mcast_joined error; step 4: Then, I reboot "closed" node and fence_ack_manual(instead of fence device) it, but when it joins the cluster , the cluster is still abnormal: [root@node02 cluster]# group_tool -n fence domain member count 1 victim count 2 victim now 0 master nodeid 2 wait state messages members 2 all nodes nodeid 1 member 1 victim 1 last fence master 2 how override nodeid 2 member 0 victim 1 last fence master 0 how none This situation is really happened in physical machine and I just reproduce this problem with VirtualBox. If OS or corosync hangs for a while but not fence itself successfully by another node and then recover, this situation can cause the problem happen. Moving this bug upstream for the following reasons: - fence is a strict requirement in RHEL6 and here it is not configured. this report is a corner case scenario that cannot be deployed in production. - current corosync + cman packages in RHEL (including the ones mentioned in the report do behave correctly and do not show the problem at all) - customer/RHEL bugs should be escalated via Global Support Services. After investigation, it appears that it is possible (rarely) to trigger some race condition in which a cluster remerge would present the above output from fence_tool ls. The output appears to be gather "too fast" and waiting a bit longer would cause cman to die (as expected) on one of the node and the surviving node, does the right thing all the way. In other cases (depending on when the race triggered) fence_tool ls does show some inconsistencies. It appears to happen only on the node where cman has been killed and fenced is waiting there to be fenced (aka expected). cpg_multicast error messages have never shows in my logs. After a forceful reboot of the suspended node, fence_ack_manual restore conditions to normal. Suggested actions here are: - add fencing (as it is a requirement) - update packages to a more recent version that include fixes for #663397 that, based on my understanding, mitigate or fix part of those race conditions. I try again with the rpm packet that include fixes for #663397. I find some new interesting thing: if I make master nodeid(that group_tool shows) hang, the problem is still existed as above. but if I make non-master nodeid hang for more than 10 seconds, then recover it ,reboot it and execute fence_ack_manual on master nodeid, then the cluster becomes normal, when non-master nodeid starts successfully, it can join the cluster. I never see that cman was killed. I think maybe there is something wrong that handle this situation in fenced process. Please stop reassigning this Bugzilla to RHEL products. The configuration you are experiencing the issue is not supported in RHEL. This bug applies upstream where it has to be fixed first. This message is a reminder that Fedora 16 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 16. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '16'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 16's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 16 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |
Created attachment 582249 [details] log-20120505 the cluster becomes abnormal when a node recover after hang Description of problem: This cluster has two nodes(node01 and node02). Cluster is running. Then, node01 hangs(also corosync/cman hangs), node02 detects the node01's heartheat lost, and begins to fence the node01 but failed. Then, node01 recovers again, at this time, I shutdown node01, and use manual method(fence_ack_manual command) to fence node01, but the cluster is still abnomal. Even node01 power on and join the cluster, but the cluster can't recover. /var/log/messages print bellows messages: Apr 19 08:55:19 node02 corosync[23125]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 19 08:55:19 node02 corosync[23125]: [QUORUM] Members[2]: 1 2 Apr 19 08:55:19 node02 corosync[23125]: [QUORUM] Members[2]: 1 2 Apr 19 08:55:19 node02 dlm_controld[23187]: cpg_mcast_joined error 12 handle 3006c83e00000000 protocol Apr 19 08:55:19 node02 gfs_controld[23244]: cpg_mcast_joined error 12 handle 71f3245400000000 protocol Apr 19 08:55:19 node02 corosync[23125]: [CPG ] chosen downlist: sender r(0) ip(192.168.1.102) ; members(old:2 left:1) Apr 19 08:55:19 node02 corosync[23125]: [MAIN ] Completed service synchronization, ready to provide service. Apr 19 08:55:21 node02 fenced[23173]: cpg_mcast_joined error 12 handle 7fdcc23300000000 protocol Apr 19 08:56:51 node02 corosync[23125]: [TOTEM ] A processor failed, forming new configuration. Apr 19 08:56:53 node02 kernel: dlm: closing connection to node 1 Apr 19 08:56:53 node02 corosync[23125]: [QUORUM] Members[1]: 2 Apr 19 08:56:53 node02 corosync[23125]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 19 08:56:53 node02 corosync[23125]: [CPG ] chosen downlist: sender r(0) ip(192.168.1.103) ; members(old:2 left:1) Apr 19 08:56:53 node02 corosync[23125]: [MAIN ] Completed service synchronization, ready to provide service. Apr 19 08:57:09 node02 fenced[23173]: fence node01 overridden by administrator intervention Apr 19 08:57:09 node02 fenced[23173]: cpg_mcast_joined error 12 handle 71f3245400000001 victim_done Apr 19 08:57:09 node02 fenced[23173]: cpg_mcast_joined error 12 handle 71f3245400000001 complete Apr 19 08:57:09 node02 fenced[23173]: cpg_mcast_joined error 12 handle 71f3245400000001 start [root@node01 cluster]# group_tool -n fence domain member count 1 victim count 0 victim now 0 master nodeid 1 wait state none members 1 all nodes nodeid 1 member 1 victim 0 last fence master 0 how none nodeid 2 member 0 victim 0 last fence master 0 how none [root@node02 cluster]# group_tool -n fence domain member count 1 victim count 2 //victim count becomes 2 ?? victim now 0 master nodeid 2 wait state messages members 2 all nodes nodeid 1 member 1 victim 1 last fence master 2 how override nodeid 2 member 0 victim 1 last fence master 0 how none Version-Release number of selected component (if applicable): cman-3.0.12.1-23.el6.x86_64 corosync-1.4.1-4.el6.x86_64 How reproducible: see bellow. Steps to Reproduce: 1.create a two nodes cluster; 2.make one node hangs; 3.recover this node from hanging; 4.reboot or shutdown this node then fence_ack_manual it. Actual results: cluster is abnormal when a node is fenced. Expected results: cluster should become normal. Additional info: log file.