Bug 819144

Summary: fence_tool reports incorrect data in corner case situation
Product: [Fedora] Fedora Reporter: sunshineway <xwzh2008>
Component: clusterAssignee: David Teigland <teigland>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: 16CC: agk, ccaulfie, cfeist, cluster-maint, fdinitto, lhh, mkelly, rpeterso, swhiteho, teigland
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-14 00:50:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log-20120505 the cluster becomes abnormal when a node recover after hang none

Description sunshineway 2012-05-05 04:22:58 UTC
Created attachment 582249 [details]
log-20120505 the cluster becomes abnormal when a node recover after hang

Description of problem:
This cluster has two nodes(node01 and node02). Cluster is running. Then, node01 hangs(also corosync/cman hangs), node02 detects the node01's heartheat lost, and begins to fence the node01 but failed. Then, node01 recovers again, at this time, I shutdown node01, and use manual method(fence_ack_manual command) to fence node01, but the cluster is still abnomal. Even node01 power on and join the cluster, but the cluster can't recover.

/var/log/messages print bellows messages:
Apr 19 08:55:19 node02 corosync[23125]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 19 08:55:19 node02 corosync[23125]:   [QUORUM] Members[2]: 1 2
Apr 19 08:55:19 node02 corosync[23125]:   [QUORUM] Members[2]: 1 2
Apr 19 08:55:19 node02 dlm_controld[23187]: cpg_mcast_joined error 12 handle 3006c83e00000000 protocol
Apr 19 08:55:19 node02 gfs_controld[23244]: cpg_mcast_joined error 12 handle 71f3245400000000 protocol
Apr 19 08:55:19 node02 corosync[23125]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.1.102) ; members(old:2 left:1)
Apr 19 08:55:19 node02 corosync[23125]:   [MAIN  ] Completed service synchronization, ready to provide service.
Apr 19 08:55:21 node02 fenced[23173]: cpg_mcast_joined error 12 handle 7fdcc23300000000 protocol
Apr 19 08:56:51 node02 corosync[23125]:   [TOTEM ] A processor failed, forming new configuration.
Apr 19 08:56:53 node02 kernel: dlm: closing connection to node 1
Apr 19 08:56:53 node02 corosync[23125]:   [QUORUM] Members[1]: 2
Apr 19 08:56:53 node02 corosync[23125]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 19 08:56:53 node02 corosync[23125]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.1.103) ; members(old:2 left:1)
Apr 19 08:56:53 node02 corosync[23125]:   [MAIN  ] Completed service synchronization, ready to provide service.
Apr 19 08:57:09 node02 fenced[23173]: fence node01 overridden by administrator intervention
Apr 19 08:57:09 node02 fenced[23173]: cpg_mcast_joined error 12 handle 71f3245400000001 victim_done
Apr 19 08:57:09 node02 fenced[23173]: cpg_mcast_joined error 12 handle 71f3245400000001 complete
Apr 19 08:57:09 node02 fenced[23173]: cpg_mcast_joined error 12 handle 71f3245400000001 start

[root@node01 cluster]# group_tool -n
fence domain
member count  1
victim count  0
victim now    0
master nodeid 1
wait state    none
members       1 
all nodes
nodeid 1 member 1 victim 0 last fence master 0 how none
nodeid 2 member 0 victim 0 last fence master 0 how none


[root@node02 cluster]# group_tool -n
fence domain
member count  1
victim count  2       //victim count becomes 2 ??
victim now    0
master nodeid 2
wait state    messages
members       2 
all nodes
nodeid 1 member 1 victim 1 last fence master 2 how override
nodeid 2 member 0 victim 1 last fence master 0 how none


Version-Release number of selected component (if applicable):
cman-3.0.12.1-23.el6.x86_64
corosync-1.4.1-4.el6.x86_64

How reproducible:
see bellow.

Steps to Reproduce:
1.create a two nodes cluster;
2.make one node hangs;
3.recover this node from hanging;
4.reboot or shutdown this node then fence_ack_manual it.
  
Actual results:
cluster is abnormal when a node is fenced.

Expected results:
cluster should become normal.

Additional info:
log file.

Comment 2 Fabio Massimo Di Nitto 2012-05-05 04:46:11 UTC
Please attach also cluster.conf.

If the issue is reproducible constantly, can you also run a session with
<logging debug="on"/> and capture sosreports from both nodes?

what version of cman/corosync etc. are installed on the system?

Comment 3 sunshineway 2012-05-05 05:30:17 UTC
(In reply to comment #2)
> Please attach also cluster.conf.
> 

The cluster.conf fallows:

<?xml version="1.0"?>
<cluster config_version="1" name="test-cluster">
	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="node01" nodeid="1" votes="1">
		</clusternode>
		<clusternode name="node02" nodeid="2" votes="1">
		                </clusternode>
	</clusternodes>
        <cman two_node="1" expected_votes="1">
        </cman>
	<fencedevices>
	</fencedevices>
	<rm>
		<failoverdomains/>
		<resources/>
	</rm>
	<logging debug="on"/>
</cluster>

> If the issue is reproducible constantly, can you also run a session with
> <logging debug="on"/> and capture sosreports from both nodes?
> 

The debug is already turned on . And the corosync debug message is in attachment.

> what version of cman/corosync etc. are installed on the system?
cman-3.0.12.1-23.el6.x86_64
corosync-1.4.1-4.el6.x86_64

Comment 4 RHEL Program Management 2012-05-09 04:04:01 UTC
Since RHEL 6.3 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 5 Fabio Massimo Di Nitto 2012-05-09 07:46:48 UTC
> 
> Steps to Reproduce:
> 1.create a two nodes cluster;
> 2.make one node hangs;
> 3.recover this node from hanging;
> 4.reboot or shutdown this node then fence_ack_manual it.


I am puzzled.. how do you make the node hang? what process do you use to simulate the failure?

Comment 6 sunshineway 2012-05-10 15:10:48 UTC
(In reply to comment #5)
> > 
> > Steps to Reproduce:
> > 1.create a two nodes cluster;
> > 2.make one node hangs;
> > 3.recover this node from hanging;
> > 4.reboot or shutdown this node then fence_ack_manual it.
> 
> 
> I am puzzled.. how do you make the node hang? what process do you use to
> simulate the failure?

The method of making a node hang as follows:
step 1: I use VirtualBox to create two virtual node of redhat 6.2,  create a two nodes' cluster by Conga, and run it;
step 2: Then, I close one of virtual node with option "Save the machine state";
When online node detects the "closed" node lost heartbeat and begins to fence it but failed;
step 3: Then I open the "closed" node, its state is recovered and corosync is running as before.
The "online" node detects the "closed" node is alive and forms a new membership like:
Apr 19 08:55:19 node02 corosync[23125]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 19 08:55:19 node02 corosync[23125]:   [QUORUM] Members[2]: 1 2
Apr 19 08:55:19 node02 corosync[23125]:   [QUORUM] Members[2]: 1 2
Apr 19 08:55:19 node02 dlm_controld[23187]: cpg_mcast_joined error 12 handle 3006c83e00000000 protocol
Apr 19 08:55:19 node02 gfs_controld[23244]: cpg_mcast_joined error 12 handle 71f3245400000000 protocol

But,dlm_controld and gfs_controld daemon print cpg_mcast_joined error;

step 4: Then, I reboot "closed" node and fence_ack_manual(instead of fence device) it, but when it joins the cluster , the cluster is still abnormal:

[root@node02 cluster]# group_tool -n
fence domain
member count  1
victim count  2      
victim now    0
master nodeid 2
wait state    messages
members       2 
all nodes
nodeid 1 member 1 victim 1 last fence master 2 how override
nodeid 2 member 0 victim 1 last fence master 0 how none


This situation is really happened in physical machine and I just reproduce this problem with VirtualBox. If OS or corosync hangs for a while but not fence itself successfully by another node and then recover, this situation can cause the problem happen.

Comment 7 Fabio Massimo Di Nitto 2012-05-11 07:24:25 UTC
Moving this bug upstream for the following reasons:

- fence is a strict requirement in RHEL6 and here it is not configured.
  this report is a corner case scenario that cannot be deployed in production.
- current corosync + cman packages in RHEL (including the ones mentioned
  in the report do behave correctly and do not show the problem at all)
- customer/RHEL bugs should be escalated via Global Support Services.

After investigation, it appears that it is possible (rarely) to trigger some race condition in which a cluster remerge would present the above output from fence_tool ls.

The output appears to be gather "too fast" and waiting a bit longer would cause cman to die (as expected) on one of the node and the surviving node, does the right thing all the way.

In other cases (depending on when the race triggered) fence_tool ls does show some inconsistencies. It appears to happen only on the node where cman has been killed and fenced is waiting there to be fenced (aka expected).

cpg_multicast error messages have never shows in my logs.

After a forceful reboot of the suspended node, fence_ack_manual restore conditions to normal.

Suggested actions here are:

- add fencing (as it is a requirement)
- update packages to a more recent version that include fixes for #663397 that,
  based on my understanding, mitigate or fix part of those race conditions.

Comment 8 sunshineway 2012-05-12 07:42:02 UTC
I try again with the rpm packet that include fixes for #663397.
I find some new interesting thing:
if I make master nodeid(that group_tool shows) hang, the problem is still existed as above. but if I make non-master nodeid hang for more than 10 seconds, then recover it ,reboot it and execute fence_ack_manual on master nodeid, then the cluster becomes normal, when non-master nodeid starts successfully, it can join the cluster.  I never see that cman was killed. I think maybe there is something wrong that handle this situation in fenced process.

Comment 10 Fabio Massimo Di Nitto 2012-06-19 03:00:02 UTC
Please stop reassigning this Bugzilla to RHEL products.

The configuration you are experiencing the issue is not supported in RHEL.

This bug applies upstream where it has to be fixed first.

Comment 11 Fedora End Of Life 2013-01-16 22:15:20 UTC
This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 12 Fedora End Of Life 2013-02-14 00:50:50 UTC
Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.