Bug 454702 - fence_tool join occasionally hangs during revolver runs on the smoke cluster
Summary: fence_tool join occasionally hangs during revolver runs on the smoke cluster
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.4
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-07-09 20:04 UTC by Abhijith Das
Modified: 2009-05-01 19:08 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-05-01 19:08:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Abhijith Das 2008-07-09 20:04:16 UTC
Description of problem:

After about 10 hrs of running, fence_tool join (through the cman init scipt)
hung on one of the nodes (winston) on the smoke cluster that was rebooted by
revolver.

fence_tool dump from merit.(The node that fenced winston) :

dump read: Success
1215573989 our_nodeid 2 our_name merit
1215573989 listen 4 member 5 groupd 7
1215573994 client 3: join default
1215573994 delay post_join 20s post_fail 2000s
1215573994 added 4 nodes from ccs
1215574041 setid default 65538
1215574041 start default 1 members 2 3 
1215574041 do_recovery stop 0 start 1 finish 0
1215574041 finish default 1
1215574041 stop default
1215574041 start default 2 members 1 2 3 
1215574041 do_recovery stop 1 start 2 finish 1
1215574041 finish default 2
1215574041 stop default
1215574041 start default 3 members 5 1 2 3 
1215574041 do_recovery stop 2 start 3 finish 2
1215574041 finish default 3
1215574410 stop default
1215574410 start default 9 members 5 2 3 
1215574410 do_recovery stop 3 start 9 finish 3
1215574410 add node 1 to list 1
1215574411 node "winston" not a cman member, cn 1
1215574412 node "winston" not a cman member, cn 1
1215574413 node "winston" not a cman member, cn 1
1215574414 node "winston" not a cman member, cn 1
1215574415 node "winston" not a cman member, cn 1
1215574416 node "winston" not a cman member, cn 1
1215574417 node "winston" not a cman member, cn 1
1215574418 node "winston" not a cman member, cn 1
1215574419 node "winston" not a cman member, cn 1
1215574420 node "winston" not a cman member, cn 1
1215574421 node "winston" not a cman member, cn 1
1215574422 node "winston" not a cman member, cn 1
1215574423 node "winston" not a cman member, cn 1
1215574424 node "winston" not a cman member, cn 1
1215574425 node "winston" not a cman member, cn 1
1215574426 node "winston" not a cman member, cn 1
1215574427 node "winston" not a cman member, cn 1
1215574428 node "winston" not a cman member, cn 1
1215574429 node "winston" not a cman member, cn 1
1215574430 node "winston" not a cman member, cn 1
1215574431 node "winston" not a cman member, cn 1
1215574432 node "winston" not a cman member, cn 1
1215574433 node "winston" not a cman member, cn 1
1215574434 node "winston" not a cman member, cn 1
1215574435 node "winston" not a cman member, cn 1
1215574436 node "winston" not a cman member, cn 1
1215574437 node "winston" not a cman member, cn 1
1215574438 node "winston" not a cman member, cn 1
1215574439 node "winston" not a cman member, cn 1
1215574440 node "winston" not a cman member, cn 1
1215574441 node "winston" not a cman member, cn 1
1215574442 node "winston" not a cman member, cn 1
1215574443 node "winston" not a cman member, cn 1
1215574444 node "winston" not a cman member, cn 1
1215574445 node "winston" not a cman member, cn 1
1215574446 node "winston" not a cman member, cn 1
1215574447 node "winston" not a cman member, cn 1
1215574448 node "winston" not a cman member, cn 1
1215574449 node "winston" not a cman member, cn 1
1215574450 node "winston" not a cman member, cn 1
1215574451 node "winston" not a cman member, cn 1
1215574452 node "winston" not a cman member, cn 1
1215574453 node "winston" not a cman member, cn 1
1215574454 node "winston" not a cman member, cn 1
1215574455 node "winston" not a cman member, cn 1
1215574456 node "winston" not a cman member, cn 1
1215574457 node "winston" not a cman member, cn 1
1215574458 node "winston" not a cman member, cn 1
1215574459 node "winston" not a cman member, cn 1
1215574460 node "winston" not a cman member, cn 1
1215574461 node "winston" not a cman member, cn 1
1215574462 node "winston" not a cman member, cn 1
1215574463 node "winston" not a cman member, cn 1
1215574464 node "winston" not a cman member, cn 1
1215574465 reduce victim winston
1215574465 delay of 55s leaves 0 victims
1215574465 finish default 9
1215574466 stop default
1215574466 start default 13 members 1 5 2 3 
1215574466 do_recovery stop 9 start 13 finish 9
1215574466 finish default 13
1215574814 stop default
1215574814 start default 17 members 5 2 3 
1215574814 do_recovery stop 13 start 17 finish 13
1215574814 add node 1 to list 1
1215574815 node "winston" not a cman member, cn 1
1215574816 node "winston" not a cman member, cn 1
1215574817 node "winston" not a cman member, cn 1
1215574818 node "winston" not a cman member, cn 1
1215574819 node "winston" not a cman member, cn 1
1215574820 node "winston" not a cman member, cn 1
1215574821 node "winston" not a cman member, cn 1
1215574822 node "winston" not a cman member, cn 1
1215574823 node "winston" not a cman member, cn 1
1215574824 node "winston" not a cman member, cn 1
1215574825 node "winston" not a cman member, cn 1
1215574826 node "winston" not a cman member, cn 1
1215574827 node "winston" not a cman member, cn 1
1215574828 node "winston" not a cman member, cn 1
1215574829 node "winston" not a cman member, cn 1
1215574830 node "winston" not a cman member, cn 1
1215574831 node "winston" not a cman member, cn 1
1215574832 node "winston" not a cman member, cn 1
1215574833 node "winston" not a cman member, cn 1
1215574834 node "winston" not a cman member, cn 1
1215574835 node "winston" not a cman member, cn 1
1215574836 node "winston" not a cman member, cn 1
1215574837 node "winston" not a cman member, cn 1
1215574838 node "winston" not a cman member, cn 1
1215574839 node "winston" not a cman member, cn 1
1215574840 node "winston" not a cman member, cn 1
1215574841 node "winston" not a cman member, cn 1
1215574842 node "winston" not a cman member, cn 1
1215574843 node "winston" not a cman member, cn 1
1215574844 node "winston" not a cman member, cn 1
1215574845 node "winston" not a cman member, cn 1
1215574846 node "winston" not a cman member, cn 1
1215574847 node "winston" not a cman member, cn 1
1215574848 node "winston" not a cman member, cn 1
1215574849 node "winston" not a cman member, cn 1
1215574850 node "winston" not a cman member, cn 1
1215574851 node "winston" not a cman member, cn 1
1215574852 node "winston" not a cman member, cn 1
1215574853 node "winston" not a cman member, cn 1
1215574854 node "winston" not a cman member, cn 1
1215574855 node "winston" not a cman member, cn 1
1215574856 node "winston" not a cman member, cn 1
1215574857 node "winston" not a cman member, cn 1
1215574858 node "winston" not a cman member, cn 1
1215574859 node "winston" not a cman member, cn 1
1215574860 node "winston" not a cman member, cn 1
1215574861 node "winston" not a cman member, cn 1
1215574862 node "winston" not a cman member, cn 1
1215574863 node "winston" not a cman member, cn 1
1215574864 node "winston" not a cman member, cn 1
1215574865 node "winston" not a cman member, cn 1
1215574866 node "winston" not a cman member, cn 1
1215574867 node "winston" not a cman member, cn 1
1215574868 node "winston" not in groupd cpg
1215574869 reduce victim winston
1215574869 delay of 55s leaves 0 victims
1215574869 finish default 17
1215615683 client 3: dump


Version-Release number of selected component (if applicable):


How reproducible:
Don't know. Seen it once so far

Steps to Reproduce:
1. Run the QE revolver test with 1 fs and with init scripts enabled.
2.
3.
  
Actual results:
After several iterations spanning hours, the cman init script will hang in the "
Starting fencing..." state.

Expected results:
Rebooted node should recover without hang.

Additional info:
Please let me know what additional info you need from the cluster if/when this
happens again.

Comment 1 David Teigland 2008-07-10 14:24:40 UTC
Can't say much without looking at winston itself. The last thing
merit sees is that winston joins the cluster and starts groupd.

If it happens again we'll need to log into the node with the stuck
fence_tool join and see what it's blocked on.



Note You need to log in before you can comment on or make changes to this bug.