Bug 128432 - cluster services get stuck in recovery state
Summary: cluster services get stuck in recovery state
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gfs
Version: 4
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-07-22 20:56 UTC by Corey Marthaler
Modified: 2010-01-12 02:54 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-08-25 17:10:25 UTC
Embargoed:


Attachments (Terms of Use)

Description Corey Marthaler 2004-07-22 20:57:00 UTC
Description of problem:
After having nodes go down and attempted to be brought back into the
cluster, the nodes left up have their services stuck in the recovery
state.

morph-01 and morph-05 were the nodes which were shot.


[root@morph-01 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join     
S-1,80,6
[]

DLM Lock Space:  "clvmd"                             0   3 join     
S-1,80,6
[]

[root@morph-02 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 4 -
[6 4 3 1]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[4 3 1 6]

DLM Lock Space:  "foobar0"                           3   4 recover 0 -
[6 4 3 1]

DLM Lock Space:  "foobar1"                           5   6 recover 0 -
[6 4 3 1]

DLM Lock Space:  "foobar2"                           7   8 recover 0 -
[6 4 3 1]

GFS Mount Group: "foobar0"                           4   5 recover 0 -
[6 4 3 1]

GFS Mount Group: "foobar1"                           6   7 recover 0 -
[6 4 3 1]

GFS Mount Group: "foobar2"                           8   9 recover 0 -
[6 4 3 1]


[root@morph-03 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 4 -
[4 6 3 1]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[4 1 3 6]

DLM Lock Space:  "foobar0"                           3   4 recover 0 -
[4 6 3 1]

DLM Lock Space:  "foobar1"                           5   6 recover 0 -
[4 6 3 1]

DLM Lock Space:  "foobar2"                           7   8 recover 0 -
[4 6 3 1]

GFS Mount Group: "foobar0"                           4   5 recover 0 -
[4 6 3 1]

GFS Mount Group: "foobar1"                           6   7 recover 0 -
[4 6 3 1]

GFS Mount Group: "foobar2"                           8   9 recover 0 -
[4 6 3 1]



[root@morph-04 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 4 -
[3 4 6 1]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[1 3 4 6]

DLM Lock Space:  "foobar0"                           3   4 recover 0 -
[3 4 6 1]

DLM Lock Space:  "foobar1"                           5   6 recover 0 -
[3 4 6 1]

DLM Lock Space:  "foobar2"                           7   8 recover 0 -
[3 4 6 1]

GFS Mount Group: "foobar0"                           4   5 recover 0 -
[3 4 6 1]

GFS Mount Group: "foobar1"                           6   7 recover 0 -
[3 4 6 1]

GFS Mount Group: "foobar2"                           8   9 recover 0 -
[3 4 6 1]


[root@morph-05 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join     
S-1,80,6
[]

DLM Lock Space:  "clvmd"                             0   3 join     
S-1,280,6
[]


[root@morph-06 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 2 -
[1 3 4 6]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[1 4 3 6]

DLM Lock Space:  "foobar0"                           3   4 recover 0 -
[1 3 4 6]

DLM Lock Space:  "foobar1"                           5   6 recover 0 -
[1 3 4 6]

DLM Lock Space:  "foobar2"                           7   8 recover 0 -
[1 3 4 6]

GFS Mount Group: "foobar0"                           4   5 recover 0 -
[1 3 4 6]

GFS Mount Group: "foobar1"                           6   7 recover 0 -
[1 3 4 6]

GFS Mount Group: "foobar2"                           8   9 recover 0 -
[1 3 4 6]



How reproducible:
Sometimes

Comment 1 David Teigland 2004-07-23 03:43:33 UTC
morph-06 is in recover state 2 for the fence domain which means
it's waiting for fenced or the agent to complete the fencing operation.
So, the fencing operation is stuck for some reason -- maybe the same
reason as bz 127021?

The other nodes in recover state 4 are waiting for morph-06 to finish
before doing anything else.

morph-01 and 05 are trying to join the fence domain but must wait until
the fd completes recovery.

Comment 2 Corey Marthaler 2004-07-23 22:32:07 UTC
These nodes have your patch from yesterday which fixes 127021 I 
thought, at least I don't see the radom fencing during start up 
anymore. But I also don't see the fence attempt during recovery 
either. 

Comment 3 David Teigland 2004-08-19 04:55:46 UTC
do you still see this?  everything in the original report looks ok --
as if a fencing operation is in progress on morph-06.

Comment 4 Corey Marthaler 2004-08-25 15:30:47 UTC
I do still see this problem of all services being stuck in the 
recovery state when one or many nodes are taken down and then 
brought back up.  
 
Also, I still never see any fence messages/attempts which is 
apparently the reason this promblem occurs. 

Comment 5 Corey Marthaler 2004-08-25 17:10:25 UTC
Need to have the perl Net-Telnet rpm installed.  
 
I forgot there are no checks/warnings for perl to be installed since 
we are building everything. 

Comment 6 Kiersten (Kerri) Anderson 2004-11-16 19:13:44 UTC
Updating version to the right level in the defects.  Sorry for the storm.


Note You need to log in before you can comment on or make changes to this bug.