Bug 128432 - cluster services get stuck in recovery state
cluster services get stuck in recovery state
Status: CLOSED NOTABUG
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
4
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-07-22 16:56 EDT by Corey Marthaler
Modified: 2010-01-11 21:54 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-08-25 13:10:25 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2004-07-22 16:57:00 EDT
Description of problem:
After having nodes go down and attempted to be brought back into the
cluster, the nodes left up have their services stuck in the recovery
state.

morph-01 and morph-05 were the nodes which were shot.


[root@morph-01 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join     
S-1,80,6
[]

DLM Lock Space:  "clvmd"                             0   3 join     
S-1,80,6
[]

[root@morph-02 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 4 -
[6 4 3 1]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[4 3 1 6]

DLM Lock Space:  "foobar0"                           3   4 recover 0 -
[6 4 3 1]

DLM Lock Space:  "foobar1"                           5   6 recover 0 -
[6 4 3 1]

DLM Lock Space:  "foobar2"                           7   8 recover 0 -
[6 4 3 1]

GFS Mount Group: "foobar0"                           4   5 recover 0 -
[6 4 3 1]

GFS Mount Group: "foobar1"                           6   7 recover 0 -
[6 4 3 1]

GFS Mount Group: "foobar2"                           8   9 recover 0 -
[6 4 3 1]


[root@morph-03 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 4 -
[4 6 3 1]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[4 1 3 6]

DLM Lock Space:  "foobar0"                           3   4 recover 0 -
[4 6 3 1]

DLM Lock Space:  "foobar1"                           5   6 recover 0 -
[4 6 3 1]

DLM Lock Space:  "foobar2"                           7   8 recover 0 -
[4 6 3 1]

GFS Mount Group: "foobar0"                           4   5 recover 0 -
[4 6 3 1]

GFS Mount Group: "foobar1"                           6   7 recover 0 -
[4 6 3 1]

GFS Mount Group: "foobar2"                           8   9 recover 0 -
[4 6 3 1]



[root@morph-04 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 4 -
[3 4 6 1]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[1 3 4 6]

DLM Lock Space:  "foobar0"                           3   4 recover 0 -
[3 4 6 1]

DLM Lock Space:  "foobar1"                           5   6 recover 0 -
[3 4 6 1]

DLM Lock Space:  "foobar2"                           7   8 recover 0 -
[3 4 6 1]

GFS Mount Group: "foobar0"                           4   5 recover 0 -
[3 4 6 1]

GFS Mount Group: "foobar1"                           6   7 recover 0 -
[3 4 6 1]

GFS Mount Group: "foobar2"                           8   9 recover 0 -
[3 4 6 1]


[root@morph-05 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join     
S-1,80,6
[]

DLM Lock Space:  "clvmd"                             0   3 join     
S-1,280,6
[]


[root@morph-06 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 2 -
[1 3 4 6]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[1 4 3 6]

DLM Lock Space:  "foobar0"                           3   4 recover 0 -
[1 3 4 6]

DLM Lock Space:  "foobar1"                           5   6 recover 0 -
[1 3 4 6]

DLM Lock Space:  "foobar2"                           7   8 recover 0 -
[1 3 4 6]

GFS Mount Group: "foobar0"                           4   5 recover 0 -
[1 3 4 6]

GFS Mount Group: "foobar1"                           6   7 recover 0 -
[1 3 4 6]

GFS Mount Group: "foobar2"                           8   9 recover 0 -
[1 3 4 6]



How reproducible:
Sometimes
Comment 1 David Teigland 2004-07-22 23:43:33 EDT
morph-06 is in recover state 2 for the fence domain which means
it's waiting for fenced or the agent to complete the fencing operation.
So, the fencing operation is stuck for some reason -- maybe the same
reason as bz 127021?

The other nodes in recover state 4 are waiting for morph-06 to finish
before doing anything else.

morph-01 and 05 are trying to join the fence domain but must wait until
the fd completes recovery.
Comment 2 Corey Marthaler 2004-07-23 18:32:07 EDT
These nodes have your patch from yesterday which fixes 127021 I 
thought, at least I don't see the radom fencing during start up 
anymore. But I also don't see the fence attempt during recovery 
either. 
Comment 3 David Teigland 2004-08-19 00:55:46 EDT
do you still see this?  everything in the original report looks ok --
as if a fencing operation is in progress on morph-06.
Comment 4 Corey Marthaler 2004-08-25 11:30:47 EDT
I do still see this problem of all services being stuck in the 
recovery state when one or many nodes are taken down and then 
brought back up.  
 
Also, I still never see any fence messages/attempts which is 
apparently the reason this promblem occurs. 
Comment 5 Corey Marthaler 2004-08-25 13:10:25 EDT
Need to have the perl Net-Telnet rpm installed.  
 
I forgot there are no checks/warnings for perl to be installed since 
we are building everything. 
Comment 6 Kiersten (Kerri) Anderson 2004-11-16 14:13:44 EST
Updating version to the right level in the defects.  Sorry for the storm.

Note You need to log in before you can comment on or make changes to this bug.