Bug 130915 - dlm recovery not happening
dlm recovery not happening
Status: CLOSED WORKSFORME
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
4
i686 Linux
high Severity medium
: ---
: ---
Assigned To: David Teigland
GFS Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-08-25 14:57 EDT by Corey Marthaler
Modified: 2010-01-11 21:57 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-01-31 18:11:20 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2004-08-25 14:57:05 EDT
Description of problem: 
basic recovery senario, healthy cluster running I/O. Two nodes  
are shot (morph-02 and morph-03) and an attempt to bring them into 
the cluster causes clvmd to get stuck in the "join" or "update" 
state.  
 
 
morph-01: 
 
[root@morph-01 root]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[5 6 2 1 3 4] 
 
DLM Lock Space:  "clvmd"                             2   3 update    
U-4,1,3 
[5 1 2 6 3] 
 
DLM Lock Space:  "corey0"                            3   4 run       
- 
[5 6 2 1] 
 
DLM Lock Space:  "corey1"                            5   6 run       
- 
[5 6 2 1] 
 
DLM Lock Space:  "corey2"                            7   8 run       
- 
[5 6 2 1] 
 
DLM Lock Space:  "corey3"                            9  10 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey0"                            4   5 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey1"                            6   7 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey2"                            8   9 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey3"                           10  11 run       
- 
[5 6 2 1] 
 
 
morph-02: 
[root@morph-02 root]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[2 1 5 6 3 4] 
 
DLM Lock Space:  "clvmd"                             2   3 join      
S-6,20,5 
[2 1 5 6 3] 
 
 
morph-03: 
[root@morph-03 root]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[1 2 3 5 6 4] 
 
DLM Lock Space:  "clvmd"                             0   3 join      
S-1,80,6 
[] 
 
 
morph-04: 
[root@morph-04 root]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[5 6 2 1 3 4] 
 
DLM Lock Space:  "clvmd"                             2   3 update    
U-4,1,3 
[5 1 2 6 3] 
 
DLM Lock Space:  "corey0"                            3   4 run       
- 
[5 6 2 1] 
 
DLM Lock Space:  "corey1"                            5   6 run       
- 
[5 6 2 1] 
 
DLM Lock Space:  "corey2"                            7   8 run       
- 
[5 6 2 1] 
 
DLM Lock Space:  "corey3"                            9  10 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey0"                            4   5 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey1"                            6   7 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey2"                            8   9 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey3"                           10  11 run       
- 
[5 6 2 1] 
 
 
morph-05: 
[root@morph-05 root]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[2 5 6 1 3 4] 
 
DLM Lock Space:  "clvmd"                             2   3 update    
U-4,1,3 
[1 2 5 6 3] 
 
DLM Lock Space:  "corey0"                            3   4 run       
- 
[2 5 6 1] 
 
DLM Lock Space:  "corey1"                            5   6 run       
- 
[2 5 6 1] 
 
DLM Lock Space:  "corey2"                            7   8 run       
- 
[2 5 6 1] 
 
DLM Lock Space:  "corey3"                            9  10 run       
- 
[2 5 6 1] 
 
GFS Mount Group: "corey0"                            4   5 run       
- 
[2 5 6 1] 
 
GFS Mount Group: "corey1"                            6   7 run       
- 
[2 5 6 1] 
 
GFS Mount Group: "corey2"                            8   9 run       
- 
[2 5 6 1] 
 
GFS Mount Group: "corey3"                           10  11 run       
- 
[2 5 6 1] 
 
 
morph-06: 
[root@morph-06 root]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[1 2 5 6 3 4] 
 
DLM Lock Space:  "clvmd"                             2   3 update    
U-4,1,3 
[1 5 2 6 3] 
 
DLM Lock Space:  "corey0"                            3   4 run       
- 
[1 2 5 6] 
 
DLM Lock Space:  "corey1"                            5   6 run       
- 
[1 2 5 6] 
 
DLM Lock Space:  "corey2"                            7   8 run       
- 
[1 2 5 6] 
 
DLM Lock Space:  "corey3"                            9  10 run       
- 
[1 2 5 6] 
 
GFS Mount Group: "corey0"                            4   5 run       
- 
[1 2 5 6] 
 
GFS Mount Group: "corey1"                            6   7 run       
- 
[1 2 5 6] 
 
GFS Mount Group: "corey2"                            8   9 run       
- 
[1 2 5 6] 
 
GFS Mount Group: "corey3"                           10  11 run       
- 
[1 2 5 6] 
 
 
How reproducible: 
Sometimes
Comment 1 Corey Marthaler 2004-08-31 14:57:48 EDT
reproduced again today
Comment 2 Corey Marthaler 2004-09-01 12:15:15 EDT
I have also seen the GFS mount group services get stuck in the
update/join state.

[root@morph-05 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[5 4 2 1 6 3]

DLM Lock Space:  "clvmd"                             2   3 run       -
[5 4 2 1 6 3]

DLM Lock Space:  "corey0"                            3   4 run       -
[5 4 2 1 6 3]

GFS Mount Group: "corey0"                            4   5 join     
S-6,20,6
[5 4 2 1 6 3]


All others in cluster:

[root@morph-04 tmp]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[5 1 4 6 2 3]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1 5 4 6 2 3]

DLM Lock Space:  "corey0"                            3   4 run       -
[5 1 4 6 2 3]

DLM Lock Space:  "corey1"                            5   6 run       -
[5 1 4 6 2]

GFS Mount Group: "corey0"                            4   5 update   
U-4,1,3
[5 1 4 6 2 3]

GFS Mount Group: "corey1"                            6   7 run       -
[5 1 4 6 2]
Comment 3 Corey Marthaler 2004-09-01 12:16:45 EDT
Bumping priority because all my recovery testing ends up hitting this
bug eventually and doesn't get very far.
Comment 4 Kiersten (Kerri) Anderson 2004-09-01 12:25:54 EDT
Assigning this to Dave and Patrick, since it seems like it is probably
infrastructure, not the clients.
Comment 5 Christine Caulfield 2004-09-13 08:08:12 EDT
I think should be fixed by the checkin I did to cman late last week. 

The symptom was that if two nodes died at the same time, only one was
notified to DLM so the TCP connection did not get severed.
Comment 6 Corey Marthaler 2004-09-15 18:00:52 EDT
Hit this exact same senario with the latest code: 
 
[root@morph-01 tmp]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[1 3 4 5 2 6] 
 
DLM Lock Space:  "clvmd"                             2   3 update    
U-4,1,6 
[1 3 4 5 2 6] 
 
DLM Lock Space:  "corey0"                            3   4 run       
- 
[1 3 4 5 2] 
 
DLM Lock Space:  "corey1"                            5   6 run       
- 
[1 3 4 5 2] 
 
DLM Lock Space:  "corey2"                            7   8 run       
- 
[1 3 4 5 2] 
 
GFS Mount Group: "corey0"                            4   5 run       
- 
[1 3 4 5 2] 
 
GFS Mount Group: "corey1"                            6   7 run       
- 
[1 3 4 5 2] 
 
GFS Mount Group: "corey2"                            8   9 run       
- 
[1 3 4 5 2] 
 
 
[root@morph-06 root]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[3 2 1 4 5 6] 
 
DLM Lock Space:  "clvmd"                             2   3 join      
S-6,20,6 
[3 2 1 4 5 6] 
 
 
Comment 7 Kiersten (Kerri) Anderson 2004-11-16 14:03:21 EST
Updating version to the right level in the defects.  Sorry for the storm.
Comment 8 David Teigland 2005-01-04 04:03:42 EST
These cases where lockspace recovery wasn't happening is quite
possibly the effect of the dlm_recoverd thread problems we were
having.  The wakeup to the recovery thread was being missed so
no recovery would ever happen (or later with dynamically created
threads, no new recovery thread would be created to do the recovery).

If this was the problem it should be fixed.  More info on the state of
the dlm kernel threads would be needed if this is seen again.

I'm dismissing the comment about mount-group recovery since that 
would be a different and unrelated problem.
Comment 9 Corey Marthaler 2005-01-06 14:50:48 EST
FYI, the comment about mount-group services getting stuck is bz143269
Comment 10 Corey Marthaler 2005-01-06 15:18:06 EST
Hmmm, it must automatically move from NEEDINFO to ASSIGNED regardless 
of info given, moving back to NEEDINFO 
Comment 11 Corey Marthaler 2005-01-31 18:11:20 EST
This bug seemed to be specifically about recovery problems with the
DLM/clvmd service. I have not seen those in a long time. All other
DLM/GFS services having problems with recovery should be documented in
bug 145683.  

Note You need to log in before you can comment on or make changes to this bug.