130915 – dlm recovery not happening

Bug 130915 - dlm recovery not happening

Summary: dlm recovery not happening

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-08-25 18:57 UTC by Corey Marthaler
Modified:	2010-01-12 02:57 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-01-31 23:11:20 UTC
Embargoed:

Attachments	(Terms of Use)

Description Corey Marthaler 2004-08-25 18:57:05 UTC

Description of problem: 
basic recovery senario, healthy cluster running I/O. Two nodes  
are shot (morph-02 and morph-03) and an attempt to bring them into 
the cluster causes clvmd to get stuck in the "join" or "update" 
state.  
 
 
morph-01: 
 
[root@morph-01 root]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[5 6 2 1 3 4] 
 
DLM Lock Space:  "clvmd"                             2   3 update    
U-4,1,3 
[5 1 2 6 3] 
 
DLM Lock Space:  "corey0"                            3   4 run       
- 
[5 6 2 1] 
 
DLM Lock Space:  "corey1"                            5   6 run       
- 
[5 6 2 1] 
 
DLM Lock Space:  "corey2"                            7   8 run       
- 
[5 6 2 1] 
 
DLM Lock Space:  "corey3"                            9  10 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey0"                            4   5 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey1"                            6   7 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey2"                            8   9 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey3"                           10  11 run       
- 
[5 6 2 1] 
 
 
morph-02: 
[root@morph-02 root]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[2 1 5 6 3 4] 
 
DLM Lock Space:  "clvmd"                             2   3 join      
S-6,20,5 
[2 1 5 6 3] 
 
 
morph-03: 
[root@morph-03 root]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[1 2 3 5 6 4] 
 
DLM Lock Space:  "clvmd"                             0   3 join      
S-1,80,6 
[] 
 
 
morph-04: 
[root@morph-04 root]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[5 6 2 1 3 4] 
 
DLM Lock Space:  "clvmd"                             2   3 update    
U-4,1,3 
[5 1 2 6 3] 
 
DLM Lock Space:  "corey0"                            3   4 run       
- 
[5 6 2 1] 
 
DLM Lock Space:  "corey1"                            5   6 run       
- 
[5 6 2 1] 
 
DLM Lock Space:  "corey2"                            7   8 run       
- 
[5 6 2 1] 
 
DLM Lock Space:  "corey3"                            9  10 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey0"                            4   5 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey1"                            6   7 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey2"                            8   9 run       
- 
[5 6 2 1] 
 
GFS Mount Group: "corey3"                           10  11 run       
- 
[5 6 2 1] 
 
 
morph-05: 
[root@morph-05 root]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[2 5 6 1 3 4] 
 
DLM Lock Space:  "clvmd"                             2   3 update    
U-4,1,3 
[1 2 5 6 3] 
 
DLM Lock Space:  "corey0"                            3   4 run       
- 
[2 5 6 1] 
 
DLM Lock Space:  "corey1"                            5   6 run       
- 
[2 5 6 1] 
 
DLM Lock Space:  "corey2"                            7   8 run       
- 
[2 5 6 1] 
 
DLM Lock Space:  "corey3"                            9  10 run       
- 
[2 5 6 1] 
 
GFS Mount Group: "corey0"                            4   5 run       
- 
[2 5 6 1] 
 
GFS Mount Group: "corey1"                            6   7 run       
- 
[2 5 6 1] 
 
GFS Mount Group: "corey2"                            8   9 run       
- 
[2 5 6 1] 
 
GFS Mount Group: "corey3"                           10  11 run       
- 
[2 5 6 1] 
 
 
morph-06: 
[root@morph-06 root]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[1 2 5 6 3 4] 
 
DLM Lock Space:  "clvmd"                             2   3 update    
U-4,1,3 
[1 5 2 6 3] 
 
DLM Lock Space:  "corey0"                            3   4 run       
- 
[1 2 5 6] 
 
DLM Lock Space:  "corey1"                            5   6 run       
- 
[1 2 5 6] 
 
DLM Lock Space:  "corey2"                            7   8 run       
- 
[1 2 5 6] 
 
DLM Lock Space:  "corey3"                            9  10 run       
- 
[1 2 5 6] 
 
GFS Mount Group: "corey0"                            4   5 run       
- 
[1 2 5 6] 
 
GFS Mount Group: "corey1"                            6   7 run       
- 
[1 2 5 6] 
 
GFS Mount Group: "corey2"                            8   9 run       
- 
[1 2 5 6] 
 
GFS Mount Group: "corey3"                           10  11 run       
- 
[1 2 5 6] 
 
 
How reproducible: 
Sometimes

Comment 1 Corey Marthaler 2004-08-31 18:57:48 UTC

reproduced again today

Comment 2 Corey Marthaler 2004-09-01 16:15:15 UTC

I have also seen the GFS mount group services get stuck in the
update/join state.

[root@morph-05 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[5 4 2 1 6 3]

DLM Lock Space:  "clvmd"                             2   3 run       -
[5 4 2 1 6 3]

DLM Lock Space:  "corey0"                            3   4 run       -
[5 4 2 1 6 3]

GFS Mount Group: "corey0"                            4   5 join     
S-6,20,6
[5 4 2 1 6 3]


All others in cluster:

[root@morph-04 tmp]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[5 1 4 6 2 3]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1 5 4 6 2 3]

DLM Lock Space:  "corey0"                            3   4 run       -
[5 1 4 6 2 3]

DLM Lock Space:  "corey1"                            5   6 run       -
[5 1 4 6 2]

GFS Mount Group: "corey0"                            4   5 update   
U-4,1,3
[5 1 4 6 2 3]

GFS Mount Group: "corey1"                            6   7 run       -
[5 1 4 6 2]

Comment 3 Corey Marthaler 2004-09-01 16:16:45 UTC

Bumping priority because all my recovery testing ends up hitting this
bug eventually and doesn't get very far.

Comment 4 Kiersten (Kerri) Anderson 2004-09-01 16:25:54 UTC

Assigning this to Dave and Patrick, since it seems like it is probably
infrastructure, not the clients.

Comment 5 Christine Caulfield 2004-09-13 12:08:12 UTC

I think should be fixed by the checkin I did to cman late last week. 

The symptom was that if two nodes died at the same time, only one was
notified to DLM so the TCP connection did not get severed.

Comment 6 Corey Marthaler 2004-09-15 22:00:52 UTC

Hit this exact same senario with the latest code: 
 
[root@morph-01 tmp]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[1 3 4 5 2 6] 
 
DLM Lock Space:  "clvmd"                             2   3 update    
U-4,1,6 
[1 3 4 5 2 6] 
 
DLM Lock Space:  "corey0"                            3   4 run       
- 
[1 3 4 5 2] 
 
DLM Lock Space:  "corey1"                            5   6 run       
- 
[1 3 4 5 2] 
 
DLM Lock Space:  "corey2"                            7   8 run       
- 
[1 3 4 5 2] 
 
GFS Mount Group: "corey0"                            4   5 run       
- 
[1 3 4 5 2] 
 
GFS Mount Group: "corey1"                            6   7 run       
- 
[1 3 4 5 2] 
 
GFS Mount Group: "corey2"                            8   9 run       
- 
[1 3 4 5 2] 
 
 
[root@morph-06 root]# cat /proc/cluster/services 
Service          Name                              GID LID State     
Code 
Fence Domain:    "default"                           1   2 run       
- 
[3 2 1 4 5 6] 
 
DLM Lock Space:  "clvmd"                             2   3 join      
S-6,20,6 
[3 2 1 4 5 6]

Comment 7 Kiersten (Kerri) Anderson 2004-11-16 19:03:21 UTC

Updating version to the right level in the defects.  Sorry for the storm.

Comment 8 David Teigland 2005-01-04 09:03:42 UTC

These cases where lockspace recovery wasn't happening is quite
possibly the effect of the dlm_recoverd thread problems we were
having.  The wakeup to the recovery thread was being missed so
no recovery would ever happen (or later with dynamically created
threads, no new recovery thread would be created to do the recovery).

If this was the problem it should be fixed.  More info on the state of
the dlm kernel threads would be needed if this is seen again.

I'm dismissing the comment about mount-group recovery since that 
would be a different and unrelated problem.

Comment 9 Corey Marthaler 2005-01-06 19:50:48 UTC

FYI, the comment about mount-group services getting stuck is bz143269

Comment 10 Corey Marthaler 2005-01-06 20:18:06 UTC

Hmmm, it must automatically move from NEEDINFO to ASSIGNED regardless 
of info given, moving back to NEEDINFO

Comment 11 Corey Marthaler 2005-01-31 23:11:20 UTC

This bug seemed to be specifically about recovery problems with the
DLM/clvmd service. I have not seen those in a long time. All other
DLM/GFS services having problems with recovery should be documented in
bug 145683.

Note You need to log in before you can comment on or make changes to this bug.