235252 – cmirror synchronization deadlocked waiting for response from server

Bug 235252 - cmirror synchronization deadlocked waiting for response from server

Summary: cmirror synchronization deadlocked waiting for response from server

Keywords:
Status:	CLOSED DUPLICATE of bug 217438
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cmirror
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-04-04 18:26 UTC by Corey Marthaler
Modified:	2010-01-12 02:03 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-04-05 14:35:54 UTC
Embargoed:

Attachments	(Terms of Use)

Description Corey Marthaler 2007-04-04 18:26:11 UTC

Description of problem:
I was running revolver with 3 gfs on top on 3 cmirrors. After many iterations, 
gfs journal recovery got stuck waiting behind the cmirror recovery to take place
on one of the mirrors. 


[Revolver output (after shooting taft-02)]
Verifying that recovery properly took place on the node(s) which stayed in the
cluster
checking Fence recovery...
checking DLM recovery...
checking GFS recovery...
waited around 20 minutes...
assuming that GFS recovery is hung

[root@taft-01 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           3   2 run       U-1,10,1
[4 3 2]

DLM Lock Space:  "clvmd"                            11   3 run       U-1,10,1
[4 3 2]

DLM Lock Space:  "clustered_log"                    15   4 run       -
[4 3 2]

DLM Lock Space:  "gfs1"                             16   6 run       -
[4 3 2]

DLM Lock Space:  "gfs2"                             18   8 run       -
[4 3 2]

DLM Lock Space:  "gfs3"                             20  10 run       -
[4 3 2]

GFS Mount Group: "gfs1"                             17   7 run       -
[4 3 2]

GFS Mount Group: "gfs2"                             19   9 recover 4 -
[4 3 2]

GFS Mount Group: "gfs3"                             21  11 run       -
[4 3 2]

User:            "usrm::manager"                    23   5 recover 0 -
[4 3 2]


[root@taft-02 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           3   2 join      S-4,4,1
[2 3 4 1]

DLM Lock Space:  "clvmd"                            11   3 join      S-4,4,1
[2 3 4 1]

User:            "usrm::manager"                     0   4 join      S-1,80,4
[]



[root@taft-03 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           3   2 run       U-1,10,1
[4 3 2]

DLM Lock Space:  "clvmd"                            11   3 run       U-1,10,1
[4 3 2]

DLM Lock Space:  "clustered_log"                    15   4 run       -
[4 3 2]

DLM Lock Space:  "gfs1"                             16   6 run       -
[4 3 2]

DLM Lock Space:  "gfs2"                             18   8 run       -
[4 3 2]

DLM Lock Space:  "gfs3"                             20  10 run       -
[4 3 2]

GFS Mount Group: "gfs1"                             17   7 run       -
[4 3 2]

GFS Mount Group: "gfs2"                             19   9 recover 4 -
[4 3 2]

GFS Mount Group: "gfs3"                             21  11 run       -
[4 3 2]

User:            "usrm::manager"                    23   5 recover 0 -
[4 3 2]



[root@taft-04 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           3   2 run       U-1,10,1
[4 3 2]

DLM Lock Space:  "clvmd"                            11   3 run       U-1,10,1
[4 3 2]

DLM Lock Space:  "clustered_log"                    15   4 run       -
[4 3 2]

DLM Lock Space:  "gfs1"                             16   6 run       -
[4 3 2]

DLM Lock Space:  "gfs2"                             18   8 run       -
[4 3 2]

DLM Lock Space:  "gfs3"                             20  10 run       -
[4 3 2]

GFS Mount Group: "gfs1"                             17   7 run       -
[4 3 2]

GFS Mount Group: "gfs2"                             19   9 recover 2 -
[4 3 2]

GFS Mount Group: "gfs3"                             21  11 run       -
[4 3 2]

User:            "usrm::manager"                    23   5 recover 0 -
[4 3 2]



[root@taft-01 ~]# ps aux | grep gfs
root      5564  0.0  0.0     0    0 ?        S    Apr03   0:16 [gfs_scand]
root      5565  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5572  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5573  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5574  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5575  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root      5594  0.0  0.0     0    0 ?        S    Apr03   0:16 [gfs_scand]
root      5595  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5596  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5597  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5598  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5599  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root      5625  0.0  0.0     0    0 ?        S    Apr03   0:17 [gfs_scand]
root      5626  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5627  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5628  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5629  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5630  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root     29628  0.0  0.0 51084  696 pts/0    S+   13:23   0:00 grep gfs



[root@taft-03 ~]# ps aux | grep gfs
root      5745  0.0  0.0     0    0 ?        S    Apr03   0:16 [gfs_scand]
root      5746  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5747  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5748  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5749  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5750  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root      5769  0.0  0.0     0    0 ?        S    Apr03   0:16 [gfs_scand]
root      5770  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5771  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5772  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5773  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5774  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root      5790  0.0  0.0     0    0 ?        S    Apr03   0:17 [gfs_scand]
root      5791  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5801  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5802  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5803  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5804  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root     30369  0.0  0.0 51104  696 pts/0    R+   13:23   0:00 grep gfs



[root@taft-04 ~]# ps aux | grep gfs
root      5590  0.0  0.0     0    0 ?        S    Apr03   0:17 [gfs_scand]
root      5591  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5592  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5593  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5594  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5595  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root      5620  0.0  0.0     0    0 ?        S    Apr03   0:17 [gfs_scand]
root      5621  0.0  0.0     0    0 ?        D    Apr03   0:00 [gfs_glockd]
root      5622  0.0  0.0     0    0 ?        D    Apr03   0:00 [gfs_recoverd]
root      5623  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5624  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5625  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root      5644  0.0  0.0     0    0 ?        S    Apr03   0:16 [gfs_scand]
root      5645  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5653  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5654  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5655  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5656  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root     30320  0.0  0.0 51088  696 pts/0    R+   13:23   0:00 grep gfs




Version-Release number of selected component (if applicable):
2.6.9-50.ELlargesmp
cman-kernel-2.6.9-49.2
dlm-kernel-2.6.9-46.14
cmirror-kernel-2.6.9-27.0

Comment 1 Jonathan Earl Brassow 2007-04-04 19:47:04 UTC

This problem has to do with bug #235039 and bug #239040

The problem is that one machine might be caching clear region requests, one of
which is the clear of the region that is being recovered.  The mark can't happen
while the region is being recovered, so it gets delayed.  If the machine with
the cached clear region never flushes them, it can result in an indefinite hang
- i.e this bug.

A somewhat rare condition.  We'll need to workaround the issue until the kernel
is fixed.

Comment 2 Jonathan Earl Brassow 2007-04-04 21:39:26 UTC

moved recovery/write conflict checking to flush from mark_region.

This should fix the issue that I see.  If the mirror is getting into this state
via a method I don't yet understand, I'll need to be more aggressive.

Comment 3 Jonathan Earl Brassow 2007-04-05 14:35:54 UTC

I guess I should have seen this before...  This is a duplicate of 217438.


*** This bug has been marked as a duplicate of 217438 ***

Note You need to log in before you can comment on or make changes to this bug.