This service will be undergoing maintenance at 00:00 UTC, 2016-09-28. It is expected to last about 1 hours
Bug 235252 - cmirror synchronization deadlocked waiting for response from server
cmirror synchronization deadlocked waiting for response from server
Status: CLOSED DUPLICATE of bug 217438
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cmirror (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-04-04 14:26 EDT by Corey Marthaler
Modified: 2010-01-11 21:03 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-04-05 10:35:54 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2007-04-04 14:26:11 EDT
Description of problem:
I was running revolver with 3 gfs on top on 3 cmirrors. After many iterations, 
gfs journal recovery got stuck waiting behind the cmirror recovery to take place
on one of the mirrors. 


[Revolver output (after shooting taft-02)]
Verifying that recovery properly took place on the node(s) which stayed in the
cluster
checking Fence recovery...
checking DLM recovery...
checking GFS recovery...
waited around 20 minutes...
assuming that GFS recovery is hung

[root@taft-01 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           3   2 run       U-1,10,1
[4 3 2]

DLM Lock Space:  "clvmd"                            11   3 run       U-1,10,1
[4 3 2]

DLM Lock Space:  "clustered_log"                    15   4 run       -
[4 3 2]

DLM Lock Space:  "gfs1"                             16   6 run       -
[4 3 2]

DLM Lock Space:  "gfs2"                             18   8 run       -
[4 3 2]

DLM Lock Space:  "gfs3"                             20  10 run       -
[4 3 2]

GFS Mount Group: "gfs1"                             17   7 run       -
[4 3 2]

GFS Mount Group: "gfs2"                             19   9 recover 4 -
[4 3 2]

GFS Mount Group: "gfs3"                             21  11 run       -
[4 3 2]

User:            "usrm::manager"                    23   5 recover 0 -
[4 3 2]


[root@taft-02 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           3   2 join      S-4,4,1
[2 3 4 1]

DLM Lock Space:  "clvmd"                            11   3 join      S-4,4,1
[2 3 4 1]

User:            "usrm::manager"                     0   4 join      S-1,80,4
[]



[root@taft-03 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           3   2 run       U-1,10,1
[4 3 2]

DLM Lock Space:  "clvmd"                            11   3 run       U-1,10,1
[4 3 2]

DLM Lock Space:  "clustered_log"                    15   4 run       -
[4 3 2]

DLM Lock Space:  "gfs1"                             16   6 run       -
[4 3 2]

DLM Lock Space:  "gfs2"                             18   8 run       -
[4 3 2]

DLM Lock Space:  "gfs3"                             20  10 run       -
[4 3 2]

GFS Mount Group: "gfs1"                             17   7 run       -
[4 3 2]

GFS Mount Group: "gfs2"                             19   9 recover 4 -
[4 3 2]

GFS Mount Group: "gfs3"                             21  11 run       -
[4 3 2]

User:            "usrm::manager"                    23   5 recover 0 -
[4 3 2]



[root@taft-04 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           3   2 run       U-1,10,1
[4 3 2]

DLM Lock Space:  "clvmd"                            11   3 run       U-1,10,1
[4 3 2]

DLM Lock Space:  "clustered_log"                    15   4 run       -
[4 3 2]

DLM Lock Space:  "gfs1"                             16   6 run       -
[4 3 2]

DLM Lock Space:  "gfs2"                             18   8 run       -
[4 3 2]

DLM Lock Space:  "gfs3"                             20  10 run       -
[4 3 2]

GFS Mount Group: "gfs1"                             17   7 run       -
[4 3 2]

GFS Mount Group: "gfs2"                             19   9 recover 2 -
[4 3 2]

GFS Mount Group: "gfs3"                             21  11 run       -
[4 3 2]

User:            "usrm::manager"                    23   5 recover 0 -
[4 3 2]



[root@taft-01 ~]# ps aux | grep gfs
root      5564  0.0  0.0     0    0 ?        S    Apr03   0:16 [gfs_scand]
root      5565  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5572  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5573  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5574  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5575  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root      5594  0.0  0.0     0    0 ?        S    Apr03   0:16 [gfs_scand]
root      5595  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5596  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5597  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5598  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5599  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root      5625  0.0  0.0     0    0 ?        S    Apr03   0:17 [gfs_scand]
root      5626  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5627  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5628  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5629  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5630  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root     29628  0.0  0.0 51084  696 pts/0    S+   13:23   0:00 grep gfs



[root@taft-03 ~]# ps aux | grep gfs
root      5745  0.0  0.0     0    0 ?        S    Apr03   0:16 [gfs_scand]
root      5746  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5747  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5748  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5749  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5750  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root      5769  0.0  0.0     0    0 ?        S    Apr03   0:16 [gfs_scand]
root      5770  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5771  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5772  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5773  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5774  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root      5790  0.0  0.0     0    0 ?        S    Apr03   0:17 [gfs_scand]
root      5791  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5801  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5802  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5803  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5804  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root     30369  0.0  0.0 51104  696 pts/0    R+   13:23   0:00 grep gfs



[root@taft-04 ~]# ps aux | grep gfs
root      5590  0.0  0.0     0    0 ?        S    Apr03   0:17 [gfs_scand]
root      5591  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5592  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5593  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5594  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5595  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root      5620  0.0  0.0     0    0 ?        S    Apr03   0:17 [gfs_scand]
root      5621  0.0  0.0     0    0 ?        D    Apr03   0:00 [gfs_glockd]
root      5622  0.0  0.0     0    0 ?        D    Apr03   0:00 [gfs_recoverd]
root      5623  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5624  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5625  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root      5644  0.0  0.0     0    0 ?        S    Apr03   0:16 [gfs_scand]
root      5645  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_glockd]
root      5653  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_recoverd]
root      5654  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_logd]
root      5655  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_quotad]
root      5656  0.0  0.0     0    0 ?        S    Apr03   0:00 [gfs_inoded]
root     30320  0.0  0.0 51088  696 pts/0    R+   13:23   0:00 grep gfs




Version-Release number of selected component (if applicable):
2.6.9-50.ELlargesmp
cman-kernel-2.6.9-49.2
dlm-kernel-2.6.9-46.14
cmirror-kernel-2.6.9-27.0
Comment 1 Jonathan Earl Brassow 2007-04-04 15:47:04 EDT
This problem has to do with bug #235039 and bug #239040

The problem is that one machine might be caching clear region requests, one of
which is the clear of the region that is being recovered.  The mark can't happen
while the region is being recovered, so it gets delayed.  If the machine with
the cached clear region never flushes them, it can result in an indefinite hang
- i.e this bug.

A somewhat rare condition.  We'll need to workaround the issue until the kernel
is fixed.
Comment 2 Jonathan Earl Brassow 2007-04-04 17:39:26 EDT
moved recovery/write conflict checking to flush from mark_region.

This should fix the issue that I see.  If the mirror is getting into this state
via a method I don't yet understand, I'll need to be more aggressive.
Comment 3 Jonathan Earl Brassow 2007-04-05 10:35:54 EDT
I guess I should have seen this before...  This is a duplicate of 217438.


*** This bug has been marked as a duplicate of 217438 ***

Note You need to log in before you can comment on or make changes to this bug.