Bug 373711 - recovery "stuck" on 3 remaining nodes after fourth node is shot.
recovery "stuck" on 3 remaining nodes after fourth node is shot.
Status: CLOSED DUPLICATE of bug 373671
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cman-kernel (Show other bugs)
4
ia64 Linux
medium Severity high
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-11-09 15:30 EST by Dean Jansa
Modified: 2009-04-16 16:01 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-14 09:18:05 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
link-13 stack (74.45 KB, text/plain)
2007-11-09 15:30 EST, Dean Jansa
no flags Details
link-14 stack (120.49 KB, text/plain)
2007-11-09 15:32 EST, Dean Jansa
no flags Details
link-15 stack (106.88 KB, text/plain)
2007-11-09 15:33 EST, Dean Jansa
no flags Details
link-16 stack (104.31 KB, text/plain)
2007-11-09 15:33 EST, Dean Jansa
no flags Details

  None (edit)
Description Dean Jansa 2007-11-09 15:30:37 EST
Description of problem:

Four node cluster: link-{13,14,15,16}
link-13 is shot during recovery testing, remaining 3 nodes never complete
recovery.  Dave dug about and found link-{14,15,16} are all waiting for a
barrier to complete, which isn't happening for some unknown reason.


Version-Release number of selected component (if applicable):

2.6.9-55.0.12.ELlargesmp
cman-kernel-largesmp-2.6.9-50.2.0.6
dlm-kernel-largesmp-2.6.9-46.16.0.12



How reproducible:

Haven't tried at this point.


Steps to Reproduce:
1.  Run revolver with single gfs fs on a 4 node cluster.
2.
3.
 
----------------------------
[root@link-13 ~]# cat /proc/cluster/services 
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 join      S-4,4,1
[2 3 4 1]

DLM Lock Space:  "clvmd"                             4   3 join      S-4,4,1
[2 3 4 1]

User:            "usrm::manager"                    11   4 run       -
[1]


---------
[root@link-14 ~]# cat /proc/cluster/services 
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       U-1,10,1
[4 2 3]

DLM Lock Space:  "clvmd"                             4   4 run       U-1,10,1
[4 2 3]

DLM Lock Space:  "link_ia640"                        5   5 run       -
[4 2 3]

DLM Lock Space:  "link_ia641"                        7   7 run       -
[4 2 3]

DLM Lock Space:  "link_ia642"                        9   9 recover 4 -
[4 2 3]

GFS Mount Group: "link_ia640"                        6   6 recover 0 -
[4 2 3]

GFS Mount Group: "link_ia641"                        8   8 recover 0 -
[4 2 3]

GFS Mount Group: "link_ia642"                       10  10 recover 0 -
[4 2 3]

--------------------
[root@link-15 ~]# cat /proc/cluster/services 
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       U-1,10,1
[2 4 3]

DLM Lock Space:  "clvmd"                             4   4 run       U-1,10,1
[2 4 3]

DLM Lock Space:  "link_ia640"                        5   5 run       -
[2 4 3]

DLM Lock Space:  "link_ia641"                        7   7 run       -
[2 4 3]

DLM Lock Space:  "link_ia642"                        9   9 recover 4 -
[2 4 3]

GFS Mount Group: "link_ia640"                        6   6 recover 0 -
[2 4 3]

GFS Mount Group: "link_ia641"                        8   8 recover 0 -
[2 4 3]

GFS Mount Group: "link_ia642"                       10  10 recover 0 -
[2 4 3]

--------------------

[root@link-16 ~]# cat /proc/cluster/services 
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       U-1,10,1
[2 3 4]

DLM Lock Space:  "clvmd"                             4   4 run       U-1,10,1
[2 3 4]

DLM Lock Space:  "link_ia640"                        5   5 run       -
[2 3 4]

DLM Lock Space:  "link_ia641"                        7   7 run       -
[2 3 4]

DLM Lock Space:  "link_ia642"                        9   9 recover 4 -
[2 3 4]

GFS Mount Group: "link_ia640"                        6   6 recover 0 -
[2 3 4]

GFS Mount Group: "link_ia641"                        8   8 recover 0 -
[2 3 4]

GFS Mount Group: "link_ia642"                       10  10 recover 0 -
[2 3 4]
Comment 1 Dean Jansa 2007-11-09 15:30:37 EST
Created attachment 253351 [details]
link-13 stack
Comment 2 Dean Jansa 2007-11-09 15:32:36 EST
Created attachment 253361 [details]
link-14 stack
Comment 3 Dean Jansa 2007-11-09 15:33:20 EST
Created attachment 253371 [details]
link-15 stack
Comment 4 Dean Jansa 2007-11-09 15:33:47 EST
Created attachment 253381 [details]
link-16 stack
Comment 5 Christine Caulfield 2007-11-12 04:50:23 EST
I strongly suspect this is the same as
https://bugzilla.redhat.com/show_bug.cgi?id=299061#c39
Comment 6 Christine Caulfield 2007-11-14 09:18:05 EST
The above now has it's own BZ so I'll make this a duplicate of that bug. The
patch is ready. 

*** This bug has been marked as a duplicate of 373671 ***

Note You need to log in before you can comment on or make changes to this bug.