Bug 252209

Summary: mount attempt deadlocks after gulm recovery
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: gulmAssignee: Chris Feist <cfeist>
Status: CLOSED WONTFIX QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: low    
Version: 4CC: cluster-maint, nstraz
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-04-20 15:03:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
kern dump from taft-04
none
kernel dump from taft-04 during the hung mnt attempt none

Description Corey Marthaler 2007-08-14 19:00:30 UTC
Description of problem:
This appears to be the same as closed bz 183383.

[revolver]
================================================================================
[revolver] Senario iteration 0.6 started at Mon Aug 13 17:30:58 CDT 2007
[revolver] Sleeping 2 minute(s) to let the I/O get its lock count up...
[revolver]      Gulm Status
[revolver]      ===========
[revolver]      taft-02: Master
[revolver]      taft-01: Slave
[revolver]      taft-03: Slave
[revolver]      taft-04: Client
[revolver] Senario: GULM kill all Clients and Slaves
[revolver]
[revolver] Those picked to face the revolver... taft-04 taft-03 taft-01
[revolver] Feeling lucky taft-04? Well do ya? Go'head make my day...
[revolver] Didn't receive heartbeat for 5 seconds
[revolver] Feeling lucky taft-03? Well do ya? Go'head make my day...
[revolver] Didn't receive heartbeat for 5 seconds
[revolver] Feeling lucky taft-01? Well do ya? Go'head make my day...
[revolver] Didn't receive heartbeat for 5 seconds
[revolver]
[revolver] Verify that taft-04 has been removed from cluster on remaining nodes
[revolver] Verify that taft-03 has been removed from cluster on remaining nodes
[revolver] Verify that taft-01 has been removed from cluster on remaining nodes
[revolver] Verifying that the dueler(s) are alive
[revolver] Still not all alive, sleeping another 10 seconds
[revolver] Still not all alive, sleeping another 10 seconds
[...]
[revolver] Still not all alive, sleeping another 10 seconds
[revolver] All killed nodes are back up, making sure they're qarshable...
[revolver] Verifying that recovery properly took place on the node(s) which
stayed in the cluster
[revolver] checking Gulm recovery...
[revolver] Verifying that clvmd was started properly on the dueler(s)
[revolver] mounting /dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER0 on /mnt/TAFT_CLUSTER0
on taft-04
[revolver] mounting /dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER1 on /mnt/TAFT_CLUSTER1
on taft-04
[revolver] mounting /dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER2 on /mnt/TAFT_CLUSTER2
on taft-04
PAN2 caught SIGINT: ALL STOP!!!


[root@taft-04 ~]# ps -ef | grep mount
root      5258  5257  0 Aug13 ?        00:00:00 mount -t gfs -o debug /dev/mappe2
root     21267 31714  0 13:51 ttyS0    00:00:00 grep mount

Version-Release number of selected component (if applicable):
gulm-1.0.10-0


I'll attach the stack traces...

Comment 1 Corey Marthaler 2007-08-14 19:03:37 UTC
Created attachment 161297 [details]
kern dump from taft-04

Comment 2 Chris Feist 2007-08-20 22:20:17 UTC
This is caused by a problem with the protocol which doesn't notify us if a node
is rejoining the cluster or is joining the cluster for the first time.  I'm
working on a solution to this issue without changing the protocol.

Comment 3 Corey Marthaler 2007-11-26 16:37:11 UTC
FYI - hit this bug again during 4.6 regression testing.

2.6.9-67.ELsmp
gulm-1.0.10-0

Comment 4 Corey Marthaler 2008-04-29 13:53:02 UTC
Hit this issue again during 4.6.Z testing. Note, this may be the same issue as
bz 382671.

================================================================================
[revolver] Senario iteration 0.6 started at Tue Apr 29 00:21:14 CDT 2008
[revolver] Sleeping 5 minute(s) to let the I/O get its lock count up...
[revolver]      Gulm Status
[revolver]      ===========
[revolver]      taft-02: Master
[revolver]      taft-03: Slave
[revolver]      taft-01: Slave
[revolver]      taft-04: Client
[revolver] Senario: GULM kill all Clients and Slaves
[revolver]
[revolver] Those picked to face the revolver... taft-04 taft-01 taft-03
[revolver] Feeling lucky taft-04? Well do ya? Go'head make my day...
[revolver] Didn't receive heartbeat for 5 seconds
[revolver] Feeling lucky taft-01? Well do ya? Go'head make my day...
[revolver] Didn't receive heartbeat for 5 seconds
[revolver] Feeling lucky taft-03? Well do ya? Go'head make my day...
[revolver] Didn't receive heartbeat for 5 seconds
[revolver]
[revolver] Verify that taft-04 has been removed from cluster on remaining nodes
[revolver] Verify that taft-01 has been removed from cluster on remaining nodes
[revolver] Verify that taft-03 has been removed from cluster on remaining nodes
[revolver] Verifying that the dueler(s) are alive
[revolver] Still not all alive, sleeping another 10 seconds
[revolver] Still not all alive, sleeping another 10 seconds
[revolver] Still not all alive, sleeping another 10 seconds
[revolver] Still not all alive, sleeping another 10 seconds
[revolver] All killed nodes are back up, making sure they're qarshable...
[revolver] Verifying that recovery properly took place on the node(s) which
stayed in the cluster
[revolver] checking Gulm recovery...
[revolver] Verifying that clvmd was started properly on the dueler(s)
[revolver] mounting /dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER0 on /mnt/TAFT_CLUSTER0
on taft-04
[STUCK]

root      5640  5639  0 00:28 ?        00:00:00 mount -t gfs -o debug
/dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER0 /mnt/TAFT_CLUSTER0

Comment 5 Corey Marthaler 2008-04-29 13:57:43 UTC
Created attachment 304116 [details]
kernel dump from taft-04 during the hung mnt attempt

Comment 6 Corey Marthaler 2008-07-17 15:03:06 UTC
Appear to have reproduced this again during 4.7 GA regression testing. It
requires all Slaves being killed (leaving only the Master).

Comment 7 Nate Straz 2008-07-17 15:21:18 UTC
*** Bug 382671 has been marked as a duplicate of this bug. ***

Comment 8 Corey Marthaler 2008-09-03 21:19:20 UTC
Reproduced during 4.7.Z testing.

[revolver] Senario iteration 0.4 started at Wed Sep  3 10:24:21 CDT 2008
[revolver] Sleeping 3 minute(s) to let the I/O get its lock count up...
[revolver]      Gulm Status
[revolver]      ===========
[revolver]      grant-02: Slave
[revolver]      grant-03: Master
[revolver]      grant-01: Slave
[revolver] Senario: GULM kill all Slaves

Comment 9 RHEL Program Management 2010-04-20 15:03:38 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.