Bug 252209 - mount attempt deadlocks after gulm recovery
Summary: mount attempt deadlocks after gulm recovery
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gulm
Version: 4
Hardware: All
OS: Linux
low
low
Target Milestone: ---
Assignee: Chris Feist
QA Contact: Cluster QE
URL:
Whiteboard:
: 382671 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-08-14 19:00 UTC by Corey Marthaler
Modified: 2010-04-20 15:03 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-04-20 15:03:38 UTC
Embargoed:


Attachments (Terms of Use)
kern dump from taft-04 (95.35 KB, text/plain)
2007-08-14 19:03 UTC, Corey Marthaler
no flags Details
kernel dump from taft-04 during the hung mnt attempt (86.73 KB, text/plain)
2008-04-29 13:57 UTC, Corey Marthaler
no flags Details

Description Corey Marthaler 2007-08-14 19:00:30 UTC
Description of problem:
This appears to be the same as closed bz 183383.

[revolver]
================================================================================
[revolver] Senario iteration 0.6 started at Mon Aug 13 17:30:58 CDT 2007
[revolver] Sleeping 2 minute(s) to let the I/O get its lock count up...
[revolver]      Gulm Status
[revolver]      ===========
[revolver]      taft-02: Master
[revolver]      taft-01: Slave
[revolver]      taft-03: Slave
[revolver]      taft-04: Client
[revolver] Senario: GULM kill all Clients and Slaves
[revolver]
[revolver] Those picked to face the revolver... taft-04 taft-03 taft-01
[revolver] Feeling lucky taft-04? Well do ya? Go'head make my day...
[revolver] Didn't receive heartbeat for 5 seconds
[revolver] Feeling lucky taft-03? Well do ya? Go'head make my day...
[revolver] Didn't receive heartbeat for 5 seconds
[revolver] Feeling lucky taft-01? Well do ya? Go'head make my day...
[revolver] Didn't receive heartbeat for 5 seconds
[revolver]
[revolver] Verify that taft-04 has been removed from cluster on remaining nodes
[revolver] Verify that taft-03 has been removed from cluster on remaining nodes
[revolver] Verify that taft-01 has been removed from cluster on remaining nodes
[revolver] Verifying that the dueler(s) are alive
[revolver] Still not all alive, sleeping another 10 seconds
[revolver] Still not all alive, sleeping another 10 seconds
[...]
[revolver] Still not all alive, sleeping another 10 seconds
[revolver] All killed nodes are back up, making sure they're qarshable...
[revolver] Verifying that recovery properly took place on the node(s) which
stayed in the cluster
[revolver] checking Gulm recovery...
[revolver] Verifying that clvmd was started properly on the dueler(s)
[revolver] mounting /dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER0 on /mnt/TAFT_CLUSTER0
on taft-04
[revolver] mounting /dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER1 on /mnt/TAFT_CLUSTER1
on taft-04
[revolver] mounting /dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER2 on /mnt/TAFT_CLUSTER2
on taft-04
PAN2 caught SIGINT: ALL STOP!!!


[root@taft-04 ~]# ps -ef | grep mount
root      5258  5257  0 Aug13 ?        00:00:00 mount -t gfs -o debug /dev/mappe2
root     21267 31714  0 13:51 ttyS0    00:00:00 grep mount

Version-Release number of selected component (if applicable):
gulm-1.0.10-0


I'll attach the stack traces...

Comment 1 Corey Marthaler 2007-08-14 19:03:37 UTC
Created attachment 161297 [details]
kern dump from taft-04

Comment 2 Chris Feist 2007-08-20 22:20:17 UTC
This is caused by a problem with the protocol which doesn't notify us if a node
is rejoining the cluster or is joining the cluster for the first time.  I'm
working on a solution to this issue without changing the protocol.

Comment 3 Corey Marthaler 2007-11-26 16:37:11 UTC
FYI - hit this bug again during 4.6 regression testing.

2.6.9-67.ELsmp
gulm-1.0.10-0

Comment 4 Corey Marthaler 2008-04-29 13:53:02 UTC
Hit this issue again during 4.6.Z testing. Note, this may be the same issue as
bz 382671.

================================================================================
[revolver] Senario iteration 0.6 started at Tue Apr 29 00:21:14 CDT 2008
[revolver] Sleeping 5 minute(s) to let the I/O get its lock count up...
[revolver]      Gulm Status
[revolver]      ===========
[revolver]      taft-02: Master
[revolver]      taft-03: Slave
[revolver]      taft-01: Slave
[revolver]      taft-04: Client
[revolver] Senario: GULM kill all Clients and Slaves
[revolver]
[revolver] Those picked to face the revolver... taft-04 taft-01 taft-03
[revolver] Feeling lucky taft-04? Well do ya? Go'head make my day...
[revolver] Didn't receive heartbeat for 5 seconds
[revolver] Feeling lucky taft-01? Well do ya? Go'head make my day...
[revolver] Didn't receive heartbeat for 5 seconds
[revolver] Feeling lucky taft-03? Well do ya? Go'head make my day...
[revolver] Didn't receive heartbeat for 5 seconds
[revolver]
[revolver] Verify that taft-04 has been removed from cluster on remaining nodes
[revolver] Verify that taft-01 has been removed from cluster on remaining nodes
[revolver] Verify that taft-03 has been removed from cluster on remaining nodes
[revolver] Verifying that the dueler(s) are alive
[revolver] Still not all alive, sleeping another 10 seconds
[revolver] Still not all alive, sleeping another 10 seconds
[revolver] Still not all alive, sleeping another 10 seconds
[revolver] Still not all alive, sleeping another 10 seconds
[revolver] All killed nodes are back up, making sure they're qarshable...
[revolver] Verifying that recovery properly took place on the node(s) which
stayed in the cluster
[revolver] checking Gulm recovery...
[revolver] Verifying that clvmd was started properly on the dueler(s)
[revolver] mounting /dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER0 on /mnt/TAFT_CLUSTER0
on taft-04
[STUCK]

root      5640  5639  0 00:28 ?        00:00:00 mount -t gfs -o debug
/dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER0 /mnt/TAFT_CLUSTER0

Comment 5 Corey Marthaler 2008-04-29 13:57:43 UTC
Created attachment 304116 [details]
kernel dump from taft-04 during the hung mnt attempt

Comment 6 Corey Marthaler 2008-07-17 15:03:06 UTC
Appear to have reproduced this again during 4.7 GA regression testing. It
requires all Slaves being killed (leaving only the Master).

Comment 7 Nate Straz 2008-07-17 15:21:18 UTC
*** Bug 382671 has been marked as a duplicate of this bug. ***

Comment 8 Corey Marthaler 2008-09-03 21:19:20 UTC
Reproduced during 4.7.Z testing.

[revolver] Senario iteration 0.4 started at Wed Sep  3 10:24:21 CDT 2008
[revolver] Sleeping 3 minute(s) to let the I/O get its lock count up...
[revolver]      Gulm Status
[revolver]      ===========
[revolver]      grant-02: Slave
[revolver]      grant-03: Master
[revolver]      grant-01: Slave
[revolver] Senario: GULM kill all Slaves

Comment 9 RHEL Program Management 2010-04-20 15:03:38 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.


Note You need to log in before you can comment on or make changes to this bug.