Description of problem: This appears to be the same as closed bz 183383. [revolver] ================================================================================ [revolver] Senario iteration 0.6 started at Mon Aug 13 17:30:58 CDT 2007 [revolver] Sleeping 2 minute(s) to let the I/O get its lock count up... [revolver] Gulm Status [revolver] =========== [revolver] taft-02: Master [revolver] taft-01: Slave [revolver] taft-03: Slave [revolver] taft-04: Client [revolver] Senario: GULM kill all Clients and Slaves [revolver] [revolver] Those picked to face the revolver... taft-04 taft-03 taft-01 [revolver] Feeling lucky taft-04? Well do ya? Go'head make my day... [revolver] Didn't receive heartbeat for 5 seconds [revolver] Feeling lucky taft-03? Well do ya? Go'head make my day... [revolver] Didn't receive heartbeat for 5 seconds [revolver] Feeling lucky taft-01? Well do ya? Go'head make my day... [revolver] Didn't receive heartbeat for 5 seconds [revolver] [revolver] Verify that taft-04 has been removed from cluster on remaining nodes [revolver] Verify that taft-03 has been removed from cluster on remaining nodes [revolver] Verify that taft-01 has been removed from cluster on remaining nodes [revolver] Verifying that the dueler(s) are alive [revolver] Still not all alive, sleeping another 10 seconds [revolver] Still not all alive, sleeping another 10 seconds [...] [revolver] Still not all alive, sleeping another 10 seconds [revolver] All killed nodes are back up, making sure they're qarshable... [revolver] Verifying that recovery properly took place on the node(s) which stayed in the cluster [revolver] checking Gulm recovery... [revolver] Verifying that clvmd was started properly on the dueler(s) [revolver] mounting /dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER0 on /mnt/TAFT_CLUSTER0 on taft-04 [revolver] mounting /dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER1 on /mnt/TAFT_CLUSTER1 on taft-04 [revolver] mounting /dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER2 on /mnt/TAFT_CLUSTER2 on taft-04 PAN2 caught SIGINT: ALL STOP!!! [root@taft-04 ~]# ps -ef | grep mount root 5258 5257 0 Aug13 ? 00:00:00 mount -t gfs -o debug /dev/mappe2 root 21267 31714 0 13:51 ttyS0 00:00:00 grep mount Version-Release number of selected component (if applicable): gulm-1.0.10-0 I'll attach the stack traces...
Created attachment 161297 [details] kern dump from taft-04
This is caused by a problem with the protocol which doesn't notify us if a node is rejoining the cluster or is joining the cluster for the first time. I'm working on a solution to this issue without changing the protocol.
FYI - hit this bug again during 4.6 regression testing. 2.6.9-67.ELsmp gulm-1.0.10-0
Hit this issue again during 4.6.Z testing. Note, this may be the same issue as bz 382671. ================================================================================ [revolver] Senario iteration 0.6 started at Tue Apr 29 00:21:14 CDT 2008 [revolver] Sleeping 5 minute(s) to let the I/O get its lock count up... [revolver] Gulm Status [revolver] =========== [revolver] taft-02: Master [revolver] taft-03: Slave [revolver] taft-01: Slave [revolver] taft-04: Client [revolver] Senario: GULM kill all Clients and Slaves [revolver] [revolver] Those picked to face the revolver... taft-04 taft-01 taft-03 [revolver] Feeling lucky taft-04? Well do ya? Go'head make my day... [revolver] Didn't receive heartbeat for 5 seconds [revolver] Feeling lucky taft-01? Well do ya? Go'head make my day... [revolver] Didn't receive heartbeat for 5 seconds [revolver] Feeling lucky taft-03? Well do ya? Go'head make my day... [revolver] Didn't receive heartbeat for 5 seconds [revolver] [revolver] Verify that taft-04 has been removed from cluster on remaining nodes [revolver] Verify that taft-01 has been removed from cluster on remaining nodes [revolver] Verify that taft-03 has been removed from cluster on remaining nodes [revolver] Verifying that the dueler(s) are alive [revolver] Still not all alive, sleeping another 10 seconds [revolver] Still not all alive, sleeping another 10 seconds [revolver] Still not all alive, sleeping another 10 seconds [revolver] Still not all alive, sleeping another 10 seconds [revolver] All killed nodes are back up, making sure they're qarshable... [revolver] Verifying that recovery properly took place on the node(s) which stayed in the cluster [revolver] checking Gulm recovery... [revolver] Verifying that clvmd was started properly on the dueler(s) [revolver] mounting /dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER0 on /mnt/TAFT_CLUSTER0 on taft-04 [STUCK] root 5640 5639 0 00:28 ? 00:00:00 mount -t gfs -o debug /dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER0 /mnt/TAFT_CLUSTER0
Created attachment 304116 [details] kernel dump from taft-04 during the hung mnt attempt
Appear to have reproduced this again during 4.7 GA regression testing. It requires all Slaves being killed (leaving only the Master).
*** Bug 382671 has been marked as a duplicate of this bug. ***
Reproduced during 4.7.Z testing. [revolver] Senario iteration 0.4 started at Wed Sep 3 10:24:21 CDT 2008 [revolver] Sleeping 3 minute(s) to let the I/O get its lock count up... [revolver] Gulm Status [revolver] =========== [revolver] grant-02: Slave [revolver] grant-03: Master [revolver] grant-01: Slave [revolver] Senario: GULM kill all Slaves
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.