Description of problem: I've reproduced this issue 3 times now while attempting to recreate bz 480401 with less filesystems. Each time I attempt to simultaneously mount 50 gfs filesystems, the mounts end up deadlocked in JOIN_STOP_WAIT/FAIL_ALL_STOPPED. Here's what a few of the GFS services look like: [root@taft-02 tmp]# head cman type level name id state fence 0 default 00010001 FAIL_STOP_WAIT [1 2 3 4] dlm 1 clvmd 00010003 FAIL_STOP_WAIT [1 2 3 4] dlm 1 vgblv1 00650002 FAIL_ALL_STOPPED [2 3] dlm 1 vgclv2 006b0002 none [2] dlm 1 vgblv2 006f0002 FAIL_ALL_STOPPED [...] gfs 2 vgelv5 00000000 JOIN_STOP_WAIT [1 2 3] gfs 2 vgflv2 00030001 JOIN_STOP_WAIT [1 2 3 4] gfs 2 vgflv5 00000000 JOIN_STOP_WAIT [1 2 3 4] gfs 2 vgflv6 00000000 JOIN_STOP_WAIT [1 2 3] gfs 2 vgelv7 00000000 JOIN_STOP_WAIT [1 2 3] gfs 2 vgflv4 000b0003 FAIL_ALL_STOPPED [2 3 4] gfs 2 vgflv1 000d0003 FAIL_STOP_WAIT [1 2 3 4] gfs 2 vgdlv7 00000000 JOIN_STOP_WAIT Version-Release number of selected component (if applicable): 2.6.18-128.el5 lvm2-2.02.40-6.el5 BUILT: Fri Oct 24 07:37:33 CDT 2008 lvm2-cluster-2.02.40-7.el5 BUILT: Wed Nov 26 07:19:19 CST 2008 device-mapper-1.02.28-2.el5 BUILT: Fri Sep 19 02:50:32 CDT 2008 kmod-gfs-0.1.31-3.el5 BUILT: Wed 17 Dec 2008 03:19:30 PM CST How reproducible: often I'll attach the log/kernel dumps from the 4 taft nodes...
Created attachment 329410 [details] log from taft-01
Created attachment 329411 [details] log from taft-02
Created attachment 329412 [details] log from taft-03
Created attachment 329413 [details] log from taft-04
This doesn't looks like a gfs issue to me, sounds more like a cluster infrastructure problem .
comment 3 shows the likely problem, Jan 19 16:01:01 taft-03 gfs_controld[6625]: Assertion failed on line 411 of file recover.c Assertion: "memb->opts & MEMB_OPT_RECOVER" (It appears the groupd state in the description was collected after sysrq-t wrecked the cluster, so it's not useful.)
I'll need a dump of the gfs_controld debug log from all nodes next time this happens: group_tool dump gfs > gfs_controld.txt It might be good to grep /var/log/messages for that error message first to verify that it's the same bug that's been reproduced.
I appear to have hit something similar to this. On all three grant nodes I tried this: [root@grant-01 ~]# for i in $(seq 1 20); do mount /dev/B/lv$i /mnt/B$i; done [and all are hung] grant-01: root 11717 10021 0 May07 pts/0 00:00:00 mount /dev/B/lv6 /mnt/B6 root 11720 11717 0 May07 pts/0 00:00:00 /sbin/mount.gfs /dev/B/lv6 /mnt/B6 -o rw grant-02: root 11713 10056 0 May07 pts/0 00:00:00 mount /dev/B/lv6 /mnt/B6 root 11718 11713 0 May07 pts/0 00:00:00 /sbin/mount.gfs /dev/B/lv6 /mnt/B6 -o rw grant-03: root 11713 10048 0 May07 pts/0 00:00:00 mount /dev/B/lv6 /mnt/B6 root 11716 11713 0 May07 pts/0 00:00:00 /sbin/mount.gfs /dev/B/lv6 /mnt/B6 -o rw grant-02 is the only node with a stuck cman service: [root@grant-02 ~]# cman_tool services type level name id state fence 0 default 00010001 none [1 2 3] dlm 1 clvmd 00010003 none [1 2 3] dlm 1 B1 00020002 none [1 2 3] dlm 1 B2 00040002 none [1 2 3] dlm 1 B3 00060002 none [1 2 3] dlm 1 B4 00080002 none [1 2 3] dlm 1 B5 000a0002 none [1 2 3] dlm 1 B6 00000000 JOIN_STOP_WAIT [-1341969328 2] gfs 2 B1 00010002 none [1 2 3] gfs 2 B2 00030002 none [1 2 3] gfs 2 B3 00050002 none [1 2 3] gfs 2 B4 00070002 none [1 2 3] gfs 2 B5 00090002 none [1 2 3] gfs 2 B6 000b0002 none [1 2 3] I'll attach a group_tool dump of gfs from each node.
Created attachment 343153 [details] gfs_tool dump from grant-01
Created attachment 343154 [details] gfs_tool dump from grant-02
Created attachment 343156 [details] gfs_tool dump from grant-03
Here's the only interesting thing in the syslog (on grant-02): May 7 16:17:07 grant-02 kernel: Trying to join cluster "lock_dlm", "GRANT:B5" May 7 16:17:07 grant-02 kernel: Joined cluster. Now mounting FS... May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=0: Trying to acquire journal lock... May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=0: Looking at journal... May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=0: Done May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=1: Trying to acquire journal lock... May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=1: Looking at journal... May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=1: Done May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=2: Trying to acquire journal lock... May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=2: Looking at journal... May 7 16:17:07 grant-02 openais[10160]: [TOTEM] Retransmit List: 313 May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=2: Done May 7 16:17:07 grant-02 kernel: Trying to join cluster "lock_dlm", "GRANT:B6" May 7 16:17:09 grant-02 openais[10160]: [TOTEM] Retransmit List: 395 May 7 16:17:09 grant-02 openais[10160]: [TOTEM] Retransmit List: 3aa May 7 16:20:20 grant-02 kernel: dlm: B6: group join failed -512 0 May 7 16:20:20 grant-02 kernel: lock_dlm: dlm_new_lockspace error -512 May 7 16:20:20 grant-02 kernel: can't mount proto=lock_dlm, table=GRANT:B6, hostdata=jid=0:id=720898:first=1 May 7 16:20:20 grant-02 kernel: Trying to join cluster "lock_dlm", "GRANT:B6" May 7 16:20:20 grant-02 dlm_controld[10186]: process_uevent online@ error -17 errno
I snuck onto grant-02 and also grabbed the debug log from groupd, which shows this: 1241731027 1:B6 got join 1241731027 1:B6 is cpg client 19 name 1_B6 handle 79a1deaa0000000e 1241731027 1:B6 cpg_join ok 1241731027 1:B6 waiting for first cpg event 1241731027 1:B6 confchg left 0 joined 1 total 2 1241731027 1:B6 process_node_join 2 1241731027 1:B6 cpg add node 2 total 1 1241731027 1:B6 cpg add node -1341969328 total 2 1241731027 1:B6 make_event_id 200020001 nodeid 2 memb_count 2 type 1 which shows the confchg giving us two members, the first with nodeid 2 and the second with nodeid -1341969328. It's not immediately clear whether the confchg should have been for just one member, or whether there's really two members and the nodeid is bad.
Created attachment 343166 [details] group_tool dump from grant-02
Chrissie, I wonder if this is related to the other recent openais regressions.
*** This bug has been marked as a duplicate of bug 499734 ***
changed component to openais.