Description of problem: I was playing around with mounting and umounting 30 gfs and was seeing no problems so I then created 30 more gfs and attempted to mount them. Shortly after, the mount commands hung. These are the stuck services (on taft-04): DLM Lock Space: "31" 305 306 update U-4,1,2 [1 2] DLM Lock Space: "33" 305 308 join S-8,40,3 [1 2 4] 4 D root 14137 11551 0 78 0 - 12492 kcl_jo 09:46 pts/0 00:00:00 mount -t gfs /dev/vg/gfs33 /mnt/gfs33 Sep 12 09:56:37 taft-04 kernel: mount D 0000010217ba2ca0 0 14137 14546 (NOTLB) Sep 12 09:56:37 taft-04 kernel: 00000102120a18f8 0000000000000006 ffffffffa025cf24 0000010212ba5400 Sep 12 09:56:37 taft-04 kernel: 0000010200000000 dead4ead00000001 00000102120a18c8 0000000180132155 Sep 12 09:56:38 taft-04 kernel: 000001020d5d37f0 0000000000005f7a Sep 12 09:56:38 taft-04 kernel: Call Trace:<ffffffffa025cf24>{:dlm:dlm_recoverd+0} <ffffffff8030a405>{wait_for_completion+167} Sep 12 09:56:38 taft-04 kernel: <ffffffff80133da9>{default_wake_function+0} <ffffffff80133da9>{default_wake_function+0} Sep 12 09:56:38 taft-04 kernel: <ffffffffa023a175>{:cman:kcl_join_service+381} <ffffffffa02549ca>{:dlm:dlm_new_lockspace+1401} Sep 12 09:56:38 taft-04 kernel: <ffffffffa02ce94f>{:lock_dlm:lm_dlm_mount+1322} <ffffffffa0278310>{:lock_harness:lm_mount+358} Sep 12 09:56:38 taft-04 kernel: <ffffffffa02936e4>{:gfs:lm_cb+0} <ffffffffa02938bd>{:gfs:gfs_lm_mount+453} Sep 12 09:56:38 taft-04 kernel: <ffffffffa029f768>{:gfs:gfs_get_sb+1552} <ffffffff80185bd2>{do_lookup+44} Sep 12 09:56:38 taft-04 kernel: <ffffffff8018f019>{dput+56} <ffffffff80186c39>{__link_path_walk+3427} Sep 12 09:56:38 taft-04 kernel: <ffffffff8018f019>{dput+56} <ffffffff80161936>{cache_alloc_refill+390} Sep 12 09:56:38 taft-04 kernel: <ffffffff8017fb41>{do_kern_mount+161} <ffffffff801951a0>{do_mount+1690} Sep 12 09:56:38 taft-04 kernel: <ffffffff801ea821>{__up_read+16} <ffffffff80123ed3>{do_page_fault+575} Sep 12 09:56:38 taft-04 kernel: <ffffffff80169ca3>{handle_mm_fault+373} <ffffffff8018728a>{__user_walk+94} Sep 12 09:56:38 taft-04 kernel: <ffffffff80195543>{sys_mount+186} <ffffffff8011026a>{system_call+126} [root@taft-04 ~]# cat /proc/cluster/dlm_debug waiting requests 28 marked 0 requests 28 recover event 1575 done 28 move flags 0,0,1 ids 1555,1575,1575 28 process held requests 28 processed 0 requests 28 resend marked requests 28 resent 0 requests 28 recover event 1575 finished 31 move flags 0,1,0 ids 0,1577,0 31 move use event 1577 31 recover event 1577 (first) 31 add nodes 31 total nodes 1 31 rebuild resource directory 31 rebuilt 0 resources 31 recover event 1577 done 31 move flags 0,0,1 ids 0,1577,1577 31 process held requests 31 processed 0 requests 31 recover event 1577 finished 31 move flags 1,0,0 ids 1577,1577,1577 31 move flags 0,1,0 ids 1577,1580,1577 31 move use event 1580 31 recover event 1580 31 add node 2 31 total nodes 2 31 rebuild resource directory 31 rebuilt 7 resources 31 purge requests 31 purged 0 requests 33 move flags 0,1,0 ids 0,1579,0 33 move use event 1579 33 recover event 1579 (first) 33 add nodes 33 total nodes 3 33 rebuild resource directory 33 rebuilt 0 resources 33 unexpected rcom msgid a/8 nodeid=2 33 recover event 1579 done [root@taft-04 ~]# cat /proc/cluster/dlm_locks [root@taft-04 ~]# cat /proc/cluster/dlm_stats DLM stats (HZ=1000) Lock operations: 293925 Unlock operations: 288971 Convert operations: 249200 Completion ASTs: 832029 Blocking ASTs: 497 Lockqueue num waittime ave WAIT_RSB 217955 205340 0 WAIT_CONV 154 472 3 WAIT_GRANT 5998 8541 1 WAIT_UNLOCK 4465 139725 31 Total 228572 354078 1 Version-Release number of selected component (if applicable): [root@taft-04 ~]# uname -ar Linux taft-04 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 17:57:31 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux [root@taft-04 ~]# rpm -q dlm dlm-1.0.1-1
I don't know whether they are related but I can reproduce both this and bz#206212 with the following script: NUM=60 for i in `seq 1 $NUM`; do ./lstest -l SPACE-$i -d0 "$@" & done it uses the lstest program in cluster/dlm/tests/usertest
Marking as a beta2 blocker for rhel5. Also cc'ing steve dake since it might be an openais issue.
Sorry, this is a rhel4 bug, marked improperly for cluster-5.0 - remarked for cluster-4.5
Trying to reproduce in my own cluster. If you happen to hit this again, please collect all the following from each node: foreach: mount >> log.txt foreach: cman_tool nodes >> log.txt foreach: cman_tool services >> log.txt foreach: ps ax -o pid,cmd,wchan >> log.txt foreach: cat /proc/cluster/sm_debug >> log.txt foreach: cat /proc/cluster/dlm_debug >> log.txt foreach: cat /proc/cluster/lock_dlm/debug >> log.txt foreach: sysrq foreach: collect /var/log/messages
I think the key thing here is the 33 unexpected rcom msgid a/8 nodeid=2 When it happens to me, there is always such a message in the dlm_debug log. The "received" msg numbers seem often to be junk too. my latest happening was: gfs-2 unexpected rcom msgid d/25b nodeid=6 So I'm going to look for message corruption.
I left a mount/unmount test on 20 fs's running over the weekend and found the same thing -- one node reported the rcom msgid error before everything hung. Interestingly, just before it was the error message I added last week when fixing bug 206212 where process_startdone_barrier_new() gets an sg with no sevent for it. We now print this new error (below) instead of oopsing by using the null pointer. Sep 15 21:29:26 bull-02 kernel: SM: 010020dc process_startdone_barrier_new: no sev 0 Sep 15 21:29:29 bull-02 kernel: GFS: Trying to join cluster "lock_dlm", "bull:x10" Sep 15 21:29:31 bull-02 kernel: dlm: x9: unexpected rcom msgid 5/27e59 nodeid=4
bull-01 (nodeid 1) DLM Lock Space: "x9" 8412 28080 run - [1 2] bull-02 (nodeid 2) DLM Lock Space: "x9" 8412 27804 join S-8,40,2 [1 2] 20dc == 8412 == lockspace x9 So, bull-02 is joining lockspace x9 which already contains bull-01. bull-02 is waiting for the final barrier that says everyone is done and will complete the join. bull-01 has apparently gotten the barrier and moved on to the normal running state. When the barrier callback is delivered on bull-02 (with a success status of 0), process_startdone_barrier_new() doesn't find an sevent on the sg structure for x9, so it ignores the barrier callback and does the printk in comment 6. This explains why x9 is still in sevent state SEST_BARRIER_WAIT. What doesn't make sense is why the barrier callback doesn't find the sevent structure -- clearly one exists as it's shown by /proc/cluster/services. This doesn't seem to explain the subsequent dlm msgid printk in any way; nodeid 4 (sender of the rcom message) isn't in lockspace x9 and doesn't appear to be joining it either.
Hmm, I think the DLM message might just be related to the SM problem, look at this: Service Name GID LID State Code Fence Domain: "default" 2 2 run - [7 5 6 8] DLM Lock Space: "clvmd" 1 3 run - [5 6 7 8] DLM Lock Space: "SPACE-51" 53 54 run - [5 6 7 8] DLM Lock Space: "gfs-1" 54 55 update U-4,1,6 [7 6] DLM Lock Space: "gfs-2" 54 57 join S-8,40,2 [5 7] GFS Mount Group: "gfs-1" 56 56 run - [7] See, lockspaces gfs-1 & gfs-2 have the same global ID !
Good catch, that's exactly the bug here -- two groups are being created with the same global id. SM gets a barrier callback specifying the gid, does a gid-to-sg lookup, gets the wrong sg back, finds that sg doesn't have an sevent and prints "process_startdone_barrier_new: no sev 0". The correct sg is left without the barrier callback it needed. Not sure yet how to fix the creation of global id's.
If someone had seen this bug they would have gotten kernel oopses (prior to a recent change) or the "process_startdone_barrier_new" message (after the recent change). The "recent change" was bug 206212 which turns the oops into a kernel message + hang when this bug (duplicate global ids) is hit. So, fixing this bug will make the fix for bug 206212 pretty irrelevant. Note the comment in bug 206212 that states "the deeper question is why" -- this bug is the deeper why.
I think we should try to get a fix for this into the errata release, it turns out to pretty simple to hit. Working on a possible solution.
Before choosing a new global id for a new group, a node discovers what the current max group id seen by any node in the cluster is and picks new=max+1. The problem is when two nodes do this in parallel for two different new groups (that are at the same level) and end up assigning the same global id to the two different groups. To fix this, the idea is to have a node take this max group id and add its own nodeid to it, so new=max+our_nodeid. This is similar to designating some of the id bits to be the nodeid of the node creating the group, but that method would break backward compatibility. What I'm not certain of is whether there are cases where the two nodes might see different values of max so that their different nodeids would make the two new values the same again.
The test results of the idea in comment 15 look good, although I'm doing more; I may also add a check somewhere to give us an explicit warning if duplicates exist. I'm pretty satisfied that adding the nodeid will solve the problem, when two nodes are creating a new group in parallel, they'll have a common max value AFAICS so no max(4)+nodeid(1)=gid(5) vs max(3)+nodeid(2)=gid(5) collision. A number of reasons we can't use uuidgen: we're buried in the kernel, our id is 32 bits, we need to maintain backward compat. In groupd (rhel5) gid's are still created (by an even simpler local nodeid | local counter), but not used, only passed to dlm_controld and gfs_controld. gfs_controld can just use a local counter it keeps itself (i.e. its' not global), and I hope to do something simpler for dlm lockspace id's within the dlm. Then the gid's can be removed completely.
Checked the fix into the RHEL4 branch. http://sources.redhat.com/ml/cluster-cvs/2006-q3/msg00661.html
This "fix" for this from comment 19 is bad and makes the problem much worse. Testing what I expect is the correct fix right now.
Just made/mounted 65 gfs filesytems on a 4 node cluster. Marking verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0135.html
Test loops mounting four fs's in parallel on four nodes, then unmounts all in parallel. After some time hit this bug again, see that lockspaces lv0 and lv1 both have GID 5531. [root@marathon-01 src]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 2 2 run - [1 2 3 4 5] DLM Lock Space: "clvmd" 3 3 run - [1 2 3 4 5] DLM Lock Space: "lv2" 5527 2584 run - [1 3] DLM Lock Space: "lv1" 5531 2586 update U-4,1,2 [1 2] DLM Lock Space: "lv0" 5531 2588 join S-8,40,2 [1 2] GFS Mount Group: "lv2" 5530 2585 run - [1 3] GFS Mount Group: "lv1" 5532 2587 run - [1] [root@marathon-02 src]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 2 2 run - [1 2 3 4 5] DLM Lock Space: "clvmd" 3 3 run - [1 2 3 4 5] DLM Lock Space: "lv3" 5526 2582 run - [2 4 3] DLM Lock Space: "lv0" 5531 2584 update U-4,1,1 [2 1] DLM Lock Space: "lv1" 5531 2586 join S-8,40,2 [1 2] GFS Mount Group: "lv3" 5529 2583 run - [2 4 3] GFS Mount Group: "lv0" 5533 2585 run - [2] marathon-01: dlm: lv0: unexpected rcom msgid 5/4 nodeid=2 marathon-02: dlm: lv1: unexpected rcom msgid 4/1 nodeid=1
Changing acks to target 4.8 (instead of 4.5).
It's not possible to fix this without major work breaking compatibility. No users or customers have ever reported seeing this, and it would be highly unlikely.