Description of problem: I was trying to recreate bug #434962 using clues from bug #436383 and I hit an apparent problem somewhere between lock_dlm and dlm. I got these messages: lock_dlm: exxonfs: gdlm_lock 2,17 err=-16 cur=3 req=5 lkf=3044 flags=80 GFS: fsid=bobs_exxon:exxonfs.2: fatal: assertion "!(lck_ret & LM_OUT_ERROR)" failed Version-Release number of selected component (if applicable): RHEL5.2 prototype with the 2.6.18-83.el5 kernel and kmod-gfs-0.1.23-3.el5. How reproducible: Unknown. Got it twice in a row though without much trouble. Steps to Reproduce: Requires 900G shared storage. 1. reboot all nodes 2. On one node: vgchange -an exxon_vg lvremove /dev/exxon_vg/exxon_lv lvcreate -l 102399 exxon_vg -n exxon_lv mkfs.gfs -O -j3 -p lock_dlm -t bobs_exxon:exxonfs /dev/exxon_vg/exxon_lv 3. On all nodes: mount -tgfs /dev/exxon_vg/exxon_lv /mnt/gfs 4. On both nodes 2 and 3: for i in `seq 1 200`; do echo $i ; time df -h /mnt/gfs ; sleep 1; done 5. While that loop is running on nodes 2 and 3, do this on node 1: lvextend -l +51199 /dev/exxon_vg/exxon_lv gfs_grow /mnt/gfs 6. If that works, do this on node 1: lvextend -l +76799 /dev/exxon_vg/exxon_lv gfs_grow /mnt/gfs Actual results: lock_dlm: exxonfs: gdlm_lock 2,17 err=-16 cur=3 req=5 lkf=3044 flags=80 GFS: fsid=bobs_exxon:exxonfs.2: fatal: assertion "!(lck_ret & LM_OUT_ERROR)" failed GFS: fsid=bobs_exxon:exxonfs.2: function = gfs_glock_xmote_th GFS: fsid=bobs_exxon:exxonfs.2: file = /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src2 GFS: fsid=bobs_exxon:exxonfs.2: time = 1205945156 Expected results: No lock_dlm errors should be reported and the gfs_grow should be successful. Additional info: I did this several times with smaller chunks, i.e. lvcreate -L 50G and then lvextend +50G but I could go from 50G all the way up to 900G (so nearly 20 times) one after another, but the problem did not recreate. It was only when I started extending by a large size that I ran into this problem. I tried this same set of commands using a single-node system, where one terminal was doing the loop and the other was doing gfs_grow, using the lock_nolock protocol. That also did not recreate the problem. This may be the root cause of the two bugs mentioned above. With the latest recreation I ran with dlm_debug = "1" and asked Dave T. to look at the failed system. Perhaps he can add what he observed.
More detailed dmesg after the failure: lock_dlm: gdlm_cancel 2,17 flags 0 dlm: exxonfs: add overlap 19f0001 cur 2 new 4 count 2 flags 100000 lock_dlm: gdlm_cancel rv 0 2,17 flags 88 lock_dlm: complete dlm cancel 2,17 flags 8 lock_dlm: gdlm_cancel 2,17 flags 80 lock_dlm: gdlm_cancel skip 2,17 flags 0 lock_dlm: gdlm_cancel 2,17 flags 80 lock_dlm: gdlm_cancel rv 0 2,17 flags 8 lock_dlm: exxonfs: gdlm_lock 2,17 err=-16 cur=3 req=5 lkf=3044 flags=80 GFS: fsid=bobs_exxon:exxonfs.2: fatal: assertion "!(lck_ret & LM_OUT_ERROR)" failed GFS: fsid=bobs_exxon:exxonfs.2: function = gfs_glock_xmote_th GFS: fsid=bobs_exxon:exxonfs.2: file = /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/glock.c, line = 1162 GFS: fsid=bobs_exxon:exxonfs.2: time = 1205945156 GFS: fsid=bobs_exxon:exxonfs.2: about to withdraw from the cluster GFS: fsid=bobs_exxon:exxonfs.2: telling LM to withdraw dlm: exxonfs: leaving the lockspace group... dlm: exxonfs: group event done 0 0
Created attachment 298706 [details] Output from group_tool dump gfs after failure
I managed to recreate the problem with a test patch from Dave, and got these messages: GFS: fsid=bobs_exxon:exxonfs.2: jid=2: Trying to acquire journal lock... GFS: fsid=bobs_exxon:exxonfs.2: jid=2: Looking at journal... GFS: fsid=bobs_exxon:exxonfs.2: jid=2: Done lock_dlm: gdlm_cancel 2,17 flags 0 lock_dlm: gdlm_cancel skip 2,17 flags 0 lock_dlm: gdlm_cancel 2,17 flags 80 lock_dlm: gdlm_cancel skip 2,17 flags 0 lock_dlm: gdlm_cancel 2,17 flags 80 lock_dlm: gdlm_cancel rv 0 2,17 flags 8 lock_dlm: exxonfs: gdlm_lock 2,17 err=-16 cur=3 req=5 lkf=3044 flags=80 It took several (maybe 10) attempts before it recreated this time.
With the debug code, the problem would not recreate after hours of trying. The new debug printk messages changed the timing too much. I modified the patch so that it keeps the history in memory, and dumps it out when the problem occurs. With this modification the problem recreated the first time I tried. And here's what I got: GFS: fsid=bobs_exxon:exxonfs.2: jid=2: Trying to acquire journal lock... GFS: fsid=bobs_exxon:exxonfs.2: jid=2: Looking at journal... GFS: fsid=bobs_exxon:exxonfs.2: jid=2: Done 217 lock 0,3 flags 0 pid 7439 lpflags 0 217 complete sb_status 0 lpflags 0 217 lock 3,1 flags 10 pid 7487 lpflags 0 217 complete sb_status 0 lpflags 0 217 lock 1,3 flags 1 pid 7487 lpflags 0 217 cancel pid 7487 lpflags 80 217 complete sb_status 0 lpflags 0 217 lock 3,1 flags 10 pid 7451 lpflags 0 217 complete sb_status 0 lpflags 0 217 lock 1,3 flags 1 pid 7487 lpflags 0 217 cancel pid 7487 lpflags 80 217 complete sb_status 0 lpflags 8 217 lock 3,1 flags 10 pid 7451 lpflags 8 lock_dlm: exxonfs: gdlm_lock 2,17 err=-16 cur=3 req=5 lkf=3044 flags=80
Stepping back a step, this looks like about the same issue Abhi and I were working on in Apr 2007 in response to this email: https://www.redhat.com/archives/linux-cluster/2007-April/msg00182.html Here's an email I sent summarizing our conclusions back then: On Mon, Apr 23, 2007 at 05:45:23PM +0100, Robert Clark wrote: > > > I have a script which runs gfs_quota to set quotas for all the users > > > on my GFS filesystem. When it's run simultaneously on two nodes, errors > > > like the following begin to appear: > > > > > > lock_dlm: lm_dlm_cancel 2,34 flags 80 > > > lock_dlm: lm_dlm_cancel rv 0 2,34 flags 40080 > > > lock_dlm: complete dlm cancel 2,34 flags 40000 > > > > Writes to a hidden/internal file (like the quota file) set the PRIORITY > > flag in the glock request, which cause the cancels. I don't understand > > why gfs would want to do cancels in this case, though, I was very > > surprised to see that. So, first job is to figure out why these writes > > need to do cancels (or if that's the bug right there) > > Is the uncertainty about whether hidden writes should be prioritised > or whether the priority should trigger cancels? Sorry, my understanding > of all this is still a bit cursory. > > I don't know if this helps, but trawling back through the CVS log > reveals that the conditions for the call to handle_cancels (now > do_cancels) in glock_wait_internal was changed from LM_FLAG_NOEXP to > LM_FLAG_PRIORITY in revision 1.8 for bz133368: > > http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/gfs-kernel/src/gfs/glock. c?cvsroot=cluster#rev1.8 > > apart from that, the code hasn't changed much that I can see. Heh, thanks a lot for digging that out, I'd been meaning to go back and see if there was some history for that. It further confirms what we're becoming convinced of here, which is that we don't need or want to do cancels when locking the quota file (which probably amounts to just removing the PRIORITY flag from the hfile_write function). The PRIORITY flag has had an ambiguous meaning over the years. In the cvs check-in you found, PRIORITY suddenly came to mean "cancel" in addition to the vague "give this request preference" meaning it held before. This correctly fixed bz133368, but wasn't necessary (and was probably incorrect) for hfile_write's PRIORITY use. I'm guessing Ken didn't check for other PRIORITY users, or did and thought it wouldn't make any difference in this case (it usually doesn't). We've been looking at how the quota file is locked by gfs, and we can't see why it would ever need or want to do cancels as part of that. We believe that in your case, since you're running gfs_quota on more than one node, the quota daemon is being delayed trying to lock the quota file, which makes it likely that the gfs_quota command will come in and cancel the lock request from the daemon. We're going to be removing the PRIORITY flag from hfile_write and running some tests to verify that it works ok. Dave
I think the unresolved question that Abhi and I had last spring, was whether we should just remove the PRIORITY flag from hfile_write (which removes both the cancels and the higher preference in the dlm), or whether we should remove the cancels but keep the higher dlm preference for the lock. I didn't think that this middle ground was necessary, but thought it may be a little safer given that it would be a smaller change to the existing behavior, i.e. removing just the part we don't want (cancels) and leaving the other part.
Bob, could you run your tests with no PRIORITY flag in hfile_write?
It's running now, but now it's extremely slow. A gfs_grow used to take 10-15 seconds with the PRIORITY flag set, when growing a file system to 900G. Without PRIORITY set, it's been over five minutes and it's only gotten up to 215G (and growing).
It took probably about an hour, maybe longer, to do the gfs_grow without the PRIORITY bit set.
Created attachment 299022 [details] patch to try I'm guessing that you were actively using the fs while trying to grow. If there's little to no activity does the grow happen quickly? With the attached patch, we keep LM_FLAG_PRIORITY so that the lock priority is kept in the dlm, but add GL_NOCANCEL_OTHER to avoid canceling other outstanding locks.
With this patch, things are back to being fast again. However, the problem is still there. On my third attempt to recreate the problem I got this: Trying to join cluster "lock_dlm", "bobs_exxon:exxonfs" Joined cluster. Now mounting FS... GFS: fsid=bobs_exxon:exxonfs.1: jid=1: Trying to acquire journal lock... GFS: fsid=bobs_exxon:exxonfs.1: jid=1: Looking at journal... GFS: fsid=bobs_exxon:exxonfs.1: jid=1: Done 217 lock 3,1 flags 10 pid 10962 lpflags 0 217 complete sb_status 0 lpflags 0 217 lock 1,3 flags 1 pid 10999 lpflags 0 217 cancel pid 10999 lpflags 80 217 complete sb_status 0 lpflags 8 217 lock 3,1 flags 10 pid 10962 lpflags 8 lock_dlm: exxonfs: gdlm_lock 2,17 err=-16 cur=3 req=5 lkf=3044 flags=80 GFS: fsid=bobs_exxon:exxonfs.1: fatal: assertion "!(lck_ret & LM_OUT_ERROR)" failed GFS: fsid=bobs_exxon:exxonfs.1: function = gfs_glock_xmote_th GFS: fsid=bobs_exxon:exxonfs.1: file = /home/bob/cluster/gfs-kernel/src/gfs/glock.c, line = 112 GFS: fsid=bobs_exxon:exxonfs.1: time = 1206533294 GFS: fsid=bobs_exxon:exxonfs.1: about to withdraw from the cluster GFS: fsid=bobs_exxon:exxonfs.1: telling LM to withdraw
Then the patch isn't working like it's supposed to; there are not supposed to be any cancels happening.
My bad. I screwed up the patch. With Dave's patch in place (correctly) I was able to run this scenario at least a dozen times with no failures while df loops were going on simultaneously from two other nodes. The problem did not recreate. Now I need to scratch build my cluster to RHEL4 and try to recreate the problem there.
After many attempts, I have not been able to recreate a hang in RHEL4. So I'll defer to Dave as to whether we should back-port it to RHEL4 or just do the fix in RHEL5.
commit in RHEL5 branch: 0847ffdaf607aafd538e949c91eb47f2a06c4335 also pushed to STABLE2: 86a728d286e95b6db8a46ecd8c9b9a6540d2b138
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0132.html