Bug 438268

Summary: GFS: gfs_grow causes lock_dlm: exxonfs: gdlm_lock 2,17 err=-16
Product: Red Hat Enterprise Linux 5 Reporter: Robert Peterson <rpeterso>
Component: gfs-kmodAssignee: Robert Peterson <rpeterso>
Status: CLOSED ERRATA QA Contact: GFS Bugs <gfs-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 5.2CC: edamato, mkarg, tao, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 487026 (view as bug list) Environment:
Last Closed: 2009-01-20 21:18:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 487026    
Attachments:
Description Flags
Output from group_tool dump gfs after failure
none
patch to try none

Description Robert Peterson 2008-03-19 22:14:53 UTC
Description of problem:
I was trying to recreate bug #434962 using clues from bug #436383
and I hit an apparent problem somewhere between lock_dlm and dlm.
I got these messages:

lock_dlm: exxonfs: gdlm_lock 2,17 err=-16 cur=3 req=5 lkf=3044 flags=80
GFS: fsid=bobs_exxon:exxonfs.2: fatal: assertion "!(lck_ret & LM_OUT_ERROR)" failed

Version-Release number of selected component (if applicable):
RHEL5.2 prototype with the 2.6.18-83.el5 kernel and kmod-gfs-0.1.23-3.el5.

How reproducible:
Unknown.  Got it twice in a row though without much trouble.

Steps to Reproduce:
Requires 900G shared storage.

1. reboot all nodes
2. On one node:
   vgchange -an exxon_vg
   lvremove /dev/exxon_vg/exxon_lv
   lvcreate -l 102399 exxon_vg -n exxon_lv
   mkfs.gfs -O -j3 -p lock_dlm -t bobs_exxon:exxonfs /dev/exxon_vg/exxon_lv
3. On all nodes:
   mount -tgfs /dev/exxon_vg/exxon_lv /mnt/gfs
4. On both nodes 2 and 3:
   for i in `seq 1 200`; do echo $i ; time df -h /mnt/gfs ; sleep 1; done
5. While that loop is running on nodes 2 and 3, do this on node 1:
   lvextend -l +51199 /dev/exxon_vg/exxon_lv 
   gfs_grow /mnt/gfs
6. If that works, do this on node 1:
   lvextend -l +76799 /dev/exxon_vg/exxon_lv 
   gfs_grow /mnt/gfs

Actual results:
lock_dlm: exxonfs: gdlm_lock 2,17 err=-16 cur=3 req=5 lkf=3044 flags=80
GFS: fsid=bobs_exxon:exxonfs.2: fatal: assertion "!(lck_ret & LM_OUT_ERROR)" failed
GFS: fsid=bobs_exxon:exxonfs.2:   function = gfs_glock_xmote_th
GFS: fsid=bobs_exxon:exxonfs.2:   file =
/builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src2
GFS: fsid=bobs_exxon:exxonfs.2:   time = 1205945156

Expected results:
No lock_dlm errors should be reported and the gfs_grow should be
successful.

Additional info:
I did this several times with smaller chunks, i.e. lvcreate -L 50G
and then lvextend +50G but I could go from 50G all the way up to
900G (so nearly 20 times) one after another, but the problem did not
recreate.  It was only when I started extending by a large size that
I ran into this problem.

I tried this same set of commands using a single-node system, where
one terminal was doing the loop and the other was doing gfs_grow,
using the lock_nolock protocol.  That also did not recreate the problem.

This may be the root cause of the two bugs mentioned above.

With the latest recreation I ran with dlm_debug = "1" and asked Dave T.
to look at the failed system.  Perhaps he can add what he observed.

Comment 1 Robert Peterson 2008-03-20 13:49:25 UTC
More detailed dmesg after the failure:

lock_dlm: gdlm_cancel 2,17 flags 0
dlm: exxonfs: add overlap 19f0001 cur 2 new 4 count 2 flags 100000
lock_dlm: gdlm_cancel rv 0 2,17 flags 88
lock_dlm: complete dlm cancel 2,17 flags 8
lock_dlm: gdlm_cancel 2,17 flags 80
lock_dlm: gdlm_cancel skip 2,17 flags 0
lock_dlm: gdlm_cancel 2,17 flags 80
lock_dlm: gdlm_cancel rv 0 2,17 flags 8
lock_dlm: exxonfs: gdlm_lock 2,17 err=-16 cur=3 req=5 lkf=3044 flags=80
GFS: fsid=bobs_exxon:exxonfs.2: fatal: assertion "!(lck_ret & LM_OUT_ERROR)" failed
GFS: fsid=bobs_exxon:exxonfs.2:   function = gfs_glock_xmote_th
GFS: fsid=bobs_exxon:exxonfs.2:   file =
/builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/glock.c, line = 1162
GFS: fsid=bobs_exxon:exxonfs.2:   time = 1205945156
GFS: fsid=bobs_exxon:exxonfs.2: about to withdraw from the cluster
GFS: fsid=bobs_exxon:exxonfs.2: telling LM to withdraw
dlm: exxonfs: leaving the lockspace group...
dlm: exxonfs: group event done 0 0


Comment 2 Robert Peterson 2008-03-20 15:26:37 UTC
Created attachment 298706 [details]
Output from group_tool dump gfs after failure

Comment 3 Robert Peterson 2008-03-20 16:38:31 UTC
I managed to recreate the problem with a test patch from Dave, and
got these messages:

GFS: fsid=bobs_exxon:exxonfs.2: jid=2: Trying to acquire journal lock...
GFS: fsid=bobs_exxon:exxonfs.2: jid=2: Looking at journal...
GFS: fsid=bobs_exxon:exxonfs.2: jid=2: Done
lock_dlm: gdlm_cancel 2,17 flags 0
lock_dlm: gdlm_cancel skip 2,17 flags 0
lock_dlm: gdlm_cancel 2,17 flags 80
lock_dlm: gdlm_cancel skip 2,17 flags 0
lock_dlm: gdlm_cancel 2,17 flags 80
lock_dlm: gdlm_cancel rv 0 2,17 flags 8
lock_dlm: exxonfs: gdlm_lock 2,17 err=-16 cur=3 req=5 lkf=3044 flags=80

It took several (maybe 10) attempts before it recreated this time.


Comment 4 Robert Peterson 2008-03-20 21:59:13 UTC
With the debug code, the problem would not recreate after hours of
trying.  The new debug printk messages changed the timing too much.
I modified the patch so that it keeps the history in memory, and 
dumps it out when the problem occurs.  With this modification
the problem recreated the first time I tried.  And here's what I got:

GFS: fsid=bobs_exxon:exxonfs.2: jid=2: Trying to acquire journal lock...
GFS: fsid=bobs_exxon:exxonfs.2: jid=2: Looking at journal...
GFS: fsid=bobs_exxon:exxonfs.2: jid=2: Done
217 lock 0,3 flags 0 pid 7439 lpflags 0
217 complete sb_status 0 lpflags 0
217 lock 3,1 flags 10 pid 7487 lpflags 0
217 complete sb_status 0 lpflags 0
217 lock 1,3 flags 1 pid 7487 lpflags 0
217 cancel pid 7487 lpflags 80
217 complete sb_status 0 lpflags 0
217 lock 3,1 flags 10 pid 7451 lpflags 0
217 complete sb_status 0 lpflags 0
217 lock 1,3 flags 1 pid 7487 lpflags 0
217 cancel pid 7487 lpflags 80
217 complete sb_status 0 lpflags 8
217 lock 3,1 flags 10 pid 7451 lpflags 8
lock_dlm: exxonfs: gdlm_lock 2,17 err=-16 cur=3 req=5 lkf=3044 flags=80


Comment 5 David Teigland 2008-03-24 19:17:51 UTC
Stepping back a step, this looks like about the same issue Abhi and I were
working on in Apr 2007 in response to this email:
https://www.redhat.com/archives/linux-cluster/2007-April/msg00182.html

Here's an email I sent summarizing our conclusions back then:

On Mon, Apr 23, 2007 at 05:45:23PM +0100, Robert Clark wrote:
> > >   I have a script which runs gfs_quota to set quotas for all the users
> > > on my GFS filesystem. When it's run simultaneously on two nodes, errors
> > > like the following begin to appear:
> > > 
> > >     lock_dlm: lm_dlm_cancel 2,34 flags 80
> > >     lock_dlm: lm_dlm_cancel rv 0 2,34 flags 40080
> > >     lock_dlm: complete dlm cancel 2,34 flags 40000
> > 
> > Writes to a hidden/internal file (like the quota file) set the PRIORITY
> > flag in the glock request, which cause the cancels.  I don't understand
> > why gfs would want to do cancels in this case, though, I was very
> > surprised to see that.  So, first job is to figure out why these writes
> > need to do cancels (or if that's the bug right there)
> 
>   Is the uncertainty about whether hidden writes should be prioritised
> or whether the priority should trigger cancels? Sorry, my understanding
> of all this is still a bit cursory.
> 
>   I don't know if this helps, but trawling back through the CVS log
> reveals that the conditions for the call to handle_cancels (now
> do_cancels) in glock_wait_internal was changed from LM_FLAG_NOEXP to
> LM_FLAG_PRIORITY in revision 1.8 for bz133368:
> 
>     http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/gfs-kernel/src/gfs/glock.
c?cvsroot=cluster#rev1.8
> 
> apart from that, the code hasn't changed much that I can see.

Heh, thanks a lot for digging that out, I'd been meaning to go back and
see if there was some history for that.  It further confirms what we're
becoming convinced of here, which is that we don't need or want to do
cancels when locking the quota file (which probably amounts to just
removing the PRIORITY flag from the hfile_write function).

The PRIORITY flag has had an ambiguous meaning over the years.  In the cvs
check-in you found, PRIORITY suddenly came to mean "cancel" in addition to
the vague "give this request preference" meaning it held before.  This
correctly fixed bz133368, but wasn't necessary (and was probably
incorrect) for hfile_write's PRIORITY use.  I'm guessing Ken didn't check
for other PRIORITY users, or did and thought it wouldn't make any
difference in this case (it usually doesn't).

We've been looking at how the quota file is locked by gfs, and we can't
see why it would ever need or want to do cancels as part of that.  We
believe that in your case, since you're running gfs_quota on more than one
node, the quota daemon is being delayed trying to lock the quota file,
which makes it likely that the gfs_quota command will come in and cancel
the lock request from the daemon.

We're going to be removing the PRIORITY flag from hfile_write and running
some tests to verify that it works ok.

Dave

Comment 6 David Teigland 2008-03-24 19:35:19 UTC
I think the unresolved question that Abhi and I had last spring, was whether
we should just remove the PRIORITY flag from hfile_write (which removes both
the cancels and the higher preference in the dlm), or whether we should remove
the cancels but keep the higher dlm preference for the lock.  I didn't think
that this middle ground was necessary, but thought it may be a little safer
given that it would be a smaller change to the existing behavior, i.e. removing
just the part we don't want (cancels) and leaving the other part.


Comment 7 David Teigland 2008-03-24 19:43:07 UTC
Bob, could you run your tests with no PRIORITY flag in hfile_write?


Comment 8 Robert Peterson 2008-03-24 22:10:22 UTC
It's running now, but now it's extremely slow.  A gfs_grow used to
take 10-15 seconds with the PRIORITY flag set, when growing a file
system to 900G.  Without PRIORITY set, it's been over five minutes
and it's only gotten up to 215G (and growing).


Comment 9 Robert Peterson 2008-03-25 00:53:12 UTC
It took probably about an hour, maybe longer, to do the gfs_grow
without the PRIORITY bit set.


Comment 10 David Teigland 2008-03-25 13:41:19 UTC
Created attachment 299022 [details]
patch to try

I'm guessing that you were actively using the fs while trying to grow.
If there's little to no activity does the grow happen quickly?

With the attached patch, we keep LM_FLAG_PRIORITY so that the lock
priority is kept in the dlm, but add GL_NOCANCEL_OTHER to avoid
canceling other outstanding locks.

Comment 11 Robert Peterson 2008-03-26 17:16:13 UTC
With this patch, things are back to being fast again.  However, the
problem is still there.  On my third attempt to recreate the problem
I got this:

Trying to join cluster "lock_dlm", "bobs_exxon:exxonfs"
Joined cluster. Now mounting FS...
GFS: fsid=bobs_exxon:exxonfs.1: jid=1: Trying to acquire journal lock...
GFS: fsid=bobs_exxon:exxonfs.1: jid=1: Looking at journal...
GFS: fsid=bobs_exxon:exxonfs.1: jid=1: Done
217 lock 3,1 flags 10 pid 10962 lpflags 0

217 complete sb_status 0 lpflags 0

217 lock 1,3 flags 1 pid 10999 lpflags 0

217 cancel pid 10999 lpflags 80

217 complete sb_status 0 lpflags 8

217 lock 3,1 flags 10 pid 10962 lpflags 8

lock_dlm: exxonfs: gdlm_lock 2,17 err=-16 cur=3 req=5 lkf=3044 flags=80
GFS: fsid=bobs_exxon:exxonfs.1: fatal: assertion "!(lck_ret & LM_OUT_ERROR)" failed
GFS: fsid=bobs_exxon:exxonfs.1:   function = gfs_glock_xmote_th
GFS: fsid=bobs_exxon:exxonfs.1:   file =
/home/bob/cluster/gfs-kernel/src/gfs/glock.c, line = 112
GFS: fsid=bobs_exxon:exxonfs.1:   time = 1206533294
GFS: fsid=bobs_exxon:exxonfs.1: about to withdraw from the cluster
GFS: fsid=bobs_exxon:exxonfs.1: telling LM to withdraw


Comment 12 David Teigland 2008-03-26 17:42:20 UTC
Then the patch isn't working like it's supposed to; there are not supposed to
be any cancels happening.


Comment 13 Robert Peterson 2008-03-26 19:09:54 UTC
My bad.  I screwed up the patch.  With Dave's patch in place (correctly)
I was able to run this scenario at least a dozen times with no failures
while df loops were going on simultaneously from two other nodes.
The problem did not recreate.  Now I need to scratch build my cluster
to RHEL4 and try to recreate the problem there.


Comment 14 Robert Peterson 2008-04-07 15:21:44 UTC
After many attempts, I have not been able to recreate a hang in RHEL4.
So I'll defer to Dave as to whether we should back-port it to RHEL4 or
just do the fix in RHEL5.


Comment 15 David Teigland 2008-04-15 18:06:47 UTC
commit in RHEL5 branch: 0847ffdaf607aafd538e949c91eb47f2a06c4335
also pushed to STABLE2: 86a728d286e95b6db8a46ecd8c9b9a6540d2b138



Comment 17 RHEL Program Management 2008-07-31 22:00:22 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 20 errata-xmlrpc 2009-01-20 21:18:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0132.html