Bug 595397

Summary: GFS2: stuck in inode wait, no glocks stuck
Product: Red Hat Enterprise Linux 5 Reporter: Robert Peterson <rpeterso>
Component: kernelAssignee: Robert Peterson <rpeterso>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 5.6CC: adas, allen, amark, bmarzins, bturner, cww, dhoward, everett.bennett, james, james.brown, jwest, jwilson, liko, nstraz, rpeterso, rwheeler, snagar, swhiteho, syeghiay, teigland
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 583737 Environment:
Last Closed: 2011-01-13 21:33:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 566771, 582359, 651805    
Attachments:
Description Flags
Latest patch from 14 June 2010
none
Latest patch from 06 Jul 2010 none

Comment 1 RHEL Program Management 2010-05-24 14:33:55 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 Robert Peterson 2010-06-05 20:41:13 UTC
Time for an update on this problem.

I back-ported the RHEL6 patches to RHEL5.6 and, given enough
pressure, I can recreate various problems that did not appear
in RHEL6.0.  The main problem seems to occur in the gfs2-minrg
coherency test.  The symptom is this:

Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 
 [<0000000000000000>] _stext+0x7ffff000/0x1000
PGD 2041d8067 PUD 20a010067 PMD 0 
Oops: 0010 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:1c.5/0000:06:00.0/irq
CPU 1 
Modules linked in: md5 sctp lock_dlm gfs2(U) dlm configfs autofs4 hidp rfcomm l2cap bluetooth dm_log_clustered(U) lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sr_mod joydev tg3 ide_cd serio_raw pata_sil680 i3000_edac i2c_i801 shpchp cdrom i2c_core edac_mc sg pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 3191, comm: dlm_astd Tainted: G      2.6.18-200.el5 #1
RIP: 0010:[<0000000000000000>]  [<0000000000000000>] _stext+0x7ffff000/0x1000
RSP: 0018:ffff8102314bfeb8  EFLAGS: 00010247
RAX: 0000000000000000 RBX: ffff8101fbeec4d8 RCX: 0000000000000001
RDX: ffff810208b63188 RSI: 0000000000000101 RDI: 0000000000200200
RBP: 0000000000000005 R08: ffff8102314be000 R09: ffff810232b75280
R10: 0000000000000003 R11: 0000000000000000 R12: ffffffff888212c1
R13: 0000000000000002 R14: ffff81023149fca8 R15: ffffffff800a07e7
FS:  0000000000000000(0000) GS:ffff8101fffb77c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000204270000 CR4: 00000000000006e0
Process dlm_astd (pid: 3191, threadinfo ffff8102314be000, task ffff810231462080)
Stack:  ffffffff88775238 ffff81023149fcb8 0000000000000000 ffffffff88775137
 ffff81023149fcb8 0000000000000282 ffffffff80032880 0000000000000000
 ffff810231462080 ffffffffffffffff ffffffffffffffff ffffffffffffffff
Call Trace:
 [<ffffffff88775238>] :dlm:dlm_astd+0x101/0x14f
 [<ffffffff88775137>] :dlm:dlm_astd+0x0/0x14f
 [<ffffffff80032880>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a07e7>] keventd_create_kthread+0x0/0xc4

In other words, dlm sends GFS2 a glock callback and that apparently
has a bad pointer.

I've backtracked the problem to the fact that gfs2 has deleted the
lock_dlm lock structure before the dlm is done using it.

This is _always_ on an iopen glock by the way.

The reason, it turns out, is because function gfs2_clear_inode was
called, destroying the inode and subsequently its i_open glock, so
it's clearly a use-after-free error and the corrupt pointer is due
to the memory of the lock_dlm lock structure being reused.

Today I backtracked some more and, as I feared, the problem seems
due to the sequence of events when dealing with reclaiming unlinked
dinodes.

Today's instrumentation makes me believe very strongly that we have
a race.  One unknown process holds the inode's i_open glock.  Another
is trying to allocate blocks in the same rgrp and finds the block is
unlinked and needs to be reclaimed.  When the second process calls
my new function gfs2_process_unlinked_inode, the function does a
LM_FLAG_TRY trylock on the i_open glock, but when that fails with
TRY_FAILED, function gfs2_glock_nq_init, having found an error, calls
uninit on the holder.  The problem is, the gfs2_holder is common to
all users of the ip, also known as ip->i_iopen_gh.  The first process
blocks in the dlm.  The second one's TRY_FAILED attempt nukes the
holder being waited on by the first one's attempt.  When dlm responds
with its callback, the memory has been destroyed.  It almost seems
like there are two scenarios happening here but they both stem from
this re-use of the i_open holder.  I've written about this concern
before.

The reason this does not happen in RHEL6 is because in RHEL6 the
lock_dlm has been integrated with the glock layer, so the lock_dlm
locking structure isn't relied upon, and the glock will still exist.

Today I attempted another solution, but that failed and I know why.
I'll craft another solution tomorrow or Monday.  When done,
whatever solution I come up with should be ported back to RHEL6.

Comment 3 Steve Whitehouse 2010-06-07 11:53:43 UTC
Why does the solution require a RHEL6 backport if RHEL6 doesn't have the problem?

Comment 4 Steve Whitehouse 2010-06-10 12:09:59 UTC
The trace in comment #2 has an RIP value of 0. That suggests that this is a call to a NULL function pointer. However, in the dlm ast code, there is a check for that occurring on the two ast/bast functions. In addition, the gfs2 code only ever uses constant values for those functions for all glocks.

This suggests that something else might be wrong, or that this trace isn't giving us all the details. If the functions were NULL then the code in astd should have picked that up and not called the callbacks.

It would be worth adding a check to the astd loop to check that the ref count on the lkbs is correct. That should catch any use-after-free problems. The same check could also be added into dlm_add_ast in case we can catch it earlier.

I guess we should also add a check for NULL parameters being provided to the ast/bast functions too.

If you get any better/different stack traces, please upload those to the bz too.

Comment 5 Robert Peterson 2010-06-10 13:27:07 UTC
Yesterday, for a while, I thought the problem was due to the fact
that the i_iopen_gh holder was being used half of the time without
the TRY flag, via gfs2_inode_lookup and half of the time with the
TRY flag by gfs2_process_unlinked_inode.  The TRY bit is getting
turned on and off while multiple glock operations were taking place.
As an experiment, I changed the lock in gfs2_process_unlinked_inode
to not have the TRY flag.  The failing scenario ran successfully
about eight times (each run being about 10 to 15 minutes), but then
it failed with the same error.  So the problem isn't the TRY flag.

I added instrumentation to dlm to see if it was perhaps using a
pointer after it was done with it, but dlm passed the test with
flying colors: the same corruption occurred with the dlm having
apparently done nothing wrong (that I could tell with that
instrumentation).  Basically, I made the dlm completely destroy
its lkb pointer whenever it was freed, so accesses to it after
the lkb was freed should have triggered a gpf or something.
This was a complex, pervasive patch that yielded no results, but
at least it told me what's _not_ happening.

This is similar to a patch I did a couple weeks ago where I added
checks to the dlm to make sure lvbs were not overwritten.  That
ended the same way: dlm seems innocent.

Before that patch, I had another that verified in the dlm that
everything was set properly in the lkb.  Then a similar one in
gfs2 that made sure the glock was proper before submitting the
request to dlm.  Then another in lock_dlm that made sure the
bast address had not changed from its original address in various
places (it only was different just before the failing bast call),
and Dave did a similar patch for the dlm.

So the big question is: why would dlm make a call to the bast
function for a freed gdlm_lock?  Instrumentation has all but
proven the gdlm_lock associated with the call was set correctly,
locked correctly, unlocked correctly and then freed properly.
I can even tell you the path by which it was freed:

   gfs2_clear_inode
      gfs2_glock_dq_uninit(&ip->i_iopen_gh);
         gfs2_glock_put
            glock_free
               gfs2_lm_put_lock
                  lm_put_lock
                     gdlm_put_lock
                        kfree the lock_dlm structure

This seems to be a case of memory corruption, because the gdlm_lock
structure getting the bast call indicates (via instrumentation)
the memory was already freed.  The corruption is always in the
gdlm_lock structure for an i_iopen inode.  All attempts to detect
corruption before the dlm call have yielded no results.

Comment 6 Steve Whitehouse 2010-06-10 13:44:02 UTC
We do have some other clues though, in that it might be the same thing as that reported in bug #568102 comment #14. I've been trying to figure out what possible problems might cause that symptom and whether it might be the same problem as we are seeing here.

Looking at the above call graph, I'd suggest that the bug might simply be not waiting for the reply to the dlm demote request before freeing the lock_dlm structure. At least I can't see where we wait for that here.

In RHEL6 at least we know that we don't free the glock until the final reply from the DLM telling us that the lock is ready to be freed. I'd suggest poisoning the structure when its freed so that it will be obvious if it is a freed structure thats being dereferenced. The slab debugging code should be able to do that for you, or just memset it to some obvious value.

Comment 7 Robert Peterson 2010-06-10 22:47:54 UTC
Using some more complex and creative instrumentation, I might
have made a breakthrough here.  I set a mousetrap looking for
places where we delete a gdlm_lock while there's still something
queued on its lkb's lkb_astqueue.  It took a long time to
recreate and there were a lot of false positives, but finally
I got a failure.  Right before the failure, I got what seemed
to be a positive.  The "failing" call trace looked like this:


Call Trace:
 [<ffffffff8869f695>] :lock_dlm:gdlm_delete_lp+0x118/0x136
 [<ffffffff88624410>] :gfs2:glock_free+0x20/0x3f
 [<ffffffff886247f2>] :gfs2:gfs2_glock_put+0xe5/0x111
 [<ffffffff80022e20>] clear_inode+0xd2/0x123
 [<ffffffff80039d99>] generic_drop_inode+0x146/0x15a
 [<ffffffff886283d7>] :gfs2:gfs2_process_unlinked_inode+0x1be/0x1c7
 [<ffffffff88623e76>] :gfs2:gfs2_glock_schedule_for_reclaim+0x5d/0x9a
 [<ffffffff8863902e>] :gfs2:gfs2_inplace_reserve_i+0x663/0x6f4
 [<ffffffff8012b77b>] selinux_capable+0x15/0x2a
 [<ffffffff8862ff6e>] :gfs2:gfs2_page_mkwrite+0x13e/0x2dd
 [<ffffffff8000aadf>] get_page_from_freelist+0x378/0x43a
 [<ffffffff8862fe90>] :gfs2:gfs2_page_mkwrite+0x60/0x2dd
 [<ffffffff80008b59>] __handle_mm_fault+0x3f6/0xfaa
 [<ffffffff886251bb>] :gfs2:run_queue+0xad/0x18a
 [<ffffffff88623e76>] :gfs2:gfs2_glock_schedule_for_reclaim+0x5d/0x9a
 [<ffffffff80066b55>] do_page_fault+0x4cb/0x874
 [<ffffffff8001cd14>] vma_link+0xd0/0xfd
 [<ffffffff8000e129>] do_mmap_pgoff+0x615/0x780
 [<ffffffff8005dde9>] error_exit+0x0/0x84

 [<ffffffff8004d7eb>] run_workqueue+0x99/0xf6
 [<ffffffff8004a021>] worker_thread+0x0/0x122
 [<ffffffff800a0ac1>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8004a111>] worker_thread+0xf0/0x122
 [<ffffffff8008cf9f>] default_wake_function+0x0/0xe
 [<ffffffff800a0ac1>] keventd_create_kthread+0x0/0xc4
lock_dlm: fatal assertion failed "lp->gdlm_magic == GDLM_MAGIC"
lock_dlm:  gdlm_lock hosed 7e1df400 1,736c lp=ffff81007519c1c0 lkb=ffff81005d52de78

Using objdump, I isolated the eip to this statement in function
gfs2_process_unlinked_inode:
	error = gfs2_glock_nq_init(ip->i_gl, LM_ST_EXCLUSIVE, LM_FLAG_TRY,
				   &gh);

So here's my latest theory:
1. Function gfs2_process_unlinked_inode queues up a TRY lock
   request for an inode's i_iopen glock.
2. Next, it queues up a TRY lock for the i_gl glock.
3. The second glock fails with -EAGAIN (TRYFAILED), which
   causes the gfs2_glock_nq_init to dequeue the holder for the
   i_iopen glock immediately.
4. The holder is freed while there's still a bast queued on dlm's
   queue for the lkb.
5. The memory is reused by some process, thus overwriting the
   memory with trash (and it has the right to do so).
6. The dlm takes the bast off its queue and tries to call the
   bast function, but now the memory for the valid lkb points to
   the freed memory, and thus the bast address is hosed, so it
   chokes.

I don't know all the ins and outs of dlm's flags and how this
combination is supposed to work.  According to Chrissie's dlm
user's guide:

If the lock manager cannot grant your lock request, it normally adds your request to the end of the wait queue,
along with all other blocked lock requests on the lock resource. However, you can specify that the lock
manager not queue your request if it cannot be granted immediately by specifying the LKF_NOQUEUE flag
as an argument to the lock routine.
If your lock request cannot be granted immediately, the dlm_lock routine returns the status 0 and the AST is
queued with the status EAGAIN in the status field of the lock status block.

And according to the comments in one of the header files:

 * DLM_LKF_NOQUEUEBAST
 *
 * Send blocking AST's before returning -EAGAIN to the caller.  It is only
 * used along with the NOQUEUE flag.  Blocking AST's are not sent for failed
 * NOQUEUE requests otherwise.

Based on these findings I've made a patch that changes the glock
call in function gfs2_process_unlinked_inode from LM_FLAG_TRY to
LM_FLAG_TRY_1CB.  So far I haven't hit any failures.  I need to
do more investigation because I'm not sure if this is the
"right thing to do."  I need to understand the nuances of the
dlm and TRY locks better before I'll feel comfortable with this.

Tomorrow I may just make this change to the uninstrumented
version of gfs2 and see if the problem goes away, but in the
mean time I need to speak with Steve Whitehouse and Dave Teigland
about how this is "supposed" to work.

Adding Dave to the cc list.

Comment 8 Steve Whitehouse 2010-06-11 09:34:26 UTC
I think there is probably some confusion in comment #7. The trace shows generic_drop_inode calling clear_inode. There is no suggestion that I can see which might point to there being a call to delete_inode. It probably doesn't make a lot of difference whether there was or not though, since we'd land up in clear inode anyway.

From the traces I've seen to far, the problem has been in the ast not the bast so I think that #4 in your "So here's my latest theory:" is wrong unless there is some other evidence.

If we have a problem with using memory which has been freed, then we need to ensure that we don't free that memory until after whatever is using it has finished. Although adding the _1CB might be harmless in this case, it won't actually do anything other than change the timing.

I've not seen any evidence that my theory of comment #6 (2nd pg) is wrong so far, so I still think that is the best bet.

Comment 9 Robert Peterson 2010-06-11 17:11:42 UTC
Regarding comment #8:

The problem is, in fact, detected on a bast, not an ast.
The evidence you're looking for is this: The corrupt gdlm_lock is
first noticed by instrumentation I added to function gdlm_bast and
not by any other function's instrumentation.  And gdlm_bast is called
only when dlm calls from the loop where it processes things on its
lkb_astqueue, which includes both ast and bast locks.

A few days ago, I took the suggestion given in comment #6 and
memsetted the entire gdlm_lock to 0x66 just before freeing it.
There were absolutely no references to the gdlm_lock by either
lock_dlm or gfs2 after it was freed, or it would have segfaulted.

The code I'm running has numerous calls to a "check_integrity"
function that verifies the gdlm_lock is not corrupt any time
something significant happens to it, and none of these triggers.

The very first report of the corruption comes from function gdlm_bast
and it reports that, when it was called by dlm, the structure is
now filled with 0x66.  That can only mean the dlm has called it for
an lkb that's still left on its lkb_astqueue for which the
gdlm_lock has been freed.

Before this memset was added, the check_integrity function reported
the gdlm_lock's lock numbers, and they normally started with "5,"
meaning that before it was freed, it used to correspond to a i_open
glock.

The most suspicious thing I've found in the code is the fact that
dlm protects its lkb_astqueue with a spinlock called ast_queue_lock,
and yet other functions mess with the queue without that spinlock
held, but with proc->asts_spin held instead.  Not with the queue
directly, because it's a global variable declared in ast.c, but
with some asts lkb_astqueue.  This may be nothing. But for example,
dlm function dlm_user_add_ast and others mess with some lkb's
lkb_astqueue and hopefully the lkbs in question are not queued on
the lkb_astqueue when they do.

So I stand by my comments in comment #7.

Comment 10 Steve Whitehouse 2010-06-14 09:10:08 UTC
Then it is very much starting to sound like you've simply run into the same dlm bug as we had in RHEL6 (now fixed) with reordering and which has yet to be back ported to RHEL5. Perhaps we should have been a bit more proactive in pushing for that backport :(

The two different locks are due to different locking schemes according to whether the (b)ast is being delivered to userland or to the kernel. I think the locking is ok on the kernel side, but I've not been though the userland side of things in any detail.

Comment 11 David Teigland 2010-06-14 15:59:15 UTC
The cast/bast ordering problems were done in rhel6 bug 568102.

And Steve is correct, the proc->asts_spin locking is only used for userland locks, and wouldn't be a factor for gfs locks.

Comment 12 Robert Peterson 2010-06-14 22:06:56 UTC
Created attachment 423982 [details]
Latest patch from 14 June 2010

I figured out what the problem was, at least the one I was
stuck on.  The problem was related to the fact that the new
function, gfs2_process_unlinked_inode, was using a TRY lock
to try to hold the i_open and i_gl glocks for an unlinked
dinode to be reclaimed.  In the failing case, the try lock
was succeeding for the i_open glock but failing for the i_gl
glock.  The error code path ended up calling function
gfs2_glock_put twice for the i_open glock: Once for the normal
put and once for the error-exit put.  This is still the
case (and can be seen) in today's function gfs2_inode_lookup,
but since it doesn't use try locks, the glock is a lot less
likely to error out.  We may still want to fix it.

So here is my latest patch, which includes a better error path
for gfs2_process_unlinked_inode (but gfs2_inode_lookup still
has the issue.  We may want to change it, and I'd like to
discuss that with Steve Whitehouse.)

This patch doesn't include some of my other code changes, for
example, the one to protect the i_open glock from being
reinitialized by a failed glock_nq_init.  It has some of my
changes though.  I'd like Steve and possibly others to review
the patch to see which changes they deem unwanted/unnecessary.

I've run this version a few times without error, and it should
have failed if the problem was still there.  I did, however,
get other errors, which are known problems.  Another problem is
that one of my nodes reported hardware errors: looks like I may
have a bad hard disk to complicate matters.

Comment 13 Steve Whitehouse 2010-06-15 07:49:02 UTC
Yes, I think we should fix gfs2_inode_lookup as well. Anyway, that sounds good and I can't see any issues with the patch. If it is working ok, then lets go with that and get this one posted.

Comment 14 Robert Peterson 2010-06-23 14:47:20 UTC
*** Bug 582359 has been marked as a duplicate of this bug. ***

Comment 15 Robert Peterson 2010-07-06 22:52:32 UTC
Created attachment 429927 [details]
Latest patch from 06 Jul 2010

This patch fixes three bugs having to do with the reclaiming
of unlinked dinodes.  The three were discovered during RHEL6
testing and reported as these bugzilla numbers:

bug #570182 - GFS2: glock held by ended process
bug #583737 - GFS2: stuck in inode wait, no glocks stuck
bug #604244 - GFS2: kernel NULL pointer dereference from dlm_astd

This patch is a combined version of three patches that already
appear upstream in the -nmw git repository and in the RHEL6 kernel.

Due to the serious nature of the problems, the patches were
back-ported to RHEL5 for inclusion in the RHEL5.6 kernel under
the bugzilla bug #595397.

The first bug #570182 was caused by GFS2 not following its documented
lock ordering rules while reclaiming unlinked dinodes.  The fix
was to split the gfs2_inode_lookup function into a separate function
just for reclaim: gfs2_process_unlinked_inode.  The new function
follows the lock ordering rules properly by doing two things:
First, by using a "try" lock during the lookup.  Second, by
returning the dinode block number to a higher-level function that
can deal with the inode outside the locks that were causing deadlock,
(rather than to try to do the locking while the other locks were held).

The second bug #583737 was caused by a problem with the patch
for the first bug #570182.  This is the combined patch.  Basically,
the timing window was nearly closed in the first patch, but there
was still a case where it could hang.

The third bug #604244 was due to a problem with the second patch.
Basically, if the "try" lock failed due to lock contention, the
glock was improperly released twice, which caused dlm_astd to
reference a pointer to freed storage.  Now we set the freed pointer
to NULL so that it's not unlocked a second time.

This patch was tested and found to be correct on the roth-0{1,2,3}
cluster using the brawl test.

I posted it to rhkernel-list for inclusion into the 5.6 kernel.

Comment 16 Everett Bennett 2010-07-07 12:56:51 UTC
Is this patch available for testing in RHEL 5.5 x86_64?  If so, how may I obtain it?

Comment 17 Robert Peterson 2010-07-07 14:14:44 UTC
I put a compiled x86_64 version for RHEL5.x on my people page
if you want to try it.  This has not undergone our rigorous
file system testing process, so there are absolutely no guarantees:

http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/gfs2.ko

Comment 19 Jarod Wilson 2010-07-12 15:46:52 UTC
in kernel-2.6.18-206.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 21 Robert Peterson 2010-08-10 16:27:58 UTC
Adding James to the cc list.  There's really no reason this
bz needs to be private; I think maybe that's the default.
James: comment #15 should be of particular interest to you.

Comment 24 Ben Turner 2010-11-10 15:45:42 UTC
*** Bug 645563 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2011-01-13 21:33:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html