This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Time for an update on this problem. I back-ported the RHEL6 patches to RHEL5.6 and, given enough pressure, I can recreate various problems that did not appear in RHEL6.0. The main problem seems to occur in the gfs2-minrg coherency test. The symptom is this: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [<0000000000000000>] _stext+0x7ffff000/0x1000 PGD 2041d8067 PUD 20a010067 PMD 0 Oops: 0010 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:1c.5/0000:06:00.0/irq CPU 1 Modules linked in: md5 sctp lock_dlm gfs2(U) dlm configfs autofs4 hidp rfcomm l2cap bluetooth dm_log_clustered(U) lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sr_mod joydev tg3 ide_cd serio_raw pata_sil680 i3000_edac i2c_i801 shpchp cdrom i2c_core edac_mc sg pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 3191, comm: dlm_astd Tainted: G 2.6.18-200.el5 #1 RIP: 0010:[<0000000000000000>] [<0000000000000000>] _stext+0x7ffff000/0x1000 RSP: 0018:ffff8102314bfeb8 EFLAGS: 00010247 RAX: 0000000000000000 RBX: ffff8101fbeec4d8 RCX: 0000000000000001 RDX: ffff810208b63188 RSI: 0000000000000101 RDI: 0000000000200200 RBP: 0000000000000005 R08: ffff8102314be000 R09: ffff810232b75280 R10: 0000000000000003 R11: 0000000000000000 R12: ffffffff888212c1 R13: 0000000000000002 R14: ffff81023149fca8 R15: ffffffff800a07e7 FS: 0000000000000000(0000) GS:ffff8101fffb77c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000204270000 CR4: 00000000000006e0 Process dlm_astd (pid: 3191, threadinfo ffff8102314be000, task ffff810231462080) Stack: ffffffff88775238 ffff81023149fcb8 0000000000000000 ffffffff88775137 ffff81023149fcb8 0000000000000282 ffffffff80032880 0000000000000000 ffff810231462080 ffffffffffffffff ffffffffffffffff ffffffffffffffff Call Trace: [<ffffffff88775238>] :dlm:dlm_astd+0x101/0x14f [<ffffffff88775137>] :dlm:dlm_astd+0x0/0x14f [<ffffffff80032880>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a07e7>] keventd_create_kthread+0x0/0xc4 In other words, dlm sends GFS2 a glock callback and that apparently has a bad pointer. I've backtracked the problem to the fact that gfs2 has deleted the lock_dlm lock structure before the dlm is done using it. This is _always_ on an iopen glock by the way. The reason, it turns out, is because function gfs2_clear_inode was called, destroying the inode and subsequently its i_open glock, so it's clearly a use-after-free error and the corrupt pointer is due to the memory of the lock_dlm lock structure being reused. Today I backtracked some more and, as I feared, the problem seems due to the sequence of events when dealing with reclaiming unlinked dinodes. Today's instrumentation makes me believe very strongly that we have a race. One unknown process holds the inode's i_open glock. Another is trying to allocate blocks in the same rgrp and finds the block is unlinked and needs to be reclaimed. When the second process calls my new function gfs2_process_unlinked_inode, the function does a LM_FLAG_TRY trylock on the i_open glock, but when that fails with TRY_FAILED, function gfs2_glock_nq_init, having found an error, calls uninit on the holder. The problem is, the gfs2_holder is common to all users of the ip, also known as ip->i_iopen_gh. The first process blocks in the dlm. The second one's TRY_FAILED attempt nukes the holder being waited on by the first one's attempt. When dlm responds with its callback, the memory has been destroyed. It almost seems like there are two scenarios happening here but they both stem from this re-use of the i_open holder. I've written about this concern before. The reason this does not happen in RHEL6 is because in RHEL6 the lock_dlm has been integrated with the glock layer, so the lock_dlm locking structure isn't relied upon, and the glock will still exist. Today I attempted another solution, but that failed and I know why. I'll craft another solution tomorrow or Monday. When done, whatever solution I come up with should be ported back to RHEL6.
Why does the solution require a RHEL6 backport if RHEL6 doesn't have the problem?
The trace in comment #2 has an RIP value of 0. That suggests that this is a call to a NULL function pointer. However, in the dlm ast code, there is a check for that occurring on the two ast/bast functions. In addition, the gfs2 code only ever uses constant values for those functions for all glocks. This suggests that something else might be wrong, or that this trace isn't giving us all the details. If the functions were NULL then the code in astd should have picked that up and not called the callbacks. It would be worth adding a check to the astd loop to check that the ref count on the lkbs is correct. That should catch any use-after-free problems. The same check could also be added into dlm_add_ast in case we can catch it earlier. I guess we should also add a check for NULL parameters being provided to the ast/bast functions too. If you get any better/different stack traces, please upload those to the bz too.
Yesterday, for a while, I thought the problem was due to the fact that the i_iopen_gh holder was being used half of the time without the TRY flag, via gfs2_inode_lookup and half of the time with the TRY flag by gfs2_process_unlinked_inode. The TRY bit is getting turned on and off while multiple glock operations were taking place. As an experiment, I changed the lock in gfs2_process_unlinked_inode to not have the TRY flag. The failing scenario ran successfully about eight times (each run being about 10 to 15 minutes), but then it failed with the same error. So the problem isn't the TRY flag. I added instrumentation to dlm to see if it was perhaps using a pointer after it was done with it, but dlm passed the test with flying colors: the same corruption occurred with the dlm having apparently done nothing wrong (that I could tell with that instrumentation). Basically, I made the dlm completely destroy its lkb pointer whenever it was freed, so accesses to it after the lkb was freed should have triggered a gpf or something. This was a complex, pervasive patch that yielded no results, but at least it told me what's _not_ happening. This is similar to a patch I did a couple weeks ago where I added checks to the dlm to make sure lvbs were not overwritten. That ended the same way: dlm seems innocent. Before that patch, I had another that verified in the dlm that everything was set properly in the lkb. Then a similar one in gfs2 that made sure the glock was proper before submitting the request to dlm. Then another in lock_dlm that made sure the bast address had not changed from its original address in various places (it only was different just before the failing bast call), and Dave did a similar patch for the dlm. So the big question is: why would dlm make a call to the bast function for a freed gdlm_lock? Instrumentation has all but proven the gdlm_lock associated with the call was set correctly, locked correctly, unlocked correctly and then freed properly. I can even tell you the path by which it was freed: gfs2_clear_inode gfs2_glock_dq_uninit(&ip->i_iopen_gh); gfs2_glock_put glock_free gfs2_lm_put_lock lm_put_lock gdlm_put_lock kfree the lock_dlm structure This seems to be a case of memory corruption, because the gdlm_lock structure getting the bast call indicates (via instrumentation) the memory was already freed. The corruption is always in the gdlm_lock structure for an i_iopen inode. All attempts to detect corruption before the dlm call have yielded no results.
We do have some other clues though, in that it might be the same thing as that reported in bug #568102 comment #14. I've been trying to figure out what possible problems might cause that symptom and whether it might be the same problem as we are seeing here. Looking at the above call graph, I'd suggest that the bug might simply be not waiting for the reply to the dlm demote request before freeing the lock_dlm structure. At least I can't see where we wait for that here. In RHEL6 at least we know that we don't free the glock until the final reply from the DLM telling us that the lock is ready to be freed. I'd suggest poisoning the structure when its freed so that it will be obvious if it is a freed structure thats being dereferenced. The slab debugging code should be able to do that for you, or just memset it to some obvious value.
Using some more complex and creative instrumentation, I might have made a breakthrough here. I set a mousetrap looking for places where we delete a gdlm_lock while there's still something queued on its lkb's lkb_astqueue. It took a long time to recreate and there were a lot of false positives, but finally I got a failure. Right before the failure, I got what seemed to be a positive. The "failing" call trace looked like this: Call Trace: [<ffffffff8869f695>] :lock_dlm:gdlm_delete_lp+0x118/0x136 [<ffffffff88624410>] :gfs2:glock_free+0x20/0x3f [<ffffffff886247f2>] :gfs2:gfs2_glock_put+0xe5/0x111 [<ffffffff80022e20>] clear_inode+0xd2/0x123 [<ffffffff80039d99>] generic_drop_inode+0x146/0x15a [<ffffffff886283d7>] :gfs2:gfs2_process_unlinked_inode+0x1be/0x1c7 [<ffffffff88623e76>] :gfs2:gfs2_glock_schedule_for_reclaim+0x5d/0x9a [<ffffffff8863902e>] :gfs2:gfs2_inplace_reserve_i+0x663/0x6f4 [<ffffffff8012b77b>] selinux_capable+0x15/0x2a [<ffffffff8862ff6e>] :gfs2:gfs2_page_mkwrite+0x13e/0x2dd [<ffffffff8000aadf>] get_page_from_freelist+0x378/0x43a [<ffffffff8862fe90>] :gfs2:gfs2_page_mkwrite+0x60/0x2dd [<ffffffff80008b59>] __handle_mm_fault+0x3f6/0xfaa [<ffffffff886251bb>] :gfs2:run_queue+0xad/0x18a [<ffffffff88623e76>] :gfs2:gfs2_glock_schedule_for_reclaim+0x5d/0x9a [<ffffffff80066b55>] do_page_fault+0x4cb/0x874 [<ffffffff8001cd14>] vma_link+0xd0/0xfd [<ffffffff8000e129>] do_mmap_pgoff+0x615/0x780 [<ffffffff8005dde9>] error_exit+0x0/0x84 [<ffffffff8004d7eb>] run_workqueue+0x99/0xf6 [<ffffffff8004a021>] worker_thread+0x0/0x122 [<ffffffff800a0ac1>] keventd_create_kthread+0x0/0xc4 [<ffffffff8004a111>] worker_thread+0xf0/0x122 [<ffffffff8008cf9f>] default_wake_function+0x0/0xe [<ffffffff800a0ac1>] keventd_create_kthread+0x0/0xc4 lock_dlm: fatal assertion failed "lp->gdlm_magic == GDLM_MAGIC" lock_dlm: gdlm_lock hosed 7e1df400 1,736c lp=ffff81007519c1c0 lkb=ffff81005d52de78 Using objdump, I isolated the eip to this statement in function gfs2_process_unlinked_inode: error = gfs2_glock_nq_init(ip->i_gl, LM_ST_EXCLUSIVE, LM_FLAG_TRY, &gh); So here's my latest theory: 1. Function gfs2_process_unlinked_inode queues up a TRY lock request for an inode's i_iopen glock. 2. Next, it queues up a TRY lock for the i_gl glock. 3. The second glock fails with -EAGAIN (TRYFAILED), which causes the gfs2_glock_nq_init to dequeue the holder for the i_iopen glock immediately. 4. The holder is freed while there's still a bast queued on dlm's queue for the lkb. 5. The memory is reused by some process, thus overwriting the memory with trash (and it has the right to do so). 6. The dlm takes the bast off its queue and tries to call the bast function, but now the memory for the valid lkb points to the freed memory, and thus the bast address is hosed, so it chokes. I don't know all the ins and outs of dlm's flags and how this combination is supposed to work. According to Chrissie's dlm user's guide: If the lock manager cannot grant your lock request, it normally adds your request to the end of the wait queue, along with all other blocked lock requests on the lock resource. However, you can specify that the lock manager not queue your request if it cannot be granted immediately by specifying the LKF_NOQUEUE flag as an argument to the lock routine. If your lock request cannot be granted immediately, the dlm_lock routine returns the status 0 and the AST is queued with the status EAGAIN in the status field of the lock status block. And according to the comments in one of the header files: * DLM_LKF_NOQUEUEBAST * * Send blocking AST's before returning -EAGAIN to the caller. It is only * used along with the NOQUEUE flag. Blocking AST's are not sent for failed * NOQUEUE requests otherwise. Based on these findings I've made a patch that changes the glock call in function gfs2_process_unlinked_inode from LM_FLAG_TRY to LM_FLAG_TRY_1CB. So far I haven't hit any failures. I need to do more investigation because I'm not sure if this is the "right thing to do." I need to understand the nuances of the dlm and TRY locks better before I'll feel comfortable with this. Tomorrow I may just make this change to the uninstrumented version of gfs2 and see if the problem goes away, but in the mean time I need to speak with Steve Whitehouse and Dave Teigland about how this is "supposed" to work. Adding Dave to the cc list.
I think there is probably some confusion in comment #7. The trace shows generic_drop_inode calling clear_inode. There is no suggestion that I can see which might point to there being a call to delete_inode. It probably doesn't make a lot of difference whether there was or not though, since we'd land up in clear inode anyway. From the traces I've seen to far, the problem has been in the ast not the bast so I think that #4 in your "So here's my latest theory:" is wrong unless there is some other evidence. If we have a problem with using memory which has been freed, then we need to ensure that we don't free that memory until after whatever is using it has finished. Although adding the _1CB might be harmless in this case, it won't actually do anything other than change the timing. I've not seen any evidence that my theory of comment #6 (2nd pg) is wrong so far, so I still think that is the best bet.
Regarding comment #8: The problem is, in fact, detected on a bast, not an ast. The evidence you're looking for is this: The corrupt gdlm_lock is first noticed by instrumentation I added to function gdlm_bast and not by any other function's instrumentation. And gdlm_bast is called only when dlm calls from the loop where it processes things on its lkb_astqueue, which includes both ast and bast locks. A few days ago, I took the suggestion given in comment #6 and memsetted the entire gdlm_lock to 0x66 just before freeing it. There were absolutely no references to the gdlm_lock by either lock_dlm or gfs2 after it was freed, or it would have segfaulted. The code I'm running has numerous calls to a "check_integrity" function that verifies the gdlm_lock is not corrupt any time something significant happens to it, and none of these triggers. The very first report of the corruption comes from function gdlm_bast and it reports that, when it was called by dlm, the structure is now filled with 0x66. That can only mean the dlm has called it for an lkb that's still left on its lkb_astqueue for which the gdlm_lock has been freed. Before this memset was added, the check_integrity function reported the gdlm_lock's lock numbers, and they normally started with "5," meaning that before it was freed, it used to correspond to a i_open glock. The most suspicious thing I've found in the code is the fact that dlm protects its lkb_astqueue with a spinlock called ast_queue_lock, and yet other functions mess with the queue without that spinlock held, but with proc->asts_spin held instead. Not with the queue directly, because it's a global variable declared in ast.c, but with some asts lkb_astqueue. This may be nothing. But for example, dlm function dlm_user_add_ast and others mess with some lkb's lkb_astqueue and hopefully the lkbs in question are not queued on the lkb_astqueue when they do. So I stand by my comments in comment #7.
Then it is very much starting to sound like you've simply run into the same dlm bug as we had in RHEL6 (now fixed) with reordering and which has yet to be back ported to RHEL5. Perhaps we should have been a bit more proactive in pushing for that backport :( The two different locks are due to different locking schemes according to whether the (b)ast is being delivered to userland or to the kernel. I think the locking is ok on the kernel side, but I've not been though the userland side of things in any detail.
The cast/bast ordering problems were done in rhel6 bug 568102. And Steve is correct, the proc->asts_spin locking is only used for userland locks, and wouldn't be a factor for gfs locks.
Created attachment 423982 [details] Latest patch from 14 June 2010 I figured out what the problem was, at least the one I was stuck on. The problem was related to the fact that the new function, gfs2_process_unlinked_inode, was using a TRY lock to try to hold the i_open and i_gl glocks for an unlinked dinode to be reclaimed. In the failing case, the try lock was succeeding for the i_open glock but failing for the i_gl glock. The error code path ended up calling function gfs2_glock_put twice for the i_open glock: Once for the normal put and once for the error-exit put. This is still the case (and can be seen) in today's function gfs2_inode_lookup, but since it doesn't use try locks, the glock is a lot less likely to error out. We may still want to fix it. So here is my latest patch, which includes a better error path for gfs2_process_unlinked_inode (but gfs2_inode_lookup still has the issue. We may want to change it, and I'd like to discuss that with Steve Whitehouse.) This patch doesn't include some of my other code changes, for example, the one to protect the i_open glock from being reinitialized by a failed glock_nq_init. It has some of my changes though. I'd like Steve and possibly others to review the patch to see which changes they deem unwanted/unnecessary. I've run this version a few times without error, and it should have failed if the problem was still there. I did, however, get other errors, which are known problems. Another problem is that one of my nodes reported hardware errors: looks like I may have a bad hard disk to complicate matters.
Yes, I think we should fix gfs2_inode_lookup as well. Anyway, that sounds good and I can't see any issues with the patch. If it is working ok, then lets go with that and get this one posted.
*** Bug 582359 has been marked as a duplicate of this bug. ***
Created attachment 429927 [details] Latest patch from 06 Jul 2010 This patch fixes three bugs having to do with the reclaiming of unlinked dinodes. The three were discovered during RHEL6 testing and reported as these bugzilla numbers: bug #570182 - GFS2: glock held by ended process bug #583737 - GFS2: stuck in inode wait, no glocks stuck bug #604244 - GFS2: kernel NULL pointer dereference from dlm_astd This patch is a combined version of three patches that already appear upstream in the -nmw git repository and in the RHEL6 kernel. Due to the serious nature of the problems, the patches were back-ported to RHEL5 for inclusion in the RHEL5.6 kernel under the bugzilla bug #595397. The first bug #570182 was caused by GFS2 not following its documented lock ordering rules while reclaiming unlinked dinodes. The fix was to split the gfs2_inode_lookup function into a separate function just for reclaim: gfs2_process_unlinked_inode. The new function follows the lock ordering rules properly by doing two things: First, by using a "try" lock during the lookup. Second, by returning the dinode block number to a higher-level function that can deal with the inode outside the locks that were causing deadlock, (rather than to try to do the locking while the other locks were held). The second bug #583737 was caused by a problem with the patch for the first bug #570182. This is the combined patch. Basically, the timing window was nearly closed in the first patch, but there was still a case where it could hang. The third bug #604244 was due to a problem with the second patch. Basically, if the "try" lock failed due to lock contention, the glock was improperly released twice, which caused dlm_astd to reference a pointer to freed storage. Now we set the freed pointer to NULL so that it's not unlocked a second time. This patch was tested and found to be correct on the roth-0{1,2,3} cluster using the brawl test. I posted it to rhkernel-list for inclusion into the 5.6 kernel.
Is this patch available for testing in RHEL 5.5 x86_64? If so, how may I obtain it?
I put a compiled x86_64 version for RHEL5.x on my people page if you want to try it. This has not undergone our rigorous file system testing process, so there are absolutely no guarantees: http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/gfs2.ko
in kernel-2.6.18-206.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Adding James to the cc list. There's really no reason this bz needs to be private; I think maybe that's the default. James: comment #15 should be of particular interest to you.
*** Bug 645563 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html