+++ This bug was initially created as a clone of Bug #250772 +++ Description of problem: We removed some kludges from the code and now the problem is back where unstuffing a journaled inode is causing the kernel to panic. Part of the removed kludging looked like: if (!bh){ list_move(&bd->bd_ail_st_list, &ai->ai_ail2_list); continue; } in functions gfs2_ail1_start_one and gfs2_ail1_empty_one. It was removed as part of the recent 248176. As an experiment, I tried adding this back in, but it only pushed the problem further out and it failed later on. I discussed this at length with Steve Whitehouse and he agrees that this was a kludge and was rightfully removed: A null bh should not be on the ail lists to begin with. I tried some experiments at Steve's suggestion, all of which failed, but it told us more about the problem. After Steve left for the day, I did a bunch of experiments and debugging to find out exactly where these null entries are coming from. The sequence of events seems to be this: As part of the write operation, gfs2_invalidatepage gets called when it's done. gfs2_invalidatepage calls discard_buffer, which sets: bh->b_private = NULL; But that buffer header hasn't been flushed out to disk yet. So when log_flush gets called and the bh gets passed down the active items lists, that's where we encounter a problem. I checked the source code in git, and this call to discard_buffer and its zeroing out of b_private have been in there since day one. I tried making discard_buffer not do that, but of course, that also failed, most likely because discard_buffer still does a bunch of resetting of the bits in the bh. It fails in: gfs2_ail2_empty_one on the statement: bh_ip = GFS2_I(bd->bd_bh->b_page->mapping->host); I think that perhaps we shouldn't call discard_buffer at all until the log/lops functions, (the gfs2_log_flush process), have finished with the buffers/buffer descriptors. Having the log code reference these pages and bds, especially after discard_buffer calls kmem_cache_free on the bd, is asking for corruption and kernel panics. Perhaps Steve has some more thoughts on how to do this properly. -- Additional comment from rpeterso on 2007-08-13 13:06 EST -- Created an attachment (id=161189) Patch to fix the unstuff problem Abhi, can you test this patch with your quota unstuff patch? This is half-Steve and half-me to fix these problems. -- Additional comment from adas on 2007-08-13 18:42 EST -- Created an attachment (id=161236) Patch to forcibly unstuff the quota inode in gfs2_adjust_quota() Bob, I tried this unstuffing patch of mine on top of your patch in the previous comment and things seem to work fine. Steve, can you look through this patch and tell me if it looks good? -- Additional comment from rpeterso on 2007-08-13 18:56 EST -- I've been doing some more testing on the above patch. I found another test that fails, even with this patch. This used to work a month or so ago. It's test #4 listed here: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243899#c12 It basically hits the stuck_releasepage code in ops_address.c which tells me somewhere it forgot to do a brelse on a buffer header. Either that or someone did an extra bhget. It gives the following values: GFS2: fsid=bob_cluster2:test_gfs.0: stuck in gfs2_releasepage() d6c79ca0 GFS2: fsid=bob_cluster2:test_gfs.0: blkno = 147019, bh->b_count = 1 GFS2: fsid=bob_cluster2:test_gfs.0: pinned = 0 GFS2: fsid=bob_cluster2:test_gfs.0: bh->b_private = !NULL GFS2: fsid=bob_cluster2:test_gfs.0: gl = (2, 99316) GFS2: fsid=bob_cluster2:test_gfs.0: bd_list_tr = no, bd_le.le_list = no GFS2: fsid=bob_cluster2:test_gfs.0: ip = 22 99316 GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[0] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[1] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[2] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[3] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[4] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[5] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[6] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[7] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[8] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[9] = NULL BTW, that block didn't look interesting, and it's not in an RG or anything. Just binary data in the middle (174MB) of the file, and it's not always the same block. I tried backing off my patch and had the same problem. I tried some older versions of the code, but they all have various problems doing this same test. Since the buffer isn't pinned, the pins must match the unpins, so that's apparently not it. Since the i_cache's are all NULL, that tells me that gfs2_meta_cache_flush has already run and done its brelse's as well, so that's apparently not it either. I did notice in gfs2_unpin that it will only do the brelse if (bd->bd_ail), so perhaps we have a bd_ail management issue? I'm still investigating. Perhaps Steve can try this on his system, since he invented the test. -- Additional comment from swhiteho on 2007-08-14 04:18 EST -- Abhi, your patch from comment #13 looks ok to me. -- Additional comment from swhiteho on 2007-08-14 05:47 EST -- Bob, I tried the test with just my patch and the change to gfs2_ail2_empty_one() and it worked ok for me. Perhaps thats a clue as to where the problem lies? To be honest I've never really understood why it is that we wait in gfs2_releasepage() at all. There is a comment in the VFS code which indicates that there is a plan to make releasepage non-blocking in the future anyway. ext3 doesn't block in its release page, so if that becomes a problem, then I think we should adopt the same solution. -- Additional comment from rpeterso on 2007-08-14 10:38 EST -- I added a call to dump_stack() when the failure is detected and got this: GFS2: fsid=bob_cluster2:test_gfs.0: stuck in gfs2_releasepage() c3d1eca0 [<e02b9763>] gfs2_releasepage+0xef/0x3b1 [gfs2] [<e02b9674>] gfs2_releasepage+0x0/0x3b1 [gfs2] [<c0147e3f>] try_to_release_page+0x30/0x42 [<c014d6ed>] shrink_inactive_list+0x4f9/0x7c7 [<c01e49c0>] nfs_access_cache_shrinker+0x20/0x182 [<c014da85>] shrink_zone+0xca/0xef [<c014ded7>] kswapd+0x288/0x405 [<c0133e5d>] autoremove_wake_function+0x0/0x35 [<c011dd60>] complete+0x39/0x48 [<c014dc4f>] kswapd+0x0/0x405 [<c0133d97>] kthread+0x38/0x5d [<c0133d5f>] kthread+0x0/0x5d [<c0105a6f>] kernel_thread_helper+0x7/0x10 ======================= GFS2: fsid=bob_cluster2:test_gfs.0: blkno = 146511, bh->b_count = 1 GFS2: fsid=bob_cluster2:test_gfs.0: pinned = 1 GFS2: fsid=bob_cluster2:test_gfs.0: bh->b_private = !NULL GFS2: fsid=bob_cluster2:test_gfs.0: gl = (2, 99316) GFS2: fsid=bob_cluster2:test_gfs.0: bd_list_tr = no, bd_le.le_list = yes GFS2: fsid=bob_cluster2:test_gfs.0: ip = 22 99316 GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[0] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[1] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[2] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[3] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[4] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[5] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[6] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[7] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[8] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[9] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: stuck in gfs2_releasepage() c3d1eca0 [<e02b9763>] gfs2_releasepage+0xef/0x3b1 [gfs2] [<e02b9674>] gfs2_releasepage+0x0/0x3b1 [gfs2] [<c0147e3f>] try_to_release_page+0x30/0x42 [<c014d6ed>] shrink_inactive_list+0x4f9/0x7c7 [<c041f1e2>] __sched_text_start+0x53a/0x5c9 [<c012bb1e>] lock_timer_base+0x19/0x35 [<c014da85>] shrink_zone+0xca/0xef [<c014e3f8>] try_to_free_pages+0x11d/0x1f1 [<c014a2cf>] __alloc_pages+0x18f/0x28b [<c014732f>] generic_file_buffered_write+0x1b3/0x5d2 [<c041f985>] __wait_on_bit_lock+0x4b/0x52 [<c0147bd6>] __generic_file_aio_write_nolock+0x488/0x4e7 [<c0147c87>] generic_file_aio_write+0x52/0xb0 [<c01622f9>] do_sync_write+0xc7/0x10a [<c0128852>] tasklet_action+0x46/0x90 [<c0133e5d>] autoremove_wake_function+0x0/0x35 [<c012872d>] irq_exit+0x53/0x6b [<c011896e>] smp_apic_timer_interrupt+0x74/0x80 [<c01058ec>] apic_timer_interrupt+0x28/0x30 [<c0162232>] do_sync_write+0x0/0x10a [<c0162a72>] vfs_write+0x8a/0x10c [<c0162fde>] sys_write+0x41/0x67 [<c0104e1e>] sysenter_past_esp+0x5f/0x85 ======================= GFS2: fsid=bob_cluster2:test_gfs.0: blkno = 146535, bh->b_count = 1 GFS2: fsid=bob_cluster2:test_gfs.0: pinned = 1 GFS2: fsid=bob_cluster2:test_gfs.0: bh->b_private = !NULL GFS2: fsid=bob_cluster2:test_gfs.0: gl = (2, 99316) GFS2: fsid=bob_cluster2:test_gfs.0: bd_list_tr = no, bd_le.le_list = yes GFS2: fsid=bob_cluster2:test_gfs.0: ip = 22 99316 GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[0] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[1] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[2] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[3] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[4] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[5] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[6] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[7] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[8] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[9] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: stuck in gfs2_releasepage() c3d1eca0 [<e02b9763>] gfs2_releasepage+0xef/0x3b1 [gfs2] [<c011dbcc>] dequeue_entity+0x75/0x92 [<e02b9674>] gfs2_releasepage+0x0/0x3b1 [gfs2] [<c0147e3f>] try_to_release_page+0x30/0x42 [<c014d6ed>] shrink_inactive_list+0x4f9/0x7c7 [<c012bb1e>] lock_timer_base+0x19/0x35 [<c012bb7e>] try_to_del_timer_sync+0x44/0x4a [<c041f902>] schedule_timeout+0x79/0x8d [<c012b88c>] process_timeout+0x0/0x5 [<c041f8f4>] schedule_timeout+0x6b/0x8d [<c014f0cd>] congestion_wait+0x5b/0x64 [<c0133e5d>] autoremove_wake_function+0x0/0x35 [<c014da85>] shrink_zone+0xca/0xef [<c014e3f8>] try_to_free_pages+0x11d/0x1f1 [<c014a2cf>] __alloc_pages+0x18f/0x28b [<c015f317>] __slab_alloc+0x177/0x4a7 [<c015f75d>] kmem_cache_alloc+0x3c/0x7c [<c017c08f>] alloc_buffer_head+0x10/0x37 [<c017c08f>] alloc_buffer_head+0x10/0x37 [<e02b6265>] gfs2_log_fake_buf+0x56/0x10e [gfs2] [<e02b79ac>] databuf_lo_before_commit+0x39d/0x509 [gfs2] [<e02b6003>] gfs2_log_flush+0x110/0x2eb [gfs2] [<e02b5579>] gfs2_ail1_empty+0x2e/0x84 [gfs2] [<e02aa63a>] gfs2_logd+0x89/0x13b [gfs2] [<c011dd60>] complete+0x39/0x48 [<e02aa5b1>] gfs2_logd+0x0/0x13b [gfs2] [<c0133d97>] kthread+0x38/0x5d [<c0133d5f>] kthread+0x0/0x5d [<c0105a6f>] kernel_thread_helper+0x7/0x10 ======================= GFS2: fsid=bob_cluster2:test_gfs.0: blkno = 146543, bh->b_count = 1 GFS2: fsid=bob_cluster2:test_gfs.0: pinned = 1 GFS2: fsid=bob_cluster2:test_gfs.0: bh->b_private = !NULL GFS2: fsid=bob_cluster2:test_gfs.0: gl = (2, 99316) GFS2: fsid=bob_cluster2:test_gfs.0: bd_list_tr = no, bd_le.le_list = yes GFS2: fsid=bob_cluster2:test_gfs.0: ip = 22 99316 GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[0] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[1] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[2] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[3] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[4] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[5] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[6] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[7] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[8] = NULL GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[9] = NULL I'm not using NFS, so this is apparently just vfs doing its page maintenance.
Here's what I know so far: This problem definitely recreates all the time I try to do this test on trin-10 and it recreates regardless of whether or not I'm running with the latest few patches. I'm guessing Steve wasn't able to recreate it because his machine has more memory. For the record, to reproduce this problem I paste this into a shell: service cman start service clvmd start mkfs.gfs2 -O -t bob_cluster2:test_gfs -p lock_dlm -j 3 /dev/trin_vg/hell mount -tgfs2 /dev/trin_vg/hell /mnt/hell cd /mnt/hell mkdir foo cd foo touch foo chattr +j foo cat /home/devel/gfs2-2.6.git.tgz >foo cd /home/devel/gfs2-2.6.git/ umount /mnt/hell service clvmd stop service cman stop rmmod lock_dlm rmmod gfs2 The sequence of events looks kind of like this: 1. Approximately 32600 pages are pinned by gfs2_pin(). 2. Approximately 32600 pages go through gfs2_unpin() but there's no bd_ail list, so it doesn't do brelse() on them. Instead, it moves them to the jurisdiction of the glock with b_count==1. I did have an indication that 5 buffers had bd->bd_ail and were therefore brelse'd during gfs2_unpin. 3. Function log_pull_tail gets called. 4. Function gfs2_ail2_empty_one gets called, doing brelse and therefore setting b_count = 0. 5. It pins another 24000 pages. 6. Eventually meta_go_sync{} gets called for the exclusive glock but only twice, and the GLF_DIRTY flag wasn't set for the glock, so it didn't do anything. 7. Eventually function gfs2_ail_empty_gl gets called but only finds two bh's in the gl_ail_list and it brelse's them, changing b_count from 1 to 0. However, when the problem occurs, gfs2_releasepage is definitely sitting on a bh that has bh->b_count == 1 and it seems like the middle of data, at least on disk. It seems like these formerly pinned pages that were transferred to the glock should have been marked dirty so that they would be processed and released. It also seems like the buffers should be released sooner than waiting for a meta_go_sync or an inode_go_sync, but I'm not sure about that yet.
I have an altered gfs2_releasepage() which no longer blocks. As a result I see a bunch of gfs2_bufdata which are "stuck" after the umount. If I repeat the test, but not using journaled data, then I don't see any "stuck" gfs2_bufdata. So the problem appears only to relate to flushing journaled data on umount rather than flushing during normal operation (since the amount thats left is a lot less than the total number of blocks written), which probably means its relatively easy to fix.
Created attachment 161347 [details] My current patch This is what I'm using at the moment for testing. Does this solve the blocking problem for you? If so do you see any of the asserts in log_shutdown() ?
Created attachment 161356 [details] The updated patch Sorry, I realised that I'd attached the wrong version of this.
This patch from comment #4 seems to fix the problem and I can umount without any asserts or problems in log_shutdown. I'll run all my "Hell" test cases with it and see how it fares.
Do check cat /proc/slabinfo | grep gfs2 and see if all the bufdata's drain away after a few moments after umount. If not then we are now both seeing the same thing now. We just need to work out what extra flush needs to occur and where.
The patch passes all the "Hell" test cases. I don't have Abhi's patch yet for unstuffing the quota file, but that should be re-tested as well. My kernel seems to have been built with SLUB, not SLAB, so I don't have /proc/slabinfo. I'll recompile my kernel with SLAB and re-do.
With SLAB enabled in the kernel, I do see the numbers in /proc/slabinfo go down to zero after everything is done. Then I can remove the gfs2 module safely. However, if I immediately try to do the umount (as is the case where I paste the test into the command line) I get this error: slab error in kmem_cache_destroy(): cache `gfs2_bufdata': Can't free all objects [<c015e03d>] kmem_cache_destroy+0x84/0xc5 [<e02c55cf>] exit_gfs2_fs+0x28/0x41 [gfs2] [<c014171c>] sys_delete_module+0x1a0/0x1c8 [<c0153433>] remove_vma+0x36/0x3b [<c0104e1e>] sysenter_past_esp+0x5f/0x85 [<c0410000>] xdr_partial_copy_from_skb+0x128/0x171 ======================= BUG: unable to handle kernel paging request at virtual address e02c7876 printing eip: c021d581 *pde = 1f5cc067 *pte = 00000000 Oops: 0000 [#1] SMP Modules linked in: dlm configfs qla2xxx CPU: 0 EIP: 0060:[<c021d581>] Not tainted VLI EFLAGS: 00010297 (2.6.23-rc2 #2) EIP is at strnlen+0x6/0x15 eax: e02c7876 ebx: e02c7876 ecx: e02c7876 edx: fffffffe esi: d56f40cc edi: d7b49ee4 ebp: 00000011 esp: d7b49e78 ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Process cat (pid: 3927, ti=d7b48000 task=d5f52030 task.ti=d7b48000) Stack: c021cce0 c150bc64 00000000 c0148255 c014a0a7 00000001 00000044 00000f34 d56f40cc 0050bd08 d6e05588 d56f5000 ffffffff 00000010 c04dc174 d56f40cc ddcf0380 d6da8b80 00000000 c0175baf d7b49ee4 d7b49ee4 00005c04 e02c7876 Call Trace: [<c021cce0>] vsnprintf+0x2af/0x47c [<c0148255>] filemap_fault+0x217/0x373 [<c014a0a7>] get_page_from_freelist+0x250/0x2dd [<c0175baf>] seq_printf+0x2e/0x4b [<c015edd3>] s_show+0x1b9/0x22f [<c017603e>] seq_read+0xe7/0x271 [<c0175f57>] seq_read+0x0/0x271 [<c0189d9e>] proc_reg_read+0x5c/0x6f [<c0189d42>] proc_reg_read+0x0/0x6f [<c0161744>] vfs_read+0x88/0x10a [<c0161b3f>] sys_read+0x41/0x67 [<c0104e1e>] sysenter_past_esp+0x5f/0x85 [<c0410000>] xdr_partial_copy_from_skb+0x128/0x171 ======================= Code: c9 74 0c f2 ae 74 05 bf 01 00 00 00 4f 89 fa 5f 89 d0 c3 85 c9 57 89 c7 89 d0 74 05 f2 ae 75 01 4f 89 f8 5f c3 89 c1 89 c8 eb 06 <80> 38 00 74 07 40 4a 83 fa ff 75 f4 29 c8 c3 57 83 c9 ff 56 89 EIP: [<c021d581>] strnlen+0x6/0x15 SS:ESP 0068:d7b49e78 Perhaps gfs2_log_shutdown should wait for the buffers? Or a call from exit_gfs2_fs at the very least.
Created attachment 161370 [details] Latest patch See if this fixes things for you? It works for me. If this works, then we need to rerun all the tests including revolver & ddio etc. on it. I want to be 100% sure that we've not broken anything else along the way this time.
Created attachment 161376 [details] Latest patch This fixes the ->bd_ail problem.
This version of the patch from comment #10 works properly if we add "bd &&" to the "if (bd->bd_ail)" clause. In other words: if (bd && bd->bd_ail) With this change, the code passes all six "hell" tests. I'm running revolver on the RHEL5 version now and it's up to iteration 3.3.
That sounds promising. I will have another change for this patch by tomorrow with a bit of luck. We need an extra flush in inode_go_sync() for journaled files in order to fix 252392 and I need to look at that very carefully to work out exactly in which order that extra flush should occur.
Revolver is still running on RHEL5 with this patch. It's currently on iteration 18.2 with no problems.
Created attachment 161639 [details] Another updated patch This includes the fix for NULL bufdata's and also the fix which I expect will solve bz #252392. I've tested the latter here (I can't reproduce the exact bug, but I was able to reproduce a problem which goes away with this patch applied) so I think this is the right fix. I had actually spotted the problem earlier when looking into this bug but hadn't been able to work out why the fix apparently made the problem worse rather than better. Now I know the real answer to this bug, all has become clear and I can see why this is required and why it appeared to have the wrong effect before. The problem was related to the ordering of flushing when dropping or otherwise sync'ing a glock. In the journaled case we must flush the data blocks _after_ the journal flush (the data blocks will not be marked dirty until the journal flush has taken place), in the writeback & ordered cases we must flush the data blocks _before_ the journal flush (we want them stable on disk before we write metadata chaneges to the journal). Providing this version passes the tests, I'd hope that we can declare this patch the final version.
The RHEL5.1 patch was tested on the roth-0{1,2,3} cluster. I ran all six "hell" tests against it, plus Dean's scenario for bug #252392 and all of them were successful. This was done by loading the kernel-2.6.18-40.el5, applying the patch, compiling, rebooting, and running the tests. Steve put the patch into the upstream git tree, and I posted the RHEL5.1 patch to rhkernel-list. So I'm changing status to POST and transferring to Don Zickus.
*** Bug 252392 has been marked as a duplicate of this bug. ***
in 2.6.18-42.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html