Bug 252191
Summary: | GFS2: More problems unstuffing journaled files | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Robert Peterson <rpeterso> | ||||||||||||
Component: | kernel | Assignee: | Don Zickus <dzickus> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 5.0 | CC: | adas, bmarzins, djansa, lwang, nobody+wcheng, rkenna, rpeterso, swhiteho | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | All | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | RHBA-2007-0959 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2007-11-07 19:59:16 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 250772 | ||||||||||||||
Attachments: |
|
Description
Robert Peterson
2007-08-14 17:16:18 UTC
Here's what I know so far: This problem definitely recreates all the time I try to do this test on trin-10 and it recreates regardless of whether or not I'm running with the latest few patches. I'm guessing Steve wasn't able to recreate it because his machine has more memory. For the record, to reproduce this problem I paste this into a shell: service cman start service clvmd start mkfs.gfs2 -O -t bob_cluster2:test_gfs -p lock_dlm -j 3 /dev/trin_vg/hell mount -tgfs2 /dev/trin_vg/hell /mnt/hell cd /mnt/hell mkdir foo cd foo touch foo chattr +j foo cat /home/devel/gfs2-2.6.git.tgz >foo cd /home/devel/gfs2-2.6.git/ umount /mnt/hell service clvmd stop service cman stop rmmod lock_dlm rmmod gfs2 The sequence of events looks kind of like this: 1. Approximately 32600 pages are pinned by gfs2_pin(). 2. Approximately 32600 pages go through gfs2_unpin() but there's no bd_ail list, so it doesn't do brelse() on them. Instead, it moves them to the jurisdiction of the glock with b_count==1. I did have an indication that 5 buffers had bd->bd_ail and were therefore brelse'd during gfs2_unpin. 3. Function log_pull_tail gets called. 4. Function gfs2_ail2_empty_one gets called, doing brelse and therefore setting b_count = 0. 5. It pins another 24000 pages. 6. Eventually meta_go_sync{} gets called for the exclusive glock but only twice, and the GLF_DIRTY flag wasn't set for the glock, so it didn't do anything. 7. Eventually function gfs2_ail_empty_gl gets called but only finds two bh's in the gl_ail_list and it brelse's them, changing b_count from 1 to 0. However, when the problem occurs, gfs2_releasepage is definitely sitting on a bh that has bh->b_count == 1 and it seems like the middle of data, at least on disk. It seems like these formerly pinned pages that were transferred to the glock should have been marked dirty so that they would be processed and released. It also seems like the buffers should be released sooner than waiting for a meta_go_sync or an inode_go_sync, but I'm not sure about that yet. I have an altered gfs2_releasepage() which no longer blocks. As a result I see a bunch of gfs2_bufdata which are "stuck" after the umount. If I repeat the test, but not using journaled data, then I don't see any "stuck" gfs2_bufdata. So the problem appears only to relate to flushing journaled data on umount rather than flushing during normal operation (since the amount thats left is a lot less than the total number of blocks written), which probably means its relatively easy to fix. Created attachment 161347 [details]
My current patch
This is what I'm using at the moment for testing. Does this solve the blocking
problem for you? If so do you see any of the asserts in log_shutdown() ?
Created attachment 161356 [details]
The updated patch
Sorry, I realised that I'd attached the wrong version of this.
This patch from comment #4 seems to fix the problem and I can umount without any asserts or problems in log_shutdown. I'll run all my "Hell" test cases with it and see how it fares. Do check cat /proc/slabinfo | grep gfs2 and see if all the bufdata's drain away after a few moments after umount. If not then we are now both seeing the same thing now. We just need to work out what extra flush needs to occur and where. The patch passes all the "Hell" test cases. I don't have Abhi's patch yet for unstuffing the quota file, but that should be re-tested as well. My kernel seems to have been built with SLUB, not SLAB, so I don't have /proc/slabinfo. I'll recompile my kernel with SLAB and re-do. With SLAB enabled in the kernel, I do see the numbers in /proc/slabinfo go down to zero after everything is done. Then I can remove the gfs2 module safely. However, if I immediately try to do the umount (as is the case where I paste the test into the command line) I get this error: slab error in kmem_cache_destroy(): cache `gfs2_bufdata': Can't free all objects [<c015e03d>] kmem_cache_destroy+0x84/0xc5 [<e02c55cf>] exit_gfs2_fs+0x28/0x41 [gfs2] [<c014171c>] sys_delete_module+0x1a0/0x1c8 [<c0153433>] remove_vma+0x36/0x3b [<c0104e1e>] sysenter_past_esp+0x5f/0x85 [<c0410000>] xdr_partial_copy_from_skb+0x128/0x171 ======================= BUG: unable to handle kernel paging request at virtual address e02c7876 printing eip: c021d581 *pde = 1f5cc067 *pte = 00000000 Oops: 0000 [#1] SMP Modules linked in: dlm configfs qla2xxx CPU: 0 EIP: 0060:[<c021d581>] Not tainted VLI EFLAGS: 00010297 (2.6.23-rc2 #2) EIP is at strnlen+0x6/0x15 eax: e02c7876 ebx: e02c7876 ecx: e02c7876 edx: fffffffe esi: d56f40cc edi: d7b49ee4 ebp: 00000011 esp: d7b49e78 ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Process cat (pid: 3927, ti=d7b48000 task=d5f52030 task.ti=d7b48000) Stack: c021cce0 c150bc64 00000000 c0148255 c014a0a7 00000001 00000044 00000f34 d56f40cc 0050bd08 d6e05588 d56f5000 ffffffff 00000010 c04dc174 d56f40cc ddcf0380 d6da8b80 00000000 c0175baf d7b49ee4 d7b49ee4 00005c04 e02c7876 Call Trace: [<c021cce0>] vsnprintf+0x2af/0x47c [<c0148255>] filemap_fault+0x217/0x373 [<c014a0a7>] get_page_from_freelist+0x250/0x2dd [<c0175baf>] seq_printf+0x2e/0x4b [<c015edd3>] s_show+0x1b9/0x22f [<c017603e>] seq_read+0xe7/0x271 [<c0175f57>] seq_read+0x0/0x271 [<c0189d9e>] proc_reg_read+0x5c/0x6f [<c0189d42>] proc_reg_read+0x0/0x6f [<c0161744>] vfs_read+0x88/0x10a [<c0161b3f>] sys_read+0x41/0x67 [<c0104e1e>] sysenter_past_esp+0x5f/0x85 [<c0410000>] xdr_partial_copy_from_skb+0x128/0x171 ======================= Code: c9 74 0c f2 ae 74 05 bf 01 00 00 00 4f 89 fa 5f 89 d0 c3 85 c9 57 89 c7 89 d0 74 05 f2 ae 75 01 4f 89 f8 5f c3 89 c1 89 c8 eb 06 <80> 38 00 74 07 40 4a 83 fa ff 75 f4 29 c8 c3 57 83 c9 ff 56 89 EIP: [<c021d581>] strnlen+0x6/0x15 SS:ESP 0068:d7b49e78 Perhaps gfs2_log_shutdown should wait for the buffers? Or a call from exit_gfs2_fs at the very least. Created attachment 161370 [details]
Latest patch
See if this fixes things for you? It works for me.
If this works, then we need to rerun all the tests including revolver & ddio
etc. on it. I want to be 100% sure that we've not broken anything else along
the way this time.
Created attachment 161376 [details]
Latest patch
This fixes the ->bd_ail problem.
This version of the patch from comment #10 works properly if we add "bd &&" to the "if (bd->bd_ail)" clause. In other words: if (bd && bd->bd_ail) With this change, the code passes all six "hell" tests. I'm running revolver on the RHEL5 version now and it's up to iteration 3.3. That sounds promising. I will have another change for this patch by tomorrow with a bit of luck. We need an extra flush in inode_go_sync() for journaled files in order to fix 252392 and I need to look at that very carefully to work out exactly in which order that extra flush should occur. Revolver is still running on RHEL5 with this patch. It's currently on iteration 18.2 with no problems. Created attachment 161639 [details] Another updated patch This includes the fix for NULL bufdata's and also the fix which I expect will solve bz #252392. I've tested the latter here (I can't reproduce the exact bug, but I was able to reproduce a problem which goes away with this patch applied) so I think this is the right fix. I had actually spotted the problem earlier when looking into this bug but hadn't been able to work out why the fix apparently made the problem worse rather than better. Now I know the real answer to this bug, all has become clear and I can see why this is required and why it appeared to have the wrong effect before. The problem was related to the ordering of flushing when dropping or otherwise sync'ing a glock. In the journaled case we must flush the data blocks _after_ the journal flush (the data blocks will not be marked dirty until the journal flush has taken place), in the writeback & ordered cases we must flush the data blocks _before_ the journal flush (we want them stable on disk before we write metadata chaneges to the journal). Providing this version passes the tests, I'd hope that we can declare this patch the final version. The RHEL5.1 patch was tested on the roth-0{1,2,3} cluster. I ran all six "hell" tests against it, plus Dean's scenario for bug #252392 and all of them were successful. This was done by loading the kernel-2.6.18-40.el5, applying the patch, compiling, rebooting, and running the tests. Steve put the patch into the upstream git tree, and I posted the RHEL5.1 patch to rhkernel-list. So I'm changing status to POST and transferring to Don Zickus. *** Bug 252392 has been marked as a duplicate of this bug. *** in 2.6.18-42.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html |