Bug 516025
| Summary: | GFS2: couldn't free all objects in gfs2_bufdata cache on unload | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Nate Straz <nstraz> | ||||||||
| Component: | kernel | Assignee: | Robert Peterson <rpeterso> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | low | ||||||||||
| Version: | 5.3 | CC: | adas, bmarzins, chen1537, jharriga, lmiksik, nstraz, rpeterso, rwheeler, swhiteho | ||||||||
| Target Milestone: | rc | Keywords: | Reopened | ||||||||
| Target Release: | --- | ||||||||||
| Hardware: | All | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | kernel-2.6.18-397.el5 | Doc Type: | Bug Fix | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2014-09-16 00:32:16 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Nate Straz
2009-08-06 12:52:21 UTC
The gfs2_bufdata slab cache is in the gfs2 module, not the lock_dlm module. I wonder if this is another consequence of the bug which Fabio reported on the upstream kernel. It looks like we might be allowing the module to unload too early, although its odd that its the bufdata slab cache rather than the glock slab cache that has appeared. The glock issue is fixed upstream and in RHEL6. It doesn't directly translate to RHEL5 due to the lock_dlm module getting in the way. We may need to implement some other kind of waiting before the module unloads since the upstream solution to the problem is far too invasive for RHEL5. I suspect that the same issue lies at the root of it though. Putting into needinfo until it can be reproduced. Since this does not appear to have occurred again and we have no know reproducer, I'm closing this for the time being. It has been in need info for some time and it was originally reported in 2009, so if it hasn't occurred again in the mean time, then it doesn't seem like it is something that happens very often at least. Please reopen if happens again. I hit this again while running regressions on 5.9 with kernel-2.6.18-336.el5 slab error in kmem_cache_destroy(): cache `gfs2_bufdata': Can't free all objects Call Trace: [<ffffffff800e0f70>] kmem_cache_destroy+0x80/0x179 [<ffffffff8867ee1e>] :gfs2:exit_gfs2_fs+0x56/0x74 [<ffffffff800a9c24>] sys_delete_module+0x19f/0x1d4 [<ffffffff8005d29e>] tracesys+0xd5/0xdf SLAB: cache with size 96 has lost its name What kind of info would be useful if I hit this again? The only slab we have that's 96 bytes long is gfs2_bufdata. So there must have been a buffer that was not written or something. Was there more info in dmesg? Nate, I opened bug #1024024 before I saw this one out there. We've got a somewhat reliable way to recreate/debug it on RHEL6. Since there's a recent push to close RHEL5 bugs, and minimize our fixes to RHEL5, can we close this bug, and just work on the problem in the name of bug #1024024? Or should we close that one as a duplicate of this one? This one has apparently sat dormant for more than a year, but I'm now actively looking into the problem from a RHEL6 standpoint. Since there hasn't been any new information on this problem for well over a year, we could probably close it again. To the best of my knowledge, the QE people have not hit it again. However, I am able to recreate the issue in RHEL6 and I'm debugging it in the name of bug #1024024 as noted above. When that bz gets solved, we can decide whether to back-port the patch to RHEL5. It seems rare enough that customers aren't hitting it. In the meantime, I'm reassigning it to myself so the issue is all in one place. Now that bug #1024024 is solved and I've got two patches that fix the problem, I have determined that the bug exists in RHEL5 and the patches could be easily ported back to RHEL5. Therefore I'm going to proceed with that plan. Requesting flags for 5.11. Created attachment 838395 [details]
RHEL5 patch to fix the problem
This patch combines the two pertinent upstream patches.
The attached "combined" patch was posted to rhkernel-list for
inclusion into RHEL5.11. Changing status to POST.
It was tested on my gfs-i16c-0{1,2}.mpc.lab.eng.bos.redhat.com
(aka hlab{1,2}) cluster.
I hit this again while running on kernel-2.6.18-389.el5. It happened after running through the brawl load with the "sync" mount option. slab error in kmem_cache_destroy(): cache `gfs2_bufdata': Can't free all objects Call Trace: [<ffffffff800e177f>] kmem_cache_destroy+0x80/0x179 [<ffffffff887a7f22>] :gfs2:exit_gfs2_fs+0x56/0x74 [<ffffffff800aa17a>] sys_delete_module+0x19f/0x1d4 [<ffffffff8005d29e>] tracesys+0xd5/0xdf Unable to handle kernel paging request at ffffffff887aa9bb RIP: [<ffffffff8003be32>] strnlen+0x11/0x1a PGD 203067 PUD 205063 PMD 38ecc067 PTE 0 Oops: 0000 [1] SMP last sysfs file: /kernel/dlm/brawl0/event_done CPU 0 Modules linked in: dm_log_clustered(U) gfs(U) dlm gnbd(U) configfs lpfc scsi_transport_fc sg be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_a ddr iscsi_tcp bnx2i cnic cxgb3i iptable_filter ip_tables x_tables sctp autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api uio libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy i2c_piix4 i2c_core pcspkr serio_raw tpm_tis virtio_net virtio_balloon tpm tpm_bios dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci ata_piix libata sd_mod scsi_mod virtio_blk virtio_pci virtio_ring virtio ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 5091, comm: awk Tainted: G -------------------- 2.6.18-389.el5 #1 RIP: 0010:[<ffffffff8003be32>] [<ffffffff8003be32>] strnlen+0x11/0x1a RSP: 0018:ffff81001c8ebbf0 EFLAGS: 00010297 RAX: ffffffff887aa9bb RBX: 0000000000000010 RCX: 0000000000000000 RDX: ffffffff887aa9bb RSI: fffffffffffffffe RDI: ffffffff887aa9bb RBP: ffff810005f480cc R08: 0000000000000028 R09: 0000000000000060 R10: ffff81001f6127c0 R11: 0000000000000048 R12: ffffffff887aa9bb R13: 0000000000000011 R14: ffff810005f49000 R15: ffff810005f49000 FS: 00002afd82454f70(0000) GS:ffffffff80436000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff887aa9bb CR3: 00000000306f3000 CR4: 00000000000006e0 Process awk (pid: 5091, threadinfo ffff81001c8ea000, task ffff810015ce70c0) Stack: ffffffff8015acd4 ffffffff802c4152 ffff81001c8ebd28 0000000000000004 ffff810005f480cc ffffffff802c4157 ffffffff8001a98f 0000001000000004 0000000a00000011 ffffffffffffffff ffffffff8001a55f 0000000000000f34 Call Trace: [<ffffffff8015acd4>] string+0x39/0x98 [<ffffffff8001a98f>] vsnprintf+0x5d7/0xb54 [<ffffffff8001a55f>] vsnprintf+0x1a7/0xb54 [<ffffffff800423fd>] d_rehash+0x21/0x34 [<ffffffff8000ab68>] get_page_from_freelist+0x380/0x442 [<ffffffff800561c0>] seq_printf+0x67/0x8f [<ffffffff80132581>] inode_has_perm+0x56/0x63 [<ffffffff800abca4>] s_show+0x22c/0x291 [<ffffffff8003fa24>] seq_read+0x105/0x28c [<ffffffff8010c67a>] proc_reg_read+0x7e/0x99 [<ffffffff8000b735>] vfs_read+0xcb/0x171 [<ffffffff80011d9a>] sys_read+0x45/0x6e [<ffffffff8005d29e>] tracesys+0xd5/0xdf I am able to reproduce this using: gfs/bin/brawl -i 100 -f gfs2 -S QUICK -t gfs2-sync Update on this problem: I am now able to recreate this problem using a command similar to the one in comment #16: gfs/bin/brawl -i 100 -f gfs2 -R /home/bob/hlab.xml -l /usr/tests/sts-rhel5.11 -r /usr/tests/sts-rhel5.11 -S QUICK -t gfs2-sync &> /tmp/brawl.out I've been using instrumentation to debug the issue, and making some headway. Progress is slow because it takes a long time to recreate the problem. What I know so far is: 1. The bd left over in the slab corresponds to a buffer_head that contains user data. 2. The bd was last manipulated by function gfs2_invalidatepage's call to gfs2_discard. 3. gfs2_discard found the bd was on an le list, so it dequeued it from it bd_le. From there it was lost. 4. The call to gfs2_invalidatepage came from aops->invalidatepage via the gfs2_ordered_aops. In other words, it did NOT come from gfs2_writepage_common, gfs2_write_jdata_pagevec, jdata aops, nor writeback_aops. I'm continuing to debug this with instrumentation. Latest findings: 1. gfs2_ordered_writepage called gfs2_page_add_databufs to add the bd to the buffer_head. 2. databuf_lo_add was called normally This is not jdata, so pin/unpin was never called. 3. gfs2_releasepage was called for this page, but bh->b_count was non-zero, go it did "goto cannot_release". 4. The buffer was never included in gfs2_write_blocks, nor was the databuf lops: databuf_lo_before_commit was never called, etc. When gfs2_ordered_write was called, the undead bd was on the sd_log_le_ordered list. However, it was not marked dirty at the time. So gfs2_ordered_write didn't do anything with it. At the time of failure, the buffer's only flag is BH_Uptodate. Created attachment 916771 [details] PATCH: GFS2: Detect and release delayed bufdata structures This addendum patch detects and circumvents the problem. It seems pretty reliable. This is what I propose we ship for this problem. The patch is not needed in RHEL6 or above because the problem does not exist there; It was most likely fixed by the numerous patches we did to ordered writes, all of which seem too intrusive and risky to backport to RHEL5. I tested this on my gfs-i16c-0{1,2}.mpc.lab.eng.bos.redhat.com (hlab) cluster using my new "g" kernel that contains this patch plus the five for bug #1022254. I used brawl, which was reliably recreating the problem. I've now done two complete runs of the failing brawl scenario with no problems, using my "g" kernel that contains the patch (which takes more than 9 hours each). So the following statements are proven facts based on instrumentation: The order of events is: - A write is occurring, and gfs2_ordered_writepage gets called by vfs. - The page sometimes has no buffers, so gfs2_ordered_writepage calls create_empty_buffers and sets the bh dirty. (This isn't always the case. Sometimes the page has buffers and falls into the same trap). - gfs2_ordered_writepage calls gfs2_page_add_databufs, which in turn adds the new dirty buffer to the transaction via gfs2_trans_add_bh. - gfs2_trans_add_bh creates a bd for the bh and calls the lops add function, databuf_lo_add. - databuf_lo_add adds the bd to the sd_log_le_ordered (ordered write list). - gfs2_ordered_writepage has no choice but to call block_write_full_page - gfs2_ordered_write gets called, takes the bd off the le_ordered list, but then it sees !buffer_dirty, so it can't submit it for IO, nor can it brelse. (proof: dbg7==02) - gfs2_ordered_wait gets called, takes the bd off the sd_log_le_ordered list. It has no choice. - This happens before gfs2_releasepage(). - The call stack proves: The glock is dropped and the glock_work_func calls do_xmote. - The call stack proves: Function inode_go_inval calls truncate_inode_pages, which calls gfs2_releasepage. - Function gfs2_invalidatepage gets called, sees that offset==0, so it calls try_to_release_page (proof: dbg6==1). - Function try_to_release_page calls a_ops->releasepage, AKA gfs2_releasepage. - Function gfs2_releasepage notices that b_count is non-zero, so it can't release the page. Instead, it does nothing and leaves the bd stranded. - This is the last time gfs2_releasepage is called for this bh/bd pair (which are still properly linked both ways). This was proven by virtue of the fact that dbg2 has not changed. After weeks of trying various things, I FINALLY caught the culprit, in other words, who is using the bd while trying to be released. The call trace looks like this: bh/bd=ffff810678aa4c70/(null) ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at include/linux/buffer_head.h:263 invalid opcode: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:01.0/0000:06:00.1/host CPU 1 Modules linked in: lock_dlm gfs2 sctp dlm configfs autofs4 hidp rfcomm l2cap bluetooth lockdd Pid: 0, comm: swapper Not tainted 2.6.18-392.el5.bz516025i #1 RIP: 0010:[<ffffffff800324d4>] [<ffffffff800324d4>] end_buffer_read_sync+0x23/0x35 RSP: 0018:ffff810487143cf0 EFLAGS: 00010202 RAX: 0000000000004001 RBX: ffff810678aa4c70 RCX: ffff81048704e0c0 RDX: ffff81025ed3d9d0 RSI: 0000000000000286 RDI: ffff81048704e0b8 RBP: ffff81086866d3c0 R08: ffff81048704e0c0 R09: 000000000000002a R10: ffff810487143ae0 R11: ffffffff800423c1 R12: 0000000000000000 R13: ffff81066a0a3948 R14: ffff810287196400 R15: ffff81046eb805b8 FS: 0000000000000000(0000) GS:ffff8102871912c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 000000001fefd000 CR3: 00000002601f4000 CR4: 00000000000006a0 Process swapper (pid: 0, threadinfo ffff810108cc2000, task ffff8106871467b0) Stack: ffff81086866d940 ffffffff800423f0 0000000000000000 ffffffff882b25af 0000000000000000 0000000000000000 ffff81086866d640 0000000000000000 ffff81066a0a3378 ffff81027e347c00 ffff81046f821768 ffffffff882b25af Call Trace: <IRQ> [<ffffffff800423f0>] end_bio_bh_io_sync+0x2f/0x3b [<ffffffff882b25af>] :dm_mod:clone_endio+0xc1/0xd0 [<ffffffff882b25af>] :dm_mod:clone_endio+0xc1/0xd0 [<ffffffff882b25af>] :dm_mod:clone_endio+0xc1/0xd0 [<ffffffff8002d3f4>] __end_that_request_first+0x270/0x4ea [<ffffffff8807a745>] :scsi_mod:scsi_end_request+0x27/0xcd [<ffffffff8807a969>] :scsi_mod:scsi_io_completion+0x17e/0x339 [<ffffffff880a87ce>] :sd_mod:sd_rw_intr+0x21e/0x258 [<ffffffff8817acf4>] :qla2xxx:qla24xx_process_response_queue+0xae/0x242 [<ffffffff8807abe5>] :scsi_mod:scsi_device_unbusy+0x67/0x81 [<ffffffff80038147>] blk_done_softirq+0x5f/0x6d [<ffffffff800125b5>] __do_softirq+0x89/0x133 [<ffffffff8005e30c>] call_softirq+0x1c/0x28 [<ffffffff8006d644>] do_softirq+0x2c/0x7d [<ffffffff8006d4d4>] do_IRQ+0xee/0xf7 [<ffffffff8005d625>] ret_from_intr+0x0/0xa <EOI> [<ffffffff801a66fd>] acpi_safe_halt+0x25/0x36 [<ffffffff801a6f7b>] acpi_processor_idle_simple+0xfc/0x341 [<ffffffff801a6e7f>] acpi_processor_idle_simple+0x0/0x341 [<ffffffff801a6e7f>] acpi_processor_idle_simple+0x0/0x341 [<ffffffff80049670>] cpu_idle+0x95/0xb8 [<ffffffff80078d43>] start_secondary+0x479/0x488 Code: 0f 0b 68 e0 44 2c 80 c2 07 01 eb fe f0 ff 4b 58 5b c3 41 57 RIP [<ffffffff800324d4>] end_buffer_read_sync+0x23/0x35 RSP <ffff810487143cf0> <0>Kernel panic - not syncing: Aiee, killing interrupt handler! So there you have it: there's IO still happening for the buffer while it's being released, and at the end of the IO, when end_bio_bh_io_sync is called, it decrements the b_count. Created attachment 920713 [details]
GFS2: Detect and release delayed bufdata structures
This patch has a correction from the previous version.
It also has a better description.
The patch from comment #25 passed a complete 100-iteration run of brawl, so I'm planning to post it. The addendum patch is now posted to rhkernel-list. It was tested
my on hlab{1,2} (gfs-i16c-0{1,2}) cluster using 100 iterations
of brawl. Changing the status to POST.
Addendum patch has been NACKed. Resetting status to ASSIGNED until this is resolved. Replacement addendum patch was just posted to rhkernel-list.
This is identical to the previous addendum patch except for
this modification:
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index a547106..a7acfae 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -844,6 +844,7 @@ static void gfs2_ordered_wait(struct gfs2_sbd *sdp)
get_bh(bh);
gfs2_log_unlock(sdp);
wait_on_buffer(bh);
+ gfs2_log_lock(sdp);
if (test_clear_buffer_release(bh)) {
bd->bd_bh = NULL;
bh->b_private = NULL;
@@ -851,7 +852,6 @@ static void gfs2_ordered_wait(struct gfs2_sbd *sdp)
kmem_cache_free(gfs2_bufdata_cachep, bd);
}
brelse(bh);
- gfs2_log_lock(sdp);
continue;
}
list_del_init(&bd->bd_le.le_list);
In other words, the gfs2_log_lock was moved up in order to cover
a timing window that was potentially a problem.
I am retesting on my hlab cluster, and so far brawl has run
successfully overnight. I'll let it run all weekend.
I made it through a weekend run of brawl without hitting this with kernel-2.6.18-397.el5. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1196.html Patch(es) available in kernel-2.6.18-398.el5 You can download this test kernel (or newer) from http://people.redhat.com/agordeev/el5/ Detailed testing feedback is always welcomed. If you require guidance regarding testing, please ask the bug assignee. |