Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 516025

Summary: GFS2: couldn't free all objects in gfs2_bufdata cache on unload
Product: Red Hat Enterprise Linux 5 Reporter: Nate Straz <nstraz>
Component: kernelAssignee: Robert Peterson <rpeterso>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 5.3CC: adas, bmarzins, chen1537, jharriga, lmiksik, nstraz, rpeterso, rwheeler, swhiteho
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.18-397.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-09-16 00:32:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
RHEL5 patch to fix the problem
none
PATCH: GFS2: Detect and release delayed bufdata structures
none
GFS2: Detect and release delayed bufdata structures none

Description Nate Straz 2009-08-06 12:52:21 UTC
Description of problem:

After running our standard set of regression tests on GFS2 on a six node cluster, one node was not able to destroy the gfs2_bufdata slab cache during unload of the lock_dlm module.

rmmod:
slab error in kmem_cache_destroy(): cache `gfs2_bufdata': Can't free all objects
 [<c04704e2>] kmem_cache_destroy+0x61/0xf3
 [<f8fcd622>] exit_gfs2_fs+0x3c/0x55 [gfs2]
 [<c043db98>] sys_delete_module+0x192/0x1ba
 [<c04483fe>] audit_syscall_entry+0x15a/0x18c
 [<c0404f17>] syscall_call+0x7/0xb
 =======================
SLAB: cache with size 48 has lost its name

subsequent modprobe failed:
DLM (built Jul 27 2009 17:45:57) installed
kmem_cache_create: duplicate cache gfs2_bufdata
 [<c0470075>] kmem_cache_create+0x453/0x483
 [<f897f10e>] init_gfs2_fs+0x8d/0x186 [gfs2]
 [<c043d7d5>] sys_init_module+0x1aed/0x1caa
 [<c0470481>] kmem_cache_destroy+0x0/0xf3
 [<c04483fe>] audit_syscall_entry+0x15a/0x18c
 [<c0404f17>] syscall_call+0x7/0xb
 =======================


Version-Release number of selected component (if applicable):
kernel-2.6.18-160.el5

How reproducible:
Unknown

Comment 1 Steve Whitehouse 2010-01-14 10:27:00 UTC
The gfs2_bufdata slab cache is in the gfs2 module, not the lock_dlm module. I wonder if this is another consequence of the bug which Fabio reported on the upstream kernel. It looks like we might be allowing the module to unload too early, although its odd that its the bufdata slab cache rather than the glock slab cache that has appeared.

Comment 2 Steve Whitehouse 2010-02-17 17:47:45 UTC
The glock issue is fixed upstream and in RHEL6. It doesn't directly translate to RHEL5 due to the lock_dlm module getting in the way.

We may need to implement some other kind of waiting before the module unloads since the upstream solution to the problem is far too invasive for RHEL5. I suspect that the same issue lies at the root of it though.

Comment 3 Steve Whitehouse 2011-01-24 13:02:17 UTC
Putting into needinfo until it can be reproduced.

Comment 4 Steve Whitehouse 2011-05-12 10:24:19 UTC
Since this does not appear to have occurred again and we have no know reproducer, I'm closing this for the time being. It has been in need info for some time and it was originally reported in 2009, so if it hasn't occurred again in the mean time, then it doesn't seem like it is something that happens very often at least.

Please reopen if happens again.

Comment 5 Nate Straz 2012-09-07 13:58:01 UTC
I hit this again while running regressions on 5.9 with kernel-2.6.18-336.el5

slab error in kmem_cache_destroy(): cache `gfs2_bufdata': Can't free all objects

Call Trace:
 [<ffffffff800e0f70>] kmem_cache_destroy+0x80/0x179
 [<ffffffff8867ee1e>] :gfs2:exit_gfs2_fs+0x56/0x74
 [<ffffffff800a9c24>] sys_delete_module+0x19f/0x1d4
 [<ffffffff8005d29e>] tracesys+0xd5/0xdf

SLAB: cache with size 96 has lost its name

What kind of info would be useful if I hit this again?

Comment 6 Robert Peterson 2012-09-11 15:01:45 UTC
The only slab we have that's 96 bytes long is gfs2_bufdata.
So there must have been a buffer that was not written or
something. Was there more info in dmesg?

Comment 7 Robert Peterson 2013-10-30 15:40:30 UTC
Nate, I opened bug #1024024 before I saw this one out there.
We've got a somewhat reliable way to recreate/debug it on RHEL6.
Since there's a recent push to close RHEL5 bugs, and minimize
our fixes to RHEL5, can we close this bug, and just work on
the problem in the name of bug #1024024? Or should we close
that one as a duplicate of this one? This one has apparently
sat dormant for more than a year, but I'm now actively looking
into the problem from a RHEL6 standpoint.

Comment 8 Robert Peterson 2013-11-04 18:35:31 UTC
Since there hasn't been any new information on this problem for
well over a year, we could probably close it again. To the best
of my knowledge, the QE people have not hit it again. However,
I am able to recreate the issue in RHEL6 and I'm debugging it
in the name of bug #1024024 as noted above.

When that bz gets solved, we can decide whether to back-port the
patch to RHEL5. It seems rare enough that customers aren't
hitting it. In the meantime, I'm reassigning it to myself so the
issue is all in one place.

Comment 9 Robert Peterson 2013-12-17 20:41:23 UTC
Now that bug #1024024 is solved and I've got two patches that
fix the problem, I have determined that the bug exists in RHEL5
and the patches could be easily ported back to RHEL5. Therefore
I'm going to proceed with that plan. Requesting flags for 5.11.

Comment 10 Robert Peterson 2013-12-18 15:26:16 UTC
Created attachment 838395 [details]
RHEL5 patch to fix the problem

This patch combines the two pertinent upstream patches.

Comment 11 Robert Peterson 2013-12-18 16:10:23 UTC
The attached "combined" patch was posted to rhkernel-list for
inclusion into RHEL5.11. Changing status to POST.
It was tested on my gfs-i16c-0{1,2}.mpc.lab.eng.bos.redhat.com
(aka hlab{1,2}) cluster.

Comment 15 Nate Straz 2014-05-29 13:15:37 UTC
I hit this again while running on kernel-2.6.18-389.el5.  It happened after running through the brawl load with the "sync" mount option.


slab error in kmem_cache_destroy(): cache `gfs2_bufdata': Can't free all objects

Call Trace:
 [<ffffffff800e177f>] kmem_cache_destroy+0x80/0x179
 [<ffffffff887a7f22>] :gfs2:exit_gfs2_fs+0x56/0x74
 [<ffffffff800aa17a>] sys_delete_module+0x19f/0x1d4
 [<ffffffff8005d29e>] tracesys+0xd5/0xdf

Unable to handle kernel paging request at ffffffff887aa9bb RIP: 
 [<ffffffff8003be32>] strnlen+0x11/0x1a
PGD 203067 PUD 205063 PMD 38ecc067 PTE 0
Oops: 0000 [1] SMP 
last sysfs file: /kernel/dlm/brawl0/event_done
CPU 0 
Modules linked in: dm_log_clustered(U) gfs(U) dlm gnbd(U) configfs lpfc scsi_transport_fc sg be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_a
ddr iscsi_tcp bnx2i cnic cxgb3i iptable_filter ip_tables x_tables sctp autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api uio libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy i2c_piix4 i2c_core pcspkr serio_raw tpm_tis virtio_net virtio_balloon tpm tpm_bios dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci ata_piix libata sd_mod scsi_mod virtio_blk virtio_pci virtio_ring virtio ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 5091, comm: awk Tainted: G     --------------------    2.6.18-389.el5 #1
RIP: 0010:[<ffffffff8003be32>]  [<ffffffff8003be32>] strnlen+0x11/0x1a
RSP: 0018:ffff81001c8ebbf0  EFLAGS: 00010297
RAX: ffffffff887aa9bb RBX: 0000000000000010 RCX: 0000000000000000
RDX: ffffffff887aa9bb RSI: fffffffffffffffe RDI: ffffffff887aa9bb
RBP: ffff810005f480cc R08: 0000000000000028 R09: 0000000000000060
R10: ffff81001f6127c0 R11: 0000000000000048 R12: ffffffff887aa9bb
R13: 0000000000000011 R14: ffff810005f49000 R15: ffff810005f49000
FS:  00002afd82454f70(0000) GS:ffffffff80436000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff887aa9bb CR3: 00000000306f3000 CR4: 00000000000006e0
Process awk (pid: 5091, threadinfo ffff81001c8ea000, task ffff810015ce70c0)
Stack:  ffffffff8015acd4 ffffffff802c4152 ffff81001c8ebd28 0000000000000004
 ffff810005f480cc ffffffff802c4157 ffffffff8001a98f 0000001000000004
 0000000a00000011 ffffffffffffffff ffffffff8001a55f 0000000000000f34
Call Trace:
 [<ffffffff8015acd4>] string+0x39/0x98
 [<ffffffff8001a98f>] vsnprintf+0x5d7/0xb54
 [<ffffffff8001a55f>] vsnprintf+0x1a7/0xb54
 [<ffffffff800423fd>] d_rehash+0x21/0x34
 [<ffffffff8000ab68>] get_page_from_freelist+0x380/0x442
 [<ffffffff800561c0>] seq_printf+0x67/0x8f
 [<ffffffff80132581>] inode_has_perm+0x56/0x63
 [<ffffffff800abca4>] s_show+0x22c/0x291
 [<ffffffff8003fa24>] seq_read+0x105/0x28c
 [<ffffffff8010c67a>] proc_reg_read+0x7e/0x99
 [<ffffffff8000b735>] vfs_read+0xcb/0x171
 [<ffffffff80011d9a>] sys_read+0x45/0x6e
 [<ffffffff8005d29e>] tracesys+0xd5/0xdf

Comment 16 Nate Straz 2014-05-29 19:20:51 UTC
I am able to reproduce this using:

gfs/bin/brawl -i 100 -f gfs2 -S QUICK -t gfs2-sync

Comment 17 Robert Peterson 2014-06-16 15:26:40 UTC
Update on this problem: I am now able to recreate this problem
using a command similar to the one in comment #16:

gfs/bin/brawl -i 100 -f gfs2 -R /home/bob/hlab.xml -l /usr/tests/sts-rhel5.11 -r /usr/tests/sts-rhel5.11 -S QUICK -t gfs2-sync &> /tmp/brawl.out

I've been using instrumentation to debug the issue, and making
some headway. Progress is slow because it takes a long time to
recreate the problem.

What I know so far is:
1. The bd left over in the slab corresponds to a buffer_head that
   contains user data.
2. The bd was last manipulated by function gfs2_invalidatepage's
   call to gfs2_discard.
3. gfs2_discard found the bd was on an le list, so it dequeued it
   from it bd_le. From there it was lost.
4. The call to gfs2_invalidatepage came from aops->invalidatepage
   via the gfs2_ordered_aops. In other words, it did NOT come from
   gfs2_writepage_common, gfs2_write_jdata_pagevec, jdata aops,
   nor writeback_aops.

I'm continuing to debug this with instrumentation.

Comment 18 Robert Peterson 2014-06-18 14:53:45 UTC
Latest findings:

1. gfs2_ordered_writepage called gfs2_page_add_databufs to add the
   bd to the buffer_head.
2. databuf_lo_add was called normally
   This is not jdata, so pin/unpin was never called.
3. gfs2_releasepage was called for this page, but bh->b_count was
   non-zero, go it did "goto cannot_release".
4. The buffer was never included in gfs2_write_blocks, nor was
   the databuf lops: databuf_lo_before_commit was never called, etc.

Comment 19 Robert Peterson 2014-06-18 17:11:48 UTC
When gfs2_ordered_write was called, the undead bd was on the
sd_log_le_ordered list. However, it was not marked dirty at
the time. So gfs2_ordered_write didn't do anything with it.
At the time of failure, the buffer's only flag is BH_Uptodate.

Comment 21 Robert Peterson 2014-07-09 13:12:33 UTC
Created attachment 916771 [details]
PATCH: GFS2: Detect and release delayed bufdata structures

This addendum patch detects and circumvents the problem.
It seems pretty reliable. This is what I propose we ship for
this problem. The patch is not needed in RHEL6 or above because
the problem does not exist there; It was most likely fixed by
the numerous patches we did to ordered writes, all of which
seem too intrusive and risky to backport to RHEL5.

I tested this on my gfs-i16c-0{1,2}.mpc.lab.eng.bos.redhat.com
(hlab) cluster using my new "g" kernel that contains this patch
plus the five for bug #1022254. I used brawl, which was reliably
recreating the problem.

Comment 22 Robert Peterson 2014-07-10 12:52:04 UTC
I've now done two complete runs of the failing brawl scenario
with no problems, using my "g" kernel that contains the patch
(which takes more than 9 hours each).

Comment 23 Robert Peterson 2014-07-10 13:36:27 UTC
So the following statements are proven facts based on instrumentation:
The order of events is:

- A write is occurring, and gfs2_ordered_writepage gets called by vfs.
- The page sometimes has no buffers, so gfs2_ordered_writepage calls
  create_empty_buffers and sets the bh dirty. (This isn't always the
  case. Sometimes the page has buffers and falls into the same trap).
- gfs2_ordered_writepage calls gfs2_page_add_databufs, which in turn
  adds the new dirty buffer to the transaction via gfs2_trans_add_bh.
- gfs2_trans_add_bh creates a bd for the bh and calls the lops add
  function, databuf_lo_add.
- databuf_lo_add adds the bd to the sd_log_le_ordered (ordered write
  list).
- gfs2_ordered_writepage has no choice but to call block_write_full_page
- gfs2_ordered_write gets called, takes the bd off the le_ordered list,
  but then it sees !buffer_dirty, so it can't submit it for IO, nor
  can it brelse. (proof: dbg7==02)
- gfs2_ordered_wait gets called, takes the bd off the
  sd_log_le_ordered list. It has no choice.
- This happens before gfs2_releasepage().
- The call stack proves: The glock is dropped and the glock_work_func
  calls do_xmote.
- The call stack proves: Function inode_go_inval calls
  truncate_inode_pages, which calls gfs2_releasepage.
- Function gfs2_invalidatepage gets called, sees that offset==0, so
  it calls try_to_release_page (proof: dbg6==1).
- Function try_to_release_page calls a_ops->releasepage, AKA
  gfs2_releasepage.
- Function gfs2_releasepage notices that b_count is non-zero, so it
  can't release the page. Instead, it does nothing and leaves the bd
  stranded.
- This is the last time gfs2_releasepage is called for this bh/bd
  pair (which are still properly linked both ways). This was proven
  by virtue of the fact that dbg2 has not changed.

Comment 24 Robert Peterson 2014-07-11 13:51:56 UTC
After weeks of trying various things, I FINALLY caught the
culprit, in other words, who is using the bd while trying to
be released. The call trace looks like this:

bh/bd=ffff810678aa4c70/(null)
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at include/linux/buffer_head.h:263
invalid opcode: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:01.0/0000:06:00.1/host
CPU 1 
Modules linked in: lock_dlm gfs2 sctp dlm configfs autofs4 hidp rfcomm l2cap bluetooth lockdd
Pid: 0, comm: swapper Not tainted 2.6.18-392.el5.bz516025i #1
RIP: 0010:[<ffffffff800324d4>]  [<ffffffff800324d4>] end_buffer_read_sync+0x23/0x35
RSP: 0018:ffff810487143cf0  EFLAGS: 00010202
RAX: 0000000000004001 RBX: ffff810678aa4c70 RCX: ffff81048704e0c0
RDX: ffff81025ed3d9d0 RSI: 0000000000000286 RDI: ffff81048704e0b8
RBP: ffff81086866d3c0 R08: ffff81048704e0c0 R09: 000000000000002a
R10: ffff810487143ae0 R11: ffffffff800423c1 R12: 0000000000000000
R13: ffff81066a0a3948 R14: ffff810287196400 R15: ffff81046eb805b8
FS:  0000000000000000(0000) GS:ffff8102871912c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000001fefd000 CR3: 00000002601f4000 CR4: 00000000000006a0
Process swapper (pid: 0, threadinfo ffff810108cc2000, task ffff8106871467b0)
Stack:  ffff81086866d940 ffffffff800423f0 0000000000000000 ffffffff882b25af
 0000000000000000 0000000000000000 ffff81086866d640 0000000000000000
 ffff81066a0a3378 ffff81027e347c00 ffff81046f821768 ffffffff882b25af
Call Trace:
 <IRQ>  [<ffffffff800423f0>] end_bio_bh_io_sync+0x2f/0x3b
 [<ffffffff882b25af>] :dm_mod:clone_endio+0xc1/0xd0
 [<ffffffff882b25af>] :dm_mod:clone_endio+0xc1/0xd0
 [<ffffffff882b25af>] :dm_mod:clone_endio+0xc1/0xd0
 [<ffffffff8002d3f4>] __end_that_request_first+0x270/0x4ea
 [<ffffffff8807a745>] :scsi_mod:scsi_end_request+0x27/0xcd
 [<ffffffff8807a969>] :scsi_mod:scsi_io_completion+0x17e/0x339
 [<ffffffff880a87ce>] :sd_mod:sd_rw_intr+0x21e/0x258
 [<ffffffff8817acf4>] :qla2xxx:qla24xx_process_response_queue+0xae/0x242
 [<ffffffff8807abe5>] :scsi_mod:scsi_device_unbusy+0x67/0x81
 [<ffffffff80038147>] blk_done_softirq+0x5f/0x6d
 [<ffffffff800125b5>] __do_softirq+0x89/0x133
 [<ffffffff8005e30c>] call_softirq+0x1c/0x28
 [<ffffffff8006d644>] do_softirq+0x2c/0x7d
 [<ffffffff8006d4d4>] do_IRQ+0xee/0xf7
 [<ffffffff8005d625>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff801a66fd>] acpi_safe_halt+0x25/0x36
 [<ffffffff801a6f7b>] acpi_processor_idle_simple+0xfc/0x341
 [<ffffffff801a6e7f>] acpi_processor_idle_simple+0x0/0x341
 [<ffffffff801a6e7f>] acpi_processor_idle_simple+0x0/0x341
 [<ffffffff80049670>] cpu_idle+0x95/0xb8
 [<ffffffff80078d43>] start_secondary+0x479/0x488


Code: 0f 0b 68 e0 44 2c 80 c2 07 01 eb fe f0 ff 4b 58 5b c3 41 57 
RIP  [<ffffffff800324d4>] end_buffer_read_sync+0x23/0x35
 RSP <ffff810487143cf0>
 <0>Kernel panic - not syncing: Aiee, killing interrupt handler!

So there you have it: there's IO still happening for the
buffer while it's being released, and at the end of the IO,
when end_bio_bh_io_sync is called, it decrements the b_count.

Comment 25 Robert Peterson 2014-07-24 19:36:50 UTC
Created attachment 920713 [details]
GFS2: Detect and release delayed bufdata structures

This patch has a correction from the previous version.
It also has a better description.

Comment 26 Robert Peterson 2014-07-28 13:32:08 UTC
The patch from comment #25 passed a complete 100-iteration run
of brawl, so I'm planning to post it.

Comment 27 Robert Peterson 2014-07-28 13:55:47 UTC
The addendum patch is now posted to rhkernel-list. It was tested
my on hlab{1,2} (gfs-i16c-0{1,2}) cluster using 100 iterations
of brawl. Changing the status to POST.

Comment 28 Robert Peterson 2014-07-31 15:50:50 UTC
Addendum patch has been NACKed. Resetting status to ASSIGNED
until this is resolved.

Comment 29 Robert Peterson 2014-08-01 15:52:32 UTC
Replacement addendum patch was just posted to rhkernel-list.
This is identical to the previous addendum patch except for
this modification:

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index a547106..a7acfae 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -844,6 +844,7 @@ static void gfs2_ordered_wait(struct gfs2_sbd *sdp)
                         get_bh(bh);
                         gfs2_log_unlock(sdp);
                         wait_on_buffer(bh);
+                        gfs2_log_lock(sdp);
                         if (test_clear_buffer_release(bh)) {
                                 bd->bd_bh = NULL;
                                 bh->b_private = NULL;
@@ -851,7 +852,6 @@ static void gfs2_ordered_wait(struct gfs2_sbd *sdp)
                                 kmem_cache_free(gfs2_bufdata_cachep, bd);
                         }
                         brelse(bh);
-                        gfs2_log_lock(sdp);
                         continue;
                 }
                 list_del_init(&bd->bd_le.le_list);

In other words, the gfs2_log_lock was moved up in order to cover
a timing window that was potentially a problem.

I am retesting on my hlab cluster, and so far brawl has run
successfully overnight. I'll let it run all weekend.

Comment 32 Nate Straz 2014-08-11 14:47:59 UTC
I made it through a weekend run of brawl without hitting this with kernel-2.6.18-397.el5.

Comment 34 errata-xmlrpc 2014-09-16 00:32:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1196.html

Comment 35 Alexander Gordeev 2014-09-16 08:57:31 UTC
Patch(es) available in kernel-2.6.18-398.el5
You can download this test kernel (or newer) from http://people.redhat.com/agordeev/el5/
Detailed testing feedback is always welcomed.
If you require guidance regarding testing, please ask the bug assignee.