Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 690900 - slab corruption after seeing some nfs-related BUG: warning
slab corruption after seeing some nfs-related BUG: warning
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
All Linux
high Severity high
: rc
: ---
Assigned To: J. Bruce Fields
Filesystem QE
: Reopened
Depends On: 589512
Blocks:
  Show dependency treegraph
 
Reported: 2011-03-25 14:25 EDT by J. Bruce Fields
Modified: 2013-03-03 19:20 EST (History)
17 users (show)

See Also:
Fixed In Version: kernel-2.6.32-130.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 589512
Environment:
Last Closed: 2011-05-19 08:55:38 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0542 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 6.1 kernel security, bug fix and enhancement update 2011-05-19 07:58:07 EDT

  None (edit)
Description J. Bruce Fields 2011-03-25 14:25:02 EDT
+++ This bug was initially created as a clone of Bug #589512 +++

Description of problem:
Two different boxes encountered a slab corruption after seeing such messages:
BUG: warning at lib/kref.c:32/kref_get() (Not tainted)

Call Trace:
 [<ffffffff800368c3>] kref_get+0x38/0x3d
 [<ffffffff884cb59e>] :sunrpc:svcauth_unix_set_client+0x87/0xc5
 [<ffffffff884c82e5>] :sunrpc:svc_process+0x2b0/0x71b
 [<ffffffff8008a4ad>] default_wake_function+0x0/0xe
 [<ffffffff884c98f1>] :sunrpc:svc_send+0xda/0x10d
 [<ffffffff8851408a>] :lockd:lockd+0x0/0x272
 [<ffffffff88514211>] :lockd:lockd+0x187/0x272
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8851408a>] :lockd:lockd+0x0/0x272
 [<ffffffff8851408a>] :lockd:lockd+0x0/0x272
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


In both cases, we've had the following backtraces:
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lib/list_debug.c:70
invalid opcode: 0000 [1] SMP 
last sysfs file: /class/fc_remote_ports/rport-1:0-1/roles
CPU 4 
Modules linked in: mptctl mptbase ipmi_devintf ipmi_si ipmi_msghandler nfsd exportfs lockd nfs_acl auth_rpcgss sunrpc autofs4 bonding dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport st shpchp sg tg3 serio_raw pcspkr hpilo libphy bnx2x dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage qla2xxx scsi_transport_fc cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 54, comm: events/4 Not tainted 2.6.18-128.2.1.el5 #1
RIP: 0010:[<ffffffff8014c3db>]  [<ffffffff8014c3db>] list_del+0x48/0x71
RSP: 0018:ffff81031f9fdd70  EFLAGS: 00010086
RAX: 0000000000000058 RBX: ffff81019db49140 RCX: ffffffff802f7aa8
RDX: ffffffff802f7aa8 RSI: 0000000000000000 RDI: ffffffff802f7aa0
RBP: ffff8101a568cdc0 R08: ffffffff802f7aa8 R09: 000000000000003f
R10: ffff81031f9fda10 R11: 0000000000000280 R12: ffff81019e454540
R13: ffff8102f3ebec00 R14: 0000000000000000 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff81031ffe1240(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000129c9490 CR3: 00000003101bd000 CR4: 00000000000006e0
Process events/4 (pid: 54, threadinfo ffff81031f9fc000, task ffff81019f7dc7a0)
Stack:  ffff81019db49140 ffffffff800d732e ffff81031f2a8440 000000011f2a8440
 ffff81031fe1f618 ffff81031fe1f618 0000000000000001 ffff81031fe1f600
 0000000000000001 ffff8101a568cdc0 ffff81019e454540 ffffffff800d7447
Call Trace:
 [<ffffffff800d732e>] free_block+0xb5/0x143
 [<ffffffff800d7447>] drain_array+0x8b/0xc0
 [<ffffffff800d7e84>] cache_reap+0x0/0x217
 [<ffffffff800d7f29>] cache_reap+0xa5/0x217
 [<ffffffff8004d1c5>] run_workqueue+0x94/0xe4
 [<ffffffff80049a3f>] worker_thread+0x0/0x122
 [<ffffffff80049b2f>] worker_thread+0xf0/0x122
 [<ffffffff8008a4ad>] default_wake_function+0x0/0xe
 [<ffffffff800323b8>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800322ba>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11



Version-Release number of selected component (if applicable):
RHEL5.3 2.6.18-128.2.1.el5

How reproducible:
Not known, in both cases the uptime was ~= 2 month
vmcores available for both systems, now trying to run memtest to be sure it's not RAM/hw

--- Additional comment from bfields@redhat.com on 2011-03-24 23:19:45 EDT ---

My reproducer is:

0. Patch server to BUG after it sees an auth_domain reference count exceed 1000.
1. Do NFSv3 mount on client
2. Run connectathon tests in a loop on client.  (cthon locking tests would have been sufficient.)
3. Run "exportfs -f" in a loop on the server.

And, after digging through the code some more.... There's a spurious "rqstp->rq_client = NULL" in the lockd code, leftover from an old patch that moved the rq_client management out of the lockd code into common sunrpc code, but left behind this one piece.  The result is that the rq_client isn't put when it should be, leading to this refcount imbalance.  After 4 billion lock requests or so the refcount overflows to 0, then to 1, and then the next auth_domain_put() incorrectly frees the auth_domain while it is still in use.

Confirmed that I no longer hit the bug in my reproducer above after removing this line.

diff --git a/fs/nfsd/lockd.c b/fs/nfsd/lockd.c
index 0c6d816..7c831a2 100644
--- a/fs/nfsd/lockd.c
+++ b/fs/nfsd/lockd.c
@@ -38,7 +38,6 @@ nlm_fopen(struct svc_rqst *rqstp, struct nfs_fh *f, struct fil
e **filp)
 	exp_readlock();
 	nfserr = nfsd_open(rqstp, &fh, S_IFREG, NFSD_MAY_LOCK, filp);
 	fh_put(&fh);
-	rqstp->rq_client = NULL;
 	exp_readunlock();
  	/* We return nlm error codes as nlm doesn't know
 	 * about nfsd, but nfsd does know about nlm..
Comment 1 KernelOops Bot 2011-03-25 14:25:50 EDT
 with this filelineno:  bug 589512
Comment 5 RHEL Product and Program Management 2011-03-25 14:39:46 EDT
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.
Comment 9 Aristeu Rozanski 2011-04-07 09:51:41 EDT
Patch(es) available on kernel-2.6.32-130.el6
Comment 13 errata-xmlrpc 2011-05-19 08:55:38 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html

Note You need to log in before you can comment on or make changes to this bug.