690900 – slab corruption after seeing some nfs-related BUG: warning

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 690900 - slab corruption after seeing some nfs-related BUG: warning

Summary: slab corruption after seeing some nfs-related BUG: warning

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	J. Bruce Fields
QA Contact:	Filesystem QE
Docs Contact:
URL:
Whiteboard:
Depends On:	589512
Blocks:
TreeView+	depends on / blocked

Reported:	2011-03-25 18:25 UTC by J. Bruce Fields
Modified:	2013-03-04 00:20 UTC (History)
CC List:	17 users (show)
Fixed In Version:	kernel-2.6.32-130.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:	589512
Environment:
Last Closed:	2011-05-19 12:55:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0542	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 6.1 kernel security, bug fix and enhancement update	2011-05-19 11:58:07 UTC

Description J. Bruce Fields 2011-03-25 18:25:02 UTC

+++ This bug was initially created as a clone of Bug #589512 +++

Description of problem:
Two different boxes encountered a slab corruption after seeing such messages:
BUG: warning at lib/kref.c:32/kref_get() (Not tainted)

Call Trace:
 [<ffffffff800368c3>] kref_get+0x38/0x3d
 [<ffffffff884cb59e>] :sunrpc:svcauth_unix_set_client+0x87/0xc5
 [<ffffffff884c82e5>] :sunrpc:svc_process+0x2b0/0x71b
 [<ffffffff8008a4ad>] default_wake_function+0x0/0xe
 [<ffffffff884c98f1>] :sunrpc:svc_send+0xda/0x10d
 [<ffffffff8851408a>] :lockd:lockd+0x0/0x272
 [<ffffffff88514211>] :lockd:lockd+0x187/0x272
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8851408a>] :lockd:lockd+0x0/0x272
 [<ffffffff8851408a>] :lockd:lockd+0x0/0x272
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


In both cases, we've had the following backtraces:
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lib/list_debug.c:70
invalid opcode: 0000 [1] SMP 
last sysfs file: /class/fc_remote_ports/rport-1:0-1/roles
CPU 4 
Modules linked in: mptctl mptbase ipmi_devintf ipmi_si ipmi_msghandler nfsd exportfs lockd nfs_acl auth_rpcgss sunrpc autofs4 bonding dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport st shpchp sg tg3 serio_raw pcspkr hpilo libphy bnx2x dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage qla2xxx scsi_transport_fc cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 54, comm: events/4 Not tainted 2.6.18-128.2.1.el5 #1
RIP: 0010:[<ffffffff8014c3db>]  [<ffffffff8014c3db>] list_del+0x48/0x71
RSP: 0018:ffff81031f9fdd70  EFLAGS: 00010086
RAX: 0000000000000058 RBX: ffff81019db49140 RCX: ffffffff802f7aa8
RDX: ffffffff802f7aa8 RSI: 0000000000000000 RDI: ffffffff802f7aa0
RBP: ffff8101a568cdc0 R08: ffffffff802f7aa8 R09: 000000000000003f
R10: ffff81031f9fda10 R11: 0000000000000280 R12: ffff81019e454540
R13: ffff8102f3ebec00 R14: 0000000000000000 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff81031ffe1240(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000129c9490 CR3: 00000003101bd000 CR4: 00000000000006e0
Process events/4 (pid: 54, threadinfo ffff81031f9fc000, task ffff81019f7dc7a0)
Stack:  ffff81019db49140 ffffffff800d732e ffff81031f2a8440 000000011f2a8440
 ffff81031fe1f618 ffff81031fe1f618 0000000000000001 ffff81031fe1f600
 0000000000000001 ffff8101a568cdc0 ffff81019e454540 ffffffff800d7447
Call Trace:
 [<ffffffff800d732e>] free_block+0xb5/0x143
 [<ffffffff800d7447>] drain_array+0x8b/0xc0
 [<ffffffff800d7e84>] cache_reap+0x0/0x217
 [<ffffffff800d7f29>] cache_reap+0xa5/0x217
 [<ffffffff8004d1c5>] run_workqueue+0x94/0xe4
 [<ffffffff80049a3f>] worker_thread+0x0/0x122
 [<ffffffff80049b2f>] worker_thread+0xf0/0x122
 [<ffffffff8008a4ad>] default_wake_function+0x0/0xe
 [<ffffffff800323b8>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800322ba>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11



Version-Release number of selected component (if applicable):
RHEL5.3 2.6.18-128.2.1.el5

How reproducible:
Not known, in both cases the uptime was ~= 2 month
vmcores available for both systems, now trying to run memtest to be sure it's not RAM/hw

--- Additional comment from bfields on 2011-03-24 23:19:45 EDT ---

My reproducer is:

0. Patch server to BUG after it sees an auth_domain reference count exceed 1000.
1. Do NFSv3 mount on client
2. Run connectathon tests in a loop on client.  (cthon locking tests would have been sufficient.)
3. Run "exportfs -f" in a loop on the server.

And, after digging through the code some more.... There's a spurious "rqstp->rq_client = NULL" in the lockd code, leftover from an old patch that moved the rq_client management out of the lockd code into common sunrpc code, but left behind this one piece.  The result is that the rq_client isn't put when it should be, leading to this refcount imbalance.  After 4 billion lock requests or so the refcount overflows to 0, then to 1, and then the next auth_domain_put() incorrectly frees the auth_domain while it is still in use.

Confirmed that I no longer hit the bug in my reproducer above after removing this line.

diff --git a/fs/nfsd/lockd.c b/fs/nfsd/lockd.c
index 0c6d816..7c831a2 100644
--- a/fs/nfsd/lockd.c
+++ b/fs/nfsd/lockd.c
@@ -38,7 +38,6 @@ nlm_fopen(struct svc_rqst *rqstp, struct nfs_fh *f, struct fil
e **filp)
 	exp_readlock();
 	nfserr = nfsd_open(rqstp, &fh, S_IFREG, NFSD_MAY_LOCK, filp);
 	fh_put(&fh);
-	rqstp->rq_client = NULL;
 	exp_readunlock();
  	/* We return nlm error codes as nlm doesn't know
 	 * about nfsd, but nfsd does know about nlm..

Comment 1 KernelOops Bot 2011-03-25 18:25:50 UTC

 with this filelineno:  bug 589512

Comment 5 RHEL Program Management 2011-03-25 18:39:46 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 9 Aristeu Rozanski 2011-04-07 13:51:41 UTC

Patch(es) available on kernel-2.6.32-130.el6

Comment 13 errata-xmlrpc 2011-05-19 12:55:38 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html

Note You need to log in before you can comment on or make changes to this bug.