Red Hat Bugzilla – Bug 690900
slab corruption after seeing some nfs-related BUG: warning
Last modified: 2013-03-03 19:20:37 EST
+++ This bug was initially created as a clone of Bug #589512 +++ Description of problem: Two different boxes encountered a slab corruption after seeing such messages: BUG: warning at lib/kref.c:32/kref_get() (Not tainted) Call Trace: [<ffffffff800368c3>] kref_get+0x38/0x3d [<ffffffff884cb59e>] :sunrpc:svcauth_unix_set_client+0x87/0xc5 [<ffffffff884c82e5>] :sunrpc:svc_process+0x2b0/0x71b [<ffffffff8008a4ad>] default_wake_function+0x0/0xe [<ffffffff884c98f1>] :sunrpc:svc_send+0xda/0x10d [<ffffffff8851408a>] :lockd:lockd+0x0/0x272 [<ffffffff88514211>] :lockd:lockd+0x187/0x272 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8851408a>] :lockd:lockd+0x0/0x272 [<ffffffff8851408a>] :lockd:lockd+0x0/0x272 [<ffffffff8005dfa7>] child_rip+0x0/0x11 In both cases, we've had the following backtraces: ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lib/list_debug.c:70 invalid opcode: 0000 [1] SMP last sysfs file: /class/fc_remote_ports/rport-1:0-1/roles CPU 4 Modules linked in: mptctl mptbase ipmi_devintf ipmi_si ipmi_msghandler nfsd exportfs lockd nfs_acl auth_rpcgss sunrpc autofs4 bonding dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport st shpchp sg tg3 serio_raw pcspkr hpilo libphy bnx2x dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage qla2xxx scsi_transport_fc cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 54, comm: events/4 Not tainted 2.6.18-128.2.1.el5 #1 RIP: 0010:[<ffffffff8014c3db>] [<ffffffff8014c3db>] list_del+0x48/0x71 RSP: 0018:ffff81031f9fdd70 EFLAGS: 00010086 RAX: 0000000000000058 RBX: ffff81019db49140 RCX: ffffffff802f7aa8 RDX: ffffffff802f7aa8 RSI: 0000000000000000 RDI: ffffffff802f7aa0 RBP: ffff8101a568cdc0 R08: ffffffff802f7aa8 R09: 000000000000003f R10: ffff81031f9fda10 R11: 0000000000000280 R12: ffff81019e454540 R13: ffff8102f3ebec00 R14: 0000000000000000 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff81031ffe1240(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000129c9490 CR3: 00000003101bd000 CR4: 00000000000006e0 Process events/4 (pid: 54, threadinfo ffff81031f9fc000, task ffff81019f7dc7a0) Stack: ffff81019db49140 ffffffff800d732e ffff81031f2a8440 000000011f2a8440 ffff81031fe1f618 ffff81031fe1f618 0000000000000001 ffff81031fe1f600 0000000000000001 ffff8101a568cdc0 ffff81019e454540 ffffffff800d7447 Call Trace: [<ffffffff800d732e>] free_block+0xb5/0x143 [<ffffffff800d7447>] drain_array+0x8b/0xc0 [<ffffffff800d7e84>] cache_reap+0x0/0x217 [<ffffffff800d7f29>] cache_reap+0xa5/0x217 [<ffffffff8004d1c5>] run_workqueue+0x94/0xe4 [<ffffffff80049a3f>] worker_thread+0x0/0x122 [<ffffffff80049b2f>] worker_thread+0xf0/0x122 [<ffffffff8008a4ad>] default_wake_function+0x0/0xe [<ffffffff800323b8>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800322ba>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Version-Release number of selected component (if applicable): RHEL5.3 2.6.18-128.2.1.el5 How reproducible: Not known, in both cases the uptime was ~= 2 month vmcores available for both systems, now trying to run memtest to be sure it's not RAM/hw --- Additional comment from bfields@redhat.com on 2011-03-24 23:19:45 EDT --- My reproducer is: 0. Patch server to BUG after it sees an auth_domain reference count exceed 1000. 1. Do NFSv3 mount on client 2. Run connectathon tests in a loop on client. (cthon locking tests would have been sufficient.) 3. Run "exportfs -f" in a loop on the server. And, after digging through the code some more.... There's a spurious "rqstp->rq_client = NULL" in the lockd code, leftover from an old patch that moved the rq_client management out of the lockd code into common sunrpc code, but left behind this one piece. The result is that the rq_client isn't put when it should be, leading to this refcount imbalance. After 4 billion lock requests or so the refcount overflows to 0, then to 1, and then the next auth_domain_put() incorrectly frees the auth_domain while it is still in use. Confirmed that I no longer hit the bug in my reproducer above after removing this line. diff --git a/fs/nfsd/lockd.c b/fs/nfsd/lockd.c index 0c6d816..7c831a2 100644 --- a/fs/nfsd/lockd.c +++ b/fs/nfsd/lockd.c @@ -38,7 +38,6 @@ nlm_fopen(struct svc_rqst *rqstp, struct nfs_fh *f, struct fil e **filp) exp_readlock(); nfserr = nfsd_open(rqstp, &fh, S_IFREG, NFSD_MAY_LOCK, filp); fh_put(&fh); - rqstp->rq_client = NULL; exp_readunlock(); /* We return nlm error codes as nlm doesn't know * about nfsd, but nfsd does know about nlm..
with this filelineno: bug 589512
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Patch(es) available on kernel-2.6.32-130.el6
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html