Bug 589512

Summary: slab corruption after seeing some nfs-related BUG: warning
Product: Red Hat Enterprise Linux 5 Reporter: Veaceslav Falico <vfalico>
Component: kernelAssignee: J. Bruce Fields <bfields>
Status: CLOSED ERRATA QA Contact: Jian Li <jiali>
Severity: high Docs Contact:
Priority: high    
Version: 5.3CC: anton, bfields, cward, dhoward, ekuric, james.brown, jiali, jlayton, jthomas, kchoi, lwoodman, nmurray, peterm, qcai, rmitchel, rwheeler, sprabhu, steved, syeghiay, tao, tumeya, vgaikwad, yanwang
Target Milestone: rcKeywords: OtherQA, Reopened, ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.18-255.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 690900 (view as bug list) Environment:
Last Closed: 2011-07-21 10:03:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 690900, 697448    
Attachments:
Description Flags
backport of possible fixes from upstream
none
possible bugfixes + some debugging code
none
Simple program to lock files in a loop. none

Description Veaceslav Falico 2010-05-06 10:59:25 UTC
Description of problem:
Two different boxes encountered a slab corruption after seeing such messages:
BUG: warning at lib/kref.c:32/kref_get() (Not tainted)

Call Trace:
 [<ffffffff800368c3>] kref_get+0x38/0x3d
 [<ffffffff884cb59e>] :sunrpc:svcauth_unix_set_client+0x87/0xc5
 [<ffffffff884c82e5>] :sunrpc:svc_process+0x2b0/0x71b
 [<ffffffff8008a4ad>] default_wake_function+0x0/0xe
 [<ffffffff884c98f1>] :sunrpc:svc_send+0xda/0x10d
 [<ffffffff8851408a>] :lockd:lockd+0x0/0x272
 [<ffffffff88514211>] :lockd:lockd+0x187/0x272
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8851408a>] :lockd:lockd+0x0/0x272
 [<ffffffff8851408a>] :lockd:lockd+0x0/0x272
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


In both cases, we've had the following backtraces:
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lib/list_debug.c:70
invalid opcode: 0000 [1] SMP 
last sysfs file: /class/fc_remote_ports/rport-1:0-1/roles
CPU 4 
Modules linked in: mptctl mptbase ipmi_devintf ipmi_si ipmi_msghandler nfsd exportfs lockd nfs_acl auth_rpcgss sunrpc autofs4 bonding dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport st shpchp sg tg3 serio_raw pcspkr hpilo libphy bnx2x dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage qla2xxx scsi_transport_fc cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 54, comm: events/4 Not tainted 2.6.18-128.2.1.el5 #1
RIP: 0010:[<ffffffff8014c3db>]  [<ffffffff8014c3db>] list_del+0x48/0x71
RSP: 0018:ffff81031f9fdd70  EFLAGS: 00010086
RAX: 0000000000000058 RBX: ffff81019db49140 RCX: ffffffff802f7aa8
RDX: ffffffff802f7aa8 RSI: 0000000000000000 RDI: ffffffff802f7aa0
RBP: ffff8101a568cdc0 R08: ffffffff802f7aa8 R09: 000000000000003f
R10: ffff81031f9fda10 R11: 0000000000000280 R12: ffff81019e454540
R13: ffff8102f3ebec00 R14: 0000000000000000 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff81031ffe1240(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000129c9490 CR3: 00000003101bd000 CR4: 00000000000006e0
Process events/4 (pid: 54, threadinfo ffff81031f9fc000, task ffff81019f7dc7a0)
Stack:  ffff81019db49140 ffffffff800d732e ffff81031f2a8440 000000011f2a8440
 ffff81031fe1f618 ffff81031fe1f618 0000000000000001 ffff81031fe1f600
 0000000000000001 ffff8101a568cdc0 ffff81019e454540 ffffffff800d7447
Call Trace:
 [<ffffffff800d732e>] free_block+0xb5/0x143
 [<ffffffff800d7447>] drain_array+0x8b/0xc0
 [<ffffffff800d7e84>] cache_reap+0x0/0x217
 [<ffffffff800d7f29>] cache_reap+0xa5/0x217
 [<ffffffff8004d1c5>] run_workqueue+0x94/0xe4
 [<ffffffff80049a3f>] worker_thread+0x0/0x122
 [<ffffffff80049b2f>] worker_thread+0xf0/0x122
 [<ffffffff8008a4ad>] default_wake_function+0x0/0xe
 [<ffffffff800323b8>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800322ba>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11



Version-Release number of selected component (if applicable):
RHEL5.3 2.6.18-128.2.1.el5

How reproducible:
Not known, in both cases the uptime was ~= 2 month
vmcores available for both systems, now trying to run memtest to be sure it's not RAM/hw

Comment 3 Veaceslav Falico 2010-05-13 09:40:31 UTC
The memtest ran for almost a day without any errors. So this one seems to be hardware-unrelated.

Comment 4 Larry Woodman 2010-05-13 14:28:30 UTC
Its clear why the system panic'd, svcauth_unix_set_client() is taking a referenced to a freed rq_client structure which is in the list of free 64-byte slab objects.  Unfortunately I dont know enough about the sun rpc code to debug this.

Larry Woodman

Comment 9 Veaceslav Falico 2010-05-25 12:43:40 UTC
Hello,

Is there anything I can do to speed up the issue? Any help/additional analysis?

Thank you!

Comment 28 RHEL Program Management 2010-12-07 10:21:34 UTC
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.6 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

Comment 30 RHEL Program Management 2010-12-07 11:15:06 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.

Comment 32 Jeff Layton 2010-12-22 20:52:37 UTC
Reopening bug since jthomas has some new info. Here's a possible scenario, and I don't think this is taken care of, even upstream...

auth_domain_put uses atomic_dec_and_lock to take the auth_domain_lock when the count reaches zero. Suppose though, that we're racing with another thread that's in auth_domain_lookup at the time...

t1) auth_domain_lookup takes the auth_domain_lock

t2) auth_domain_put decrements the refcount to zero, and ends up spinning on the lock

t1) auth_domain_lookup finds an auth_domain and does a kref_get. That increments the refcount from 0 -> 1 and triggers the kref warning. The spinlock is then released.

t2) acquires the lock and then does the hlist_del and domain_release to tear down the auth_domain

Now, that said, I still don't see how we'd end up with the kref_get warning in svcauth_unix_set_client. That should be taking doing a kref_get against a auth_domain to which we already have an entry, since we hold a reference to the ip_map.

Still, there do seem to be some holes in the refcounting here. Some of that may be mitigated by getting them on a kernel with the fixes for bug 548846. That has a bug where the auth_domain_lock will never be released, which could possibly widen the window for the above race (or something similar).

Comment 33 Jeff Layton 2010-12-22 21:42:09 UTC
Nope, my theory in comment #32 is wrong. atomic_dec_and_lock is truly atomic, even thought he comments on it don't really indicate that. The above race shouldn't be possible.

So, it seems like the ip_map that was found in svcauth_unix_set_client was likely already bad somehow. Either it had a pointer to a bogus auth_domain, or was itself already destroyed.

Comment 34 Jeff Layton 2010-12-23 14:54:21 UTC
Ok, I've looked over the code and the kref warning and I suspect that this will probably not be fixed by the patch for 548846 after all. I could be wrong, but it looks like we have a use-after-free.

The most likely scenario is a refcount imbalance -- too many auth_domain "puts" and not enough "gets". That patch fixes a refcount imbalance, but it fixes it in the other direction -- i.e. it removes an extraneous "get". It also fixes a spinlock imbalance, but that's unlikely to cause refcounting problems, but rather deadlocks.

After looking at this with Bruce yesterday, he said this:

16:59 < bfields> jlayton, I don't understand the auth_domain_put() in 
                 auth_unix_lookup().
17:03 < bfields> I think there's no way that auth_domain_put can be right.

...so that might be one avenue to investigate

Comment 35 Jeff Layton 2010-12-23 15:23:27 UTC
I'm not sure that looks suspect to me...

        if ((ipm->m_client->addr_changes - ipm->m_add_change) >0) {
                if (test_and_set_bit(CACHE_NEGATIVE, &ipm->h.flags) == 0)
                        auth_domain_put(&ipm->m_client->h);
                rv = NULL;
        } else {
                rv = &ipm->m_client->h;
                kref_get(&rv->ref);
        }


...so basically if m_client->addr_changes is larger than ipm->m_add_change, then I guess that signifies that the m_client has had auth_unix_forget_old called on it and existing ip_map entries that point to it shouldn't be trusted anymore.

We then do a test_and_set_bit for CACHE_NEGATIVE. If that returns true then that means that the CACHE_NEGATIVE bit was already set. If it's already set, then the auth_domain will likely have already been put. So that looks right to me.

One thing that might be interesting here though is to also zero out or poison the m_client pointer after we do an auth_domain_put on it. The pointer is invalid at that point, so that might make us trigger an oops sooner instead of corrupting memory.

Comment 36 Jeff Layton 2010-12-23 16:18:39 UTC
OTOH...we are calling auth_domain_put on the m_client without any regard for other users of it. We hold a reference to the ip_map at that point, but it seems like other tasks could too. Maybe that's the race that Bruce means here. If that's case, perhaps we need to use a different mechanism to do this.

There's no way to know for sure whether this is the ultimate cause of their problem, but it seems like something that needs fixing.

Comment 37 Jeff Layton 2010-12-23 16:40:44 UTC
Reassigning to Bruce for now. He has a better understanding of this code and is in a better position to fix it upstream...

Comment 38 J. Bruce Fields 2010-12-23 19:10:50 UTC
The auth_domain_put() in auth_unix_lookup() is a bug, and I'll make a patch for it, but: normally both the addr_changes and m_add_change fields are zero.  The only code that increments either is called from the "legacy" nfsserverctl interfaces, which shouldn't be getting used in this case unless there's been a very old nfs-utils (<1.1.0?) installed, or the "nfsd" filesystem fails to mount for some reason.  Neither is likely.

(One exception: ip_map_parse will increment m_add_change if an expiry time of 0x7FFFFFFF is passed in; but no nfs-utils has ever done that, that I can see.)

So I'm still looking for an explanation.

Comment 39 Jeff Layton 2010-12-23 19:18:19 UTC
...and even in that case, this:

 (ipm->m_client->addr_changes - ipm->m_add_change) > 0

...would be equivalent to:

 (0 - 1) > 0

...and that would return false. So I think you're right and this is not the likely cause.

Comment 41 J. Bruce Fields 2011-01-25 00:53:12 UTC
Created attachment 475093 [details]
backport of possible fixes from upstream

We're not working with very much information here yet, unfortunately.  But the attached patches:

1. Close a race which would only be seen when using the legacy nfsd interface, which rhel5 nfs-utils shouldn't be.  But perhaps a userspace bug or odd configuration is causing it to be used in this case.
2. Add warning printk's on use of the legacy interface, to help detect whether someone is still using it.
3. Close an smp race which could allow a cpu to see the flag indicating that a cache entry is valid while not yet seeing the updated contents of the cache entry.  Could explain the reported backtrace, though it seems likely to be a rare race.

The upstream versions of these patches have gotten some testing, but these backports are compile-tested only so far.

Comment 62 J. Bruce Fields 2011-03-09 16:54:36 UTC
Created attachment 483261 [details]
possible bugfixes + some debugging code

Also add a couple more assertions so we BUG() a little sooner in the case of an auth_domain refcount imbalance.

Comment 63 Jon Thomas 2011-03-23 12:47:18 UTC
Hi,

The kernel from comment 59 crashed and we have a vmcore. 

megatron.gsslab.rdu.redhat.com:/cores/20110323021323

It looks like the kernel needs to be rebuilt to get debuginfo.

Comment 71 J. Bruce Fields 2011-03-25 03:19:45 UTC
My reproducer is:

0. Patch server to BUG after it sees an auth_domain reference count exceed 1000.
1. Do NFSv3 mount on client
2. Run connectathon tests in a loop on client.  (cthon locking tests would have been sufficient.)
3. Run "exportfs -f" in a loop on the server.

And, after digging through the code some more.... There's a spurious "rqstp->rq_client = NULL" in the lockd code, leftover from an old patch that moved the rq_client management out of the lockd code into common sunrpc code, but left behind this one piece.  The result is that the rq_client isn't put when it should be, leading to this refcount imbalance.  After 4 billion lock requests or so the refcount overflows to 0, then to 1, and then the next auth_domain_put() incorrectly frees the auth_domain while it is still in use.

Confirmed that I no longer hit the bug in my reproducer above after removing this line.

diff --git a/fs/nfsd/lockd.c b/fs/nfsd/lockd.c
index 0c6d816..7c831a2 100644
--- a/fs/nfsd/lockd.c
+++ b/fs/nfsd/lockd.c
@@ -38,7 +38,6 @@ nlm_fopen(struct svc_rqst *rqstp, struct nfs_fh *f, struct fil
e **filp)
 	exp_readlock();
 	nfserr = nfsd_open(rqstp, &fh, S_IFREG, NFSD_MAY_LOCK, filp);
 	fh_put(&fh);
-	rqstp->rq_client = NULL;
 	exp_readunlock();
  	/* We return nlm error codes as nlm doesn't know
 	 * about nfsd, but nfsd does know about nlm..

Comment 73 RHEL Program Management 2011-03-25 04:15:34 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 77 Jarod Wilson 2011-04-01 22:03:01 UTC
Patch(es) available in kernel-2.6.18-254.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 80 Jarod Wilson 2011-04-04 21:57:45 UTC
Patch(es) available in kernel-2.6.18-255.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 81 Sachin Prabhu 2011-04-05 12:49:38 UTC
A quick test to check for increasing refcount on the auth_domain. 

On the NFS server, run the following stap script.
#stap -e 'probe module("sunrpc").function("auth_domain_lookup").return{printf("%s %d\n", kernel_string($return->name), $return->ref->refcount->counter);}'

This probes the auth_domain_lookup function. On return, it prints out the auth_domain clients and refcount.

On the NFS Client, you can run your usual workload to reproduce the problem. 

If you would like to generate an artificial workload to quickly hit the problem, you can hit it in the following manner using a simple program(locking in this case) to generate the load.
for i in {1..1000}; do mount vm130-21:/exports /mnt;cd /mnt; ./locking a xxx; cd /; umount /mnt; done
Where vm130-21:/exports was the NFS server in my test setup

With affected kernels, you will see the refcount continously increasing

# stap -e 'probe module("sunrpc").function("auth_domain_lookup").return{printf("%s %d\n", kernel_string($return->name), $return->ref->refcount->counter);}'
* 2606
* 2606
* 2606
* 2607
* 2607
* 2607
* 2608
* 2608
* 2608
* 2609
* 2609
* 2609
* 2610
* 2610
...

On a NFS server which contains the fixed kernel, you shouldn't see the reference count increasing.
ex:
# uname -a
Linux vm130-21.seg1.gsslab.fab.redhat.com 2.6.18-255.el5 #1 SMP Fri Apr 1 19:35:11 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
# stap -e 'probe module("sunrpc").function("auth_domain_lookup").return{printf("%s %d\n", kernel_string($return->name), $return->ref->refcount->counter);}'
* 4
* 4
* 4
* 4
* 4
* 4
* 4
* 4

Comment 82 Sachin Prabhu 2011-04-07 09:56:20 UTC
Created attachment 490504 [details]
Simple program to lock files in a loop.

Simple program to lock files in a loop. To be used to check reference count using stap script.

Comment 87 Jian Li 2011-06-23 10:10:44 UTC
This bug is reproduced on 2.6.18-253.el5, and verified on 2.6.18-268.el5.

Test steps:
[[server]] share one directory with nfs, check result with stap.

[[client]] mount server's directory, test with flock.
[root@intel-s3e3144-03 ~]# for i in {1..10000}; do mount hp-dl380g6-01.rhts.eng.bos.redhat.com:/test /mnt/test; flock /mnt/test/lockfile -c "ls " ; umount /mnt/test ; done

Test result:
========reproducer
[root@hp-dl380g6-01 ~]# uname -a
Linux hp-dl380g6-01.rhts.eng.bos.redhat.com 2.6.18-253.el5 #1 SMP Tue Mar 29 19:22:17 EDT 2011 i686 i686 i386 GNU/Linux
[root@hp-dl380g6-01 ~]# stap -e 'probe module("sunrpc").function("auth_domain_lookup").return { printf("%s %d\n",kernel_string($return->name), $return->ref->refcount->counter);}'
* 11
* 11
* 11
** ** snip **
* 1169
* 1169
* 1169
* 1170
* 1170
* 1170

========verify
[root@hp-dl380g6-01 ~]# uname -a
Linux hp-dl380g6-01.rhts.eng.bos.redhat.com 2.6.18-268.el5PAE #1 SMP Tue Jun 14 18:30:19 EDT 2011 i686 i686 i386 GNU/Linux
[root@hp-dl380g6-01 ~]# stap -e 'probe module("sunrpc").function("auth_domain_lookup").return { printf("%s %d\n",kernel_string($return->name), $return->ref->refcount->counter);}'
* 4
* 4
** ** snip **
* 4
* 4

Comment 88 errata-xmlrpc 2011-07-21 10:03:37 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html