Bug 589512
Summary: | slab corruption after seeing some nfs-related BUG: warning | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Veaceslav Falico <vfalico> | ||||||||
Component: | kernel | Assignee: | J. Bruce Fields <bfields> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Jian Li <jiali> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 5.3 | CC: | anton, bfields, cward, dhoward, ekuric, james.brown, jiali, jlayton, jthomas, kchoi, lwoodman, nmurray, peterm, qcai, rmitchel, rwheeler, sprabhu, steved, syeghiay, tao, tumeya, vgaikwad, yanwang | ||||||||
Target Milestone: | rc | Keywords: | OtherQA, Reopened, ZStream | ||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | kernel-2.6.18-255.el5 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 690900 (view as bug list) | Environment: | |||||||||
Last Closed: | 2011-07-21 10:03:37 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 690900, 697448 | ||||||||||
Attachments: |
|
Description
Veaceslav Falico
2010-05-06 10:59:25 UTC
The memtest ran for almost a day without any errors. So this one seems to be hardware-unrelated. Its clear why the system panic'd, svcauth_unix_set_client() is taking a referenced to a freed rq_client structure which is in the list of free 64-byte slab objects. Unfortunately I dont know enough about the sun rpc code to debug this. Larry Woodman Hello, Is there anything I can do to speed up the issue? Any help/additional analysis? Thank you! This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.6 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug. Development Management has reviewed and declined this request. You may appeal this decision by reopening this request. Reopening bug since jthomas has some new info. Here's a possible scenario, and I don't think this is taken care of, even upstream... auth_domain_put uses atomic_dec_and_lock to take the auth_domain_lock when the count reaches zero. Suppose though, that we're racing with another thread that's in auth_domain_lookup at the time... t1) auth_domain_lookup takes the auth_domain_lock t2) auth_domain_put decrements the refcount to zero, and ends up spinning on the lock t1) auth_domain_lookup finds an auth_domain and does a kref_get. That increments the refcount from 0 -> 1 and triggers the kref warning. The spinlock is then released. t2) acquires the lock and then does the hlist_del and domain_release to tear down the auth_domain Now, that said, I still don't see how we'd end up with the kref_get warning in svcauth_unix_set_client. That should be taking doing a kref_get against a auth_domain to which we already have an entry, since we hold a reference to the ip_map. Still, there do seem to be some holes in the refcounting here. Some of that may be mitigated by getting them on a kernel with the fixes for bug 548846. That has a bug where the auth_domain_lock will never be released, which could possibly widen the window for the above race (or something similar). Nope, my theory in comment #32 is wrong. atomic_dec_and_lock is truly atomic, even thought he comments on it don't really indicate that. The above race shouldn't be possible. So, it seems like the ip_map that was found in svcauth_unix_set_client was likely already bad somehow. Either it had a pointer to a bogus auth_domain, or was itself already destroyed. Ok, I've looked over the code and the kref warning and I suspect that this will probably not be fixed by the patch for 548846 after all. I could be wrong, but it looks like we have a use-after-free. The most likely scenario is a refcount imbalance -- too many auth_domain "puts" and not enough "gets". That patch fixes a refcount imbalance, but it fixes it in the other direction -- i.e. it removes an extraneous "get". It also fixes a spinlock imbalance, but that's unlikely to cause refcounting problems, but rather deadlocks. After looking at this with Bruce yesterday, he said this: 16:59 < bfields> jlayton, I don't understand the auth_domain_put() in auth_unix_lookup(). 17:03 < bfields> I think there's no way that auth_domain_put can be right. ...so that might be one avenue to investigate I'm not sure that looks suspect to me... if ((ipm->m_client->addr_changes - ipm->m_add_change) >0) { if (test_and_set_bit(CACHE_NEGATIVE, &ipm->h.flags) == 0) auth_domain_put(&ipm->m_client->h); rv = NULL; } else { rv = &ipm->m_client->h; kref_get(&rv->ref); } ...so basically if m_client->addr_changes is larger than ipm->m_add_change, then I guess that signifies that the m_client has had auth_unix_forget_old called on it and existing ip_map entries that point to it shouldn't be trusted anymore. We then do a test_and_set_bit for CACHE_NEGATIVE. If that returns true then that means that the CACHE_NEGATIVE bit was already set. If it's already set, then the auth_domain will likely have already been put. So that looks right to me. One thing that might be interesting here though is to also zero out or poison the m_client pointer after we do an auth_domain_put on it. The pointer is invalid at that point, so that might make us trigger an oops sooner instead of corrupting memory. OTOH...we are calling auth_domain_put on the m_client without any regard for other users of it. We hold a reference to the ip_map at that point, but it seems like other tasks could too. Maybe that's the race that Bruce means here. If that's case, perhaps we need to use a different mechanism to do this. There's no way to know for sure whether this is the ultimate cause of their problem, but it seems like something that needs fixing. Reassigning to Bruce for now. He has a better understanding of this code and is in a better position to fix it upstream... The auth_domain_put() in auth_unix_lookup() is a bug, and I'll make a patch for it, but: normally both the addr_changes and m_add_change fields are zero. The only code that increments either is called from the "legacy" nfsserverctl interfaces, which shouldn't be getting used in this case unless there's been a very old nfs-utils (<1.1.0?) installed, or the "nfsd" filesystem fails to mount for some reason. Neither is likely. (One exception: ip_map_parse will increment m_add_change if an expiry time of 0x7FFFFFFF is passed in; but no nfs-utils has ever done that, that I can see.) So I'm still looking for an explanation. ...and even in that case, this: (ipm->m_client->addr_changes - ipm->m_add_change) > 0 ...would be equivalent to: (0 - 1) > 0 ...and that would return false. So I think you're right and this is not the likely cause. Created attachment 475093 [details]
backport of possible fixes from upstream
We're not working with very much information here yet, unfortunately. But the attached patches:
1. Close a race which would only be seen when using the legacy nfsd interface, which rhel5 nfs-utils shouldn't be. But perhaps a userspace bug or odd configuration is causing it to be used in this case.
2. Add warning printk's on use of the legacy interface, to help detect whether someone is still using it.
3. Close an smp race which could allow a cpu to see the flag indicating that a cache entry is valid while not yet seeing the updated contents of the cache entry. Could explain the reported backtrace, though it seems likely to be a rare race.
The upstream versions of these patches have gotten some testing, but these backports are compile-tested only so far.
Created attachment 483261 [details]
possible bugfixes + some debugging code
Also add a couple more assertions so we BUG() a little sooner in the case of an auth_domain refcount imbalance.
Hi, The kernel from comment 59 crashed and we have a vmcore. megatron.gsslab.rdu.redhat.com:/cores/20110323021323 It looks like the kernel needs to be rebuilt to get debuginfo. My reproducer is: 0. Patch server to BUG after it sees an auth_domain reference count exceed 1000. 1. Do NFSv3 mount on client 2. Run connectathon tests in a loop on client. (cthon locking tests would have been sufficient.) 3. Run "exportfs -f" in a loop on the server. And, after digging through the code some more.... There's a spurious "rqstp->rq_client = NULL" in the lockd code, leftover from an old patch that moved the rq_client management out of the lockd code into common sunrpc code, but left behind this one piece. The result is that the rq_client isn't put when it should be, leading to this refcount imbalance. After 4 billion lock requests or so the refcount overflows to 0, then to 1, and then the next auth_domain_put() incorrectly frees the auth_domain while it is still in use. Confirmed that I no longer hit the bug in my reproducer above after removing this line. diff --git a/fs/nfsd/lockd.c b/fs/nfsd/lockd.c index 0c6d816..7c831a2 100644 --- a/fs/nfsd/lockd.c +++ b/fs/nfsd/lockd.c @@ -38,7 +38,6 @@ nlm_fopen(struct svc_rqst *rqstp, struct nfs_fh *f, struct fil e **filp) exp_readlock(); nfserr = nfsd_open(rqstp, &fh, S_IFREG, NFSD_MAY_LOCK, filp); fh_put(&fh); - rqstp->rq_client = NULL; exp_readunlock(); /* We return nlm error codes as nlm doesn't know * about nfsd, but nfsd does know about nlm.. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Patch(es) available in kernel-2.6.18-254.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. Patch(es) available in kernel-2.6.18-255.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. A quick test to check for increasing refcount on the auth_domain. On the NFS server, run the following stap script. #stap -e 'probe module("sunrpc").function("auth_domain_lookup").return{printf("%s %d\n", kernel_string($return->name), $return->ref->refcount->counter);}' This probes the auth_domain_lookup function. On return, it prints out the auth_domain clients and refcount. On the NFS Client, you can run your usual workload to reproduce the problem. If you would like to generate an artificial workload to quickly hit the problem, you can hit it in the following manner using a simple program(locking in this case) to generate the load. for i in {1..1000}; do mount vm130-21:/exports /mnt;cd /mnt; ./locking a xxx; cd /; umount /mnt; done Where vm130-21:/exports was the NFS server in my test setup With affected kernels, you will see the refcount continously increasing # stap -e 'probe module("sunrpc").function("auth_domain_lookup").return{printf("%s %d\n", kernel_string($return->name), $return->ref->refcount->counter);}' * 2606 * 2606 * 2606 * 2607 * 2607 * 2607 * 2608 * 2608 * 2608 * 2609 * 2609 * 2609 * 2610 * 2610 ... On a NFS server which contains the fixed kernel, you shouldn't see the reference count increasing. ex: # uname -a Linux vm130-21.seg1.gsslab.fab.redhat.com 2.6.18-255.el5 #1 SMP Fri Apr 1 19:35:11 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux # stap -e 'probe module("sunrpc").function("auth_domain_lookup").return{printf("%s %d\n", kernel_string($return->name), $return->ref->refcount->counter);}' * 4 * 4 * 4 * 4 * 4 * 4 * 4 * 4 Created attachment 490504 [details]
Simple program to lock files in a loop.
Simple program to lock files in a loop. To be used to check reference count using stap script.
This bug is reproduced on 2.6.18-253.el5, and verified on 2.6.18-268.el5. Test steps: [[server]] share one directory with nfs, check result with stap. [[client]] mount server's directory, test with flock. [root@intel-s3e3144-03 ~]# for i in {1..10000}; do mount hp-dl380g6-01.rhts.eng.bos.redhat.com:/test /mnt/test; flock /mnt/test/lockfile -c "ls " ; umount /mnt/test ; done Test result: ========reproducer [root@hp-dl380g6-01 ~]# uname -a Linux hp-dl380g6-01.rhts.eng.bos.redhat.com 2.6.18-253.el5 #1 SMP Tue Mar 29 19:22:17 EDT 2011 i686 i686 i386 GNU/Linux [root@hp-dl380g6-01 ~]# stap -e 'probe module("sunrpc").function("auth_domain_lookup").return { printf("%s %d\n",kernel_string($return->name), $return->ref->refcount->counter);}' * 11 * 11 * 11 ** ** snip ** * 1169 * 1169 * 1169 * 1170 * 1170 * 1170 ========verify [root@hp-dl380g6-01 ~]# uname -a Linux hp-dl380g6-01.rhts.eng.bos.redhat.com 2.6.18-268.el5PAE #1 SMP Tue Jun 14 18:30:19 EDT 2011 i686 i686 i386 GNU/Linux [root@hp-dl380g6-01 ~]# stap -e 'probe module("sunrpc").function("auth_domain_lookup").return { printf("%s %d\n",kernel_string($return->name), $return->ref->refcount->counter);}' * 4 * 4 ** ** snip ** * 4 * 4 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html |