Bug 697448

Summary: slab corruption after seeing some nfs-related BUG: warning [rhel-5.6.z]
Product: Red Hat Enterprise Linux 5 Reporter: RHEL Program Management <pm-rhel>
Component: kernelAssignee: Phillip Lougher <plougher>
Status: CLOSED ERRATA QA Contact: Jian Li <jiali>
Severity: high Docs Contact:
Priority: high    
Version: 5.3CC: anton, bfields, dhoward, ekuric, james.brown, jiali, jlayton, jthomas, kchoi, lwoodman, nmurray, pm-eus, qcai, rmitchel, rwheeler, sprabhu, steved, tao, tumeya, vfalico, vgaikwad, yanwang
Target Milestone: rcKeywords: OtherQA, Reopened, ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.18-238.12.1.el5 Doc Type: Bug Fix
Doc Text:
An NFS server uses reference-counted structures, called auth_domains, to identify which group of clients (for example, 192.168.0.0/24 or *.foo.edu) the client who sent an RPC request belongs to. The server NLM code incorrectly took an extra reference of the auth_domain associated with each NLM RPC request, and never dropped that reference. The reference count is an unsigned 32-bit value, so after 2^32 (about 4 billion) lock operations from the same client or group of clients, the reference count would overflow to 0, and the kernel would incorrectly think that the auth_domain should be freed. As a result, the kernel would panic. This update removes the extra reference-count increment from the server NLM code, and the kernel no longer panics.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-31 14:11:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 589512    
Bug Blocks:    

Description RHEL Program Management 2011-04-18 10:33:25 UTC
This bug has been copied from bug #589512 and has been proposed
to be backported to 5.6 z-stream (EUS).

Comment 5 Phillip Lougher 2011-05-09 15:16:15 UTC
in kernel-2.6.18-238.12.1.el5

linux-2.6-fs-nfsd-fix-auth_domain-reference-leak-on-nlm-operations.patch

Comment 7 Jian Li 2011-05-19 10:28:26 UTC
The bug is reproduced in 2.6.18-238.el5 and verified in 2.6.18-238.12.1.el5 (RHEL6). 

This test uses one nfs client and nfs host.
In nfs client, test command:
[root@ibm-ls22-01 ~]# for i in {1..100}; do mount intel-s3e36-01.rhts.eng.rdu.redhat.com:/mnt/test /mnt/test; flock /mnt/test/lockfile -c "sleep 1" ; umount /mnt/test ; done

In nfs host, test command:
stap -e 'probe module("sunrpc").function("auth_domain_lookup").return { printf("%s %d\n",kernel_string($return->name), $return->ref->refcount->counter);}'

Output is as follow:
====reproducer
[root@intel-s3e36-01 ~]# uname -a
Linux intel-s3e36-01.rhts.eng.rdu.redhat.com 2.6.18-238.el5 #1 SMP Sun Dec 19 14:22:44 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
[root@intel-s3e36-01 ~]# stap -e 'probe module("sunrpc").function("auth_domain_lookup").return { printf("%s %d\n",kernel_string($return->name), $return->ref->refcount->counter);}'
* 4
* 4
* 4
* 5
* 5
* 5
* 6
* 6
* 6
* 7
* 7
* 7
* 8
* 7
* 8
* 8
* 8
====verify
[root@intel-s3e36-01 ~]# uname -a
Linux intel-s3e36-01.rhts.eng.rdu.redhat.com 2.6.18-238.12.1.el5 #1 SMP Sat May 7 20:18:50 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
[root@intel-s3e36-01 ~]# stap -e 'probe module("sunrpc").function("auth_domain_lookup").return { printf("%s %d\n",kernel_string($return->name), $return->ref->refcount->counter);}'
* 4
* 4
* 4
* 4
* 4
* 4
* 4
* 4
* 4
* 4
* 4
* 4
* 4
* 4
* 4
* 4
* 4

Comment 8 errata-xmlrpc 2011-05-31 14:11:07 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0833.html

Comment 9 Martin Prpič 2011-06-02 13:34:34 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
An NFS server uses reference-counted structures, called auth_domains, to identify which group of clients (for example, 192.168.0.0/24 or *.foo.edu) the client who sent an RPC request belongs to. The server NLM code incorrectly took an extra reference of the auth_domain associated with each NLM RPC request, and never dropped that reference. The reference count is an unsigned 32-bit value, so after 232 (about 4 billion) lock operations from the same client or group of clients, the reference count would overflow to 0, and the kernel would incorrectly think that the auth_domain should be freed. As a result, the kernel would panic. This update removes the extra reference-count increment from the server NLM code, and the kernel no longer panics.

Comment 10 J. Bruce Fields 2011-06-02 15:06:29 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-An NFS server uses reference-counted structures, called auth_domains, to identify which group of clients (for example, 192.168.0.0/24 or *.foo.edu) the client who sent an RPC request belongs to. The server NLM code incorrectly took an extra reference of the auth_domain associated with each NLM RPC request, and never dropped that reference. The reference count is an unsigned 32-bit value, so after 232 (about 4 billion) lock operations from the same client or group of clients, the reference count would overflow to 0, and the kernel would incorrectly think that the auth_domain should be freed. As a result, the kernel would panic. This update removes the extra reference-count increment from the server NLM code, and the kernel no longer panics.+An NFS server uses reference-counted structures, called auth_domains, to identify which group of clients (for example, 192.168.0.0/24 or *.foo.edu) the client who sent an RPC request belongs to. The server NLM code incorrectly took an extra reference of the auth_domain associated with each NLM RPC request, and never dropped that reference. The reference count is an unsigned 32-bit value, so after 2^32 (about 4 billion) lock operations from the same client or group of clients, the reference count would overflow to 0, and the kernel would incorrectly think that the auth_domain should be freed. As a result, the kernel would panic. This update removes the extra reference-count increment from the server NLM code, and the kernel no longer panics.