Bug 854217

Summary:	Kernel panic on rmmod nfs with nfs 4.1
Product:	Red Hat Enterprise Linux 6	Reporter:	Markku Tavasti <mtavasti>
Component:	kernel	Assignee:	nfs-maint
Status:	CLOSED WONTFIX	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	6.3	CC:	dhowells, mtavasti, rwheeler, steved, swhiteho
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:	fscache
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-12-06 12:55:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Markku Tavasti 2012-09-04 12:07:02 UTC

Description of problem:

We are testing nfs 4.1 suitability for production. Netapp we are using is crashing if nfs 4.1 volume under heavy IO-load is moved to other node.

On this particular test we used RHEL 6.3 server as a client.
 *  We mounted as with like this: mount -t nfs4 -o minorversion=1 10.x.x.1:/vol_stresstest /mnt
 * run nfsstone and bonnie, same time moving volume on Netapp -> Netapp crash
 * after this:
[root@sohjo ~]# lsmod
Module                  Size  Used by
nfs_layout_nfsv41_files    18540  0
nfs                   402784  1 nfs_layout_nfsv41_files
lockd                  74270  1 nfs
fscache                46859  1 nfs
nfs_acl                 2647  1 nfs
auth_rpcgss            44895  1 nfs
mptctl                 31976  1
mptbase                94037  1 mptctl
sunrpc                263516  13 nfs_layout_nfsv41_files,nfs,lockd,nfs_acl,auth_rpcgss
bonding               127806  0
8021q                  25941  1 bonding
garp                    7344  1 8021q
stp                     2173  1 garp
llc                     5642  2 garp,stp
ipv6                  322541  105 bonding
power_meter             9343  0
sg                     30124  0
be2net                 73845  0
microcode             112653  0
serio_raw               4818  0
iTCO_wdt               13934  0
iTCO_vendor_support     3088  1 iTCO_wdt
hpilo                   8237  12
hpwdt                   7094  0
i7core_edac            18184  0
edac_core              46773  1 i7core_edac
shpchp                 33482  0
ext4                  371395  6
mbcache                 8144  1 ext4
jbd2                   93312  1 ext4
dm_round_robin          2717  2
sd_mod                 39488  4
crc_t10dif              1541  1 sd_mod
qla2xxx               400792  14
scsi_transport_fc      55235  1 qla2xxx
scsi_tgt               12173  1 scsi_transport_fc
hpsa                   52456  0
dm_multipath           17649  2 dm_round_robin
dm_mirror              14101  0
dm_region_hash         12170  1 dm_mirror
dm_log                 10122  2 dm_mirror,dm_region_hash
dm_mod                 81692  31 dm_multipath,dm_mirror,dm_log
[root@sohjo ~]# rmmod nfs_layout_nfsv41_files
[root@sohjo ~]# lsmod
Module                  Size  Used by
nfs                   402784  0 
lockd                  74270  1 nfs
fscache                46859  1 nfs
nfs_acl                 2647  1 nfs
auth_rpcgss            44895  1 nfs
mptctl                 31976  1 
mptbase                94037  1 mptctl
sunrpc                263516  12 nfs,lockd,nfs_acl,auth_rpcgss
bonding               127806  0 
8021q                  25941  1 bonding
garp                    7344  1 8021q
stp                     2173  1 garp
llc                     5642  2 garp,stp
ipv6                  322541  105 bonding
power_meter             9343  0
sg                     30124  0
be2net                 73845  0
microcode             112653  0
serio_raw               4818  0
iTCO_wdt               13934  0
iTCO_vendor_support     3088  1 iTCO_wdt
hpilo                   8237  12
hpwdt                   7094  0
i7core_edac            18184  0
edac_core              46773  1 i7core_edac
shpchp                 33482  0
ext4                  371395  6
mbcache                 8144  1 ext4
jbd2                   93312  1 ext4
dm_round_robin          2717  2
sd_mod                 39488  4
crc_t10dif              1541  1 sd_mod
qla2xxx               400792  14
scsi_transport_fc      55235  1 qla2xxx
scsi_tgt               12173  1 scsi_transport_fc
hpsa                   52456  0
dm_multipath           17649  2 dm_round_robin
dm_mirror              14101  0
dm_region_hash         12170  1 dm_mirror
dm_log                 10122  2 dm_mirror,dm_region_hash
dm_mod                 81692  31 dm_multipath,dm_mirror,dm_log
[root@sohjo ~]# rmmod nfs
Timeout, server xxx not responding.


From syslog:

kernel: FS-Cache: Cookie 'FSDEF.netfs' still has children
kernel: kernel BUG at fs/fscache/cookie.c:433!
kernel: RIP [<ffffffffa02dcd9a>] __fscache_relinquish_cookie+0x1da/0x260 [fscache]
kernel: Kernel panic - not syncing: Fatal exception

Version-Release number of selected component (if applicable):
2.6.32-279.5.1.el6.x86_64

How reproducible: seen once

Additional info: we have several other nfs4 related problems in similar cases, will file them as bugs also.

Comment 2 Markku Tavasti 2012-09-05 05:47:52 UTC

This bug is reproducible in our environment.

Needed tool: http://wiki.linux-nfs.org/wiki/index.php/NFSometer

Procedure to reproduce:

 * Clean nfsometer cache
 * run: './nfsometer.py 10.x.x.x:/volume bonnie++'
   * When nfsometer is running nfs v4.1 test, interrupt it
 * Try to run nfsometer again, complains nfs already in use
 * umount does not do anything
 
From terminal:
----------------------------------------------------------
xxx 18:15 ~/nfsometer/nfsometer $ sudo umount /mnt
xxx 18:15 ~/nfsometer/nfsometer $ ./nfsometer.py -m v3 -n 2 10.x.x.x:/fail_nas2a bonnie++ 
Error: NFS client not idle (check /proc/fs/nfsfs/servers)
xxx 18:15 ~/nfsometer/nfsometer $ sudo umount /mnt
umount: /mnt: not mounted
xxx 18:15 ~/nfsometer/nfsometer $ cat /proc/fs/nfsfs/servers
NV SERVER   PORT USE HOSTNAME
v4 0axxxxxx  801   1 10.x.x.x
xxx 18:15 ~/nfsometer/nfsometer $ lsmod
Module                  Size  Used by
nfs_layout_nfsv41_files    18540  0 
nfs                   402784  1 nfs_layout_nfsv41_files
lockd                  74270  1 nfs
fscache                46859  1 nfs
nfs_acl                 2647  1 nfs
auth_rpcgss            44895  1 nfs

[root@xxx ~]# rmmod nfs_layout_nfsv41_files
[root@xxx ~]# lsmod
Module                  Size  Used by
nfs                   402784  0 
lockd                  74270  1 nfs
fscache                46859  1 nfs
nfs_acl                 2647  1 nfs
auth_rpcgss            44895  1 nfs
mptctl                 31976  1 
[root@xxx ~]# rmmod nfs
Timeout, server xxx not responding.
----------------------------------------------------------

From syslog:
----------------------------------------------------------
kernel: Pid: 27556, comm: rmmod Not tainted 2.6.32-279.5.1.el6.x86_64 #1 HP ProLiant BL460c G7
kernel: RIP: 0010:[<ffffffffa02dcd9a>]  [<ffffffffa02dcd9a>] __fscache_relinquish_cookie+0x1da/0x260 [fscache]
kernel: RSP: 0018:ffff88013ec83e58  EFLAGS: 00010296
kernel: RAX: 0000000000000038 RBX: ffff88016d2b2540 RCX: 0000000000000fb0
kernel: RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000246
kernel: RBP: ffff88013ec83e98 R08: 0000000000000000 R09: ffffffff8163ab80
kernel: R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
kernel: R13: ffff88013ec83f18 R14: 0000000000000000 R15: 0000000000000001
kernel: FS:  00007f352bc4d700(0000) GS:ffff88002df00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
kernel: CR2: 0000003dabe77850 CR3: 000000012f188000 CR4: 00000000000006e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
kernel: Process rmmod (pid: 27556, threadinfo ffff88013ec82000, task ffff880170578aa0)
kernel: Stack:
kernel: ffff8801833f7c40 0000000000000001 ffff88013ec83e88 ffffffffa0358ee0
kernel: <d> ffffffffa0358f20 ffff88013ec83f18 0000000000000000 0000000000000001
kernel: <d> ffff88013ec83eb8 ffffffffa02dd2a6 0000000000000880 0000000000000880
kernel: Call Trace:
kernel: [<ffffffffa02dd2a6>] __fscache_unregister_netfs+0x36/0x60 [fscache]
kernel: [<ffffffffa0347c45>] nfs_fscache_unregister+0x15/0x20 [nfs]
kernel: [<ffffffffa0348296>] exit_nfs_fs+0x2e/0x71 [nfs]
kernel: [<ffffffff810adf04>] sys_delete_module+0x194/0x260
kernel: [<ffffffff8150339e>] ? do_page_fault+0x3e/0xa0
kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
kernel: Code: 00 00 00 48 c7 c2 20 d2 2d a0 be 01 00 00 00 e8 5d 1d 22 e1 e9 84 fe ff ff 48 8b 73 18 48 c7 c7 28 13 2e a0 31 c0 e8 78 05 22 e1 <0f> 0b eb fe 48 c7 c7 bc 27 2e a0 31 c0 e8 66 05 22 e1 48 c7 c7 
kernel: RIP  [<ffffffffa02dcd9a>] __fscache_relinquish_cookie+0x1da/0x260 [fscache]
kernel: RSP <ffff88013ec83e58>
kernel: ---[ end trace 63babae348bd714a ]---
kernel: Kernel panic - not syncing: Fatal exception
kernel: Pid: 27556, comm: rmmod Tainted: G      D    ---------------    2.6.32-279.5.1.el6.x86_64 #1
kernel: Call Trace:
kernel: [<ffffffff814fd24a>] ? panic+0xa0/0x168
kernel: [<ffffffff815013e4>] ? oops_end+0xe4/0x100
kernel: [<ffffffff8100f26b>] ? die+0x5b/0x90
kernel: [<ffffffff81500cb4>] ? do_trap+0xc4/0x160
kernel: [<ffffffff8100ce35>] ? do_invalid_op+0x95/0xb0
kernel: [<ffffffffa02dcd9a>] ? __fscache_relinquish_cookie+0x1da/0x260 [fscache]
kernel: [<ffffffff8106c621>] ? vprintk+0x251/0x560
kernel: [<ffffffff81164810>] ? do_drain+0x0/0xa0
kernel: [<ffffffff8100bedb>] ? invalid_op+0x1b/0x20
kernel: [<ffffffffa02dcd9a>] ? __fscache_relinquish_cookie+0x1da/0x260 [fscache]
kernel: [<ffffffffa02dcd9a>] ? __fscache_relinquish_cookie+0x1da/0x260 [fscache]
kernel: [<ffffffffa02dd2a6>] ? __fscache_unregister_netfs+0x36/0x60 [fscache]
kernel: [<ffffffffa0347c45>] ? nfs_fscache_unregister+0x15/0x20 [nfs]
kernel: [<ffffffffa0348296>] ? exit_nfs_fs+0x2e/0x71 [nfs]
kernel: [<ffffffff810adf04>] ? sys_delete_module+0x194/0x260
kernel: [<ffffffff8150339e>] ? do_page_fault+0x3e/0xa0
kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
----------------------------------------------------------

Comment 3 Ric Wheeler 2012-09-05 14:48:45 UTC

Hi Marku,

Can you please open a BZ with Red Hat support? They are the first line of information gathering for us and help debug issues.

They also work to prioritze which issues get looked at by development.

Thanks!

Comment 4 RHEL Program Management 2012-12-14 08:14:37 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 5 Steve Dickson 2013-05-19 20:37:52 UTC

David,

any clue as to what's going on?

Comment 7 Steve Whitehouse 2015-12-11 13:29:50 UTC

Please confirm if this bug is still an issue, otherwise we'll close this out since it seems to be a very old bug, and it does not appear to have gone through our support team. Apologies if something has been missed on our side - please let us know if that is the case.

Comment 8 Jan Kurik 2017-12-06 12:55:56 UTC

Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/

Comment 9 Red Hat Bugzilla 2023-09-14 01:37:05 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days