Bug 489931

Summary: NFS umount deadlock in rpciod with rpc_shutdown_client()
Product: Red Hat Enterprise Linux 5 Reporter: Jeff Layton <jlayton>
Component: kernelAssignee: Ian Kent <ikent>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.4CC: cward, d.bein, djeffery, ikent, james.brown, jlayton, jtluka, qcai, rwheeler, steved, tao
Target Milestone: rcKeywords: OtherQA
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 487699 Environment:
Last Closed: 2010-03-30 07:32:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 499522, 526950, 533192    
Attachments:
Description Flags
patchset #1 (bad, leads to oops)
none
Upstream patch #1
none
Backport fix for upstream patch #1
none
Upstream patch #2
none
Upstream patch #3
none
Upstream patch #4 none

Comment 1 Ian Kent 2009-03-17 11:55:30 UTC
The mail thread below appears to cover the important issues
related to this problem (copied from bug 487699#45). An attempt
at a  backport of all the relevant patches for RHEL-4 seems a
little too risky as some of the dependant infrastructure is
quite different.

http://marc.info/?l=linux-nfs&m=120000214806703&w=2

Comment 2 Jeff Layton 2009-05-13 14:42:50 UTC
Created attachment 343779 [details]
patchset #1 (bad, leads to oops)

This is my first stab at a patchset for this. When I run cthon04 test on a kernel with this set, it oopses fairly quickly:

general protection fault: 0000 [1] SMP 
last sysfs file: /block/dm-0/range
CPU 0 
Modules linked in: nfs(FU) lockd fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth rpcsec_gss_krb5(FU) auth_rpcgss(FU) testmgr_cipher testmgr aead crypto_blkcipher crypto_algapi des sunrpc(FU) ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy xen_vnif xen_balloon i2c_piix4 xen_vbd i2c_core xen_platform_pci serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 2771, comm: nfsiod Tainted: GF     2.6.18-144.el5debug #1
RIP: 0010:[<ffffffff88501261>]  [<ffffffff88501261>] :nfs:nfs_inode_remove_request+0x15/0x9f
RSP: 0018:ffff81002938bdc0  EFLAGS: 00010286
RAX: 6b6b6b6b6b6b6b6b RBX: ffff81002961b658 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff810029b75e74 RDI: ffff81002961b658
RBP: ffff810029b75c20 R08: ffff81003ffee1c0 R09: ffff810000012c00
R10: ffff81002961b6d0 R11: 0000000000000060 R12: ffff81002961b658
R13: 0000000000000282 R14: ffff810029b75c28 R15: ffffffff883a5fb6
FS:  0000000000000000(0000) GS:ffffffff80424000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000003bab48b610 CR3: 000000003a3d0000 CR4: 00000000000006e0
Process nfsiod (pid: 2771, threadinfo ffff81002938a000, task ffff81002a138800)
Stack:  0000000000000297 ffff81002961b658 ffff810029b75c20 ffff810029b75c28
 0000000000000282 ffffffff88503613 ffff810029b75c28 ffff810029b75cf8
 0000000000000000 ffffffff883a5bb4 ffff810029b75c28 ffffffff883a5df8
Call Trace:
 [<ffffffff88503613>] :nfs:nfs_commit_done+0x138/0x195
 [<ffffffff883a5bb4>] :sunrpc:rpc_exit_task+0x25/0x6e
 [<ffffffff883a5df8>] :sunrpc:__rpc_execute+0x92/0x250
 [<ffffffff800509c6>] run_workqueue+0x9a/0xf4
 [<ffffffff8004d1c5>] worker_thread+0x0/0x122
 [<ffffffff800a3a80>] keventd_create_kthread+0x0/0xc9
 [<ffffffff8004d2b5>] worker_thread+0xf0/0x122
 [<ffffffff8008ffe6>] default_wake_function+0x0/0xe
 [<ffffffff800a3a80>] keventd_create_kthread+0x0/0xc9
 [<ffffffff800353b2>] kthread+0xfe/0x132

...still looking at the cause, but it seems like this set is uncovering a race of some sort.

Comment 10 Ian Kent 2009-11-05 03:18:05 UTC
Created attachment 367564 [details]
Upstream patch #1

Comment 11 Ian Kent 2009-11-05 03:18:54 UTC
Created attachment 367565 [details]
Backport fix for upstream patch #1

Comment 12 Ian Kent 2009-11-05 03:19:34 UTC
Created attachment 367566 [details]
Upstream patch #2

Comment 13 Ian Kent 2009-11-05 03:20:19 UTC
Created attachment 367567 [details]
Upstream patch #3

Comment 14 Ian Kent 2009-11-05 03:20:58 UTC
Created attachment 367568 [details]
Upstream patch #4

Comment 15 Ian Kent 2009-11-05 03:27:46 UTC
The above upstream patch series (originally from the post in
comment #2) allegedly resolves the deadlock issue reported
here. An additional patch has been included to add some changes
needed due to differences in the kernel that the patch series
was originally created against. I've tested against NFS
connectathon to ensure basic functionality but can't really
test the actual deadlock problem.

Comment 16 RHEL Program Management 2009-11-05 03:31:14 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 17 Ian Kent 2009-11-05 03:45:20 UTC
A kernel build with the above patches can be found at:
http://people.redhat.com/~ikent/kernel-2.6.18-172.el5.bz489931.1

Could we test this out please.

If the kernel packages needed aren't present please let me know
and I'll upload the needed packages.

Comment 18 Jan Tluka 2009-11-18 15:37:40 UTC
Do we have reproducer available? Could reporter provide reproduce steps so that QE can verify this?

Comment 19 Jeff Layton 2009-11-18 17:12:33 UTC
I don't believe we have a reproducer. I only cloned this from the RHEL4 bz based on an analysis of the RHEL4 core that indicated that this problem was also present in RHEL5.

Comment 21 Chris Ward 2009-11-19 15:05:47 UTC
@Jeff, @GSS

We need to confirm that there is third-party commitment to 
test for the resolution of this request during the RHEL 5.5 
Beta Test Phase before we can approve it for acceptance 
into the release.

RHEL 5.5 Beta Test Phase is expected to begin around February
2010.

In order to avoid any unnecessary delays, please post a 
confirmation as soon as possible, including the contact 
information for testing engineers.

Any additional information about alternative testing variations we 
could use to reproduce this issue in-house would be appreciated.

Comment 29 Don Zickus 2009-12-09 18:11:21 UTC
in kernel-2.6.18-178.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 39 Chris Ward 2010-02-11 10:26:08 UTC
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 43 errata-xmlrpc 2010-03-30 07:32:51 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Comment 45 David Bein 2010-05-03 03:04:25 UTC
Any chance that some future release of RHEL5.4 will include the patches for
this from RHEL5.5?

Comment 46 Jeff Layton 2013-07-02 14:23:01 UTC
*** Bug 488063 has been marked as a duplicate of this bug. ***