The mail thread below appears to cover the important issues related to this problem (copied from bug 487699#45). An attempt at a backport of all the relevant patches for RHEL-4 seems a little too risky as some of the dependant infrastructure is quite different. http://marc.info/?l=linux-nfs&m=120000214806703&w=2
Created attachment 343779 [details] patchset #1 (bad, leads to oops) This is my first stab at a patchset for this. When I run cthon04 test on a kernel with this set, it oopses fairly quickly: general protection fault: 0000 [1] SMP last sysfs file: /block/dm-0/range CPU 0 Modules linked in: nfs(FU) lockd fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth rpcsec_gss_krb5(FU) auth_rpcgss(FU) testmgr_cipher testmgr aead crypto_blkcipher crypto_algapi des sunrpc(FU) ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy xen_vnif xen_balloon i2c_piix4 xen_vbd i2c_core xen_platform_pci serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 2771, comm: nfsiod Tainted: GF 2.6.18-144.el5debug #1 RIP: 0010:[<ffffffff88501261>] [<ffffffff88501261>] :nfs:nfs_inode_remove_request+0x15/0x9f RSP: 0018:ffff81002938bdc0 EFLAGS: 00010286 RAX: 6b6b6b6b6b6b6b6b RBX: ffff81002961b658 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff810029b75e74 RDI: ffff81002961b658 RBP: ffff810029b75c20 R08: ffff81003ffee1c0 R09: ffff810000012c00 R10: ffff81002961b6d0 R11: 0000000000000060 R12: ffff81002961b658 R13: 0000000000000282 R14: ffff810029b75c28 R15: ffffffff883a5fb6 FS: 0000000000000000(0000) GS:ffffffff80424000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000003bab48b610 CR3: 000000003a3d0000 CR4: 00000000000006e0 Process nfsiod (pid: 2771, threadinfo ffff81002938a000, task ffff81002a138800) Stack: 0000000000000297 ffff81002961b658 ffff810029b75c20 ffff810029b75c28 0000000000000282 ffffffff88503613 ffff810029b75c28 ffff810029b75cf8 0000000000000000 ffffffff883a5bb4 ffff810029b75c28 ffffffff883a5df8 Call Trace: [<ffffffff88503613>] :nfs:nfs_commit_done+0x138/0x195 [<ffffffff883a5bb4>] :sunrpc:rpc_exit_task+0x25/0x6e [<ffffffff883a5df8>] :sunrpc:__rpc_execute+0x92/0x250 [<ffffffff800509c6>] run_workqueue+0x9a/0xf4 [<ffffffff8004d1c5>] worker_thread+0x0/0x122 [<ffffffff800a3a80>] keventd_create_kthread+0x0/0xc9 [<ffffffff8004d2b5>] worker_thread+0xf0/0x122 [<ffffffff8008ffe6>] default_wake_function+0x0/0xe [<ffffffff800a3a80>] keventd_create_kthread+0x0/0xc9 [<ffffffff800353b2>] kthread+0xfe/0x132 ...still looking at the cause, but it seems like this set is uncovering a race of some sort.
Created attachment 367564 [details] Upstream patch #1
Created attachment 367565 [details] Backport fix for upstream patch #1
Created attachment 367566 [details] Upstream patch #2
Created attachment 367567 [details] Upstream patch #3
Created attachment 367568 [details] Upstream patch #4
The above upstream patch series (originally from the post in comment #2) allegedly resolves the deadlock issue reported here. An additional patch has been included to add some changes needed due to differences in the kernel that the patch series was originally created against. I've tested against NFS connectathon to ensure basic functionality but can't really test the actual deadlock problem.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
A kernel build with the above patches can be found at: http://people.redhat.com/~ikent/kernel-2.6.18-172.el5.bz489931.1 Could we test this out please. If the kernel packages needed aren't present please let me know and I'll upload the needed packages.
Do we have reproducer available? Could reporter provide reproduce steps so that QE can verify this?
I don't believe we have a reproducer. I only cloned this from the RHEL4 bz based on an analysis of the RHEL4 core that indicated that this problem was also present in RHEL5.
@Jeff, @GSS We need to confirm that there is third-party commitment to test for the resolution of this request during the RHEL 5.5 Beta Test Phase before we can approve it for acceptance into the release. RHEL 5.5 Beta Test Phase is expected to begin around February 2010. In order to avoid any unnecessary delays, please post a confirmation as soon as possible, including the contact information for testing engineers. Any additional information about alternative testing variations we could use to reproduce this issue in-house would be appreciated.
in kernel-2.6.18-178.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html
Any chance that some future release of RHEL5.4 will include the patches for this from RHEL5.5?
*** Bug 488063 has been marked as a duplicate of this bug. ***