Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 741517

Summary:	deadlock flushing dirty pages doing pNFS
Product:	Red Hat Enterprise Linux 6	Reporter:	Ricardo Labiaga <ricardo.labiaga>
Component:	kernel	Assignee:	Steve Dickson <steved>
Status:	CLOSED WORKSFORME	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.2	CC:	bfields, bikash.choudhury, dhowells, iisaman, jlayton, ricardo.labiaga, rwheeler, sprabhu, steved, trond.myklebust
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:	pNFS
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-05-29 15:55:12 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	767187

Description Ricardo Labiaga 2011-09-27 06:40:13 UTC

Description of problem:
During heavy write traffic from a pNFS client to the ONTAP 8 c-mode Data Server
the Logical Interface (LIF) is migrated to a different cluster node.  The 
traffic stops and restarts as expected when the LIF comes back up on the new
node.  During revert (when the LIF is sent back to its original node), the
writes deadlock on the pNFS client.

Version-Release number of selected component (if applicable):
RHEL 6.2 -193 kernel.  

How reproducible:
Not every time, but can be reproduced.

Steps to Reproduce:
1. Multiple sio processes against multiple mounts
2. LIF Migrate and LIF revert on the NetApp ONTAP c-mode node
  
Actual results:
This is a stack trace against a 2.6.39.1 kernel which is the basis of the RHEL 6.2 NFS client backport.  We'll provide a RHEL 6.2 stack trace shortly.

INFO: task sio:2469 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sio            D 0000000000000001     0 2469      1 0x00020080
 ffff88007ae25ac8 0000000000000086 ffff88007aac2408 ffff88007a83d600
 ffff88007ae24010 ffff88006be15c40 0000000000013a00 ffff88007ae25fd8
 ffff88007ae25fd8 0000000000013a00 ffff88007be00000 ffff88006be15c40
Call Trace:
 [<ffffffff810cb11d>] ? __lock_page+0x6d/0x6d
 [<ffffffff8144c616>] io_schedule+0x8c/0xcf
 [<ffffffff810cb12b>] sleep_on_page+0xe/0x12
 [<ffffffff8144cce8>] __wait_on_bit+0x48/0x7b
 [<ffffffff810cb2a0>] wait_on_page_bit+0x72/0x79
 [<ffffffff8106930c>] ? autoremove_wake_function+0x3d/0x3d
 [<ffffffff810d3d05>] ? pagevec_lookup_tag+0x25/0x2e
 [<ffffffff810cb5d2>] filemap_fdatawait_range+0xa4/0x17e
 [<ffffffff810d35c4>] ? do_writepages+0x21/0x2d
 [<ffffffff810cb722>] ? __filemap_fdatawrite_range+0x50/0x52
 [<ffffffff810cb767>] filemap_write_and_wait_range+0x43/0x56
 [<ffffffff81132ecc>] vfs_fsync_range+0x35/0x7a
 [<ffffffff81132f52>] generic_write_sync+0x41/0x43
 [<ffffffff810cbbef>] generic_file_aio_write+0x91/0xb8
 [<ffffffffa026726e>] nfs_file_write+0xd9/0x16e [nfs]
 [<ffffffff81110c39>] do_sync_write+0xcb/0x108
 [<ffffffff811dfd92>] ? selinux_file_permission+0x5c/0xb0
 [<ffffffff811da712>] ? security_file_permission+0x2e/0x33
 [<ffffffff8111160a>] vfs_write+0xae/0x10a
 [<ffffffff81119b9e>] ? path_put+0x22/0x27
 [<ffffffff811116c0>] sys_pwrite64+0x5a/0x79
 [<ffffffff81037317>] sys32_pwrite+0x1c/0x1e
 [<ffffffff81455cc0>] sysenter_dispatch+0x7/0x2e
INFO: task sio:2470 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sio            D ffff88007ad05f80     0 2470      1 0x00020080
 ffff88006c3dfac8 0000000000000082 ffff88007aac0c08 ffff88007a83d600
 ffff88006c3de010 ffff88006bdeae20 0000000000013a00 ffff88006c3dffd8
 ffff88006c3dffd8 0000000000013a00 ffff88007aed8000 ffff88006bdeae20


Expected results:
Writes should stop briefly (as if the server had rebooted) and should proceed after reestablishing the session.

Additional info:
Fred Isaman at NetApp is studying the NFS client code to understand how it is that this condition can be hit.

Comment 2 RHEL Program Management 2011-10-07 15:50:40 UTC

Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 3 Fred Isaman 2011-11-07 19:13:40 UTC

(In reply to comment #0)
> Fred Isaman at NetApp is studying the NFS client code to understand how it is
> that this condition can be hit.

This was triggered by an incorrect RPC_PROG_MISMATCH server response.  I've submitted a client patch upstream that will dump a relevant message ("program 100003, version 4 unsupported by server") into the log before hanging to aid debugging.

Comment 6 Ricardo Labiaga 2012-05-29 15:55:12 UTC

The problem was triggered by a bogus server.  Although bugs in the server shouldn't cause the client to freeze, it's pretty low in the RPC layer and not deemed important enough to deal with this.  No one has reported this problem aside from the times caused by the RPC_PROG_MISMATCH problem, so I'm not going to request this to be fixed at this time.