Bug 808635

Summary: [nfs3/xfs] NFS server becomes temporarily unresponsive under heavy I/O on XFS filesystem
Product: Red Hat Enterprise Linux 6 Reporter: Maciej Puzio <c3377936>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.2CC: rwheeler, torel
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-07-27 14:39:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Excerpt from /var/log/messages none

Description Maciej Puzio 2012-03-30 23:05:45 UTC
Created attachment 574102 [details]
Excerpt from /var/log/messages

Description of problem:
Heavy local input/output activity causes nfsd, lockd and rpcbind to become unresponsive for a period of about 1-2 minutes. This problem was first observed after an upgrade to kernel 2.6.32-220.4.2.el6.x86_64, and continues with kernel 2.6.32-220.7.1.el6.x86_64. The downgrade to kernel 
2.6.32-131.21.1.el6.x86_64 resolves the problem, suggesting this is a regression introduced in RHEL 6.2.

Details:
This problem has been observed on a machine which is a production NFS fileserver using XFS to store user home directories. At night it runs rsnapshot, which takes around 35 minutes to complete. In the middle of this run, clients report that the NFS server is unresponsive, for example:
Mar  8 03:22:46 giga kernel: [2528650.152883] lockd: server wind not responding, still trying
Mar  8 03:24:41 giga kernel: [2528764.520874] lockd: server wind OK

Mar  9 03:57:55 giga kernel: [2616988.229604] rpcbind: server wind not responding, timed out

Mar 21 04:00:30 giga kernel: [3648349.365498] nfs: server wind not responding, still trying
Mar 21 04:01:39 giga kernel: [3648418.079378] nfs: server wind OK

Mar 21 04:01:39 work dovecot: dovecot: chdir(/share/wind/john) blocked for 21 secs

On one occasion the affected machine reported a soft lockup of nfsd and lockd, which appears to be caused by waiting on a mutex in XFS-related code. Please see the attached system log for details.

The workaround is to switch back to kernel 2.6.32-131.21.1.el6.x86_64.

Version-Release number of selected component (if applicable):
kernel 2.6.32-220.7.1.el6.x86_64
kernel 2.6.32-220.4.2.el6.x86_64

How reproducible:
Intermittent, but fairly frequent.

Steps to Reproduce:
As the affected machine is a production server, I was unable to perform stress-testing necessary to find a reliable way of reproduction of this problem.

Comment 2 Ric Wheeler 2012-04-01 13:48:09 UTC
Can you please open a support ticket through Red Hat support? They will help you gather information and debug first level issues and then work with development if that is required.

If you don't have a support agreement in place, best to raise these issues on upstream community lists.

Thanks!

Comment 3 Maciej Puzio 2012-04-09 19:44:43 UTC
Ric, thank you for your advice. However please keep in mind that this is a bug report, and not a service request. As I stated above, I was able to work around this issue, and do not need an assistance at this time. I will gladly cooperate in efforts to fix this bug, if Red Hat is interested in fixing it.

Comment 4 Ric Wheeler 2012-04-09 19:54:48 UTC
Not really advice - I manage all of the file system developers and we use bugzilla to support our customers. If you have a support contract, you should work with our support team since they do a lot of the work and often resolve the issues before hitting the core developer team.

If you don't have a support contract, please take the support request to the community lists. All of our developers are very active in helping out on community issues, but of course, our customers do take priority.

Thanks!

Comment 5 RHEL Program Management 2012-05-03 05:24:20 UTC
Since RHEL 6.3 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 6 Tore H. Larsen 2012-07-27 11:29:00 UTC
cc

Comment 7 Ric Wheeler 2012-07-27 14:39:15 UTC
Please reopen if you can work with Red Hat support to gather data.