Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Created attachment 574102[details]
Excerpt from /var/log/messages
Description of problem:
Heavy local input/output activity causes nfsd, lockd and rpcbind to become unresponsive for a period of about 1-2 minutes. This problem was first observed after an upgrade to kernel 2.6.32-220.4.2.el6.x86_64, and continues with kernel 2.6.32-220.7.1.el6.x86_64. The downgrade to kernel
2.6.32-131.21.1.el6.x86_64 resolves the problem, suggesting this is a regression introduced in RHEL 6.2.
Details:
This problem has been observed on a machine which is a production NFS fileserver using XFS to store user home directories. At night it runs rsnapshot, which takes around 35 minutes to complete. In the middle of this run, clients report that the NFS server is unresponsive, for example:
Mar 8 03:22:46 giga kernel: [2528650.152883] lockd: server wind not responding, still trying
Mar 8 03:24:41 giga kernel: [2528764.520874] lockd: server wind OK
Mar 9 03:57:55 giga kernel: [2616988.229604] rpcbind: server wind not responding, timed out
Mar 21 04:00:30 giga kernel: [3648349.365498] nfs: server wind not responding, still trying
Mar 21 04:01:39 giga kernel: [3648418.079378] nfs: server wind OK
Mar 21 04:01:39 work dovecot: dovecot: chdir(/share/wind/john) blocked for 21 secs
On one occasion the affected machine reported a soft lockup of nfsd and lockd, which appears to be caused by waiting on a mutex in XFS-related code. Please see the attached system log for details.
The workaround is to switch back to kernel 2.6.32-131.21.1.el6.x86_64.
Version-Release number of selected component (if applicable):
kernel 2.6.32-220.7.1.el6.x86_64
kernel 2.6.32-220.4.2.el6.x86_64
How reproducible:
Intermittent, but fairly frequent.
Steps to Reproduce:
As the affected machine is a production server, I was unable to perform stress-testing necessary to find a reliable way of reproduction of this problem.
Can you please open a support ticket through Red Hat support? They will help you gather information and debug first level issues and then work with development if that is required.
If you don't have a support agreement in place, best to raise these issues on upstream community lists.
Thanks!
Ric, thank you for your advice. However please keep in mind that this is a bug report, and not a service request. As I stated above, I was able to work around this issue, and do not need an assistance at this time. I will gladly cooperate in efforts to fix this bug, if Red Hat is interested in fixing it.
Not really advice - I manage all of the file system developers and we use bugzilla to support our customers. If you have a support contract, you should work with our support team since they do a lot of the work and often resolve the issues before hitting the core developer team.
If you don't have a support contract, please take the support request to the community lists. All of our developers are very active in helping out on community issues, but of course, our customers do take priority.
Thanks!
Comment 5RHEL Program Management
2012-05-03 05:24:20 UTC
Since RHEL 6.3 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.
Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.
Created attachment 574102 [details] Excerpt from /var/log/messages Description of problem: Heavy local input/output activity causes nfsd, lockd and rpcbind to become unresponsive for a period of about 1-2 minutes. This problem was first observed after an upgrade to kernel 2.6.32-220.4.2.el6.x86_64, and continues with kernel 2.6.32-220.7.1.el6.x86_64. The downgrade to kernel 2.6.32-131.21.1.el6.x86_64 resolves the problem, suggesting this is a regression introduced in RHEL 6.2. Details: This problem has been observed on a machine which is a production NFS fileserver using XFS to store user home directories. At night it runs rsnapshot, which takes around 35 minutes to complete. In the middle of this run, clients report that the NFS server is unresponsive, for example: Mar 8 03:22:46 giga kernel: [2528650.152883] lockd: server wind not responding, still trying Mar 8 03:24:41 giga kernel: [2528764.520874] lockd: server wind OK Mar 9 03:57:55 giga kernel: [2616988.229604] rpcbind: server wind not responding, timed out Mar 21 04:00:30 giga kernel: [3648349.365498] nfs: server wind not responding, still trying Mar 21 04:01:39 giga kernel: [3648418.079378] nfs: server wind OK Mar 21 04:01:39 work dovecot: dovecot: chdir(/share/wind/john) blocked for 21 secs On one occasion the affected machine reported a soft lockup of nfsd and lockd, which appears to be caused by waiting on a mutex in XFS-related code. Please see the attached system log for details. The workaround is to switch back to kernel 2.6.32-131.21.1.el6.x86_64. Version-Release number of selected component (if applicable): kernel 2.6.32-220.7.1.el6.x86_64 kernel 2.6.32-220.4.2.el6.x86_64 How reproducible: Intermittent, but fairly frequent. Steps to Reproduce: As the affected machine is a production server, I was unable to perform stress-testing necessary to find a reliable way of reproduction of this problem.