Bug 506140
Summary: | GFS2: Filesystem deadlock when running SPECsfs on BIGI test bed. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Ben Marzinski <bmarzins> | ||||||||
Component: | kernel | Assignee: | Ben Marzinski <bmarzins> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | low | ||||||||||
Version: | 5.4 | CC: | bmarson, dzickus, jgiles, jtluka, rwheeler, swhiteho | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2009-09-02 08:54:31 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Ben Marzinski
2009-06-15 18:11:33 UTC
Created attachment 347978 [details]
crash and glock info
The important process is pid 15509. All the other stuck processes are waiting for a glock that it it holding, and it is waiting for glock (3/8034), which nobody is holding.
All of the glock_workqueue processes are idle. So there is nothing to run the glock queue.
A new potential blocker that fell out of debugging Barry's original test case. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Created attachment 348028 [details]
debugging patch
This patch will keep a list of enqueues, dequeues and calls to glock_work_func. It stores the last 2048 actions per filesystem. Hopefully this will let us narrow down why the glock isn't getting promoted.
If this bug is a dup of the other one, can we close it as such? I thought we could keep the original bug open for the performance issue, and track the actual hang with this one. We might want to open one more bug for the panic. Steve found a place where this hang can happen. gfs2_shrink_glock_memory() locks the glock's GLF_LOCK bit, but doesn't always call reschedule a glock_workqueue process to perform the promotions related t the glock. If a glock_workqueue process tries to work on the glock while it is locked by gfs2_shrink_glock_memory(), it will see the that GLF_LOCK bit is locked and assume that whoever locked it is going to deal with the lock themselves. I believe that Steve is working on a patch to fix this and keep the iopen glocks off the lru list to help solve 504335. Created attachment 349021 [details]
Patch to always queue work when we lock GLF_LOCK
Posted in kernel-2.6.18-156.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. Patch is in -158.el5. Adding SanityOnly. Patch was tested thoroughly in -156 with postmark. We also tested SPECsfs and saw no deadlock condition. Barry An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html |