Bug 506140 - GFS2: Filesystem deadlock when running SPECsfs on BIGI test bed.
GFS2: Filesystem deadlock when running SPECsfs on BIGI test bed.
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.4
All Linux
low Severity medium
: rc
: ---
Assigned To: Ben Marzinski
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-06-15 14:11 EDT by Ben Marzinski
Modified: 2009-09-03 10:09 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 04:54:31 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
crash and glock info (196.41 KB, text/plain)
2009-06-15 14:23 EDT, Ben Marzinski
no flags Details
debugging patch (2.79 KB, patch)
2009-06-15 19:06 EDT, Ben Marzinski
no flags Details | Diff
Patch to always queue work when we lock GLF_LOCK (1.54 KB, patch)
2009-06-22 21:13 EDT, Ben Marzinski
no flags Details | Diff

  None (edit)
Description Ben Marzinski 2009-06-15 14:11:33 EDT
Description of problem:
Running the SPECsfs workload from bz #504335 on a single node GFS2 filesystem on the BIGI test bed occassionally causes the machine to hang.

Version-Release number of selected component (if applicable):
kernel-2.6.18-152.el5

How reproducible:
Occasionally on the BIGI test bed.  Has not been reproduced elsewhere.


Additional info:
Looking at the stack traces and glock dumps, it appears that the main problem is a  process that is stuck waiting for a resource group lock.  However the glock is in a compatible state, so it would be granded to that process if the glock was processed.  However, for some reason, nothing has run through the holder queue on the glock to promote it.
Comment 1 Ben Marzinski 2009-06-15 14:23:34 EDT
Created attachment 347978 [details]
crash and glock info

The important process is pid 15509.  All the other stuck processes are waiting for a glock that it it holding, and it is waiting for glock (3/8034), which nobody is holding.

All of the glock_workqueue processes are idle. So there is nothing to run the glock queue.
Comment 2 Ric Wheeler 2009-06-15 14:37:20 EDT
A new potential blocker that fell out of debugging Barry's original test case.
Comment 3 RHEL Product and Program Management 2009-06-15 14:51:45 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 4 Ben Marzinski 2009-06-15 19:06:50 EDT
Created attachment 348028 [details]
debugging patch

This patch will keep a list of enqueues, dequeues and calls to glock_work_func.  It stores the last 2048 actions per filesystem. Hopefully this will let us narrow down why the glock isn't getting promoted.
Comment 5 Steve Whitehouse 2009-06-16 09:57:11 EDT
If this bug is a dup of the other one, can we close it as such?
Comment 6 Ben Marzinski 2009-06-16 10:24:24 EDT
I thought we could keep the original bug open for the performance issue, and track the actual hang with this one. We might want to open one more bug for the panic.
Comment 7 Ben Marzinski 2009-06-17 02:21:06 EDT
Steve found a place where this hang can happen. gfs2_shrink_glock_memory() locks the glock's GLF_LOCK bit, but doesn't always call reschedule a glock_workqueue process to perform the promotions related t the glock.  If a glock_workqueue process tries to work on the glock while it is locked by gfs2_shrink_glock_memory(), it will see the that GLF_LOCK bit is locked and assume that whoever locked it is going to deal with the lock themselves.

I believe that Steve is working on a patch to fix this and keep the iopen glocks off the lru list to help solve 504335.
Comment 8 Ben Marzinski 2009-06-22 21:13:49 EDT
Created attachment 349021 [details]
Patch to always queue work when we lock GLF_LOCK
Comment 9 Ben Marzinski 2009-06-22 21:14:24 EDT
Posted
Comment 10 Don Zickus 2009-06-30 16:22:51 EDT
in kernel-2.6.18-156.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 12 Jan Tluka 2009-07-20 11:28:11 EDT
Patch is in -158.el5. Adding SanityOnly.
Comment 13 Barry Marson 2009-07-21 15:08:02 EDT
Patch was tested thoroughly in -156 with postmark.   We also tested SPECsfs and saw no deadlock condition.

Barry
Comment 15 errata-xmlrpc 2009-09-02 04:54:31 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.