Bug 455940

Summary: GFS2: hang running consistency test.
Product: Red Hat Enterprise Linux 5 Reporter: Ben Marzinski <bmarzins>
Component: kernelAssignee: Steve Whitehouse <swhiteho>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: high    
Version: 5.2CC: adas, cluster-maint, edamato, rpeterso
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-09-17 14:57:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch which might fix the hang
none
A second attempt at a test patch.
none
A third attempt at a test patch
none
Another attempt at trying to work out whats going on
none
Updated patch
none
Back to first principals
none
Updated test patch none

Description Ben Marzinski 2008-07-18 21:58:04 UTC
Description of problem:
When QA runs the consistency test on a five node cluster, often, at least one
node will hang

Version-Release number of selected component (if applicable):
kmod-gfs2-1.92-1.1.abhi.7
kernel-2.6.18-97.el5

How reproducible:
easily. Not always.

Steps to Reproduce:
1. I'll let QA explain them.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Ben Marzinski 2008-07-18 22:02:20 UTC
Looking at the lock dumps for this, in a case where marathon-01, marathon-03,
and marathon-05 hung out of all five nodes, I see the following.

On marathon-01:
-------------------------------
G:  s:UN n:2/a6ac f:l t:EX d:UN/0 l:0 a:0 r:4
 H: s:EX f:aW e:0 p:3635 [d_doio] gfs2_prepare_write+0xc7/0x360 [gfs2]

Resource ffff81006d463480 Name (len=24) "       2            a6ac"
Master Copy
Granted Queue
01ce0001 PR Remote:   4 01ad0001
Conversion Queue
00660001 NL (EX) Remote:   5 01b70001
021c0002 NL (EX) Remote:   3 03340001
02a10001 NL (EX)
Waiting Queue

On marathon-03:
----------------------------
G:  s:UN n:2/a6ac f:l t:EX d:UN/0 l:0 a:0 r:8
 H: s:EX f:aW e:0 p:3445 [d_doio] gfs2_prepare_write+0xc7/0x360 [gfs2]
 H: s:SH f:aW e:0 p:3444 [d_doio] gfs2_readpage+0x6a/0x18a [gfs2]
 H: s:SH f:aW e:0 p:3447 [d_doio] gfs2_readpage+0x6a/0x18a [gfs2]
 H: s:SH f:aW e:0 p:3443 [d_doio] gfs2_readpage+0x6a/0x18a [gfs2]
 H: s:SH f:aW e:0 p:3446 [d_doio] gfs2_readpage+0x6a/0x18a [gfs2]

Resource ffff81006b8dc280 Name (len=24) "       2            a6ac"
Local Copy, Master is node 1
Granted Queue
Conversion Queue
03340001 NL (EX) Master:     021c0002
Waiting Queue

On marathon-05:
----------------------------
G:  s:UN n:2/a6ac f:l t:EX d:UN/0 l:0 a:0 r:8
 H: s:EX f:aW e:0 p:3426 [d_doio] gfs2_prepare_write+0xc7/0x360 [gfs2]
 H: s:SH f:aW e:0 p:3422 [d_doio] gfs2_readpage+0x6a/0x18a [gfs2]
 H: s:SH f:aW e:0 p:3425 [d_doio] gfs2_readpage+0x6a/0x18a [gfs2]
 H: s:SH f:aW e:0 p:3423 [d_doio] gfs2_readpage+0x6a/0x18a [gfs2]
 H: s:SH f:aW e:0 p:3424 [d_doio] gfs2_readpage+0x6a/0x18a [gfs2]

Resource ffff81006cba35c0 Name (len=24) "       2            a6ac"
Local Copy, Master is node 1
Granted Queue
Conversion Queue
01b70001 NL (EX) Master:     00660001
Waiting Queue

On marathon-04:
-----------------------------
G:  s:SH n:2/a6ac f: t:SH d:EX/0 l:0 a:0 r:3
 I: n:1048619/42668 t:8 f:0x00000010

Resource ffff81006c391680 Name (len=24) "       2            a6ac"
Local Copy, Master is node 1
Granted Queue
01ad0001 PR Master:     01ce0001
Conversion Queue
Waiting Queue

marathon-04 is has the lock in the shared state, but there's nothing running on
marathon-04 to keep the lock open.  All the tests have completed on it.

Comment 2 Dean Jansa 2008-07-18 22:07:44 UTC
/usr/tests/sts-rhel5.2/gfs/bin/coherency -h
Usage: /usr/tests/sts-rhel5.2/gfs/bin/coherency [options]
-i <iterations>         Run each test case for n iterations.
                        (Default is 120 seconds, 0 runs forever)
-j                      Jdata is use, do not do mmap.
-l <STS_LROOT>          If not given, use what is in env
-m <mountpoint>         Start IO on fs mounted on mountpoint.
-r <STS_RROOT>          If not given, use what is in env
-s <random seed>        Seed tests with this seed (Default is PID)
-u <user>               Run IO as user. (Default is root)
-w <N>                  Run N test case at once. (Default is 1)
-O <result dir>         Directory where results and 
                        temporary files will be placed.
-R <resource_file>      Get the list of nodes to use from resource_file.
-W <N>                  Run N processes on each node. (Default is 2)

I ran with all the defaults on the marathon cluster.

It doesn't hang in the same test case each time it hangs.  I've seen it stuck in
pwrite/pread and write/mmread cases the last two times.  (That would be pwrite
writing the data, pread verifying the data, and write writing with a mmread
verifying).  So, I can't give you a single test case which always hangs.




Comment 3 Dean Jansa 2008-07-21 14:20:15 UTC
FWIW --
I hit this hang again over the weekend.  The test case was sync-mmwrite-read - 
Two nodes hung: marathon-03 and marathon marathon-04.  27 hours later the OOM
killer got both of them, so I'm afraid there is nothing of use aside from the
fact they were leaking memory somehow.  I'll restart.

Comment 4 RHEL Program Management 2008-07-21 15:10:59 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Steve Whitehouse 2008-07-21 16:20:11 UTC
Created attachment 312277 [details]
Patch which might fix the hang

After this patch, there will be only one place in the code which resets the
GLF_DEMOTE flag. The plan is to try and catch any code paths which might
otherwise reset it. I believe that due to recent changes the case where
demote_state == state in the glock workqueue function should not happen. On
that basis I've put a GLOCK_BUG_ON() in there to be sure of that.

Also when we have GLF_DEMOTE_PENDING set, we should not try and promote a lock,
as that might lead to deadlocks and lost wakeups, so we add an extra test for
that as well.

Longer term, we need better debugging in this area, as its been a source of a
number of problems in the past. Not sure what the best way to tackle it is at
the moment.

Comment 6 Steve Whitehouse 2008-07-22 17:04:47 UTC
Created attachment 312369 [details]
A second attempt at a test patch.

Comment 7 Steve Whitehouse 2008-07-23 15:44:02 UTC
Created attachment 312492 [details]
A third attempt at a test patch

Comment 8 Steve Whitehouse 2008-07-24 16:12:54 UTC
Created attachment 312577 [details]
Another attempt at trying to work out whats going on

So this patch changes the demote flags. We get rid of GLF_DEMOTE and
GLF_PENDING_DEMOTE and replace them with GLF_LOCAL_DEMOTE and GLF_REMOTE_DEMOTE
in order to try and make the logic a bit clearer. Also since GLF_LOCAL_DEMOTE
is always to LM_ST_UNLOCKED, we can keep gl_demote_state for use only with
remote demotes.

This should make the glock dumps more meaningful.... or thats what I'm hoping
anyway.

Comment 9 Steve Whitehouse 2008-07-24 20:47:43 UTC
Created attachment 312596 [details]
Updated patch

Comment 10 Steve Whitehouse 2008-07-25 13:52:57 UTC
Created attachment 312646 [details]
Back to first principals

So we know that when the hang occurs, we have "lost" a demote request
somewhere. Assuming that the request was actually delivered, then there are
only two places where this might happen. This patch adds a flag which is set in
one of those places and reset in the other of those places. That way we'll know
which of them was passed most recently when we see the hang.

Comment 11 Steve Whitehouse 2008-07-25 15:45:49 UTC
Going back to comment #1, the lock which is causing the problem is being cached
in the SH state. Lets assume that due to the lack of demote flags, that the last
operation was a demote (since thats the most likely way that the flags got
canceled). There are several possible paths that could have been taken:

EX -> SH (demote)
DF -> SH (demote) - v. unlikely as no waiting DF holders
DF -> UN -> SH (demote) - also v. unlikely due to above reason

So EX -> SH is the most likely previous operation for the glock in question. If
we assume that the previous operation was a promote, then we have the following
possible paths:

UN -> SH (promote)
DF -> SH (promote)

If the 'T' flag shows up when we run the patch from comment #10, then we can
reduce the possible previous state changes down to just one of those sets. So
the other thing I'm thinking of doing is adding a prev_state variable to each
glock so that we can see what that was. That should be just enough to tell us
which of the possible state changes is happening to our glock.

I've already been through the code for all of the above, this morning and I've
not spotted anything thats obviously wrong, so far.


Comment 12 Steve Whitehouse 2008-07-25 15:55:24 UTC
Created attachment 312657 [details]
Updated test patch

This one will record the previous lock state as well as the current state.

Comment 13 Steve Whitehouse 2008-08-14 16:17:34 UTC
Looks like a possible dup of #458684. Cannot reproduce at the moment. Will look at this again when we have a fix for #458684.

Comment 14 Steve Whitehouse 2008-09-17 14:57:44 UTC

*** This bug has been marked as a duplicate of bug 458684 ***