Bug 217356

Summary: GFS2 hangs waiting for semaphore?
Product: Red Hat Enterprise Linux 5 Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: kernelAssignee: Don Zickus <dzickus>
Status: CLOSED ERRATA QA Contact: GFS Bugs <gfs-bugs>
Severity: high Docs Contact:
Priority: medium    
Version: 5.1CC: ccaulfie, kanderso, lwang, rkenna, swhiteho
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0959 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-07 19:15:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 230693    
Bug Blocks:    
Attachments:
Description Flags
dmesg
none
ps ax
none
gfs2_test_nowriteback
none
gfs2_test_writeback
none
dmesg from the first node hanging
none
ps ax from the first hanging node
none
archkdb cluster file system
none
system lockdump
none
glock re-ordering
none
Patch which applies against RHEL 5.1 none

Description Fabio Massimo Di Nitto 2006-11-27 13:29:48 UTC
Description of problem:

While verifing bug #203705 i noticed that dbench was slowly decreasing speed
when accessing the GFS2 partition. Some processes on one of the node were dead
in D state.

Version-Release number of selected component (if applicable):

-nwm git tree (latest SHA: 

How reproducible:

same setup as #203705

Steps to Reproduce:
1. run the same setup
2. processes will die in D state.
  
Actual results:

see psax and dmesg in attachment.

Expected results:

world domination^W^Wgfs2 shouldn't hang.

Comment 1 Fabio Massimo Di Nitto 2006-11-27 13:29:48 UTC
Created attachment 142167 [details]
dmesg

Comment 2 Fabio Massimo Di Nitto 2006-11-27 13:30:38 UTC
Created attachment 142168 [details]
ps ax

Comment 3 Fabio Massimo Di Nitto 2006-11-27 13:33:48 UTC
Forgot the git SHA 3777ae733096259bdfcf0c1e6f3246412aa3a7cd from the original.

Cheers
Fabio

Comment 4 Steve Whitehouse 2006-11-27 13:46:56 UTC
Thanks for the bug report. Will take a little while to untangle whats going on,
but it looks like there is enough information here to work it out.


Comment 5 Fabio Massimo Di Nitto 2006-11-27 14:43:30 UTC
I was talking with Patrick on IRC and he suggested to try CONFIG_DLM_DEBUG +
debugfs mounted somewhere to gather more info. I can run this within the next
day. Is there anything else you would like to me to try? slam some prink in
lock_dlm ?

Fabio

Comment 6 Steve Whitehouse 2006-11-27 14:50:11 UTC
Well looking at the traces, I wonder if its related to memory allocation. One
thing to try is mounting with data=writeback since that means that there will be
no potential log flush in writepages, which is something which it appears to be
waiting for.

It looks like a lot of the other threads are also waiting upon journal flush
too, so I suspect its got something to do with that, but the trouble is working
out what is cause and what is effect.

Comment 7 Fabio Massimo Di Nitto 2006-11-28 05:34:49 UTC
Created attachment 142255 [details]
gfs2_test_nowriteback

Comment 8 Fabio Massimo Di Nitto 2006-11-28 05:36:20 UTC
Created attachment 142256 [details]
gfs2_test_writeback

Comment 9 Fabio Massimo Di Nitto 2006-11-28 05:37:18 UTC
Created attachment 142257 [details]
dmesg from the first node hanging

Comment 10 Fabio Massimo Di Nitto 2006-11-28 05:38:24 UTC
Created attachment 142258 [details]
ps ax from the first hanging node

Comment 11 Fabio Massimo Di Nitto 2006-11-28 05:39:11 UTC
so i did test 3 more combinations:

above kernel with data=writeback, enabled dlm_debug kernel with and without
data=writeback and in all cases gfs2 hangs.

In attachment:

gfs2_test_nowritedata.bz2 and gfs2_test_writedata.bz2 from debugfs.

Another thing I noticed is that the first process to hang is gfs2_glockd on one
node. lock_dlm hanging on the other node is a consequence and not the cause.

More files in attachement:

psax and dmesg from first node that hangs


Comment 13 Russell Cattelan 2006-12-06 00:11:12 UTC
Created attachment 142905 [details]
archkdb cluster file system

As first glance this looks like a dead lock nightmare.

Going to run in non-cluster mode.

Comment 14 Fabio Massimo Di Nitto 2006-12-08 03:36:36 UTC
Russel,

I can reproduce this problem easily. Would it help for me to provide you access
to the cluster and SAN? It's not a production environment so we can use it in
whatever way we like.

Fabio

Comment 18 Russell Cattelan 2007-01-03 21:23:19 UTC
I am totally lost here as what the problem is.
This has something to do with the glock code, which I am
not very familiar with.

gfs2 does use alot more glocks for the same dbench 4 run as gfs1
gfs1:
bear-02[3:17pm]#gfs_tool lockdump /gfs1_nolock | grep Glock | wc -l
16138
vs
gfs2:
xenon[2:24pm]#grep Glock b03-lockdump-3 | wc -l
125035

Suggestions welcome.



Comment 19 Steve Whitehouse 2007-01-04 14:29:45 UTC
Here are a few thoughts which might help to get a better understanding of whats
going on here. Firstly, try turning on all the lock debugging options, including
lockdep (but not debugging of lockdep itself unless you want it to run _really_
slow :-) and then when its hung, the "echo t >/proc/sysrq-trigger" output will
include a list of held locks (spinlocks & mutexes) in the system.

Out of all the locks reported by the lockdump, most of them will not be
interesting, so the question is how best to narrow down the interesting ones. I
don't think that the different numbers of locks reported is significant since it
only reflects a difference in the amount of caching in each case.

One way to get just whats needed might be to write a routine to interate over
the locks and report only the locks which are related to a particular pid (i.e.
have a holder or waiter[123] gfs2_holder structure pointing to the task in
question). Then you can get a list of those locks related to the stuck processes
which will be a very much shorter list. You don't need anything fancy, just use
a sysfs file to pass the pid, or ioctl, or whatever method is easiest since its
only a temporary thing, and then the (modified) existing glock printing code to
dump the results via dmesg.

It might not be the best way to do it, but it will probably work.


Comment 20 Russell Cattelan 2007-01-04 17:07:49 UTC
Created attachment 144819 [details]
system lockdump

Attaching lockdump from the kernel.

Comment 21 Russell Cattelan 2007-01-04 17:46:17 UTC
Even if the glocks are "just cached" why would gfs2 need 
10 times the number of glocks vs gfs1 for the same dbench run?

Comment 22 Steve Whitehouse 2007-01-11 12:24:36 UTC
The way to approach this is to look for which process is holding the rwsem since
that is the one which most of the other processes are blocked up against. My
initial bet is that its this one:

[143156.736000] dbench        D F35D5D60     0  5859   5858          5860      
(NOTLB)
[143156.744000]        f35d5d74 00200086 00000002 f35d5d60 f35d5d5c 00000000
f35d5e3c f337051c 
[143156.752000]        00000000 f337051c 00000007 f6146e70 aa987f00 0000007e
00000000 f6146f94 
[143156.760000]        c3821780 00000000 0000ee4e 00000000 c014f8ac c164dc00
00000000 c164dc00 
[143156.768000] Call Trace:
[143156.772000]  [<c03c7205>] wait_for_completion+0x74/0xaa
[143156.776000]  [<c0242a66>] glock_wait_internal+0xf2/0x21f
[143156.784000]  [<c0242d0d>] gfs2_glock_nq+0x17a/0x1ba
[143156.788000]  [<c024492e>] gfs2_change_nlink+0x78/0xaa
[143156.792000]  [<c024e8e7>] gfs2_unlink+0xc3/0x100
[143156.796000]  [<c0171b8b>] vfs_unlink+0x73/0xaf
[143156.800000]  [<c01734ad>] do_unlinkat+0x87/0x112
[143156.808000]  [<c0173548>] sys_unlink+0x10/0x12
[143156.812000]  [<c0102eac>] syscall_call+0x7/0xb

The reason is that its doing something which can only be done under a
transaction and the rwsem in question is the sdp->sd_log_flush_lock which
is taken on each transaction in log.c:gfs2_log_reserve()

So the question is then why this particular process above is blocked, apparantly
awaiting a reply from the locking module, but since in this case its nolock, the
chances are that the problem is somewhere in the state machine in glock.c.

I still think that the difference in the number of glocks is a red herring. I'd
be very surprised if the numbers of glocks used was much different overall. What
I would expect, particularly in a system with not a lot else going on other than
the test load on gfs2 is that the glocks would stay around for longer once gfs
has finished with them. As a result the lockdump will show more of them.

How did you measure the number of glocks? Just through /proc/slabinfo after
running dbench? If so then thats probably the cause.



Comment 24 Russell Cattelan 2007-01-29 23:17:07 UTC
Created attachment 146881 [details]
glock re-ordering

Comment 25 Steve Whitehouse 2007-02-02 15:34:42 UTC
The fix is now in the upstream -nmw git tree. It will filter back into FC and
RHEL shortly.


Comment 26 Russell Cattelan 2007-02-09 17:52:46 UTC
This appears to be resolved.
Reassigning to Steve so patch can make it around
the tree farm.


Comment 27 Steve Whitehouse 2007-04-18 14:31:45 UTC
Created attachment 152908 [details]
Patch which applies against RHEL 5.1

This is a patch against RHEL 5.1 ported directly from upstream.

Comment 28 Don Zickus 2007-05-01 18:08:03 UTC
in 2.6.18-17.el5

Comment 31 errata-xmlrpc 2007-11-07 19:15:32 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html