Red Hat Bugzilla – Bug 217356
GFS2 hangs waiting for semaphore?
Last modified: 2007-11-30 17:07:38 EST
Description of problem:
While verifing bug #203705 i noticed that dbench was slowly decreasing speed
when accessing the GFS2 partition. Some processes on one of the node were dead
in D state.
Version-Release number of selected component (if applicable):
-nwm git tree (latest SHA:
same setup as #203705
Steps to Reproduce:
1. run the same setup
2. processes will die in D state.
see psax and dmesg in attachment.
world domination^W^Wgfs2 shouldn't hang.
Created attachment 142167 [details]
Created attachment 142168 [details]
Forgot the git SHA 3777ae733096259bdfcf0c1e6f3246412aa3a7cd from the original.
Thanks for the bug report. Will take a little while to untangle whats going on,
but it looks like there is enough information here to work it out.
I was talking with Patrick on IRC and he suggested to try CONFIG_DLM_DEBUG +
debugfs mounted somewhere to gather more info. I can run this within the next
day. Is there anything else you would like to me to try? slam some prink in
Well looking at the traces, I wonder if its related to memory allocation. One
thing to try is mounting with data=writeback since that means that there will be
no potential log flush in writepages, which is something which it appears to be
It looks like a lot of the other threads are also waiting upon journal flush
too, so I suspect its got something to do with that, but the trouble is working
out what is cause and what is effect.
Created attachment 142255 [details]
Created attachment 142256 [details]
Created attachment 142257 [details]
dmesg from the first node hanging
Created attachment 142258 [details]
ps ax from the first hanging node
so i did test 3 more combinations:
above kernel with data=writeback, enabled dlm_debug kernel with and without
data=writeback and in all cases gfs2 hangs.
gfs2_test_nowritedata.bz2 and gfs2_test_writedata.bz2 from debugfs.
Another thing I noticed is that the first process to hang is gfs2_glockd on one
node. lock_dlm hanging on the other node is a consequence and not the cause.
More files in attachement:
psax and dmesg from first node that hangs
Created attachment 142905 [details]
archkdb cluster file system
As first glance this looks like a dead lock nightmare.
Going to run in non-cluster mode.
I can reproduce this problem easily. Would it help for me to provide you access
to the cluster and SAN? It's not a production environment so we can use it in
whatever way we like.
I am totally lost here as what the problem is.
This has something to do with the glock code, which I am
not very familiar with.
gfs2 does use alot more glocks for the same dbench 4 run as gfs1
bear-02[3:17pm]#gfs_tool lockdump /gfs1_nolock | grep Glock | wc -l
xenon[2:24pm]#grep Glock b03-lockdump-3 | wc -l
Here are a few thoughts which might help to get a better understanding of whats
going on here. Firstly, try turning on all the lock debugging options, including
lockdep (but not debugging of lockdep itself unless you want it to run _really_
slow :-) and then when its hung, the "echo t >/proc/sysrq-trigger" output will
include a list of held locks (spinlocks & mutexes) in the system.
Out of all the locks reported by the lockdump, most of them will not be
interesting, so the question is how best to narrow down the interesting ones. I
don't think that the different numbers of locks reported is significant since it
only reflects a difference in the amount of caching in each case.
One way to get just whats needed might be to write a routine to interate over
the locks and report only the locks which are related to a particular pid (i.e.
have a holder or waiter gfs2_holder structure pointing to the task in
question). Then you can get a list of those locks related to the stuck processes
which will be a very much shorter list. You don't need anything fancy, just use
a sysfs file to pass the pid, or ioctl, or whatever method is easiest since its
only a temporary thing, and then the (modified) existing glock printing code to
dump the results via dmesg.
It might not be the best way to do it, but it will probably work.
Created attachment 144819 [details]
Attaching lockdump from the kernel.
Even if the glocks are "just cached" why would gfs2 need
10 times the number of glocks vs gfs1 for the same dbench run?
The way to approach this is to look for which process is holding the rwsem since
that is the one which most of the other processes are blocked up against. My
initial bet is that its this one:
[143156.736000] dbench D F35D5D60 0 5859 5858 5860
[143156.744000] f35d5d74 00200086 00000002 f35d5d60 f35d5d5c 00000000
[143156.752000] 00000000 f337051c 00000007 f6146e70 aa987f00 0000007e
[143156.760000] c3821780 00000000 0000ee4e 00000000 c014f8ac c164dc00
[143156.768000] Call Trace:
[143156.772000] [<c03c7205>] wait_for_completion+0x74/0xaa
[143156.776000] [<c0242a66>] glock_wait_internal+0xf2/0x21f
[143156.784000] [<c0242d0d>] gfs2_glock_nq+0x17a/0x1ba
[143156.788000] [<c024492e>] gfs2_change_nlink+0x78/0xaa
[143156.792000] [<c024e8e7>] gfs2_unlink+0xc3/0x100
[143156.796000] [<c0171b8b>] vfs_unlink+0x73/0xaf
[143156.800000] [<c01734ad>] do_unlinkat+0x87/0x112
[143156.808000] [<c0173548>] sys_unlink+0x10/0x12
[143156.812000] [<c0102eac>] syscall_call+0x7/0xb
The reason is that its doing something which can only be done under a
transaction and the rwsem in question is the sdp->sd_log_flush_lock which
is taken on each transaction in log.c:gfs2_log_reserve()
So the question is then why this particular process above is blocked, apparantly
awaiting a reply from the locking module, but since in this case its nolock, the
chances are that the problem is somewhere in the state machine in glock.c.
I still think that the difference in the number of glocks is a red herring. I'd
be very surprised if the numbers of glocks used was much different overall. What
I would expect, particularly in a system with not a lot else going on other than
the test load on gfs2 is that the glocks would stay around for longer once gfs
has finished with them. As a result the lockdump will show more of them.
How did you measure the number of glocks? Just through /proc/slabinfo after
running dbench? If so then thats probably the cause.
Created attachment 146881 [details]
The fix is now in the upstream -nmw git tree. It will filter back into FC and
This appears to be resolved.
Reassigning to Steve so patch can make it around
the tree farm.
Created attachment 152908 [details]
Patch which applies against RHEL 5.1
This is a patch against RHEL 5.1 ported directly from upstream.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.