Bug 217356
Summary: | GFS2 hangs waiting for semaphore? | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Fabio Massimo Di Nitto <fdinitto> | ||||||||||||||||||||||
Component: | kernel | Assignee: | Don Zickus <dzickus> | ||||||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> | ||||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||||
Priority: | medium | ||||||||||||||||||||||||
Version: | 5.1 | CC: | ccaulfie, kanderso, lwang, rkenna, swhiteho | ||||||||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||||
Hardware: | All | ||||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||
Fixed In Version: | RHBA-2007-0959 | Doc Type: | Bug Fix | ||||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||||
Last Closed: | 2007-11-07 19:15:32 UTC | Type: | --- | ||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||
Bug Depends On: | 230693 | ||||||||||||||||||||||||
Bug Blocks: | |||||||||||||||||||||||||
Attachments: |
|
Description
Fabio Massimo Di Nitto
2006-11-27 13:29:48 UTC
Created attachment 142167 [details]
dmesg
Created attachment 142168 [details]
ps ax
Forgot the git SHA 3777ae733096259bdfcf0c1e6f3246412aa3a7cd from the original. Cheers Fabio Thanks for the bug report. Will take a little while to untangle whats going on, but it looks like there is enough information here to work it out. I was talking with Patrick on IRC and he suggested to try CONFIG_DLM_DEBUG + debugfs mounted somewhere to gather more info. I can run this within the next day. Is there anything else you would like to me to try? slam some prink in lock_dlm ? Fabio Well looking at the traces, I wonder if its related to memory allocation. One thing to try is mounting with data=writeback since that means that there will be no potential log flush in writepages, which is something which it appears to be waiting for. It looks like a lot of the other threads are also waiting upon journal flush too, so I suspect its got something to do with that, but the trouble is working out what is cause and what is effect. Created attachment 142255 [details]
gfs2_test_nowriteback
Created attachment 142256 [details]
gfs2_test_writeback
Created attachment 142257 [details]
dmesg from the first node hanging
Created attachment 142258 [details]
ps ax from the first hanging node
so i did test 3 more combinations: above kernel with data=writeback, enabled dlm_debug kernel with and without data=writeback and in all cases gfs2 hangs. In attachment: gfs2_test_nowritedata.bz2 and gfs2_test_writedata.bz2 from debugfs. Another thing I noticed is that the first process to hang is gfs2_glockd on one node. lock_dlm hanging on the other node is a consequence and not the cause. More files in attachement: psax and dmesg from first node that hangs Created attachment 142905 [details]
archkdb cluster file system
As first glance this looks like a dead lock nightmare.
Going to run in non-cluster mode.
Russel, I can reproduce this problem easily. Would it help for me to provide you access to the cluster and SAN? It's not a production environment so we can use it in whatever way we like. Fabio I am totally lost here as what the problem is. This has something to do with the glock code, which I am not very familiar with. gfs2 does use alot more glocks for the same dbench 4 run as gfs1 gfs1: bear-02[3:17pm]#gfs_tool lockdump /gfs1_nolock | grep Glock | wc -l 16138 vs gfs2: xenon[2:24pm]#grep Glock b03-lockdump-3 | wc -l 125035 Suggestions welcome. Here are a few thoughts which might help to get a better understanding of whats going on here. Firstly, try turning on all the lock debugging options, including lockdep (but not debugging of lockdep itself unless you want it to run _really_ slow :-) and then when its hung, the "echo t >/proc/sysrq-trigger" output will include a list of held locks (spinlocks & mutexes) in the system. Out of all the locks reported by the lockdump, most of them will not be interesting, so the question is how best to narrow down the interesting ones. I don't think that the different numbers of locks reported is significant since it only reflects a difference in the amount of caching in each case. One way to get just whats needed might be to write a routine to interate over the locks and report only the locks which are related to a particular pid (i.e. have a holder or waiter[123] gfs2_holder structure pointing to the task in question). Then you can get a list of those locks related to the stuck processes which will be a very much shorter list. You don't need anything fancy, just use a sysfs file to pass the pid, or ioctl, or whatever method is easiest since its only a temporary thing, and then the (modified) existing glock printing code to dump the results via dmesg. It might not be the best way to do it, but it will probably work. Created attachment 144819 [details]
system lockdump
Attaching lockdump from the kernel.
Even if the glocks are "just cached" why would gfs2 need 10 times the number of glocks vs gfs1 for the same dbench run? The way to approach this is to look for which process is holding the rwsem since that is the one which most of the other processes are blocked up against. My initial bet is that its this one: [143156.736000] dbench D F35D5D60 0 5859 5858 5860 (NOTLB) [143156.744000] f35d5d74 00200086 00000002 f35d5d60 f35d5d5c 00000000 f35d5e3c f337051c [143156.752000] 00000000 f337051c 00000007 f6146e70 aa987f00 0000007e 00000000 f6146f94 [143156.760000] c3821780 00000000 0000ee4e 00000000 c014f8ac c164dc00 00000000 c164dc00 [143156.768000] Call Trace: [143156.772000] [<c03c7205>] wait_for_completion+0x74/0xaa [143156.776000] [<c0242a66>] glock_wait_internal+0xf2/0x21f [143156.784000] [<c0242d0d>] gfs2_glock_nq+0x17a/0x1ba [143156.788000] [<c024492e>] gfs2_change_nlink+0x78/0xaa [143156.792000] [<c024e8e7>] gfs2_unlink+0xc3/0x100 [143156.796000] [<c0171b8b>] vfs_unlink+0x73/0xaf [143156.800000] [<c01734ad>] do_unlinkat+0x87/0x112 [143156.808000] [<c0173548>] sys_unlink+0x10/0x12 [143156.812000] [<c0102eac>] syscall_call+0x7/0xb The reason is that its doing something which can only be done under a transaction and the rwsem in question is the sdp->sd_log_flush_lock which is taken on each transaction in log.c:gfs2_log_reserve() So the question is then why this particular process above is blocked, apparantly awaiting a reply from the locking module, but since in this case its nolock, the chances are that the problem is somewhere in the state machine in glock.c. I still think that the difference in the number of glocks is a red herring. I'd be very surprised if the numbers of glocks used was much different overall. What I would expect, particularly in a system with not a lot else going on other than the test load on gfs2 is that the glocks would stay around for longer once gfs has finished with them. As a result the lockdump will show more of them. How did you measure the number of glocks? Just through /proc/slabinfo after running dbench? If so then thats probably the cause. Created attachment 146881 [details]
glock re-ordering
The fix is now in the upstream -nmw git tree. It will filter back into FC and RHEL shortly. This appears to be resolved. Reassigning to Steve so patch can make it around the tree farm. Created attachment 152908 [details]
Patch which applies against RHEL 5.1
This is a patch against RHEL 5.1 ported directly from upstream.
in 2.6.18-17.el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html |