Bug 498976
| Summary: | GFS2 - probably lost glock call back | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Eduardo Damato <edamato> | ||||||||||
| Component: | kernel | Assignee: | Steve Whitehouse <swhiteho> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | high | ||||||||||||
| Version: | 5.3 | CC: | adas, bmarzins, ccaulfie, cward, dzickus, hjia, jruemker, jtluka, liko, lwang, mherbert, mrappa, ricardo.arguello, riek, rpeterso, tao, teigland | ||||||||||
| Target Milestone: | rc | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | All | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | |||||||||||||
| : | 546279 (view as bug list) | Environment: | |||||||||||
| Last Closed: | 2010-03-30 07:27:34 UTC | Type: | --- | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | |||||||||||||
| Bug Blocks: | 499522, 525215, 526947, 533192, 546279 | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Eduardo Damato
2009-05-04 16:18:24 UTC
Looks like its probably a missing call back to me - need to investigate further to be certain. The attached data doesn't include the glock or the dlm lock dumps. That information (just the raw dumps from the debugfs files) is required in order to try and work out what is going on here. Since the glock is missing from the hangalyser trace, it suggests that gfs2 and the dlm have confused each other, so the dlm dumps are a vital piece of the jigsaw to see what is going on. Here is the dlm master copy of the lock in question: Resource f7b672c0 Name (len=24) " 2 18b36cb" Master Copy Granted Queue 0337017d EX Remote: 3 02de000a Conversion Queue Waiting Queue 03ea0198 -- (PR) 02330164 -- (PR) Remote: 2 036e0051 This indicates that node 1 should have sent a call back to node 3. On node 3 we have: G: s:EX n:2/18b36cb f:y t:EX d:EX/0 l:0 a:0 r:3 I: n:5103274/25900747 t:4 f:0x00000010 i.e. a glock which claims to have received no callback. So somewhere along the line, we've lost the call back. The question being why and where? There are two possible causes (a) the dlm never sent a callback or (b) gfs2 ignored it for some reason. Actually there is also (c) that the message got lost or reordered in the lock_dlm module. What I'm going to suggest at this stage is that if its possible to reproduce without too much difficulty, that you should collect tcpdumps of the dlm traffic at each node. In the end, we are only interested in the last few dlm packets which relate to the lock in question (which is the directory inode lock) on the node where the lock is being cached (i.e. node 3 in the above traces). You'll have to capture all the traffic on all the nodes though to be sure to catch it though. I'd suggest setting the snaplen to 1500 in case it chops off the end of any packets. Chrissie and/or Dave should be able to help with setting up tcpdump to capture the packets. There is also a tool which can analyse the dumps and turn them into something meaningful, but I forget its name. Again Chrissie and/or Dave should be able to help. If we can prove whether the callback is making it across the wire to the stuck node, that cuts down the possibilities. If it is making it across the wire, then it will be much harder to trace what is going on, unfortunately. I'll give that issue some thought in the mean time. wireshark is the network protocol analyser I was thinking of in comment #22 Created attachment 344053 [details]
Instrumentation patch
This patch to the 2.6.18-128 kernel adds instrumentation to lock_dlm
so that it will tell us more about what happened in the last 50 lock
"events", whether the locks were sent to the dlm, whether they came
back, etc. This should hopefully tell us whether gfs2, dlm or
lock_dlm is at fault.
Once the system is hung, have them do a command like this:
cat /sys/fs/gfs2/bobs_roth\:roth2/lock_module/lock_hist
where "bobs_roth\:roth2" should be replaced by the lock table as
returned by this command: "gfs2_tool sb /dev/their/device table"
We might want to capture this file from all nodes.
Created attachment 344859 [details]
Faster Patch
This patch removes the spin_lock from the lock history instrumentation.
This will hopefully bring the speed/performance back to normal and
maybe we'll be able to recreate the problem easier this way.
Unfortunately, that comes at the expense of possible corruption of the
history information we're trying to gather. So we still might not be
able to tell what's going on, but it's worth a try to see what we get.
I'm waiting to get the results from the previous patch. Until then, I'm putting this bug into NEEDINFO. Unfortunately, the gfs2_hangalyzer output is zero length, so I had to do the analysis by hand. There is only one glock being waited on in this case, 2/ac811, and here are the references to it: from 3: G: s:UN n:2/ac811 f:l t:SH d:EX/0 l:0 a:0 r:12 H: s:SH f:aW e:0 p:13123 [mv] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:13124 [mv] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:13125 [mv] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:13126 [mv] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:13127 [mv] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:13129 [mv] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:13131 [mv] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:18092 [python] gfs2_permission+0x69/0xb4 [gfs2] H: s:SH f:aW e:0 p:18066 [python] gfs2_permission+0x69/0xb4 [gfs2] from 2: G: s:EX n:2/ac811 f:y t:EX d:EX/0 l:0 a:0 r:3 I: n:12563622/706577 t:4 f:0x00000010 Unfortunately, the glock trace information does not contain this glock, so it doesn't tell us any more than we already knew. Can we get a recreation where the glock trace more closely matches the hang (i.e. taken soon after the hang)? Created attachment 360660 [details]
Potential fix
I have no way to prove whether this patch fixes this bug or not. On the other hand the code looks wrong and there is almost certainly some kind of issue here. The patch is against upstream, but as usual RHEL is almost identical.
I'm intending to put this patch in upstream anyway, subject to it not causing any issues during testing, if we can prove that it fixes this bug, then we can put it in RHEL too.
I've given it some brief testing in upstream, but it needs more exposure before I can be confident about it.
Created attachment 360673 [details]
Potential fix (RHEL5 port)
This is the RHEL5 version of the previously attached upstream patch.
in kernel-2.6.18-175.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. ~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |