Bug 454355 - dlm: 3 nodes looking for a lock which does not exist?
Summary: dlm: 3 nodes looking for a lock which does not exist?
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: 5.5
Assignee: David Teigland
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-07-07 21:57 UTC by Lon Hohberger
Modified: 2018-10-20 03:18 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 21:13:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Backtrace of rgmanager (9.00 KB, text/plain)
2008-07-07 21:58 UTC, Lon Hohberger
no flags Details
debugfs DLM information on the rgmanager lockspace (1.54 KB, text/plain)
2008-07-07 21:59 UTC, Lon Hohberger
no flags Details

Description Lon Hohberger 2008-07-07 21:57:13 UTC
Description of problem:

I did some support with a community user on #linux-cluster of what started out
seeming like an rgmanager problem, but ended up looking very much like a DLM bug.

* rgmanager-2.0.38-2.el5_2.1
* kernel-2.6.18-92.1.1.el5xen in Xen domU 

(All cluster nodes are domU)

Clustat (rgmanager utility to get info about running services) was timing out. 
In the past, this has been caused by a number of things.

Comment 1 Lon Hohberger 2008-07-07 21:58:12 UTC
Created attachment 311207 [details]
Backtrace of rgmanager

Thread 8 is stuck waiting for a reply from the DLM.

Comment 2 Lon Hohberger 2008-07-07 21:59:18 UTC
Created attachment 311208 [details]
debugfs DLM information on the rgmanager lockspace

This is from all 4 nodes.  Several are looking for the master holder of the
"usrm::vf" lock.  None are reported to be the master.

Comment 3 Chris St. Pierre 2008-07-07 22:03:44 UTC
As requested, group_tool -v on all nodes.

Since it sounds like we'll be doing some detailed troubleshooting on this, might 
as well use actual node names.

# Node: Chico.  Status: Functional.
type             level name       id       state node id local_done
fence            0     default    00010002 none
[1 2 3 4]
dlm              1     rgmanager  00010001 none
[1 2 3 4]

# Node: Zeppo.  Status: Functional.
type             level name       id       state node id local_done
fence            0     default    00010002 none
[1 2 3 4]
dlm              1     rgmanager  00010001 none
[1 2 3 4]

# Node: Harpo.  Status: Functional.  /sys/kernel/debug/dlm/rgmanager* was empty
type             level name       id       state node id local_done
fence            0     default    00010002 none
[1 2 3 4]
dlm              1     rgmanager  00010001 none
[1 2 3 4]

# Node: Groucho.  Status: rgmanager hosed.
type             level name       id       state node id local_done
fence            0     default    00010002 none
[1 2 3 4]
dlm              1     rgmanager  00010001 none
[1 2 3 4]



Note You need to log in before you can comment on or make changes to this bug.