Description of problem: I did some support with a community user on #linux-cluster of what started out seeming like an rgmanager problem, but ended up looking very much like a DLM bug. * rgmanager-2.0.38-2.el5_2.1 * kernel-2.6.18-92.1.1.el5xen in Xen domU (All cluster nodes are domU) Clustat (rgmanager utility to get info about running services) was timing out. In the past, this has been caused by a number of things.
Created attachment 311207 [details] Backtrace of rgmanager Thread 8 is stuck waiting for a reply from the DLM.
Created attachment 311208 [details] debugfs DLM information on the rgmanager lockspace This is from all 4 nodes. Several are looking for the master holder of the "usrm::vf" lock. None are reported to be the master.
As requested, group_tool -v on all nodes. Since it sounds like we'll be doing some detailed troubleshooting on this, might as well use actual node names. # Node: Chico. Status: Functional. type level name id state node id local_done fence 0 default 00010002 none [1 2 3 4] dlm 1 rgmanager 00010001 none [1 2 3 4] # Node: Zeppo. Status: Functional. type level name id state node id local_done fence 0 default 00010002 none [1 2 3 4] dlm 1 rgmanager 00010001 none [1 2 3 4] # Node: Harpo. Status: Functional. /sys/kernel/debug/dlm/rgmanager* was empty type level name id state node id local_done fence 0 default 00010002 none [1 2 3 4] dlm 1 rgmanager 00010001 none [1 2 3 4] # Node: Groucho. Status: rgmanager hosed. type level name id state node id local_done fence 0 default 00010002 none [1 2 3 4] dlm 1 rgmanager 00010001 none [1 2 3 4]