Red Hat Bugzilla – Bug 200841
rgmanager on cluster hung wuth "stuck with lock errors produced for 2+ until a reboot
Last modified: 2009-04-16 16:20:46 EDT
Description of problem:
Users not able to run clustat or bring services up. /var/log/messages
for 2+ hours produced:
Jul 30 16:07:45 flsrv02 clurgmgrd: <warning> NodeID:0000000000000003
stuck with lock usrm::vf
Version-Release number of selected component (if applicable):
Somtimes. Attemped to stop/restart rgmanager apps on all nodes
system got into same state and one node fenced. Then another
reboot of all nodes fixed it.
Steps to Reproduce:
rgmanager is bad state producing above messages over and over.
rgmanager should come up and not get into this state.
Created attachment 133359 [details]
messages from node2
Created attachment 133360 [details]
messages file from node3
What version of rgmanager?
U4pre1 that you provided with the magma changes as well.
Created attachment 133519 [details]
rgmanager we are using
Created attachment 133520 [details]
magma we are using
Created attachment 133521 [details]
magma plugins we are using
These patches came from bz #193128
The "stuck lock" message started *after* the rgmanagers were sent a -9 signal.
WE noticed the stop script used SIGTERM to have a graceful exit and cleanup by
daemon and also noticed the stop script cleans up some lockfiles and pidfiles in
Lon could this ungraceful way of stopping rgmanager (and then restarting it)
cause the issue ? My guess is so as it mimicks a coredump/bug type scenario
where the app just abrupty exits with no cleanup.
If this the case, then we induced it here and this is a error in the use model.
The DLM should free up the locks after you kill rgmanager with -9, I should
think... but I could be mistaken on that.
All the locks should be freed if the program is killed.
A dlm lock dump might help to see if anything is left:
echo "lockspace name" >> /proc/cluster/dlm_locks
cat /proc/cluster/dlm_locks > foo.txt
This could be related to #208968, actually
*** This bug has been marked as a duplicate of 208968 ***