/* * If we failed to stop the service, we're done. At this * point, we can't determine the service's status - so * trying to start it on other nodes is right out. */ return ABORT; ... ABORT = service failed to stop (can happen because of a simple lock failure, or it can also indicate a fatal problem with the service or file systems). Manual intervention is required. Relocating in anything *but* a lock failure can increase the risk for data/filesystem corruption. I think the correct thing to do here is wait for all the child processes which are forked off to return prior to letting the main clusvcmgrd exit (and for some reason, I thought it did this correctly). If there's a clusvcmgrd process still running / recovering after cluquorumd has exited, that's definitely a problem.
Created attachment 131299 [details] Waits for all children before notifying the quorum daemon that we're exiting. This should fix it in a rather robust fashion. Taking the lock (causing the service to return ABORT) should no longer fail because clusvcmgrd won't tell cluquorumd it's ready to exit prior for services completing operations they were performing at the time of shutdown.
FWIW, this is what rgmanager on RHEL4 does (except rgmanager has per-service event queues).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0505.html