Description of problem: On one node, if you run 'clusvcadm -r foo' in a loop, it bounces the service back and forth between nodes. On one of the nodes the service is using, if you run 'while : ; do service rgmanager start; sleep 60; service rgmanager stop', you will eventually get this: Jul 6 13:57:55 lisa rgmanager: [17227]: <notice> Shutting down Cluster Service Manager... Jul 6 13:57:55 lisa clurgmgrd[16434]: <notice> Shutting down Jul 6 13:57:56 lisa clurgmgrd[16434]: <notice> Shutdown complete, exiting Jul 6 13:57:56 lisa kernel: clurgmgrd[17239]: segfault at 0000000000000000 rip 0000000000415cd8 rsp 0000000044605f30 error 4 Jul 6 13:57:56 lisa kernel: dlm: rgmanager: group leave failed -512 0 Jul 6 13:57:56 lisa clurgmgrd[16433]: <crit> Watchdog: Daemon died, rebooting... Jul 6 13:57:56 lisa dlm_controld[3676]: open "/sys/kernel/dlm/rgmanager/control" error -1 2 Jul 6 13:57:56 lisa dlm_controld[3676]: open "/sys/kernel/dlm/rgmanager/event_done" error -1 2 Jul 6 13:57:56 lisa kernel: md: stopping all md devices. Jul 6 13:57:57 lisa kernel: Synchronizing SCSI cache for disk sda: Version-Release number of selected component (if applicable): 5.1 beta How reproducible: difficult
I think this can be solved by tracking all threads (even simple ones) and making sure they're cleaned up in the exit path. I will test this soon.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Test setup: * 5 node cluster * 2 exclusive services (test1, test2) Reproduce case: * on node 1: while :; do clusvcadm -r test1; done * on node 2: while :; do clusvcadm -r test2; done * on node 3 (**): while :; do service rgmanager stop; service rgmanager start; sleep 30; done **: This needs to be one of the nodes the service is hitting.
Patches in RHEL5, RHEL51, head.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0580.html