Description of problem: When a node running a cluster service is rebooted, it does not go down and see rgmanager is stuck in a loop trying to bring down the clurgmgrd processes. Version-Release number of selected component (if applicable): rgmanager-1.9.69-2 cman-1.0.17-0 ccs-1.0.10-0 kernel-smp-2.6.9-55.0.2.EL How reproducible: Always. Steps to Reproduce: 1. Create a 3 node clusters with 2 services running exclusively on 2 nodes. The third node is a spare. 2. Reboot one of the node running a service ( sds1 ) and where it's log showing rgmanager tring to relocate failed service ( sds2 ), which is actually running on the other node and in started state. 3. Node will not go down. It gets stuck trying to bring down rgmanager. And the log shows clurgmgrd[4627]: <err> #61: Invalid reply from member 3 during relocate operation! 4. If we issue an another reboot or restart the rgmanager on the spare node, it goes down and the service fails over to the spare node. Note : If the node is NOT logging the following prior to reboot , it always goes down. clurgmgrd[4627]: <err> #61: Invalid reply from member 3 during relocate operation! Actual results: Node does not go down, rgmanager is stcuk in a loop. Expected results: Node to go down and the service that it was running be moved to the healthy node. Additional info:
Created attachment 244631 [details] cluster conf file.
Created attachment 244641 [details] clustat output
Raja, *** Additional comments from the customer *** I tried the name change for the nodes as we discussed yesterday and it didn't make a difference. The other symptom that I'm seeing, if you want to add it to the bugzilla, is that if I try to disable the services, the first one will be disabled w/o a problem and the second service says "stopping", followed by "starting" and "started". A second request to disable the service finally disables it (And I'm talking about the "healthy" service, not the one with the problem). I've tried this with the GUI and using the command line approach.
Try: http://people.redhat.com/lhh/rgmanager-1.9.69-2.1lhh.i386.rpm http://people.redhat.com/lhh/rgmanager-1.9.69-2.1lhh.src.rpm
Created attachment 250901 [details] Potential fix I think that what is happening isn't related to rebooting at all. Node A starts node B says "hey, I think node A is a better node for this service foo" node B stops service foo node A starts service foo node B tells node A to start service foo node A says "It's already running" node B says "I don't know what that means." node B tells node A to start service fo node A says "It's already running" ... repeat.
[lhh@people public_html]$ md5sum rgmanager-1.9.69-2.2lhh.i386.rpm d09ce9850fcafabbc6f2617d23e17e8c rgmanager-1.9.69-2.2lhh.i386.rpm http://people.redhat.com/lhh/rgmanager-1.9.69-2.2lhh.i386.rpm
Ok -- This block in rg_state.c is suspect and looks like the cause of the actual problem: if (need_check) { pthread_mutex_lock(&exclusive_mutex); ret = check_exclusive_resources(membership, svcName); if (ret != 0) { cml_free(membership); pthread_mutex_unlock(&exclusive_mutex); if (ret > 0) goto relocate; else return FAIL; } } cml_free(membership); ... svc_start() actually calls svc_advise_start() which correctly aborts the start request if the service is already running. However, with exclusive resources running on a node, svc_start() and therefore svc_advise_start() is not called. Instead, we jump right to trying to relocate it to another node -- which is not what we want. That is, if the service is running, whether or not there are local services running is irrelevant, because we're not going to start it anyway. So, we should move the start check into svc_advise_start() instead of where it is I think. This would not only provide a modest efficiency improvement, but also it should fix the problem entirely. That said, I think the original patch should also be included as a partial fix to this bugzilla.
Created attachment 250981 [details] rgmanager test logs
After updating to the rgmanager package to rgmanager-1.9.69-2.2lhh.i386.rpm and testing, the reboot works fine it does not get stuck up on rgmanager. But, when the service that was running on the rebooted node is failed over to the spare node, it gives the following error, after starting the failed over service sds1, it tries to relocate the service sds2 ( healthy ) , that is running on the node 2. Nov 7 16:08:08 snrcn-mxdp-02-sd03 clurgmgrd[1968]: <notice> Service sds1 started Nov 7 16:08:08 snrcn-mxdp-02-sd03 clurgmgrd[1968]: <warning> #71: Relocating failed service sds2 Nov 7 16:08:08 snrcn-mxdp-02-sd03 clurgmgrd[1968]: <warning> #70: Attempting to restart service sds2 locally. Please see the attached, rgmanager_test_log for deatiled log. -raja
Created attachment 251121 [details] Incremental fix (need both patches for completeness) Test packages here: http://people.redhat.com/lhh/rgmanager-1.9.69-2.3lhh.src.rpm http://people.redhat.com/lhh/rgmanager-1.9.69-2.3lhh.i386.rpm
The incremental fix is large but ultimately just ensures two things: (a) We don't send out messages in a normal abort attempt where the service is running, and (b) All error cases are handled in some way. The second patch could be used without the first patch, but warnings for "Invalid reply from node X during relocate" would still appear. Ergo, the recommendation is to use both patches.
Lon, That does fix the problem. I am not seeing any reboot problems are the services relocating attempt. Looks good. customer is stress testing it and I don't foresee a problem. If it does, I will update the BZ. thanks -Raja
Here is the comment from the customer. " I tested the patch in both SDS nodes that had the problem and it seems to work fine. Thanks a lot. - Rafael " The fix provided in the rgmanager-1.9.69-2.3lhh.i386.rpm works fine. please iclude this in the future rgmanager official release.
Patches in CVS: http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/rgmanager/include/resgroup.h.diff?cvsroot=cluster&only_with_tag=RHEL4&r1=1.3.2.9&r2=1.3.2.10 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/rgmanager/src/daemons/groups.c.diff?cvsroot=cluster&only_with_tag=RHEL4&r1=1.8.2.21&r2=1.8.2.22 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/rgmanager/src/daemons/rg_state.c.diff?cvsroot=cluster&only_with_tag=RHEL4&r1=1.4.2.21&r2=1.4.2.22
Note that Nathan Straz hit this during testing of RHCS 4.6 as well, giving it even more weight.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0791.html