Bug 360401 - rgmanager is stuck in a loop while rebooting a node.
Summary: rgmanager is stuck in a loop while rebooting a node.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: rgmanager
Version: 4
Hardware: i686
OS: Linux
high
high
Target Milestone: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-10-31 15:20 UTC by Raja
Modified: 2018-10-19 22:01 UTC (History)
3 users (show)

Fixed In Version: RHBA-2008-0791
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-07-25 19:15:43 UTC
Embargoed:


Attachments (Terms of Use)
cluster conf file. (1.90 KB, text/plain)
2007-10-31 15:20 UTC, Raja
no flags Details
clustat output (595 bytes, text/plain)
2007-10-31 15:21 UTC, Raja
no flags Details
Potential fix (583 bytes, patch)
2007-11-07 21:26 UTC, Lon Hohberger
no flags Details | Diff
rgmanager test logs (4.03 KB, text/plain)
2007-11-07 22:46 UTC, Raja
no flags Details
Incremental fix (need both patches for completeness) (9.89 KB, patch)
2007-11-08 01:38 UTC, Lon Hohberger
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0791 0 normal SHIPPED_LIVE rgmanager bug fix and enhancement update 2008-07-25 19:14:58 UTC

Description Raja 2007-10-31 15:20:20 UTC
Description of problem:

When a node running a cluster service is rebooted, it does not go down and see
rgmanager is stuck in a loop trying to bring down the clurgmgrd processes.

Version-Release number of selected component (if applicable):
rgmanager-1.9.69-2
cman-1.0.17-0
ccs-1.0.10-0
kernel-smp-2.6.9-55.0.2.EL

How reproducible:

Always.
Steps to Reproduce:
1. Create a 3 node clusters with 2 services running exclusively on 2 nodes. The
third node is a spare.
2. Reboot one of the node running a service ( sds1 ) and where it's log showing
rgmanager tring to relocate failed service ( sds2 ), which is actually running
on the other node and in started state.
3. Node will not go down. It gets stuck trying to bring down rgmanager. And the
log shows clurgmgrd[4627]: <err> #61: Invalid reply from member 3 during
relocate operation!
4. If we issue an another reboot or restart the rgmanager on the spare node, it
goes down and the service fails over to the spare node.

Note : If the node is NOT logging the following prior to reboot , it always goes
down.
clurgmgrd[4627]: <err> #61: Invalid reply from member 3 during relocate operation!

Actual results:

Node does not go down, rgmanager is stcuk in a loop.

Expected results:

Node to go down and the service that it was running be moved to the healthy node.

Additional info:

Comment 1 Raja 2007-10-31 15:20:20 UTC
Created attachment 244631 [details]
cluster conf file.

Comment 2 Raja 2007-10-31 15:21:16 UTC
Created attachment 244641 [details]
clustat output

Comment 3 Raja 2007-11-05 21:36:24 UTC
Raja,
*** Additional comments from the customer ***
I tried the name change for the nodes as we discussed yesterday and it didn't
make a difference. The other symptom that I'm seeing, if you want to add it to
the bugzilla, is that if I try to disable the services, the first one
will be disabled w/o a problem and the second service says "stopping",
followed by "starting" and "started". A second request to disable the
service finally disables it (And I'm talking about the "healthy"
service, not the one with the problem). I've tried this with the GUI and
using the command line approach.

Comment 5 Lon Hohberger 2007-11-07 21:26:14 UTC
Created attachment 250901 [details]
Potential fix

I think that what is happening isn't related to rebooting at all.

Node A starts
node B says "hey, I think node A is a better node for this service foo"
node B stops service foo
node A starts service foo
node B tells node A to start service foo
node A says "It's already running"
node B says "I don't know what that means."
node B tells node A to start service fo
node A says "It's already running"
...
repeat.

Comment 7 Lon Hohberger 2007-11-07 21:55:07 UTC
[lhh@people public_html]$ md5sum rgmanager-1.9.69-2.2lhh.i386.rpm
d09ce9850fcafabbc6f2617d23e17e8c  rgmanager-1.9.69-2.2lhh.i386.rpm

http://people.redhat.com/lhh/rgmanager-1.9.69-2.2lhh.i386.rpm



Comment 8 Lon Hohberger 2007-11-07 22:41:52 UTC
Ok -- 

This block in rg_state.c is suspect and looks like the cause of the actual problem:

        if (need_check) {
                pthread_mutex_lock(&exclusive_mutex);
                ret = check_exclusive_resources(membership, svcName);
                if (ret != 0) {
                        cml_free(membership);
                        pthread_mutex_unlock(&exclusive_mutex);
                        if (ret > 0)
                                goto relocate;
                        else
                                return FAIL;
                }
        }
        cml_free(membership);
...

svc_start() actually calls svc_advise_start() which correctly aborts the start
request if the service is already running.  However, with exclusive resources
running on a node, svc_start() and therefore svc_advise_start() is not called.

Instead, we jump right to trying to relocate it to another node -- which is not
what we want.  That is, if the service is running, whether or not there are
local services running is irrelevant, because we're not going to start it anyway.  

So, we should move the start check into svc_advise_start() instead of where it
is I think.  This would not only provide a modest efficiency improvement, but
also it should fix the problem entirely.  That said, I think the original patch
should also be included as a partial fix to this bugzilla.



Comment 9 Raja 2007-11-07 22:46:12 UTC
Created attachment 250981 [details]
rgmanager test logs

Comment 10 Raja 2007-11-07 22:47:16 UTC
After updating to the rgmanager package to rgmanager-1.9.69-2.2lhh.i386.rpm and
testing, the reboot works fine it does not get stuck up on rgmanager. But, when
the service that was running on the rebooted node is failed over to the spare
node, it gives the following error, after starting the failed over service sds1,
it tries to relocate the service sds2 ( healthy ) , that is running on the node 2.

Nov  7 16:08:08 snrcn-mxdp-02-sd03 clurgmgrd[1968]: <notice> Service sds1 started 

Nov  7 16:08:08 snrcn-mxdp-02-sd03 clurgmgrd[1968]: <warning> #71: Relocating
failed service sds2 
Nov  7 16:08:08 snrcn-mxdp-02-sd03 clurgmgrd[1968]: <warning> #70: Attempting to
restart service sds2 locally. 


Please see the attached, rgmanager_test_log for deatiled log.

-raja

Comment 11 Lon Hohberger 2007-11-08 01:38:25 UTC
Created attachment 251121 [details]
Incremental fix (need both patches for completeness)

Test packages here:

http://people.redhat.com/lhh/rgmanager-1.9.69-2.3lhh.src.rpm
http://people.redhat.com/lhh/rgmanager-1.9.69-2.3lhh.i386.rpm

Comment 12 Lon Hohberger 2007-11-08 14:53:02 UTC
The incremental fix is large but ultimately just ensures two things:
(a) We don't send out messages in a normal abort attempt where the service is
running, and
(b) All error cases are handled in some way.

The second patch could be used without the first patch, but warnings for
"Invalid reply from node X during relocate" would still appear.  Ergo, the
recommendation is to use both patches.

Comment 13 Raja 2007-11-08 19:55:43 UTC
Lon,

That does fix the problem. I am not seeing any reboot problems are the services
relocating attempt. Looks good. customer is stress testing it and I don't
foresee a problem. If it does, I will update the BZ.

thanks
-Raja

Comment 14 Raja 2007-11-12 17:05:27 UTC
Here is the comment from the customer.

" I tested the patch in both SDS nodes that had the problem and it seems to work
fine. Thanks a lot. - Rafael "

The fix provided in the rgmanager-1.9.69-2.3lhh.i386.rpm works fine. please
iclude this in the future rgmanager official release.

Comment 16 Lon Hohberger 2007-11-16 17:24:08 UTC
Note that Nathan Straz hit this during testing of RHCS 4.6 as well, giving it
even more weight.

Comment 17 RHEL Program Management 2007-11-30 19:05:08 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 23 errata-xmlrpc 2008-07-25 19:15:43 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0791.html



Note You need to log in before you can comment on or make changes to this bug.