Bug 523999 - "depend" attributes not interacting correctly with exclusive prioritization
Summary: "depend" attributes not interacting correctly with exclusive prioritization
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager
Version: 5.4
Hardware: x86_64
OS: Linux
low
urgent
Target Milestone: ---
: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-09-17 14:47 UTC by Henrique Cezar
Modified: 2016-04-26 15:31 UTC (History)
3 users (show)

Fixed In Version: rgmanager-2.0.52-2.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-30 08:49:36 UTC


Attachments (Terms of Use)
Cluster.conf (5.28 KB, text/plain)
2009-10-23 17:28 UTC, Henrique Cezar
no flags Details
Preliminary fix (5.61 KB, patch)
2009-12-21 22:59 UTC, Lon Hohberger
no flags Details | Diff
Updated fix (5.65 KB, patch)
2009-12-22 15:17 UTC, Lon Hohberger
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2010:0280 normal SHIPPED_LIVE rgmanager bug fix and enhancement update 2010-03-29 13:59:11 UTC

Description Henrique Cezar 2009-09-17 14:47:31 UTC
Description of problem:

I have an environment with 5 nodes Blades running on enclosure HP. When I lost (kill) the master service of cluster I have two different behavior where the first one I have the fail over only over master node and another one where I got all services restarted and some time some services there are restarted more than once.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. The 5 nodes are stable and the master service is on node 1, ime1. 
2. Fence node 1.
3. The master services relocates to node 2, ime2.
4. Some times other nodes (node 3, node 4 and node 5) are restarted and some times no.
  
Actual results:
The cluster doesn't have any pattern for fail over.

Expected results:
I expect restart all services before restart the master node.

Additional info:

Even using the depend_mode HARD I only have all restart only when I stop the RGMANAGER but if I kill (fence) the master node I only have the recovery only over Master node and nothing to any non-master services.

Below following some parts of our cluster.conf

....
	<rm central_processing="1" log_facility="local4" log_level="7">
		<failoverdomains>
			<failoverdomain name="all" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="tucao" priority="1"/>
				<failoverdomainnode name="tuim" priority="2"/>
				<failoverdomainnode name="tuiuiu" priority="3"/>
				<failoverdomainnode name="tuju" priority="4"/>
				<failoverdomainnode name="tuque" priority="5"/>
			</failoverdomain>
			<failoverdomain name="tucao" ordered="1" restricted="1">
				<failoverdomainnode name="tucao" priority="1"/>
			</failoverdomain>
			<failoverdomain name="tuim" ordered="1" restricted="1">
				<failoverdomainnode name="tuim" priority="1"/>
			</failoverdomain>
			<failoverdomain name="tuiuiu" ordered="1" restricted="1">
				<failoverdomainnode name="tuiuiu" priority="1"/>
			</failoverdomain>
			<failoverdomain name="tuju" ordered="1" restricted="1">
				<failoverdomainnode name="tuju" priority="1"/>
			</failoverdomain>
			<failoverdomain name="tuque" ordered="1" restricted="1">
				<failoverdomainnode name="tuque" priority="1"/>
			</failoverdomain>
		</failoverdomains>
....
		<service autostart="1" domain="all" exclusive="1" name="online1" recovery="relocate" max_restarts="0" restart_expire_time="0">
.....
		</service>
		<service autostart="1" depend="service:online1" depend_mode="hard" domain="tucao" exclusive="2" name="tucao" recovery="restart">
....
		</service>
		<service autostart="1" depend="service:online1" depend_mode="hard" domain="tuim" exclusive="2" name="tuim" recovery="restart">
....
		</service>
		<service autostart="1" depend="service:online1" depend_mode="hard" domain="tuiuiu" exclusive="2" name="tuiuiu" recovery="restart">
....
		</service>
		<service autostart="1" depend="service:online1" depend_mode="hard" domain="tuju" exclusive="2" name="tuju" recovery="restart">
....
		</service>
		<service autostart="1" depend="service:online1" depend_mode="hard" domain="tuque" exclusive="2" name="tuque" recovery="restart">
....
		</service>
.....

RUNNING A FAILOVER.

- STEP 1 - ALL NODE STABLE
Cluster Status for vivo @ Thu Sep 17 11:35:44 2009
Member Status: Quorate

 Member Name                   ID   Status
 ------ ----                   ---- ------
 tucao                             1 Online, Local, RG-Worker
 tuim                              2 Online, RG-Worker
 tuiuiu                            3 Online, RG-Worker
 tuju                              4 Online, RG-Worker
 tuque                             5 Online, RG-Worker

 Service Name         Owner (Last)         State
 ------- ----         ----- ------         -----
 service:online1      tuju                 started
 service:tucao        tucao                started
 service:tuim         tuim                 started
 service:tuiuiu       tuiuiu               started
 service:tuju         (none)               stopped
 service:tuque        tuque                started

- STEP 2 - KILLING MASTER NODE (FENCE)
Cluster Status for vivo @ Thu Sep 17 11:35:44 2009
Member Status: Quorate

 Member Name                   ID   Status
 ------ ----                   ---- ------
 tucao                             1 Online, Local, RG-Worker
 tuim                              2 Online, RG-Worker
 tuiuiu                            3 Online, RG-Worker
 tuju                              4 Offline
 tuque                             5 Online, RG-Worker

 Service Name         Owner (Last)         State
 ------- ----         ----- ------         -----
 service:online1      tuju                 started
 service:tucao        tucao                started
 service:tuim         tuim                 started
 service:tuiuiu       tuiuiu               started
 service:tuju         (none)               stopped
 service:tuque        tuque                started

- STEP 3 - FAIL OVER RUNNIG.
Cluster Status for vivo @ Thu Sep 17 11:35:44 2009
Member Status: Quorate

 Member Name                   ID   Status
 ------ ----                   ---- ------
 tucao                             1 Online, Local, RG-Worker
 tuim                              2 Online, RG-Worker
 tuiuiu                            3 Online, RG-Worker
 tuju                              4 Offline
 tuque                             5 Online, RG-Worker

 Service Name         Owner (Last)         State
 ------- ----         ----- ------         -----
 service:online1      tuju                 started
 service:tucao        tucao                started
 service:tuim         tuim                 started
 service:tuiuiu       tuiuiu               started
 service:tuju         (none)               stopped
 service:tuque        tuque                stoping

- STEP 4 - STARTING MASTER NODE.
Cluster Status for vivo @ Thu Sep 17 11:35:44 2009
Member Status: Quorate

 Member Name                   ID   Status
 ------ ----                   ---- ------
 tucao                             1 Online, Local, RG-Worker
 tuim                              2 Online, RG-Worker
 tuiuiu                            3 Online, RG-Worker
 tuju                              4 Offline
 tuque                             5 Online, RG-Worker

 Service Name         Owner (Last)         State
 ------- ----         ----- ------         -----
 service:online1      tuque                starting
 service:tucao        tucao                started
 service:tuim         tuim                 started
 service:tuiuiu       tuiuiu               started
 service:tuju         (none)               stopped
 service:tuque        (tuque)              stopped

- STEP 4 - FAIL OVER DONE.
Cluster Status for vivo @ Thu Sep 17 11:35:44 2009
Member Status: Quorate

 Member Name                   ID   Status
 ------ ----                   ---- ------
 tucao                             1 Online, Local, RG-Worker
 tuim                              2 Online, RG-Worker
 tuiuiu                            3 Online, RG-Worker
 tuju                              4 Offline
 tuque                             5 Online, RG-Worker

 Service Name         Owner (Last)         State
 ------- ----         ----- ------         -----
 service:online1      tuque                started
 service:tucao        tucao                started
 service:tuim         tuim                 started
 service:tuiuiu       tuiuiu               started
 service:tuju         (none)               stopped
 service:tuque        (tuque)              stopped

Comment 1 Lon Hohberger 2009-09-22 19:07:22 UTC
Which rgmanager package are you using?

Comment 2 Henrique Cezar 2009-09-23 03:09:01 UTC
I´m using rgmanager-2.0.46-1.el5_3.4.
Actually the version that I using it is 5.3 not 4. I could set ticket with this specification.

Comment 3 Lon Hohberger 2009-09-23 14:21:55 UTC
I need:

- logs from the nodes
- cluster.conf
- any event scripts which are not shipped with rgmanager

Comment 4 Henrique Cezar 2009-10-23 17:28:52 UTC
Created attachment 365872 [details]
Cluster.conf

This is the actual cluster.conf used on my environment.

Comment 5 Henrique Cezar 2009-10-23 17:34:55 UTC
I saw that if the master service (online1) turn to STOPPED state every other nodes that depends of online1 stop too, but if I kill the node master abruptly the state keep STARTED until the cluster identify the fail and run the failover and turn the state to STARTING.

My point is. Are There any way to update the STATE of Master node after he die?

Comment 6 Lon Hohberger 2009-10-23 20:23:34 UTC
So, it seems like we aren't generating events in the correct order on node death when depend= is in use.

That should be easy(ish) to fix.

Comment 7 Lon Hohberger 2009-12-21 22:59:12 UTC
Created attachment 379729 [details]
Preliminary fix

0001-rgmanager-Fix-event-generation-with-central_process.patch

Initial pass at a fix; will test a bit more tomorrow morning.  This fixes the event generation and should apply cleanly against the following TEST rpm:

http://people.redhat.com/lhh/RHEL-5-TEST/rgmanager-2.0.52-1.36.el5.src.rpm

Comment 8 Lon Hohberger 2009-12-22 15:16:32 UTC
Configuration:
        <rm central_processing="1" log_level="5" log_facility="local4">
                <service name="1"/>
                <service name="2"/>
                <service name="3"/>
                <service name="d1" depend="service:1"/>
        </rm>


[root@frederick ~]# clustat
Cluster Status for lolcats @ Tue Dec 22 10:09:31 2009
Member Status: Quorate

 Member Name                        ID   Status
 ------ ----                        ---- ------
 molly                                  1 Online, RG-Master
 frederick                              2 Online, Local, RG-Worker
 /dev/hdb1                              0 Offline, Quorum Disk

 Service Name              Owner (Last)              State         
 ------- ----              ----- ------              -----         
 service:1                 frederick                 started       
 service:2                 frederick                 started       
 service:3                 molly                     started       
 service:d1                molly                     started   

---

Kill frederick.  Expected behavior is that:
 - service 1 and service 2 are started on molly
 - service d1 is restarted because service 1 had failed

---

Dec 22 10:13:01 molly clurgmgrd[27217]: <notice> Starting service:1 on [ 1 ] 
Dec 22 10:13:01 molly clurgmgrd[27217]: <notice> Starting stopped service service:1 
Dec 22 10:13:01 molly clurgmgrd[27217]: <notice> Service service:1 started 
Dec 22 10:13:01 molly clurgmgrd[27217]: <notice> Service service:1 is now running on member 1 
Dec 22 10:13:01 molly clurgmgrd[27217]: <notice> Starting service:2 on [ 1 ] 
Dec 22 10:13:01 molly clurgmgrd[27217]: <notice> Starting stopped service service:2 
Dec 22 10:13:01 molly clurgmgrd[27217]: <notice> Service service:2 started 
Dec 22 10:13:01 molly clurgmgrd[27217]: <notice> Service service:2 is now running on member 1 
Dec 22 10:13:01 molly clurgmgrd[27217]: <notice> Stopping service service:d1 
Dec 22 10:13:01 molly clurgmgrd[27217]: <notice> Service service:d1 is stopped 
Dec 22 10:13:01 molly clurgmgrd[27217]: <notice> Starting service:d1 on [ 1 ] 
Dec 22 10:13:01 molly clurgmgrd[27217]: <notice> Starting stopped service service:d1 
Dec 22 10:13:01 molly clurgmgrd[27217]: <notice> Service service:d1 started 
Dec 22 10:13:01 molly clurgmgrd[27217]: <notice> Service service:d1 is now running on member 1

Comment 9 Lon Hohberger 2009-12-22 15:17:19 UTC
Created attachment 379843 [details]
Updated fix

Fix cleans up log messages and error path if the stop fails during recovery.

Comment 15 errata-xmlrpc 2010-03-30 08:49:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0280.html

Comment 16 Henrique Cezar 2010-03-30 12:40:25 UTC
I would like to say Thanks for your attention and help. Unfortunately I couldn't apply the fix yet because the environment it is a customer environment and I don't have permission to apply yet.


Note You need to log in before you can comment on or make changes to this bug.