Bug 1151199 - rgmanager: Restricted domain non-member can start service if all other members exhausted
Summary: rgmanager: Restricted domain non-member can start service if all other member...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: rgmanager
Version: 6.6
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Ryan McCabe
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-10-09 18:46 UTC by John Ruemker
Modified: 2019-07-11 08:15 UTC (History)
3 users (show)

Fixed In Version: rgmanager-3.0.12.1-22.el6
Doc Type: Bug Fix
Doc Text:
Previously, when relocating a service, the rgmanager utility attempted to use all nodes in a domain and if all failed, rgmanager restarted the service locally without checking whether the local node was eligible to run the service and regardless of whether the service had been started. Consequently, under certain circumstances, a service in a restricted domain could be started on a non-member node. With this update, if the service cannot be started on any domain members, the service goes back to a stopped state, and rgmanager no longer attempts to start the service on a local node outside the restricted domain.
Clone Of:
Environment:
Last Closed: 2015-07-22 07:32:40 UTC


Attachments (Terms of Use)
rgmanager: Do not restart a service locally after failed relocation if failover domains prohibit it (1.67 KB, patch)
2014-10-09 18:47 UTC, John Ruemker
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:1402 normal SHIPPED_LIVE rgmanager bug fix update 2015-07-20 18:07:17 UTC
Red Hat Knowledge Base (Solution) 1202713 None None None Never

Description John Ruemker 2014-10-09 18:46:09 UTC
Description of problem:  When relocating a service, rgmanager will try out the best nodes first and if all else fails, will restart the service locally (even if it wasn't started to begin with).  However, it doesn't actually check to see whether the local node is eligible to run the service, and so you can end up with a service in a restricted domain starting on a member outside that domain.

A customer encountered this in the only way I was able to conceive that this could happen, and admittedly the conditions under which its possible are fairly narrow.  If the service is in a stopped state, and you issue a 'clusvcadm -r <service>' from a non-domain-member without specifying a destination, and all of the members of that domain fail to start the service, then eventually once all members are exhausted the node where the command was run will restart it locally.  If that happens to succeed there where it failed on the other nodes, it will stay running on this node which is not a member of the domain.


Version-Release number of selected component (if applicable): rgmanager-3.0.12.1-19.el6


How reproducible: Easily with contrived conditions, probably rarely in the wild


Steps to Reproduce:
1. Create a restricted failoverdomain with one node left out.

2. Create a service that can be configured to fail on-demand on startup.  I use a script resource that fails on start with the presence of a special file, and succeeds otherwise for everything else. Put in the failover domain.

3. Configure all nodes in the failoverdomain to fail the service on start.

4. Put the service into a stopped state:

  # clusvcadm -s <service>

5. On the non-domain-member, run 'clusvcadm -r <service>'

Actual results: 

		<failoverdomains>
			<failoverdomain name="1then2" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="rhel6-node1.example.com" priority="1"/>
				<failoverdomainnode name="rhel6-node2.example.com" priority="2"/>
			</failoverdomain>
			<failoverdomain name="2then1" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="rhel6-node1.example.com" priority="2"/>
				<failoverdomainnode name="rhel6-node2.example.com" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<service domain="1then2" name="test">
			<script file="/root/script" name="script"/>
		</service>

# clusvcadm -r test
Trying to relocate service:test...Failed; service running on original owner

# clustat
Cluster Status for rhel6-cluster @ Thu Oct  9 14:19:49 2014
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 rhel6-node1.example.com                                             1 Online, rgmanager
 rhel6-node2.example.com                                             2 Online, rgmanager
 rhel6-node3.example.com                                             3 Online, Local, rgmanager

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 service:test                                                     rhel6-node3.example.com                                          started  

Oct  9 14:18:37 rhel6-node1 rgmanager[27937]: Starting stopped service service:test
Oct  9 14:18:37 rhel6-node1 rgmanager[29176]: [script] Executing /root/script start
Oct  9 14:18:37 rhel6-node1 rgmanager[29197]: [script] script:script: start of /root/script failed (returned 1)
Oct  9 14:18:37 rhel6-node1 rgmanager[27937]: start on script "script" returned 1 (generic error)
Oct  9 14:18:37 rhel6-node1 rgmanager[27937]: #68: Failed to start service:test; return value: 1
Oct  9 14:18:37 rhel6-node1 rgmanager[27937]: Stopping service service:test
Oct  9 14:18:37 rhel6-node1 rgmanager[29226]: [script] Executing /root/script stop
Oct  9 14:18:37 rhel6-node1 rgmanager[27937]: Service service:test is recovering

Oct  9 14:18:10 rhel6-node2 rgmanager[13591]: Service service:test is recovering
Oct  9 14:18:37 rhel6-node2 rgmanager[13591]: Recovering failed service service:test
Oct  9 14:18:38 rhel6-node2 rgmanager[14616]: [script] Executing /root/script start
Oct  9 14:18:38 rhel6-node2 rgmanager[14637]: [script] script:script: start of /root/script failed (returned 1)
Oct  9 14:18:38 rhel6-node2 rgmanager[13591]: start on script "script" returned 1 (generic error)
Oct  9 14:18:38 rhel6-node2 rgmanager[13591]: #68: Failed to start service:test; return value: 1
Oct  9 14:18:38 rhel6-node2 rgmanager[13591]: Stopping service service:test
Oct  9 14:18:38 rhel6-node2 rgmanager[14666]: [script] Executing /root/script stop
Oct  9 14:18:38 rhel6-node2 rgmanager[13591]: Service service:test is recovering

Oct  9 14:18:40 rhel6-node3 rgmanager[8721]: #70: Failed to relocate service:test; restarting locally
Oct  9 14:18:40 rhel6-node3 rgmanager[8721]: Recovering failed service service:test
Oct  9 14:18:40 rhel6-node3 rgmanager[9559]: [script] Executing /root/script start
Oct  9 14:18:40 rhel6-node3 rgmanager[8721]: Service service:test started


Expected results: Service goes back to a stopped state if it can't start on any domain members, and the local node doesn't try to start it if its not in the domain.


Additional info:

Comment 1 John Ruemker 2014-10-09 18:47:03 UTC
Created attachment 945425 [details]
rgmanager: Do not restart a service locally after failed relocation if failover domains prohibit it

Test results:


Oct  9 14:14:37 rhel6-node1 rgmanager[26366]: Starting stopped service service:test
Oct  9 14:14:37 rhel6-node1 rgmanager[27637]: [script] Executing /root/script start
Oct  9 14:14:37 rhel6-node1 rgmanager[27658]: [script] script:script: start of /root/script failed (returned 1)
Oct  9 14:14:37 rhel6-node1 rgmanager[26366]: start on script "script" returned 1 (generic error)
Oct  9 14:14:37 rhel6-node1 rgmanager[26366]: #68: Failed to start service:test; return value: 1
Oct  9 14:14:37 rhel6-node1 rgmanager[26366]: Stopping service service:test
Oct  9 14:14:37 rhel6-node1 rgmanager[27687]: [script] Executing /root/script stop
Oct  9 14:14:37 rhel6-node1 rgmanager[26366]: Service service:test is recovering

Oct  9 14:14:37 rhel6-node2 rgmanager[12348]: Recovering failed service service:test
Oct  9 14:14:38 rhel6-node2 rgmanager[13379]: [script] Executing /root/script start
Oct  9 14:14:38 rhel6-node2 rgmanager[13400]: [script] script:script: start of /root/script failed (returned 1)
Oct  9 14:14:38 rhel6-node2 rgmanager[12348]: start on script "script" returned 1 (generic error)
Oct  9 14:14:38 rhel6-node2 rgmanager[12348]: #68: Failed to start service:test; return value: 1
Oct  9 14:14:38 rhel6-node2 rgmanager[12348]: Stopping service service:test
Oct  9 14:14:38 rhel6-node2 rgmanager[13429]: [script] Executing /root/script stop
Oct  9 14:14:38 rhel6-node2 rgmanager[12348]: Service service:test is recovering

Oct  9 14:14:40 rhel6-node3 rgmanager[7778]: Failed to relocate service:test; giving up
Oct  9 14:14:40 rhel6-node3 rgmanager[7778]: Stopping service service:test
Oct  9 14:14:40 rhel6-node3 rgmanager[8625]: [script] Executing /root/script stop
Oct  9 14:14:40 rhel6-node3 rgmanager[7778]: Service service:test is stopped

Comment 5 errata-xmlrpc 2015-07-22 07:32:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1402.html


Note You need to log in before you can comment on or make changes to this bug.