1151199 – rgmanager: Restricted domain non-member can start service if all other members exhausted

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1151199 - rgmanager: Restricted domain non-member can start service if all other members exhausted

Summary: rgmanager: Restricted domain non-member can start service if all other member...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	6.6
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Ryan McCabe
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-10-09 18:46 UTC by John Ruemker
Modified:	2019-07-11 08:15 UTC (History)
CC List:	3 users (show)
Fixed In Version:	rgmanager-3.0.12.1-22.el6
Doc Type:	Bug Fix
Doc Text:	Previously, when relocating a service, the rgmanager utility attempted to use all nodes in a domain and if all failed, rgmanager restarted the service locally without checking whether the local node was eligible to run the service and regardless of whether the service had been started. Consequently, under certain circumstances, a service in a restricted domain could be started on a non-member node. With this update, if the service cannot be started on any domain members, the service goes back to a stopped state, and rgmanager no longer attempts to start the service on a local node outside the restricted domain.
Clone Of:
Environment:
Last Closed:	2015-07-22 07:32:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
rgmanager: Do not restart a service locally after failed relocation if failover domains prohibit it (1.67 KB, patch) 2014-10-09 18:47 UTC, John Ruemker	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	1202713	0	None	None	None	Never
Red Hat Product Errata	RHBA-2015:1402	0	normal	SHIPPED_LIVE	rgmanager bug fix update	2015-07-20 18:07:17 UTC

Description John Ruemker 2014-10-09 18:46:09 UTC

Description of problem:  When relocating a service, rgmanager will try out the best nodes first and if all else fails, will restart the service locally (even if it wasn't started to begin with).  However, it doesn't actually check to see whether the local node is eligible to run the service, and so you can end up with a service in a restricted domain starting on a member outside that domain.

A customer encountered this in the only way I was able to conceive that this could happen, and admittedly the conditions under which its possible are fairly narrow.  If the service is in a stopped state, and you issue a 'clusvcadm -r <service>' from a non-domain-member without specifying a destination, and all of the members of that domain fail to start the service, then eventually once all members are exhausted the node where the command was run will restart it locally.  If that happens to succeed there where it failed on the other nodes, it will stay running on this node which is not a member of the domain.


Version-Release number of selected component (if applicable): rgmanager-3.0.12.1-19.el6


How reproducible: Easily with contrived conditions, probably rarely in the wild


Steps to Reproduce:
1. Create a restricted failoverdomain with one node left out.

2. Create a service that can be configured to fail on-demand on startup.  I use a script resource that fails on start with the presence of a special file, and succeeds otherwise for everything else. Put in the failover domain.

3. Configure all nodes in the failoverdomain to fail the service on start.

4. Put the service into a stopped state:

  # clusvcadm -s <service>

5. On the non-domain-member, run 'clusvcadm -r <service>'

Actual results: 

		<failoverdomains>
			<failoverdomain name="1then2" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="rhel6-node1.example.com" priority="1"/>
				<failoverdomainnode name="rhel6-node2.example.com" priority="2"/>
			</failoverdomain>
			<failoverdomain name="2then1" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="rhel6-node1.example.com" priority="2"/>
				<failoverdomainnode name="rhel6-node2.example.com" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<service domain="1then2" name="test">
			<script file="/root/script" name="script"/>
		</service>

# clusvcadm -r test
Trying to relocate service:test...Failed; service running on original owner

# clustat
Cluster Status for rhel6-cluster @ Thu Oct  9 14:19:49 2014
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 rhel6-node1.example.com                                             1 Online, rgmanager
 rhel6-node2.example.com                                             2 Online, rgmanager
 rhel6-node3.example.com                                             3 Online, Local, rgmanager

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 service:test                                                     rhel6-node3.example.com                                          started  

Oct  9 14:18:37 rhel6-node1 rgmanager[27937]: Starting stopped service service:test
Oct  9 14:18:37 rhel6-node1 rgmanager[29176]: [script] Executing /root/script start
Oct  9 14:18:37 rhel6-node1 rgmanager[29197]: [script] script:script: start of /root/script failed (returned 1)
Oct  9 14:18:37 rhel6-node1 rgmanager[27937]: start on script "script" returned 1 (generic error)
Oct  9 14:18:37 rhel6-node1 rgmanager[27937]: #68: Failed to start service:test; return value: 1
Oct  9 14:18:37 rhel6-node1 rgmanager[27937]: Stopping service service:test
Oct  9 14:18:37 rhel6-node1 rgmanager[29226]: [script] Executing /root/script stop
Oct  9 14:18:37 rhel6-node1 rgmanager[27937]: Service service:test is recovering

Oct  9 14:18:10 rhel6-node2 rgmanager[13591]: Service service:test is recovering
Oct  9 14:18:37 rhel6-node2 rgmanager[13591]: Recovering failed service service:test
Oct  9 14:18:38 rhel6-node2 rgmanager[14616]: [script] Executing /root/script start
Oct  9 14:18:38 rhel6-node2 rgmanager[14637]: [script] script:script: start of /root/script failed (returned 1)
Oct  9 14:18:38 rhel6-node2 rgmanager[13591]: start on script "script" returned 1 (generic error)
Oct  9 14:18:38 rhel6-node2 rgmanager[13591]: #68: Failed to start service:test; return value: 1
Oct  9 14:18:38 rhel6-node2 rgmanager[13591]: Stopping service service:test
Oct  9 14:18:38 rhel6-node2 rgmanager[14666]: [script] Executing /root/script stop
Oct  9 14:18:38 rhel6-node2 rgmanager[13591]: Service service:test is recovering

Oct  9 14:18:40 rhel6-node3 rgmanager[8721]: #70: Failed to relocate service:test; restarting locally
Oct  9 14:18:40 rhel6-node3 rgmanager[8721]: Recovering failed service service:test
Oct  9 14:18:40 rhel6-node3 rgmanager[9559]: [script] Executing /root/script start
Oct  9 14:18:40 rhel6-node3 rgmanager[8721]: Service service:test started


Expected results: Service goes back to a stopped state if it can't start on any domain members, and the local node doesn't try to start it if its not in the domain.


Additional info:

Comment 1 John Ruemker 2014-10-09 18:47:03 UTC

Created attachment 945425 [details]
rgmanager: Do not restart a service locally after failed relocation if failover domains prohibit it

Test results:


Oct  9 14:14:37 rhel6-node1 rgmanager[26366]: Starting stopped service service:test
Oct  9 14:14:37 rhel6-node1 rgmanager[27637]: [script] Executing /root/script start
Oct  9 14:14:37 rhel6-node1 rgmanager[27658]: [script] script:script: start of /root/script failed (returned 1)
Oct  9 14:14:37 rhel6-node1 rgmanager[26366]: start on script "script" returned 1 (generic error)
Oct  9 14:14:37 rhel6-node1 rgmanager[26366]: #68: Failed to start service:test; return value: 1
Oct  9 14:14:37 rhel6-node1 rgmanager[26366]: Stopping service service:test
Oct  9 14:14:37 rhel6-node1 rgmanager[27687]: [script] Executing /root/script stop
Oct  9 14:14:37 rhel6-node1 rgmanager[26366]: Service service:test is recovering

Oct  9 14:14:37 rhel6-node2 rgmanager[12348]: Recovering failed service service:test
Oct  9 14:14:38 rhel6-node2 rgmanager[13379]: [script] Executing /root/script start
Oct  9 14:14:38 rhel6-node2 rgmanager[13400]: [script] script:script: start of /root/script failed (returned 1)
Oct  9 14:14:38 rhel6-node2 rgmanager[12348]: start on script "script" returned 1 (generic error)
Oct  9 14:14:38 rhel6-node2 rgmanager[12348]: #68: Failed to start service:test; return value: 1
Oct  9 14:14:38 rhel6-node2 rgmanager[12348]: Stopping service service:test
Oct  9 14:14:38 rhel6-node2 rgmanager[13429]: [script] Executing /root/script stop
Oct  9 14:14:38 rhel6-node2 rgmanager[12348]: Service service:test is recovering

Oct  9 14:14:40 rhel6-node3 rgmanager[7778]: Failed to relocate service:test; giving up
Oct  9 14:14:40 rhel6-node3 rgmanager[7778]: Stopping service service:test
Oct  9 14:14:40 rhel6-node3 rgmanager[8625]: [script] Executing /root/script stop
Oct  9 14:14:40 rhel6-node3 rgmanager[7778]: Service service:test is stopped

Comment 5 errata-xmlrpc 2015-07-22 07:32:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1402.html

Note You need to log in before you can comment on or make changes to this bug.