Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Previously, when relocating a service, the rgmanager utility attempted to use all nodes in a domain and if all failed, rgmanager restarted the service locally without checking whether the local node was eligible to run the service and regardless of whether the service had been started. Consequently, under certain circumstances, a service in a restricted domain could be started on a non-member node. With this update, if the service cannot be started on any domain members, the service goes back to a stopped state, and rgmanager no longer attempts to start the service on a local node outside the restricted domain.
Description of problem: When relocating a service, rgmanager will try out the best nodes first and if all else fails, will restart the service locally (even if it wasn't started to begin with). However, it doesn't actually check to see whether the local node is eligible to run the service, and so you can end up with a service in a restricted domain starting on a member outside that domain.
A customer encountered this in the only way I was able to conceive that this could happen, and admittedly the conditions under which its possible are fairly narrow. If the service is in a stopped state, and you issue a 'clusvcadm -r <service>' from a non-domain-member without specifying a destination, and all of the members of that domain fail to start the service, then eventually once all members are exhausted the node where the command was run will restart it locally. If that happens to succeed there where it failed on the other nodes, it will stay running on this node which is not a member of the domain.
Version-Release number of selected component (if applicable): rgmanager-3.0.12.1-19.el6
How reproducible: Easily with contrived conditions, probably rarely in the wild
Steps to Reproduce:
1. Create a restricted failoverdomain with one node left out.
2. Create a service that can be configured to fail on-demand on startup. I use a script resource that fails on start with the presence of a special file, and succeeds otherwise for everything else. Put in the failover domain.
3. Configure all nodes in the failoverdomain to fail the service on start.
4. Put the service into a stopped state:
# clusvcadm -s <service>
5. On the non-domain-member, run 'clusvcadm -r <service>'
Actual results:
<failoverdomains>
<failoverdomain name="1then2" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="rhel6-node1.example.com" priority="1"/>
<failoverdomainnode name="rhel6-node2.example.com" priority="2"/>
</failoverdomain>
<failoverdomain name="2then1" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="rhel6-node1.example.com" priority="2"/>
<failoverdomainnode name="rhel6-node2.example.com" priority="1"/>
</failoverdomain>
</failoverdomains>
<service domain="1then2" name="test">
<script file="/root/script" name="script"/>
</service>
# clusvcadm -r test
Trying to relocate service:test...Failed; service running on original owner
# clustat
Cluster Status for rhel6-cluster @ Thu Oct 9 14:19:49 2014
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
rhel6-node1.example.com 1 Online, rgmanager
rhel6-node2.example.com 2 Online, rgmanager
rhel6-node3.example.com 3 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:test rhel6-node3.example.com started
Oct 9 14:18:37 rhel6-node1 rgmanager[27937]: Starting stopped service service:test
Oct 9 14:18:37 rhel6-node1 rgmanager[29176]: [script] Executing /root/script start
Oct 9 14:18:37 rhel6-node1 rgmanager[29197]: [script] script:script: start of /root/script failed (returned 1)
Oct 9 14:18:37 rhel6-node1 rgmanager[27937]: start on script "script" returned 1 (generic error)
Oct 9 14:18:37 rhel6-node1 rgmanager[27937]: #68: Failed to start service:test; return value: 1
Oct 9 14:18:37 rhel6-node1 rgmanager[27937]: Stopping service service:test
Oct 9 14:18:37 rhel6-node1 rgmanager[29226]: [script] Executing /root/script stop
Oct 9 14:18:37 rhel6-node1 rgmanager[27937]: Service service:test is recovering
Oct 9 14:18:10 rhel6-node2 rgmanager[13591]: Service service:test is recovering
Oct 9 14:18:37 rhel6-node2 rgmanager[13591]: Recovering failed service service:test
Oct 9 14:18:38 rhel6-node2 rgmanager[14616]: [script] Executing /root/script start
Oct 9 14:18:38 rhel6-node2 rgmanager[14637]: [script] script:script: start of /root/script failed (returned 1)
Oct 9 14:18:38 rhel6-node2 rgmanager[13591]: start on script "script" returned 1 (generic error)
Oct 9 14:18:38 rhel6-node2 rgmanager[13591]: #68: Failed to start service:test; return value: 1
Oct 9 14:18:38 rhel6-node2 rgmanager[13591]: Stopping service service:test
Oct 9 14:18:38 rhel6-node2 rgmanager[14666]: [script] Executing /root/script stop
Oct 9 14:18:38 rhel6-node2 rgmanager[13591]: Service service:test is recovering
Oct 9 14:18:40 rhel6-node3 rgmanager[8721]: #70: Failed to relocate service:test; restarting locally
Oct 9 14:18:40 rhel6-node3 rgmanager[8721]: Recovering failed service service:test
Oct 9 14:18:40 rhel6-node3 rgmanager[9559]: [script] Executing /root/script start
Oct 9 14:18:40 rhel6-node3 rgmanager[8721]: Service service:test started
Expected results: Service goes back to a stopped state if it can't start on any domain members, and the local node doesn't try to start it if its not in the domain.
Additional info:
Created attachment 945425[details]
rgmanager: Do not restart a service locally after failed relocation if failover domains prohibit it
Test results:
Oct 9 14:14:37 rhel6-node1 rgmanager[26366]: Starting stopped service service:test
Oct 9 14:14:37 rhel6-node1 rgmanager[27637]: [script] Executing /root/script start
Oct 9 14:14:37 rhel6-node1 rgmanager[27658]: [script] script:script: start of /root/script failed (returned 1)
Oct 9 14:14:37 rhel6-node1 rgmanager[26366]: start on script "script" returned 1 (generic error)
Oct 9 14:14:37 rhel6-node1 rgmanager[26366]: #68: Failed to start service:test; return value: 1
Oct 9 14:14:37 rhel6-node1 rgmanager[26366]: Stopping service service:test
Oct 9 14:14:37 rhel6-node1 rgmanager[27687]: [script] Executing /root/script stop
Oct 9 14:14:37 rhel6-node1 rgmanager[26366]: Service service:test is recovering
Oct 9 14:14:37 rhel6-node2 rgmanager[12348]: Recovering failed service service:test
Oct 9 14:14:38 rhel6-node2 rgmanager[13379]: [script] Executing /root/script start
Oct 9 14:14:38 rhel6-node2 rgmanager[13400]: [script] script:script: start of /root/script failed (returned 1)
Oct 9 14:14:38 rhel6-node2 rgmanager[12348]: start on script "script" returned 1 (generic error)
Oct 9 14:14:38 rhel6-node2 rgmanager[12348]: #68: Failed to start service:test; return value: 1
Oct 9 14:14:38 rhel6-node2 rgmanager[12348]: Stopping service service:test
Oct 9 14:14:38 rhel6-node2 rgmanager[13429]: [script] Executing /root/script stop
Oct 9 14:14:38 rhel6-node2 rgmanager[12348]: Service service:test is recovering
Oct 9 14:14:40 rhel6-node3 rgmanager[7778]: Failed to relocate service:test; giving up
Oct 9 14:14:40 rhel6-node3 rgmanager[7778]: Stopping service service:test
Oct 9 14:14:40 rhel6-node3 rgmanager[8625]: [script] Executing /root/script stop
Oct 9 14:14:40 rhel6-node3 rgmanager[7778]: Service service:test is stopped
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://rhn.redhat.com/errata/RHBA-2015-1402.html