Bug 621018

Summary:	Luci does not maintain service restart limit configuration
Product:	Red Hat Enterprise Linux 5	Reporter:	Alan Staples <alan.staples>
Component:	conga	Assignee:	Ryan McCabe <rmccabe>
Status:	CLOSED NOTABUG	QA Contact:	Cluster QE <mspqa-list>
Severity:	low	Docs Contact:
Priority:	low
Version:	5.5	CC:	alan.staples, cluster-maint, jha
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-01-26 14:58:25 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alan Staples 2010-08-04 01:38:21 UTC

Description of problem:
The luci administration pages for "service" objects has a field called "Maximum number of restart failures before relocating". Even with a service that is in a failover domain with other nodes, that is configured for a "Relocate" failover policy, luci will accept intput into this field, but saving changes to the service will not commit this particular change.

Version-Release number of selected component (if applicable): luci-0.12.2-12.el5


How reproducible:
Reliably reproducable 

Steps to Reproduce:
1. edit a service that has a failover policy of "Relocate"
2. enter a positive integer into the field "Maximum number of restart failures before relocating"
3. click "save changes"
4. open the service again, or view the /etc/cluster/cluster.conf - the numver of failures allows is still effectively 0
  
Actual results:


Expected results:


Additional info:
My cluster is built within a VMware-server environment. Both cluster nodes are on the same physical VMware server. I am using a virtual private network for cluster communication. I have a QDisk.

Comment 2 Ryan McCabe 2010-11-16 18:56:10 UTC

The max_restarts attribute is only relevant when the recovery policy is restart. The UI needs to be fixed to disable these fields when the recovery policy is something other than restart.

Comment 3 Alan Staples 2010-11-16 21:21:06 UTC

(In reply to comment #2)
> The max_restarts attribute is only relevant when the recovery policy is
> restart. The UI needs to be fixed to disable these fields when the recovery
> policy is something other than restart.

The luci GUI states "Maximum number of restart failures before relocating", which indicates to me that this should only be valide with a relocate policy actually. That makes sense to me - attempt to restart before relocating the server since restarting may likely fix the problem and relocating is a relatively expensive process.

What you're saying is that this is actually the maximum number of restart attempts for a service before disabling the service group on that particular node?

I can't find reference to this parameter or even the feature in the current Red Hat Cluster Administration Guide.

Comment 4 Ryan McCabe 2010-11-16 22:09:07 UTC

What you stated above is correct, to the best of my knowledge: restart X times, then relocate if restart fails each time. I can't find any good documentation, either, but here's a snippet from the rgmanager patch that added the feature, that confirms the explanation above:

+       /* Check restart counter/timer for this resource */
+       if (check_restart(svcName) > 0) {
+               clulog(LOG_NOTICE, "Restart threshold for %s exceeded; "
+                      "attempting to relocate\n", svcName);
+               return handle_relocate_req(svcName, RG_START_RECOVER, -1,
+                                          new_owner);

Comment 5 Lon Hohberger 2011-01-26 00:21:23 UTC

Restart counters only apply when you are using the "restart" recovery policy.

Restart recovery policy is per-host, and is zeroed each time the service is relocated - either manually or as a consequence of a failure recovery action.

That is, when "max_restarts" is exceeded within the given "restart_expire_time", rgmanager will relocate the failing service to another host in the cluster, at which point the restart counter is reset.

While this is not Red Hat documentation, it is quite accurate in describing how rgmanager's recovery policies work:

http://sources.redhat.com/cluster/wiki/ServicePolicies