621018 – Luci does not maintain service restart limit configuration

Bug 621018 - Luci does not maintain service restart limit configuration

Summary: Luci does not maintain service restart limit configuration

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	conga
Sub Component:
Version:	5.5
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Ryan McCabe
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-08-04 01:38 UTC by Alan Staples
Modified:	2011-01-26 14:58 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-01-26 14:58:25 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Alan Staples 2010-08-04 01:38:21 UTC

Description of problem:
The luci administration pages for "service" objects has a field called "Maximum number of restart failures before relocating". Even with a service that is in a failover domain with other nodes, that is configured for a "Relocate" failover policy, luci will accept intput into this field, but saving changes to the service will not commit this particular change.

Version-Release number of selected component (if applicable): luci-0.12.2-12.el5


How reproducible:
Reliably reproducable 

Steps to Reproduce:
1. edit a service that has a failover policy of "Relocate"
2. enter a positive integer into the field "Maximum number of restart failures before relocating"
3. click "save changes"
4. open the service again, or view the /etc/cluster/cluster.conf - the numver of failures allows is still effectively 0
  
Actual results:


Expected results:


Additional info:
My cluster is built within a VMware-server environment. Both cluster nodes are on the same physical VMware server. I am using a virtual private network for cluster communication. I have a QDisk.

Comment 2 Ryan McCabe 2010-11-16 18:56:10 UTC

The max_restarts attribute is only relevant when the recovery policy is restart. The UI needs to be fixed to disable these fields when the recovery policy is something other than restart.

Comment 3 Alan Staples 2010-11-16 21:21:06 UTC

(In reply to comment #2)
> The max_restarts attribute is only relevant when the recovery policy is
> restart. The UI needs to be fixed to disable these fields when the recovery
> policy is something other than restart.

The luci GUI states "Maximum number of restart failures before relocating", which indicates to me that this should only be valide with a relocate policy actually. That makes sense to me - attempt to restart before relocating the server since restarting may likely fix the problem and relocating is a relatively expensive process.

What you're saying is that this is actually the maximum number of restart attempts for a service before disabling the service group on that particular node?

I can't find reference to this parameter or even the feature in the current Red Hat Cluster Administration Guide.

Comment 4 Ryan McCabe 2010-11-16 22:09:07 UTC

What you stated above is correct, to the best of my knowledge: restart X times, then relocate if restart fails each time. I can't find any good documentation, either, but here's a snippet from the rgmanager patch that added the feature, that confirms the explanation above:

+       /* Check restart counter/timer for this resource */
+       if (check_restart(svcName) > 0) {
+               clulog(LOG_NOTICE, "Restart threshold for %s exceeded; "
+                      "attempting to relocate\n", svcName);
+               return handle_relocate_req(svcName, RG_START_RECOVER, -1,
+                                          new_owner);

Comment 5 Lon Hohberger 2011-01-26 00:21:23 UTC

Restart counters only apply when you are using the "restart" recovery policy.

Restart recovery policy is per-host, and is zeroed each time the service is relocated - either manually or as a consequence of a failure recovery action.

That is, when "max_restarts" is exceeded within the given "restart_expire_time", rgmanager will relocate the failing service to another host in the cluster, at which point the restart counter is reset.

While this is not Red Hat documentation, it is quite accurate in describing how rgmanager's recovery policies work:

http://sources.redhat.com/cluster/wiki/ServicePolicies

Note You need to log in before you can comment on or make changes to this bug.