275751 – New service gets stuck in recovering state if start script exits with status 1

Bug 275751 - New service gets stuck in recovering state if start script exits with status 1

Summary: New service gets stuck in recovering state if start script exits with status 1

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	rgmanager
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-09-04 05:46 UTC by Mark Huth
Modified:	2018-10-20 00:32 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-12-03 17:06:39 UTC
Embargoed:

Attachments	(Terms of Use)
gdb backtrace on one node (5.32 KB, text/plain) 2007-09-04 05:47 UTC, Mark Huth	no flags	Details
gdb backtrace on second node (5.32 KB, text/plain) 2007-09-04 05:47 UTC, Mark Huth	no flags	Details
/proc/cluster/dlm_debug on one node (1.04 KB, text/plain) 2007-09-04 05:48 UTC, Mark Huth	no flags	Details
/proc/cluster/dlm_debug on second node (1.04 KB, text/plain) 2007-09-04 05:48 UTC, Mark Huth	no flags	Details
View All

Description Mark Huth 2007-09-04 05:46:33 UTC

Description of problem:
Configure a new service and try to start it but it exits with status 1 (because
service wasn't properly configured).  Service fails over to 2nd node and tries
to start there but also fails with status 1 (because configuration wasn't
properly setup there either).  Service fails back to first node and service then
enters 'recovering' state (in clustat output) and stays that way indefinately. 
Trying to stop/disable the service via the GUI or clusvcadm has no impact.

Version-Release number of selected component (if applicable):
rgmanager-1.9.68-1

How reproducible:
Service must be a new service and you are trying to start it for the first time.
 Doesn't get the same error if the service is already known or previously worked
- it correctly enters the stopped state after failing to start on both nodes.

Steps to Reproduce:
1. Setup a new service so it will return 1 (error) on start
2. Configure it the same on all nodes, ie will exit with 1 on start
3. Configure the service in system-config-cluster / cluster.conf and propagate
to all nodes
4. Try to start the service
5. It will try to start on the first node, failover to the second node, failback
to the first node and then get stuck in 'recovering' state.
  
Actual results (from clustat):
  Service Name         Owner (Last)                   State  
  ------- ----         ----- ------                   -----  
  transport1-gw        none                           recovering


Expected results:
Service should fail to start on both nodes and then enter the 'stopped' state. 
When it is in the 'stopped' state then clusvcadm can try to start the service
again after you have fixed incorrect configuration.

Additional info:
Sent an email to cluster-list entitled 'service stuck in 'recovering'
state on both nodes' on Aug 28.  Attached the output of:

gdb /usr/sbin/clurgmgrd `pidof clurgmgrd`
thr a a bt

... and /proc/cluster/dlm_debug as requested in cluster-list email response.

Comment 1 Mark Huth 2007-09-04 05:47:23 UTC

Created attachment 185661 [details]
gdb backtrace on one node

Comment 2 Mark Huth 2007-09-04 05:47:49 UTC

Created attachment 185671 [details]
gdb backtrace on second node

Comment 3 Mark Huth 2007-09-04 05:48:19 UTC

Created attachment 185681 [details]
/proc/cluster/dlm_debug on one node

Comment 4 Mark Huth 2007-09-04 05:48:56 UTC

Created attachment 185691 [details]
/proc/cluster/dlm_debug on second node

Comment 5 Lon Hohberger 2007-09-18 21:00:28 UTC

This looks like a race between node 1 reconfiguring and node 2 reconfiguring.  
The easiest thing to do here is synchronize reconfiguration.

Ex:

Node 1 reconfigures, gets new resource(s)
Node 1 decides to start resources, but fails -
Node 1 stops resources
Node 1 tells node 2 to start resources
Node 2 says "Ehhhhh?"
Node 2 reconfigures, gets new resources.

Service gets stuck in recovering state.

Comment 8 RHEL Program Management 2007-10-16 03:42:46 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 10 Lon Hohberger 2008-12-03 17:06:39 UTC

Fixing this would be too invasive and may require adding config versions to rgmanager messages (e.g. "try again if your config version is newer than mine").

You can disable and enable the service again after the transition is complete.

Note You need to log in before you can comment on or make changes to this bug.