Bug 275751

Summary: New service gets stuck in recovering state if start script exits with status 1
Product: [Retired] Red Hat Cluster Suite Reporter: Mark Huth <mhuth>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED WONTFIX QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: medium    
Version: 4CC: cfeist, cluster-maint, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-12-03 17:06:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
gdb backtrace on one node
none
gdb backtrace on second node
none
/proc/cluster/dlm_debug on one node
none
/proc/cluster/dlm_debug on second node none

Description Mark Huth 2007-09-04 05:46:33 UTC
Description of problem:
Configure a new service and try to start it but it exits with status 1 (because
service wasn't properly configured).  Service fails over to 2nd node and tries
to start there but also fails with status 1 (because configuration wasn't
properly setup there either).  Service fails back to first node and service then
enters 'recovering' state (in clustat output) and stays that way indefinately. 
Trying to stop/disable the service via the GUI or clusvcadm has no impact.

Version-Release number of selected component (if applicable):
rgmanager-1.9.68-1

How reproducible:
Service must be a new service and you are trying to start it for the first time.
 Doesn't get the same error if the service is already known or previously worked
- it correctly enters the stopped state after failing to start on both nodes.

Steps to Reproduce:
1. Setup a new service so it will return 1 (error) on start
2. Configure it the same on all nodes, ie will exit with 1 on start
3. Configure the service in system-config-cluster / cluster.conf and propagate
to all nodes
4. Try to start the service
5. It will try to start on the first node, failover to the second node, failback
to the first node and then get stuck in 'recovering' state.
  
Actual results (from clustat):
  Service Name         Owner (Last)                   State  
  ------- ----         ----- ------                   -----  
  transport1-gw        none                           recovering


Expected results:
Service should fail to start on both nodes and then enter the 'stopped' state. 
When it is in the 'stopped' state then clusvcadm can try to start the service
again after you have fixed incorrect configuration.

Additional info:
Sent an email to cluster-list entitled 'service stuck in 'recovering'
state on both nodes' on Aug 28.  Attached the output of:

gdb /usr/sbin/clurgmgrd `pidof clurgmgrd`
thr a a bt

... and /proc/cluster/dlm_debug as requested in cluster-list email response.

Comment 1 Mark Huth 2007-09-04 05:47:23 UTC
Created attachment 185661 [details]
gdb backtrace on one node

Comment 2 Mark Huth 2007-09-04 05:47:49 UTC
Created attachment 185671 [details]
gdb backtrace on second node

Comment 3 Mark Huth 2007-09-04 05:48:19 UTC
Created attachment 185681 [details]
/proc/cluster/dlm_debug on one node

Comment 4 Mark Huth 2007-09-04 05:48:56 UTC
Created attachment 185691 [details]
/proc/cluster/dlm_debug on second node

Comment 5 Lon Hohberger 2007-09-18 21:00:28 UTC
This looks like a race between node 1 reconfiguring and node 2 reconfiguring.  
The easiest thing to do here is synchronize reconfiguration.

Ex:

Node 1 reconfigures, gets new resource(s)
Node 1 decides to start resources, but fails -
Node 1 stops resources
Node 1 tells node 2 to start resources
Node 2 says "Ehhhhh?"
Node 2 reconfigures, gets new resources.

Service gets stuck in recovering state.


Comment 8 RHEL Program Management 2007-10-16 03:42:46 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 10 Lon Hohberger 2008-12-03 17:06:39 UTC
Fixing this would be too invasive and may require adding config versions to rgmanager messages (e.g. "try again if your config version is newer than mine").

You can disable and enable the service again after the transition is complete.