Description of problem:
Configure a new service and try to start it but it exits with status 1 (because
service wasn't properly configured). Service fails over to 2nd node and tries
to start there but also fails with status 1 (because configuration wasn't
properly setup there either). Service fails back to first node and service then
enters 'recovering' state (in clustat output) and stays that way indefinately.
Trying to stop/disable the service via the GUI or clusvcadm has no impact.
Version-Release number of selected component (if applicable):
Service must be a new service and you are trying to start it for the first time.
Doesn't get the same error if the service is already known or previously worked
- it correctly enters the stopped state after failing to start on both nodes.
Steps to Reproduce:
1. Setup a new service so it will return 1 (error) on start
2. Configure it the same on all nodes, ie will exit with 1 on start
3. Configure the service in system-config-cluster / cluster.conf and propagate
to all nodes
4. Try to start the service
5. It will try to start on the first node, failover to the second node, failback
to the first node and then get stuck in 'recovering' state.
Actual results (from clustat):
Service Name Owner (Last) State
------- ---- ----- ------ -----
transport1-gw none recovering
Service should fail to start on both nodes and then enter the 'stopped' state.
When it is in the 'stopped' state then clusvcadm can try to start the service
again after you have fixed incorrect configuration.
Sent an email to firstname.lastname@example.org entitled 'service stuck in 'recovering'
state on both nodes' on Aug 28. Attached the output of:
gdb /usr/sbin/clurgmgrd `pidof clurgmgrd`
thr a a bt
... and /proc/cluster/dlm_debug as requested in cluster-list email response.
Created attachment 185661 [details]
gdb backtrace on one node
Created attachment 185671 [details]
gdb backtrace on second node
Created attachment 185681 [details]
/proc/cluster/dlm_debug on one node
Created attachment 185691 [details]
/proc/cluster/dlm_debug on second node
This looks like a race between node 1 reconfiguring and node 2 reconfiguring.
The easiest thing to do here is synchronize reconfiguration.
Node 1 reconfigures, gets new resource(s)
Node 1 decides to start resources, but fails -
Node 1 stops resources
Node 1 tells node 2 to start resources
Node 2 says "Ehhhhh?"
Node 2 reconfigures, gets new resources.
Service gets stuck in recovering state.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Fixing this would be too invasive and may require adding config versions to rgmanager messages (e.g. "try again if your config version is newer than mine").
You can disable and enable the service again after the transition is complete.