Description of problem: Configure a new service and try to start it but it exits with status 1 (because service wasn't properly configured). Service fails over to 2nd node and tries to start there but also fails with status 1 (because configuration wasn't properly setup there either). Service fails back to first node and service then enters 'recovering' state (in clustat output) and stays that way indefinately. Trying to stop/disable the service via the GUI or clusvcadm has no impact. Version-Release number of selected component (if applicable): rgmanager-1.9.68-1 How reproducible: Service must be a new service and you are trying to start it for the first time. Doesn't get the same error if the service is already known or previously worked - it correctly enters the stopped state after failing to start on both nodes. Steps to Reproduce: 1. Setup a new service so it will return 1 (error) on start 2. Configure it the same on all nodes, ie will exit with 1 on start 3. Configure the service in system-config-cluster / cluster.conf and propagate to all nodes 4. Try to start the service 5. It will try to start on the first node, failover to the second node, failback to the first node and then get stuck in 'recovering' state. Actual results (from clustat): Service Name Owner (Last) State ------- ---- ----- ------ ----- transport1-gw none recovering Expected results: Service should fail to start on both nodes and then enter the 'stopped' state. When it is in the 'stopped' state then clusvcadm can try to start the service again after you have fixed incorrect configuration. Additional info: Sent an email to cluster-list entitled 'service stuck in 'recovering' state on both nodes' on Aug 28. Attached the output of: gdb /usr/sbin/clurgmgrd `pidof clurgmgrd` thr a a bt ... and /proc/cluster/dlm_debug as requested in cluster-list email response.
Created attachment 185661 [details] gdb backtrace on one node
Created attachment 185671 [details] gdb backtrace on second node
Created attachment 185681 [details] /proc/cluster/dlm_debug on one node
Created attachment 185691 [details] /proc/cluster/dlm_debug on second node
This looks like a race between node 1 reconfiguring and node 2 reconfiguring. The easiest thing to do here is synchronize reconfiguration. Ex: Node 1 reconfigures, gets new resource(s) Node 1 decides to start resources, but fails - Node 1 stops resources Node 1 tells node 2 to start resources Node 2 says "Ehhhhh?" Node 2 reconfigures, gets new resources. Service gets stuck in recovering state.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Fixing this would be too invasive and may require adding config versions to rgmanager messages (e.g. "try again if your config version is newer than mine"). You can disable and enable the service again after the transition is complete.