Description of problem: If a service needs to be relocated (for example due to a failure) and a resource fails to stop (for example due to a missing script), the service could be relocated without first unmounting the shared file-systems, causing a concurrent mount from two nodes and possibly data corruption. Version-Release number of selected component (if applicable): rgmanager-1.9.46 How reproducible: It is very easily reproduceable by creating a service with a child service script and file-system. If the service script goes missing the service will try to stop before relocating, but since the script is missing the shared file-system is never unmounted before the service is migrated. Steps to Reproduce: 1. create a service with script and file-system resources 2. rename the script 3. let the service relocate 4. check mount output on both nodes Actual results: in src/daemons/restree.c: _do_child_levels stops execution and returns an error if a child fails, regardless of the operation. Expected results: in case of a "stop" operation, _do_child_levels should iterate through all resources within a tree and stop them (or at least try). Then eventually return an error. Additional info: I am attaching a patch this fixes this behaviour. I'm also attaching some clulog debug outputs which show how the file-system is never unmounted (before patch) and when it is (after patch), as well as the cluster.conf used for these tests.
Created attachment 130395 [details] patch for rgmanager-1.9.46
Created attachment 130396 [details] clulog with original rgmanager
Created attachment 130397 [details] clulog with patched rgmanager
Created attachment 130398 [details] cluster.conf used for testing (Apache service)
Ack! it should go into the failed state in this case, and not relocate at all.
Ok, so, this only happens with not-installed return code (5) on scripts. Unfortunately, the patch breaks the resource-agent error case (1) - causing a service to restart/relocate when it should move to the failed state. Jun 2 11:02:15 red clurgmgrd[27680]: <notice> Stopping service errortest Jun 2 11:02:15 red clurgmgrd: [27680]: <info> Executing /errortest stop Jun 2 11:02:15 red clurgmgrd[27680]: <notice> stop on script "errortest" returned 1 (generic error) Jun 2 11:02:15 red clurgmgrd[27680]: <notice> Service errortest is recovering Jun 2 11:02:15 red clurgmgrd[27680]: <notice> Recovering failed service errortest Jun 2 11:02:15 red clurgmgrd: [27680]: <info> Executing /errortest start Jun 2 11:02:15 red clurgmgrd[27680]: <notice> Service errortest started Jun 2 11:02:25 red clurgmgrd: [27680]: <info> Executing /errortest status Jun 2 11:02:25 red clurgmgrd[27680]: <notice> status on script "errortest" returned 1 (generic error) Jun 2 11:02:25 red clurgmgrd[27680]: <notice> Stopping service errortest Jun 2 11:02:25 red clurgmgrd: [27680]: <info> Executing /errortest stop Jun 2 11:02:25 red clurgmgrd[27680]: <notice> stop on script "errortest" returned 1 (generic error) Because of the location of the patch, it is likely that this will cause unmount failures (returning 1) - to be relocated / restarted erroneously. Maybe it's better to just call return code (5) a fatal error like (1) (which you also suggested)?
FWIW, returning (1) at any point on current rgmanager has the following effect: Jun 2 10:57:13 red clurgmgrd[26047]: <notice> Stopping service errortest Jun 2 10:57:13 red clurgmgrd: [26047]: <info> Executing /errortest stop Jun 2 10:57:13 red clurgmgrd[26047]: <notice> stop on script "errortest" returned 1 (generic error) Jun 2 10:57:13 red clurgmgrd[26047]: <crit> #12: RG errortest failed to stop; intervention required Jun 2 10:57:13 red clurgmgrd[26047]: <notice> Service errortest is failed
Created attachment 130807 [details] patch for rgmanager-1.9.46 (take two) You are right, errors weren't properly returned for RS_STOP. I am attaching a new patch which fixes this. I tested it with both OCF_RA_ERROR (1) and OCF_RA_NOT_CONFIGURED (5), and its behaving as expected: OCF_RA_ERROR sends the service to the failed state and OCF_RA_NOT_CONFIGURED doesn't, but most importantly it *does* umount the file-systems. -- Navid
confirmed
I applied this to STABLE and RHEL4 CVS branches.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0557.html