Bug 193859 - rgmanager relocates a service forgetting to umount the file-systems
rgmanager relocates a service forgetting to umount the file-systems
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: rgmanager (Show other bugs)
4
All Linux
medium Severity high
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-06-02 07:33 EDT by Navid Sheikhol-Eslami
Modified: 2009-04-16 16:20 EDT (History)
2 users (show)

See Also:
Fixed In Version: RHBA-2006-0557
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-08-10 17:24:28 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch for rgmanager-1.9.46 (630 bytes, patch)
2006-06-02 07:33 EDT, Navid Sheikhol-Eslami
no flags Details | Diff
clulog with original rgmanager (3.27 KB, text/plain)
2006-06-02 07:34 EDT, Navid Sheikhol-Eslami
no flags Details
clulog with patched rgmanager (4.22 KB, text/plain)
2006-06-02 07:34 EDT, Navid Sheikhol-Eslami
no flags Details
cluster.conf used for testing (Apache service) (2.12 KB, text/plain)
2006-06-02 07:37 EDT, Navid Sheikhol-Eslami
no flags Details
patch for rgmanager-1.9.46 (take two) (571 bytes, patch)
2006-06-14 03:06 EDT, Navid Sheikhol-Eslami
no flags Details | Diff

  None (edit)
Description Navid Sheikhol-Eslami 2006-06-02 07:33:14 EDT
Description of problem:

If a service needs to be relocated (for example due to a failure) and a resource
fails to stop (for example due to a missing script), the service could be
relocated without first unmounting the shared file-systems, causing a concurrent
mount from two nodes and possibly data corruption.

Version-Release number of selected component (if applicable):

rgmanager-1.9.46

How reproducible:

It is very easily reproduceable by creating a service with a child service
script and file-system. If the service script goes missing the service will try
to stop before relocating, but since the script is missing the shared
file-system is never unmounted before the service is migrated.

Steps to Reproduce:
1. create a service with script and file-system resources
2. rename the script
3. let the service relocate
4. check mount output on both nodes
  
Actual results:

in src/daemons/restree.c: _do_child_levels stops execution and returns an error
if a child fails, regardless of the operation.

Expected results:

in case of a "stop" operation, _do_child_levels should iterate through all
resources within a tree and stop them (or at least try). Then eventually return
an error.

Additional info:

I am attaching a patch this fixes this behaviour.

I'm also attaching some clulog debug outputs which show how the file-system is
never unmounted (before patch) and when it is (after patch), as well as the
cluster.conf used for these tests.
Comment 1 Navid Sheikhol-Eslami 2006-06-02 07:33:15 EDT
Created attachment 130395 [details]
patch for rgmanager-1.9.46
Comment 2 Navid Sheikhol-Eslami 2006-06-02 07:34:29 EDT
Created attachment 130396 [details]
clulog with original rgmanager
Comment 3 Navid Sheikhol-Eslami 2006-06-02 07:34:59 EDT
Created attachment 130397 [details]
clulog with patched rgmanager
Comment 4 Navid Sheikhol-Eslami 2006-06-02 07:37:15 EDT
Created attachment 130398 [details]
cluster.conf used for testing (Apache service)
Comment 5 Lon Hohberger 2006-06-02 10:01:35 EDT
Ack! it should go into the failed state in this case, and not relocate at all.
Comment 6 Lon Hohberger 2006-06-02 10:59:06 EDT
Ok, so, this only happens with not-installed return code (5) on scripts. 
Unfortunately, the patch breaks the resource-agent error case (1) - causing a
service to restart/relocate when it should move to the failed state.

Jun  2 11:02:15 red clurgmgrd[27680]: <notice> Stopping service errortest
Jun  2 11:02:15 red clurgmgrd: [27680]: <info> Executing /errortest stop
Jun  2 11:02:15 red clurgmgrd[27680]: <notice> stop on script "errortest"
returned 1 (generic error)
Jun  2 11:02:15 red clurgmgrd[27680]: <notice> Service errortest is recovering
Jun  2 11:02:15 red clurgmgrd[27680]: <notice> Recovering failed service errortest
Jun  2 11:02:15 red clurgmgrd: [27680]: <info> Executing /errortest start
Jun  2 11:02:15 red clurgmgrd[27680]: <notice> Service errortest started
Jun  2 11:02:25 red clurgmgrd: [27680]: <info> Executing /errortest status
Jun  2 11:02:25 red clurgmgrd[27680]: <notice> status on script "errortest"
returned 1 (generic error)
Jun  2 11:02:25 red clurgmgrd[27680]: <notice> Stopping service errortest
Jun  2 11:02:25 red clurgmgrd: [27680]: <info> Executing /errortest stop
Jun  2 11:02:25 red clurgmgrd[27680]: <notice> stop on script "errortest"
returned 1 (generic error)

Because of the location of the patch, it is likely that this will cause unmount
failures (returning 1) - to be relocated / restarted erroneously.

Maybe it's better to just call return code (5) a fatal error like (1) (which you
also suggested)?
Comment 7 Lon Hohberger 2006-06-02 11:00:46 EDT
FWIW, returning (1) at any point on current rgmanager has the following effect:

Jun  2 10:57:13 red clurgmgrd[26047]: <notice> Stopping service errortest
Jun  2 10:57:13 red clurgmgrd: [26047]: <info> Executing /errortest stop
Jun  2 10:57:13 red clurgmgrd[26047]: <notice> stop on script "errortest"
returned 1 (generic error)
Jun  2 10:57:13 red clurgmgrd[26047]: <crit> #12: RG errortest failed to stop;
intervention required
Jun  2 10:57:13 red clurgmgrd[26047]: <notice> Service errortest is failed 
Comment 9 Navid Sheikhol-Eslami 2006-06-14 03:06:41 EDT
Created attachment 130807 [details]
patch for rgmanager-1.9.46  (take two)

You are right, errors weren't properly returned for RS_STOP.

I am attaching a new patch which fixes this.

I tested it with both OCF_RA_ERROR (1) and OCF_RA_NOT_CONFIGURED (5), and its
behaving as expected: OCF_RA_ERROR sends the service to the failed state and
OCF_RA_NOT_CONFIGURED doesn't, but most importantly it *does* umount the
file-systems.

-- Navid
Comment 10 Lon Hohberger 2006-06-15 13:54:25 EDT
confirmed
Comment 11 Lon Hohberger 2006-06-16 16:04:39 EDT
I applied this to STABLE and RHEL4 CVS branches.
Comment 14 Red Hat Bugzilla 2006-08-10 17:24:33 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0557.html

Note You need to log in before you can comment on or make changes to this bug.