Bug 193859 - rgmanager relocates a service forgetting to umount the file-systems
Summary: rgmanager relocates a service forgetting to umount the file-systems
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: rgmanager
Version: 4
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-06-02 11:33 UTC by Navid Sheikhol-Eslami
Modified: 2009-04-16 20:20 UTC (History)
2 users (show)

Fixed In Version: RHBA-2006-0557
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-08-10 21:24:28 UTC
Embargoed:


Attachments (Terms of Use)
patch for rgmanager-1.9.46 (630 bytes, patch)
2006-06-02 11:33 UTC, Navid Sheikhol-Eslami
no flags Details | Diff
clulog with original rgmanager (3.27 KB, text/plain)
2006-06-02 11:34 UTC, Navid Sheikhol-Eslami
no flags Details
clulog with patched rgmanager (4.22 KB, text/plain)
2006-06-02 11:34 UTC, Navid Sheikhol-Eslami
no flags Details
cluster.conf used for testing (Apache service) (2.12 KB, text/plain)
2006-06-02 11:37 UTC, Navid Sheikhol-Eslami
no flags Details
patch for rgmanager-1.9.46 (take two) (571 bytes, patch)
2006-06-14 07:06 UTC, Navid Sheikhol-Eslami
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2006:0557 0 normal SHIPPED_LIVE rgmanager bug fix update 2006-08-10 04:00:00 UTC

Description Navid Sheikhol-Eslami 2006-06-02 11:33:14 UTC
Description of problem:

If a service needs to be relocated (for example due to a failure) and a resource
fails to stop (for example due to a missing script), the service could be
relocated without first unmounting the shared file-systems, causing a concurrent
mount from two nodes and possibly data corruption.

Version-Release number of selected component (if applicable):

rgmanager-1.9.46

How reproducible:

It is very easily reproduceable by creating a service with a child service
script and file-system. If the service script goes missing the service will try
to stop before relocating, but since the script is missing the shared
file-system is never unmounted before the service is migrated.

Steps to Reproduce:
1. create a service with script and file-system resources
2. rename the script
3. let the service relocate
4. check mount output on both nodes
  
Actual results:

in src/daemons/restree.c: _do_child_levels stops execution and returns an error
if a child fails, regardless of the operation.

Expected results:

in case of a "stop" operation, _do_child_levels should iterate through all
resources within a tree and stop them (or at least try). Then eventually return
an error.

Additional info:

I am attaching a patch this fixes this behaviour.

I'm also attaching some clulog debug outputs which show how the file-system is
never unmounted (before patch) and when it is (after patch), as well as the
cluster.conf used for these tests.

Comment 1 Navid Sheikhol-Eslami 2006-06-02 11:33:15 UTC
Created attachment 130395 [details]
patch for rgmanager-1.9.46

Comment 2 Navid Sheikhol-Eslami 2006-06-02 11:34:29 UTC
Created attachment 130396 [details]
clulog with original rgmanager

Comment 3 Navid Sheikhol-Eslami 2006-06-02 11:34:59 UTC
Created attachment 130397 [details]
clulog with patched rgmanager

Comment 4 Navid Sheikhol-Eslami 2006-06-02 11:37:15 UTC
Created attachment 130398 [details]
cluster.conf used for testing (Apache service)

Comment 5 Lon Hohberger 2006-06-02 14:01:35 UTC
Ack! it should go into the failed state in this case, and not relocate at all.

Comment 6 Lon Hohberger 2006-06-02 14:59:06 UTC
Ok, so, this only happens with not-installed return code (5) on scripts. 
Unfortunately, the patch breaks the resource-agent error case (1) - causing a
service to restart/relocate when it should move to the failed state.

Jun  2 11:02:15 red clurgmgrd[27680]: <notice> Stopping service errortest
Jun  2 11:02:15 red clurgmgrd: [27680]: <info> Executing /errortest stop
Jun  2 11:02:15 red clurgmgrd[27680]: <notice> stop on script "errortest"
returned 1 (generic error)
Jun  2 11:02:15 red clurgmgrd[27680]: <notice> Service errortest is recovering
Jun  2 11:02:15 red clurgmgrd[27680]: <notice> Recovering failed service errortest
Jun  2 11:02:15 red clurgmgrd: [27680]: <info> Executing /errortest start
Jun  2 11:02:15 red clurgmgrd[27680]: <notice> Service errortest started
Jun  2 11:02:25 red clurgmgrd: [27680]: <info> Executing /errortest status
Jun  2 11:02:25 red clurgmgrd[27680]: <notice> status on script "errortest"
returned 1 (generic error)
Jun  2 11:02:25 red clurgmgrd[27680]: <notice> Stopping service errortest
Jun  2 11:02:25 red clurgmgrd: [27680]: <info> Executing /errortest stop
Jun  2 11:02:25 red clurgmgrd[27680]: <notice> stop on script "errortest"
returned 1 (generic error)

Because of the location of the patch, it is likely that this will cause unmount
failures (returning 1) - to be relocated / restarted erroneously.

Maybe it's better to just call return code (5) a fatal error like (1) (which you
also suggested)?

Comment 7 Lon Hohberger 2006-06-02 15:00:46 UTC
FWIW, returning (1) at any point on current rgmanager has the following effect:

Jun  2 10:57:13 red clurgmgrd[26047]: <notice> Stopping service errortest
Jun  2 10:57:13 red clurgmgrd: [26047]: <info> Executing /errortest stop
Jun  2 10:57:13 red clurgmgrd[26047]: <notice> stop on script "errortest"
returned 1 (generic error)
Jun  2 10:57:13 red clurgmgrd[26047]: <crit> #12: RG errortest failed to stop;
intervention required
Jun  2 10:57:13 red clurgmgrd[26047]: <notice> Service errortest is failed 


Comment 9 Navid Sheikhol-Eslami 2006-06-14 07:06:41 UTC
Created attachment 130807 [details]
patch for rgmanager-1.9.46  (take two)

You are right, errors weren't properly returned for RS_STOP.

I am attaching a new patch which fixes this.

I tested it with both OCF_RA_ERROR (1) and OCF_RA_NOT_CONFIGURED (5), and its
behaving as expected: OCF_RA_ERROR sends the service to the failed state and
OCF_RA_NOT_CONFIGURED doesn't, but most importantly it *does* umount the
file-systems.

-- Navid

Comment 10 Lon Hohberger 2006-06-15 17:54:25 UTC
confirmed

Comment 11 Lon Hohberger 2006-06-16 20:04:39 UTC
I applied this to STABLE and RHEL4 CVS branches.

Comment 14 Red Hat Bugzilla 2006-08-10 21:24:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0557.html



Note You need to log in before you can comment on or make changes to this bug.