193859 – rgmanager relocates a service forgetting to umount the file-systems

Bug 193859 - rgmanager relocates a service forgetting to umount the file-systems

Summary: rgmanager relocates a service forgetting to umount the file-systems

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	rgmanager
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-06-02 11:33 UTC by Navid Sheikhol-Eslami
Modified:	2009-04-16 20:20 UTC (History)
CC List:	2 users (show)
Fixed In Version:	RHBA-2006-0557
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-08-10 21:24:28 UTC
Embargoed:

Attachments	(Terms of Use)
patch for rgmanager-1.9.46 (630 bytes, patch) 2006-06-02 11:33 UTC, Navid Sheikhol-Eslami	no flags	Details \| Diff
clulog with original rgmanager (3.27 KB, text/plain) 2006-06-02 11:34 UTC, Navid Sheikhol-Eslami	no flags	Details
clulog with patched rgmanager (4.22 KB, text/plain) 2006-06-02 11:34 UTC, Navid Sheikhol-Eslami	no flags	Details
cluster.conf used for testing (Apache service) (2.12 KB, text/plain) 2006-06-02 11:37 UTC, Navid Sheikhol-Eslami	no flags	Details
patch for rgmanager-1.9.46 (take two) (571 bytes, patch) 2006-06-14 07:06 UTC, Navid Sheikhol-Eslami	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2006:0557	0	normal	SHIPPED_LIVE	rgmanager bug fix update	2006-08-10 04:00:00 UTC

Description Navid Sheikhol-Eslami 2006-06-02 11:33:14 UTC

Description of problem:

If a service needs to be relocated (for example due to a failure) and a resource
fails to stop (for example due to a missing script), the service could be
relocated without first unmounting the shared file-systems, causing a concurrent
mount from two nodes and possibly data corruption.

Version-Release number of selected component (if applicable):

rgmanager-1.9.46

How reproducible:

It is very easily reproduceable by creating a service with a child service
script and file-system. If the service script goes missing the service will try
to stop before relocating, but since the script is missing the shared
file-system is never unmounted before the service is migrated.

Steps to Reproduce:
1. create a service with script and file-system resources
2. rename the script
3. let the service relocate
4. check mount output on both nodes
  
Actual results:

in src/daemons/restree.c: _do_child_levels stops execution and returns an error
if a child fails, regardless of the operation.

Expected results:

in case of a "stop" operation, _do_child_levels should iterate through all
resources within a tree and stop them (or at least try). Then eventually return
an error.

Additional info:

I am attaching a patch this fixes this behaviour.

I'm also attaching some clulog debug outputs which show how the file-system is
never unmounted (before patch) and when it is (after patch), as well as the
cluster.conf used for these tests.

Comment 1 Navid Sheikhol-Eslami 2006-06-02 11:33:15 UTC

Created attachment 130395 [details]
patch for rgmanager-1.9.46

Comment 2 Navid Sheikhol-Eslami 2006-06-02 11:34:29 UTC

Created attachment 130396 [details]
clulog with original rgmanager

Comment 3 Navid Sheikhol-Eslami 2006-06-02 11:34:59 UTC

Created attachment 130397 [details]
clulog with patched rgmanager

Comment 4 Navid Sheikhol-Eslami 2006-06-02 11:37:15 UTC

Created attachment 130398 [details]
cluster.conf used for testing (Apache service)

Comment 5 Lon Hohberger 2006-06-02 14:01:35 UTC

Ack! it should go into the failed state in this case, and not relocate at all.

Comment 6 Lon Hohberger 2006-06-02 14:59:06 UTC

Ok, so, this only happens with not-installed return code (5) on scripts. 
Unfortunately, the patch breaks the resource-agent error case (1) - causing a
service to restart/relocate when it should move to the failed state.

Jun  2 11:02:15 red clurgmgrd[27680]: <notice> Stopping service errortest
Jun  2 11:02:15 red clurgmgrd: [27680]: <info> Executing /errortest stop
Jun  2 11:02:15 red clurgmgrd[27680]: <notice> stop on script "errortest"
returned 1 (generic error)
Jun  2 11:02:15 red clurgmgrd[27680]: <notice> Service errortest is recovering
Jun  2 11:02:15 red clurgmgrd[27680]: <notice> Recovering failed service errortest
Jun  2 11:02:15 red clurgmgrd: [27680]: <info> Executing /errortest start
Jun  2 11:02:15 red clurgmgrd[27680]: <notice> Service errortest started
Jun  2 11:02:25 red clurgmgrd: [27680]: <info> Executing /errortest status
Jun  2 11:02:25 red clurgmgrd[27680]: <notice> status on script "errortest"
returned 1 (generic error)
Jun  2 11:02:25 red clurgmgrd[27680]: <notice> Stopping service errortest
Jun  2 11:02:25 red clurgmgrd: [27680]: <info> Executing /errortest stop
Jun  2 11:02:25 red clurgmgrd[27680]: <notice> stop on script "errortest"
returned 1 (generic error)

Because of the location of the patch, it is likely that this will cause unmount
failures (returning 1) - to be relocated / restarted erroneously.

Maybe it's better to just call return code (5) a fatal error like (1) (which you
also suggested)?

Comment 7 Lon Hohberger 2006-06-02 15:00:46 UTC

FWIW, returning (1) at any point on current rgmanager has the following effect:

Jun  2 10:57:13 red clurgmgrd[26047]: <notice> Stopping service errortest
Jun  2 10:57:13 red clurgmgrd: [26047]: <info> Executing /errortest stop
Jun  2 10:57:13 red clurgmgrd[26047]: <notice> stop on script "errortest"
returned 1 (generic error)
Jun  2 10:57:13 red clurgmgrd[26047]: <crit> #12: RG errortest failed to stop;
intervention required
Jun  2 10:57:13 red clurgmgrd[26047]: <notice> Service errortest is failed

Comment 9 Navid Sheikhol-Eslami 2006-06-14 07:06:41 UTC

Created attachment 130807 [details]
patch for rgmanager-1.9.46  (take two)

You are right, errors weren't properly returned for RS_STOP.

I am attaching a new patch which fixes this.

I tested it with both OCF_RA_ERROR (1) and OCF_RA_NOT_CONFIGURED (5), and its
behaving as expected: OCF_RA_ERROR sends the service to the failed state and
OCF_RA_NOT_CONFIGURED doesn't, but most importantly it *does* umount the
file-systems.

-- Navid

Comment 10 Lon Hohberger 2006-06-15 17:54:25 UTC

confirmed

Comment 11 Lon Hohberger 2006-06-16 20:04:39 UTC

I applied this to STABLE and RHEL4 CVS branches.

Comment 14 Red Hat Bugzilla 2006-08-10 21:24:33 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0557.html

Note You need to log in before you can comment on or make changes to this bug.