Bug 101609 - Strange Failover behavior
Strange Failover behavior
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: clumanager (Show other bugs)
2.1
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Lon Hohberger
:
Depends On:
Blocks: 87937
  Show dependency treegraph
 
Reported: 2003-08-04 14:12 EDT by Arsene Gschwind
Modified: 2007-11-30 17:06 EST (History)
1 user (show)

See Also:
Fixed In Version: 1.0.25-1
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2003-12-10 11:21:49 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Don't relocate when start + stop fail in sequence (1.28 KB, patch)
2003-08-04 15:54 EDT, Lon Hohberger
no flags Details | Diff

  None (edit)
Description Arsene Gschwind 2003-08-04 14:12:34 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021105

Description of problem:
I've setup a failover cluster using RHAS2.1 on 2 intel Xeon nodes connected to
an FC shared storage. 

During the setup of my failover cluster I made some changes in the service
startup script which result in a service not starting. The script returned an
invalid return code. (I've found the PB) 

It seems that when the cluster daemon is not able to start the 
service on one node it tries on the other and then it return an error message.
After that I've realized the my shared partition went mounted on both system at
the same time. I could reproduce this behavior. 
This behavior can be dangerous, because having two system accessing the same
drive on the same time may corrupt that filesystem.

Version-Release number of selected component (if applicable):
clumanager-1.0.19-2

How reproducible:
Always

Steps to Reproduce:
1.Create a not working service startup script
2.Configure a service using cluadmin with that startup script and have partition
mounted for that service
3.start the service on the first node 

Actual Results:  When the strat returns you will see that your partition is
mounted on both nodes.
    

Expected Results:  Normaly none of the system should have that partition mounted

Additional info:
Comment 1 Lon Hohberger 2003-08-04 15:15:56 EDT
Acknowledged.


Comment 2 Lon Hohberger 2003-08-04 15:37:41 EDT
This happens because when we stop a service, it starts in one order:

devices
filesystems
NFS
IP addresses
samba
user script

And stops in the reverse:

stop user script
stop samba
...

Actually, the fact that the partition isn't getting unmounted isn't a problem:
If the user script is broken, you have to assume the worst-case: the application
still needs the partitions, etc. for some reason.
 
However, the service never should have been started on the second member.  The
expected behavior in service starts:

Normal:
start service -> success returned
DONE (Service Running)

Semi-broken start (one node fails):
start service -> error returned
stop service -> success returned
(Send service to other member)
other member: start service -> success returned
DONE (Service Running)

Broken start (both nodes fail):
start service -> error returned
stop service -> success returned
(Send service to other member)
other member: start service -> error returned
other member: stop service -> success returned
other member: disable service.
DONE (Service Disabled)


Very broken start:
start service -> error returned
stop service -> error returned
disable service. (Do NOT try to start on the other member)
DONE (Service Disabled)


What it sounds like is that the last one ("Very broken start") is not working
correctly.  That is, the "start" and "stop" phases of the service script are
returning errors, but we're trying on both members anyway, leaving the
partitions mounted.  Is this correct?




Comment 3 Lon Hohberger 2003-08-04 15:54:30 EDT
Created attachment 93386 [details]
Don't relocate when start + stop fail in sequence
Comment 6 Lon Hohberger 2003-08-05 14:52:50 EDT
I have tested & verified that the above fix addresses the problem.  The fix is
slated for inclusion in the next erratum of Red Hat Cluster Manager.
Comment 7 Arsene Gschwind 2003-08-06 06:24:05 EDT
>What it sounds like is that the last one ("Very broken start") is not working
>correctly.  That is, the "start" and "stop" phases of the service script are
>returning errors, but we're trying on both members anyway, leaving the
>partitions mounted.  Is this correct?

Yes that was the case. I will try the new version you send me today.

Thanks


Comment 8 Suzanne Hillman 2003-10-07 17:05:22 EDT
Fix exists, adding to U3 blocker bug list.

Note You need to log in before you can comment on or make changes to this bug.