Bug 137406

Summary:

clusvcmgrd randomly restarts local oracle service

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Fred New <fred.new2911>

Component:

clumanager

Assignee:

Lon Hohberger <lhh>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

alessandro.gilardoni, cluster-maint, kanderso, tao

Target Milestone:

---

Target Release:

---

Hardware:

i386

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2005-02-28 21:11:14 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
clusvcmgrd configuration file	none
my local service startup script	none
Requested log file, filtered	none
Requested log file, complete, gzipped	none
Patch which handles other waitpid() conditions	none

Description Fred New 2004-10-28 09:48:37 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3)
Gecko/20040913 Firefox/0.10.1

Description of problem:
If I change service checkinterval to a non-zero value, clusvcmgrd will
occasionally decide that my service isn't running and restart it. 
This happens even though I have set up my script "oraservice status"
to use "exit 0".

Version-Release number of selected component (if applicable):
clumanager-1.2.16-1

How reproducible:
Sometimes

Steps to Reproduce:
1. Set up cluster service manager to automatically restart a local
service if it has failed.
2. Start the local service.
    

Actual Results:  The cluster service manager will occasionally stop
and restart the local service, even though the service is running and
"<scriptname> status" returns a 0.

Expected Results:  The cluster service manager should leave the local
service running if "<scriptname> status" returns 0.

Additional info:

Messages appearing in /var/log/messages are missing a line when
clusvcmgrd restarts the service at the wrong time.

Messages resulting from an incorrect restart:
Oct 21 18:08:47 bra01 clusvcmgrd[28522]: <warning> Restarting locally
failed service Oracle

Messages resulting from a correct restart:
Oct 20 15:44:00 bra01 clusvcmgrd: [8450]: <err> service error: Check
status failed on user script for Oracle
Oct 20 15:44:00 bra01 clusvcmgrd[8449]: <warning> Restarting locally
failed service Oracle

Comment 1 Fred New 2004-10-28 09:55:37 UTC

Created attachment 105886 [details]
clusvcmgrd configuration file

Comment 2 Fred New 2004-10-28 09:56:58 UTC

Created attachment 105887 [details]
my local service startup script

Comment 3 Lon Hohberger 2004-10-28 14:04:45 UTC

Could you set clusvcmgrd's log level to debug, reproduce, and attach
the logs?

Comment 4 Fred New 2004-10-29 06:32:58 UTC

Created attachment 105929 [details]
Requested log file, filtered

If you search for "warning" (no quotes), you will see the service manager
restarting Oracle at 19:14:21.	This log file has been filtered so I could give
you a small text file.	The filtering included limiting the time span and
removing records from clulockd.

Comment 5 Fred New 2004-10-29 06:35:14 UTC

Created attachment 105930 [details]
Requested log file, complete, gzipped

In case you want to see the entire log file, here it is.

Comment 6 Lon Hohberger 2004-10-29 16:30:44 UTC

Here's the funny-ness:

Oct 28 19:14:21 BRA01 clusvcmgrd[17055]: <debug> Exec of script
/usr/lib/clumanager/services/service, action status, service Oracle
Oct 28 19:14:21 BRA01 clusvcmgrd: [17056]: <debug> service debug:
Checking service Oracle ...
Oct 28 19:14:21 BRA01 clusvcmgrd: [17056]: <debug> service debug:
is_samba_service: service=Oracle, is None
Oct 28 19:14:21 BRA01 clusvcmgrd: [17056]: <debug> service debug:
Running user script '/etc/init.d/oraservice status'
Oct 28 19:14:21 BRA01 clusvcmgrd: [17056]: <debug> service debug:
Checked service Oracle ...
Oct 28 19:14:21 BRA01 clusvcmgrd[17055]: <debug> Exec of script for
service Oracle returned -1073767752

    ^--- There's the anomaly.  Strange...I wonder what happened
         to cause that.

Oct 28 19:14:21 BRA01 clusvcmgrd[17055]: <warning> Restarting locally
failed service Oracle

The exit status corresponds to death-by-signal.  The signal delivered
which killed the application (in this case, bash) was  56 -- which
isn't normal.  It's outside of normal signals (up in the real-time
signal set).  Are you running clumanager in one of the real-time queues?

The other interesting thing is that there should be a core file for
'bash' somewhere on your system...  Could you look around for it?

Comment 7 Lon Hohberger 2004-10-29 16:35:22 UTC

Also, do you know if there are any corresponding kernel messages?

Comment 8 Fred New 2004-11-15 13:47:28 UTC

My reply was delayed because we were trying to set up a test server
for this.  It seems we can only get this to fail on this particular
configuration:  Server - IBM X345, Fibre optic storage unit - EMC
CLARiiON AX100SC.

Real-time queue? - We just followed the Red Hat documentation for
installing Cluster Suite.  Nothing special was done to put anything in
a real-time queue, as far as I know.

We have tried geting a core file from bash by putting "ulimit -c
unlimited" in the service startup script, but still nothing.  Is there
something else we need to do?

Comment 9 Lon Hohberger 2004-11-15 16:02:26 UTC

Given that I can't reproduce it and it seems it can only be reproduced
on a certain configuration, it's probably best solved by Red Hat Support:

http://www.redhat.com/apps/support/

On RT... The RT-queue was just a question based on the fact that the
death signal was 56, which is between SIGRTMIN and SIGRTMAX.

On the core file... Normally, programs only drop core on SIGQUIT,
SIGILL, SIGSEGV, SIGABRT, SIGBUS, etc... All of them are below
SIGRTMIN.  The indication that there might be a core is the fact that
the WCOREDUMP() macro claimed that the bit was set indicating a core
file was made.

Given the odd exit signal (56), I figured there may be some indicative
reason in the core -- if a core file was actually made.

I think the proper place to call ulimit in this case is probably in
clumanager's 'service' script. (/usr/lib/clumanager/services/service,
IIRC) and not in your Oracle script.

What's interesting is that the clumanager service script is very
probably calling 'exit'.  Calling 'exit 0' is the very next thing it
does after successfully logging the following message:

Oct 28 19:14:21 BRA01 clusvcmgrd: [17056]: <debug> service debug:
Checked service Oracle ...

This is at the bottom of the 'service' script, and occurs long after
your specific Oracle script has been run.  I think the odd return code
occurs while we're trying to clean up the process.  That is, I think
it's in the exit path, which is outside of clumanager.

Unfortunately, strange things can and do happen when third party
drivers are loaded.  If possible, please try to reproduce without any
third party drivers loaded.

Comment 10 Justin Nemmers 2004-12-14 21:04:32 UTC

I'm seeing a similar issue with a service...  It's a database process,
not oracle, and it's giving me a similar error:

<from cluster debug logfile>
Dec 14 15:24:57 emooc1 clusvcmgrd[8077]: <debug> Exec of script for
service emooc returned -1073773400

this occurs even when I replace my starup script with a dummy that
contains only the following:
#!/bin/bash
exit 0

I've added a ulimit -c 10000000 to
/usr/lib/clumanager/services/service in hopes that I can get a
corefile to help with the debugging process, but it does not appear
that I've gotten one.

Comment 11 Fred New 2004-12-15 10:47:16 UTC

We think that there is no crash happening in
/usr/lib/clumanager/services/service so there shouldn't be any core
file created.  If there is actually an attempt to create a core file,
maybe the ulimit should go in /etc/rc.d/init.d/clumanager.  (Our
system is in production, so it is hard to schedule a restart of
clumanager to test my theory.)  And just to make sure we have all
bases covered, we have modified the ulimit statement in /etc/profile, too.

Comment 12 Lon Hohberger 2004-12-15 16:12:09 UTC

I agree.  It looks like the process is dying with an invalid exit
signal.  Can you confirm whether or not it's always the same number
appearing in the logs?

Another person has reported this problem as well.  They're using a
different database solution (= not Oracle), so it's probably not
Oracle-centric.  I didn't think it was Oracle-centric, and this
confirms that thought.

Anyway, I think I know what the problem is...

Comment 13 Lon Hohberger 2004-12-15 16:18:24 UTC

Created attachment 108627 [details]
Patch which handles other waitpid() conditions

local_ret wasn't initialized.  If waitpid() was returning -1 and errno wasn't
EINTR (e.g. it was ECHILD or EINVAL, though both being unlikely; the latter
moreso), it would contain the contents of whatever memory location it happened
to point to after waitpid() exited.

Comment 14 Lon Hohberger 2004-12-15 16:27:20 UTC

Note - To be clear: In clusvcmgrd, waitpid() should _never_ return
-1/ECHILD or -1/EINVAL; why it might be doing so is another issue.

Comment 15 Fred New 2004-12-16 08:47:09 UTC

It sounds like you (Lon) are beyond the question you asked, but yes,
the number appearing in the log is different each time.

Comment 16 Lon Hohberger 2005-01-07 22:07:06 UTC

Hi, I made some packages with the patch which should fix this problem:

http://people.redhat.com/lhh/clumanager-1.2.23-0.4lhh.i386.rpm
http://people.redhat.com/lhh/clumanager-1.2.23-0.4lhh.src.rpm

Please see if they work for you.  They're not official errata, but
they should fix this problem.

Comment 19 Lon Hohberger 2005-01-12 16:42:04 UTC

1.2.23 packages have a bug which prevents IP addresses from properly
starting in some cases.

Consequently, I have fixed this problem and made a new test package:

http://people.redhat.com/lhh/clumanager-1.2.24-0.1.i386.rpm
http://people.redhat.com/lhh/clumanager-1.2.24-0.1.src.rpm

Comment 20 Alessandro Gilardoni 2005-02-14 22:13:29 UTC

I have the same problem with my jboss service (i have also an oracle service
that as the same problem but it's not very repetitive).
The jboss service it's restarted nearly one time every day.
The clusvcmgrd report the same error "<debug> Exec of script for service jboss
returned -1073757400" and it's traced like <debug> nd not <error> (this is the
first strange thing). The number in the log change sometimes, but it doesn't
change anytime.The versione of the clumanager i'm using is 1.2.22.
Can u say me something ? Can i use that patch in a production environment ?
Thanks

Comment 23 Jay Turner 2005-05-25 16:41:01 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-047.html

Comment 24 Bouraoui 2006-03-06 14:07:48 UTC

well, I have the same error: <debug> Exec of script
/usr/lib/clumanager/services/service, action stop, service oracle_cluster and I
have clumanager 1.2.26.1-1 ? , my service is the oracle database script.
N.B: I use Bonding ethernet (active passive mode).