From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20040913 Firefox/0.10.1 Description of problem: If I change service checkinterval to a non-zero value, clusvcmgrd will occasionally decide that my service isn't running and restart it. This happens even though I have set up my script "oraservice status" to use "exit 0". Version-Release number of selected component (if applicable): clumanager-1.2.16-1 How reproducible: Sometimes Steps to Reproduce: 1. Set up cluster service manager to automatically restart a local service if it has failed. 2. Start the local service. Actual Results: The cluster service manager will occasionally stop and restart the local service, even though the service is running and "<scriptname> status" returns a 0. Expected Results: The cluster service manager should leave the local service running if "<scriptname> status" returns 0. Additional info: Messages appearing in /var/log/messages are missing a line when clusvcmgrd restarts the service at the wrong time. Messages resulting from an incorrect restart: Oct 21 18:08:47 bra01 clusvcmgrd[28522]: <warning> Restarting locally failed service Oracle Messages resulting from a correct restart: Oct 20 15:44:00 bra01 clusvcmgrd: [8450]: <err> service error: Check status failed on user script for Oracle Oct 20 15:44:00 bra01 clusvcmgrd[8449]: <warning> Restarting locally failed service Oracle
Created attachment 105886 [details] clusvcmgrd configuration file
Created attachment 105887 [details] my local service startup script
Could you set clusvcmgrd's log level to debug, reproduce, and attach the logs?
Created attachment 105929 [details] Requested log file, filtered If you search for "warning" (no quotes), you will see the service manager restarting Oracle at 19:14:21. This log file has been filtered so I could give you a small text file. The filtering included limiting the time span and removing records from clulockd.
Created attachment 105930 [details] Requested log file, complete, gzipped In case you want to see the entire log file, here it is.
Here's the funny-ness: Oct 28 19:14:21 BRA01 clusvcmgrd[17055]: <debug> Exec of script /usr/lib/clumanager/services/service, action status, service Oracle Oct 28 19:14:21 BRA01 clusvcmgrd: [17056]: <debug> service debug: Checking service Oracle ... Oct 28 19:14:21 BRA01 clusvcmgrd: [17056]: <debug> service debug: is_samba_service: service=Oracle, is None Oct 28 19:14:21 BRA01 clusvcmgrd: [17056]: <debug> service debug: Running user script '/etc/init.d/oraservice status' Oct 28 19:14:21 BRA01 clusvcmgrd: [17056]: <debug> service debug: Checked service Oracle ... Oct 28 19:14:21 BRA01 clusvcmgrd[17055]: <debug> Exec of script for service Oracle returned -1073767752 ^--- There's the anomaly. Strange...I wonder what happened to cause that. Oct 28 19:14:21 BRA01 clusvcmgrd[17055]: <warning> Restarting locally failed service Oracle The exit status corresponds to death-by-signal. The signal delivered which killed the application (in this case, bash) was 56 -- which isn't normal. It's outside of normal signals (up in the real-time signal set). Are you running clumanager in one of the real-time queues? The other interesting thing is that there should be a core file for 'bash' somewhere on your system... Could you look around for it?
Also, do you know if there are any corresponding kernel messages?
My reply was delayed because we were trying to set up a test server for this. It seems we can only get this to fail on this particular configuration: Server - IBM X345, Fibre optic storage unit - EMC CLARiiON AX100SC. Real-time queue? - We just followed the Red Hat documentation for installing Cluster Suite. Nothing special was done to put anything in a real-time queue, as far as I know. We have tried geting a core file from bash by putting "ulimit -c unlimited" in the service startup script, but still nothing. Is there something else we need to do?
Given that I can't reproduce it and it seems it can only be reproduced on a certain configuration, it's probably best solved by Red Hat Support: http://www.redhat.com/apps/support/ On RT... The RT-queue was just a question based on the fact that the death signal was 56, which is between SIGRTMIN and SIGRTMAX. On the core file... Normally, programs only drop core on SIGQUIT, SIGILL, SIGSEGV, SIGABRT, SIGBUS, etc... All of them are below SIGRTMIN. The indication that there might be a core is the fact that the WCOREDUMP() macro claimed that the bit was set indicating a core file was made. Given the odd exit signal (56), I figured there may be some indicative reason in the core -- if a core file was actually made. I think the proper place to call ulimit in this case is probably in clumanager's 'service' script. (/usr/lib/clumanager/services/service, IIRC) and not in your Oracle script. What's interesting is that the clumanager service script is very probably calling 'exit'. Calling 'exit 0' is the very next thing it does after successfully logging the following message: Oct 28 19:14:21 BRA01 clusvcmgrd: [17056]: <debug> service debug: Checked service Oracle ... This is at the bottom of the 'service' script, and occurs long after your specific Oracle script has been run. I think the odd return code occurs while we're trying to clean up the process. That is, I think it's in the exit path, which is outside of clumanager. Unfortunately, strange things can and do happen when third party drivers are loaded. If possible, please try to reproduce without any third party drivers loaded.
I'm seeing a similar issue with a service... It's a database process, not oracle, and it's giving me a similar error: <from cluster debug logfile> Dec 14 15:24:57 emooc1 clusvcmgrd[8077]: <debug> Exec of script for service emooc returned -1073773400 this occurs even when I replace my starup script with a dummy that contains only the following: #!/bin/bash exit 0 I've added a ulimit -c 10000000 to /usr/lib/clumanager/services/service in hopes that I can get a corefile to help with the debugging process, but it does not appear that I've gotten one.
We think that there is no crash happening in /usr/lib/clumanager/services/service so there shouldn't be any core file created. If there is actually an attempt to create a core file, maybe the ulimit should go in /etc/rc.d/init.d/clumanager. (Our system is in production, so it is hard to schedule a restart of clumanager to test my theory.) And just to make sure we have all bases covered, we have modified the ulimit statement in /etc/profile, too.
I agree. It looks like the process is dying with an invalid exit signal. Can you confirm whether or not it's always the same number appearing in the logs? Another person has reported this problem as well. They're using a different database solution (= not Oracle), so it's probably not Oracle-centric. I didn't think it was Oracle-centric, and this confirms that thought. Anyway, I think I know what the problem is...
Created attachment 108627 [details] Patch which handles other waitpid() conditions local_ret wasn't initialized. If waitpid() was returning -1 and errno wasn't EINTR (e.g. it was ECHILD or EINVAL, though both being unlikely; the latter moreso), it would contain the contents of whatever memory location it happened to point to after waitpid() exited.
Note - To be clear: In clusvcmgrd, waitpid() should _never_ return -1/ECHILD or -1/EINVAL; why it might be doing so is another issue.
It sounds like you (Lon) are beyond the question you asked, but yes, the number appearing in the log is different each time.
Hi, I made some packages with the patch which should fix this problem: http://people.redhat.com/lhh/clumanager-1.2.23-0.4lhh.i386.rpm http://people.redhat.com/lhh/clumanager-1.2.23-0.4lhh.src.rpm Please see if they work for you. They're not official errata, but they should fix this problem.
1.2.23 packages have a bug which prevents IP addresses from properly starting in some cases. Consequently, I have fixed this problem and made a new test package: http://people.redhat.com/lhh/clumanager-1.2.24-0.1.i386.rpm http://people.redhat.com/lhh/clumanager-1.2.24-0.1.src.rpm
I have the same problem with my jboss service (i have also an oracle service that as the same problem but it's not very repetitive). The jboss service it's restarted nearly one time every day. The clusvcmgrd report the same error "<debug> Exec of script for service jboss returned -1073757400" and it's traced like <debug> nd not <error> (this is the first strange thing). The number in the log change sometimes, but it doesn't change anytime.The versione of the clumanager i'm using is 1.2.22. Can u say me something ? Can i use that patch in a production environment ? Thanks
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-047.html
well, I have the same error: <debug> Exec of script /usr/lib/clumanager/services/service, action stop, service oracle_cluster and I have clumanager 1.2.26.1-1 ? , my service is the oracle database script. N.B: I use Bonding ethernet (active passive mode).