Bug 137406
Summary: | clusvcmgrd randomly restarts local oracle service | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Fred New <fred.new2911> | ||||||||||||
Component: | clumanager | Assignee: | Lon Hohberger <lhh> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||
Priority: | medium | ||||||||||||||
Version: | 3 | CC: | alessandro.gilardoni, cluster-maint, kanderso, tao | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | i386 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2005-02-28 21:11:14 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Fred New
2004-10-28 09:48:37 UTC
Created attachment 105886 [details]
clusvcmgrd configuration file
Created attachment 105887 [details]
my local service startup script
Could you set clusvcmgrd's log level to debug, reproduce, and attach the logs? Created attachment 105929 [details]
Requested log file, filtered
If you search for "warning" (no quotes), you will see the service manager
restarting Oracle at 19:14:21. This log file has been filtered so I could give
you a small text file. The filtering included limiting the time span and
removing records from clulockd.
Created attachment 105930 [details]
Requested log file, complete, gzipped
In case you want to see the entire log file, here it is.
Here's the funny-ness: Oct 28 19:14:21 BRA01 clusvcmgrd[17055]: <debug> Exec of script /usr/lib/clumanager/services/service, action status, service Oracle Oct 28 19:14:21 BRA01 clusvcmgrd: [17056]: <debug> service debug: Checking service Oracle ... Oct 28 19:14:21 BRA01 clusvcmgrd: [17056]: <debug> service debug: is_samba_service: service=Oracle, is None Oct 28 19:14:21 BRA01 clusvcmgrd: [17056]: <debug> service debug: Running user script '/etc/init.d/oraservice status' Oct 28 19:14:21 BRA01 clusvcmgrd: [17056]: <debug> service debug: Checked service Oracle ... Oct 28 19:14:21 BRA01 clusvcmgrd[17055]: <debug> Exec of script for service Oracle returned -1073767752 ^--- There's the anomaly. Strange...I wonder what happened to cause that. Oct 28 19:14:21 BRA01 clusvcmgrd[17055]: <warning> Restarting locally failed service Oracle The exit status corresponds to death-by-signal. The signal delivered which killed the application (in this case, bash) was 56 -- which isn't normal. It's outside of normal signals (up in the real-time signal set). Are you running clumanager in one of the real-time queues? The other interesting thing is that there should be a core file for 'bash' somewhere on your system... Could you look around for it? Also, do you know if there are any corresponding kernel messages? My reply was delayed because we were trying to set up a test server for this. It seems we can only get this to fail on this particular configuration: Server - IBM X345, Fibre optic storage unit - EMC CLARiiON AX100SC. Real-time queue? - We just followed the Red Hat documentation for installing Cluster Suite. Nothing special was done to put anything in a real-time queue, as far as I know. We have tried geting a core file from bash by putting "ulimit -c unlimited" in the service startup script, but still nothing. Is there something else we need to do? Given that I can't reproduce it and it seems it can only be reproduced on a certain configuration, it's probably best solved by Red Hat Support: http://www.redhat.com/apps/support/ On RT... The RT-queue was just a question based on the fact that the death signal was 56, which is between SIGRTMIN and SIGRTMAX. On the core file... Normally, programs only drop core on SIGQUIT, SIGILL, SIGSEGV, SIGABRT, SIGBUS, etc... All of them are below SIGRTMIN. The indication that there might be a core is the fact that the WCOREDUMP() macro claimed that the bit was set indicating a core file was made. Given the odd exit signal (56), I figured there may be some indicative reason in the core -- if a core file was actually made. I think the proper place to call ulimit in this case is probably in clumanager's 'service' script. (/usr/lib/clumanager/services/service, IIRC) and not in your Oracle script. What's interesting is that the clumanager service script is very probably calling 'exit'. Calling 'exit 0' is the very next thing it does after successfully logging the following message: Oct 28 19:14:21 BRA01 clusvcmgrd: [17056]: <debug> service debug: Checked service Oracle ... This is at the bottom of the 'service' script, and occurs long after your specific Oracle script has been run. I think the odd return code occurs while we're trying to clean up the process. That is, I think it's in the exit path, which is outside of clumanager. Unfortunately, strange things can and do happen when third party drivers are loaded. If possible, please try to reproduce without any third party drivers loaded. I'm seeing a similar issue with a service... It's a database process, not oracle, and it's giving me a similar error: <from cluster debug logfile> Dec 14 15:24:57 emooc1 clusvcmgrd[8077]: <debug> Exec of script for service emooc returned -1073773400 this occurs even when I replace my starup script with a dummy that contains only the following: #!/bin/bash exit 0 I've added a ulimit -c 10000000 to /usr/lib/clumanager/services/service in hopes that I can get a corefile to help with the debugging process, but it does not appear that I've gotten one. We think that there is no crash happening in /usr/lib/clumanager/services/service so there shouldn't be any core file created. If there is actually an attempt to create a core file, maybe the ulimit should go in /etc/rc.d/init.d/clumanager. (Our system is in production, so it is hard to schedule a restart of clumanager to test my theory.) And just to make sure we have all bases covered, we have modified the ulimit statement in /etc/profile, too. I agree. It looks like the process is dying with an invalid exit signal. Can you confirm whether or not it's always the same number appearing in the logs? Another person has reported this problem as well. They're using a different database solution (= not Oracle), so it's probably not Oracle-centric. I didn't think it was Oracle-centric, and this confirms that thought. Anyway, I think I know what the problem is... Created attachment 108627 [details]
Patch which handles other waitpid() conditions
local_ret wasn't initialized. If waitpid() was returning -1 and errno wasn't
EINTR (e.g. it was ECHILD or EINVAL, though both being unlikely; the latter
moreso), it would contain the contents of whatever memory location it happened
to point to after waitpid() exited.
Note - To be clear: In clusvcmgrd, waitpid() should _never_ return -1/ECHILD or -1/EINVAL; why it might be doing so is another issue. It sounds like you (Lon) are beyond the question you asked, but yes, the number appearing in the log is different each time. Hi, I made some packages with the patch which should fix this problem: http://people.redhat.com/lhh/clumanager-1.2.23-0.4lhh.i386.rpm http://people.redhat.com/lhh/clumanager-1.2.23-0.4lhh.src.rpm Please see if they work for you. They're not official errata, but they should fix this problem. 1.2.23 packages have a bug which prevents IP addresses from properly starting in some cases. Consequently, I have fixed this problem and made a new test package: http://people.redhat.com/lhh/clumanager-1.2.24-0.1.i386.rpm http://people.redhat.com/lhh/clumanager-1.2.24-0.1.src.rpm I have the same problem with my jboss service (i have also an oracle service that as the same problem but it's not very repetitive). The jboss service it's restarted nearly one time every day. The clusvcmgrd report the same error "<debug> Exec of script for service jboss returned -1073757400" and it's traced like <debug> nd not <error> (this is the first strange thing). The number in the log change sometimes, but it doesn't change anytime.The versione of the clumanager i'm using is 1.2.22. Can u say me something ? Can i use that patch in a production environment ? Thanks An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-047.html well, I have the same error: <debug> Exec of script /usr/lib/clumanager/services/service, action stop, service oracle_cluster and I have clumanager 1.2.26.1-1 ? , my service is the oracle database script. N.B: I use Bonding ethernet (active passive mode). |