From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0) Description of problem: After upgrading the kernel from 2.4.21-9.0.3.ELhugemem to 2.4.21- 15.ELhugemem the behavior of cluster manager has changed. Currently I have cluster manager managing an Oracle instance. Under the 2.4.21- 9.0.3.ELhugemem kernel, the dba's used to be able to stop and start the managed oracle instances through sqldba. After upgrading the kernel to version 2.4.21-15.ELhugemem, the dba's are still able to shutdown the databases, but when they try and restart them it causes the machine to failover to the backup server. Version-Release number of selected component (if applicable): clumanager-1.2.9-1 How reproducible: Always Steps to Reproduce: 1. Manage an Oracle instance using clumanager 2. Shutdown the database using sqldba 3. Restart the instance using sqldba Actual Results: The clustermanager failed the processes over to the backup server Expected Results: The database would startup. Additional info: The server configuration is as follows: 1) HP DL-740, 8 processors, 35 GB Memory 2) Storage HP XP512 SAN connect with fibre through Emulex LP-9002 fibre channel cards The only thing logged in /var/log/messages on the failover server is: May 23 19:33:38 prod2-rh clusvcmgrd[1057]: <crit> Invalid reply! May 23 19:33:43 prod2-rh clusvcmgrd[1057]: <crit> Couldn't connect to member #0: Connection timed out May 23 19:34:07 prod2-rh cluquorumd[1012]: <crit> STONITH: Data integrity may be compromised! May
The 'Invalid Reply' is a red herring; generally it means the locks timed out waiting for a response (typically due to slow I/O times). In the U2 version, this has been replaced with a <debug> level message and it properly retries; simply upgrading to the latest erratum may solve your problems. If it's reproducible on the latest erratum, you should add this to your /etc/syslog.conf: local4.* /var/log/clumanager and restart syslogd; then reproduce. /var/log/messages doesn't generally contain all of the cluster's log messages (if it did, it'd grow really fast. You may want to consider buying some power switches.
Additionally, you may want to increase your membership failure detection by several seconds. You'll want to file a ticket with Red Hat Support as well: http://www.redhat.com/apps/support/ It may be a simple matter of re-tuning your failover time. Any additional information you could provide would be helpful, specifically logs during reproduction after following the instructions in the previous comment.
It has been a month that this has been in NEEDINFO. Closing. Please reopen if there is additional information.
Fixing product name. Clumanager on RHEL3 was part of RHCS3, not RHEL3