From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)
Description of problem:
After upgrading the kernel from 2.4.21-9.0.3.ELhugemem to 2.4.21-
15.ELhugemem the behavior of cluster manager has changed. Currently I
have cluster manager managing an Oracle instance. Under the 2.4.21-
9.0.3.ELhugemem kernel, the dba's used to be able to stop and start
the managed oracle instances through sqldba. After upgrading the
kernel to version 2.4.21-15.ELhugemem, the dba's are still able to
shutdown the databases, but when they try and restart them it causes
the machine to failover to the backup server.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Manage an Oracle instance using clumanager
2. Shutdown the database using sqldba
3. Restart the instance using sqldba
Actual Results: The clustermanager failed the processes over to the
Expected Results: The database would startup.
The server configuration is as follows:
1) HP DL-740, 8 processors, 35 GB Memory
2) Storage HP XP512 SAN connect with fibre through Emulex LP-9002
fibre channel cards
The only thing logged in /var/log/messages on the failover server is:
May 23 19:33:38 prod2-rh clusvcmgrd: <crit> Invalid reply!
May 23 19:33:43 prod2-rh clusvcmgrd: <crit> Couldn't connect to
member #0: Connection timed out
May 23 19:34:07 prod2-rh cluquorumd: <crit> STONITH: Data
integrity may be compromised!
The 'Invalid Reply' is a red herring; generally it means the locks
timed out waiting for a response (typically due to slow I/O times).
In the U2 version, this has been replaced with a <debug> level message
and it properly retries; simply upgrading to the latest erratum may
solve your problems.
If it's reproducible on the latest erratum, you should add this to
and restart syslogd; then reproduce. /var/log/messages doesn't
generally contain all of the cluster's log messages (if it did, it'd
grow really fast.
You may want to consider buying some power switches.
Additionally, you may want to increase your membership failure
detection by several seconds. You'll want to file a ticket with Red
Hat Support as well:
It may be a simple matter of re-tuning your failover time.
Any additional information you could provide would be helpful,
specifically logs during reproduction after following the instructions
in the previous comment.
It has been a month that this has been in NEEDINFO. Closing. Please
reopen if there is additional information.
Fixing product name. Clumanager on RHEL3 was part of RHCS3, not RHEL3