Bug 124237 - Clumanager behaving differently under 2.4.21-15.ELhugemem kernel
Summary: Clumanager behaving differently under 2.4.21-15.ELhugemem kernel
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: clumanager
Version: 3
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
Assignee: Lon Hohberger
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-05-24 22:16 UTC by Steve Pierce
Modified: 2009-04-16 20:14 UTC (History)
1 user (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2004-07-15 20:32:32 UTC
Embargoed:


Attachments (Terms of Use)

Description Steve Pierce 2004-05-24 22:16:33 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)

Description of problem:
After upgrading the kernel from 2.4.21-9.0.3.ELhugemem to 2.4.21-
15.ELhugemem the behavior of cluster manager has changed. Currently I 
have cluster manager managing an Oracle instance. Under the 2.4.21-
9.0.3.ELhugemem kernel, the dba's used to be able to stop and start 
the managed oracle instances through sqldba. After upgrading the 
kernel to version 2.4.21-15.ELhugemem, the dba's are still able to 
shutdown the databases, but when they try and restart them it causes 
the machine to failover to the backup server. 

Version-Release number of selected component (if applicable):
clumanager-1.2.9-1

How reproducible:
Always

Steps to Reproduce:
1. Manage an Oracle instance using clumanager
2. Shutdown the database using sqldba
3. Restart the instance using sqldba
    

Actual Results:  The clustermanager failed the processes over to the 
backup server

Expected Results:  The database would startup.

Additional info:

The server configuration is as follows:
1) HP DL-740, 8 processors, 35 GB Memory
2) Storage HP XP512 SAN connect with fibre through Emulex LP-9002 
fibre channel cards

The only thing logged in /var/log/messages on the failover server is:

May 23 19:33:38 prod2-rh clusvcmgrd[1057]: <crit> Invalid reply!
May 23 19:33:43 prod2-rh clusvcmgrd[1057]: <crit> Couldn't connect to 
member #0: Connection timed out
May 23 19:34:07 prod2-rh cluquorumd[1012]: <crit> STONITH: Data 
integrity may be compromised!
May

Comment 1 Lon Hohberger 2004-05-25 13:46:07 UTC
The 'Invalid Reply' is a red herring; generally it means the locks
timed out waiting for a response (typically due to slow I/O times). 
In the U2 version, this has been replaced with a <debug> level message
and it properly retries; simply upgrading to the latest erratum may
solve your problems.

If it's reproducible on the latest erratum, you should add this to
your /etc/syslog.conf:

local4.* /var/log/clumanager

and restart syslogd; then reproduce.  /var/log/messages doesn't
generally contain all of the cluster's log messages (if it did, it'd
grow really fast.

You may want to consider buying some power switches.


Comment 2 Lon Hohberger 2004-06-23 18:54:23 UTC
Additionally, you may want to increase your membership failure
detection by several seconds.  You'll want to file a ticket with Red
Hat Support as well:

http://www.redhat.com/apps/support/

It may be a simple matter of re-tuning your failover time.

Any additional information you could provide would be helpful,
specifically logs during reproduction after following the instructions
in the previous comment.

Comment 3 Suzanne Hillman 2004-07-15 20:32:32 UTC
It has been a month that this has been in NEEDINFO. Closing. Please
reopen if there is additional information.

Comment 4 Lon Hohberger 2007-12-21 15:10:01 UTC
Fixing product name.  Clumanager on RHEL3 was part of RHCS3, not RHEL3


Note You need to log in before you can comment on or make changes to this bug.