185484 – Cluster service restarting Locally

Bug 185484 - Cluster service restarting Locally

Summary: Cluster service restarting Locally

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	clumanager
Sub Component:
Version:	3
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	185485 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-03-15 05:02 UTC by saju john
Modified:	2009-04-16 20:19 UTC (History)
CC List:	4 users (show)
Fixed In Version:	1.2.31-1
Clone Of:
Environment:
Last Closed:	2006-07-20 17:12:13 UTC
Embargoed:

Attachments	(Terms of Use)

Description saju john 2006-03-15 05:02:18 UTC

I have a 2 node cluster with RHAS3 update 3.
Kernel : 2.4.21-20.Elsmp
Clumanager : clumanager-1.2.16-1
Nodes : HP DL380 G3
SAN : MSA 1000

For more than a year everyting had been fine. Suddenly it started showing the
follwing and restarted the service locally.This used to happen quite freequently.

clusvcmgrd[1388]: <err> Unable to obtain cluster lock: Connection timed out
clulockd[1378]: <warning> Denied A.B.C.D: Broken pipe
clulockd[1378]: <err> select error: Broken pipe
clusvcmgrd: [1625]: <notice> service notice: Stopping service postgresql ...
clusvcmgrd: [1625]: <notice> service notice: Running user script
'/etc/init.d/postgresql stop'
clusvcmgrd: [1625]: <notice> service notice: Stopped service postgresql
clusvcmgrd: [1625]: <notice> service notice: Starting service postgresql ...
clusvcmgrd: [1625]: <notice> service notice: Running user script
'/etc/init.d/postgresql start'
clusvcmgrd: [1625]: <notice> service notice: Started service postgresql ...

Comment 1 Lon Hohberger 2006-03-20 20:59:35 UTC

*** Bug 185485 has been marked as a duplicate of this bug. ***

Comment 2 Lon Hohberger 2006-04-26 20:11:18 UTC

These messages are caused when a node fails to take a lock because the lock
master is not responding. The lock master failing to respond can be caused by a
variety of things.

Here are the major possible causes:

(a) Poor shared storage configuration. Multi-initiator parallel SCSI is known
(and documented) to not work reliably. Furthermore, host-RAID controllers
require extensive testing to certify, and I believe none currently are. See
section 2.1.1 of the Red Hat Cluster Suite 3 manual:

http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en-3/ch-hardware.html#S1-HARDWARE-CHOOSING

In this case, what happens is that a client connects to the lock master and
issues a lock request. At this point, the lock master writes out the lock
blocks (a combined 1024 bytes or 2 SCSI blocks) to shared storage. If the lock
master hangs in the write(2) system call (i.e. waiting for the write to
complete), it will not respond to messages from other clients.

Unfortunately, there is not much that can be done to eliminate this particular
scenario entirely short of purchasing a high-performance SAN array. The shared
storage is simply not powerful enough to handle the given load.

Possible solution: Make the timeouts in the code configurable so users may
account for the slow performance. With a longer timeout (up to, perhaps the 75
second TCP connect timeout), the likelihood that the client will not give up
before the lock master has a chance to process its connection request is increased.

(b) The lock master daemon is not getting scheduled due to heavy CPU load on the
system in which it resides.

Possible solution: In addition to (a)'s possible solution, it may be possible
to attain some benefit from dumping the clulockd process in to one of the
real-time scheduling queues, so that it has a higher priority over other
applications on the system. The clulockd process does not consume very many
resources, so this should not cause additional problems. (Side note: all of the
cluster daemons are locked in memory, and so, they should never be paged out to
disk.)

(c) Network saturation preventing lock traffic TCP messages from getting back
and forth in a timely manner, or inability for the lock master to accept the
message due to low system resources.

In this case, the lock master either never gets the TCP connect messages from
the client, or fails to accept it correctly. This, then, causes the client to
eventually give up - aborting the lock attempt.

This may be at least partially solved in 1.2.31 in which communication with the
local instance of clulockd is performed using UNIX domain sockets instad of TCP.
On clusters with several services, the most obvious effect of the TCP vs. UNIX
socket communication is the distinct lack of several hundred (or more) sockets
in the TIME_WAIT state.

Potential solution: It may be further alleviated in the manner of (a). Also,
using a private network for cluster communications should eliminate this problem.

(d) Network saturation preventing clulockd from writing information to an iSCSI
target.

This has the same potential solutions as (c), mostly.

(e) Kernel bug causing interruption on loopback network devices (see bugzilla
#168665, #155892). #168665 looks a whole lot like this bugzilla.

Basically, communication to the loopback interface stops working - causing errors.

Solution: This bug should be fully worked-around by 1.2.31 - since the loopback
connections are no longer made for locks or lock master queries.

Other considerations:

After a connection is established at the TCP level, clumanager daemons
authenticate connections using a challenge/response MD5 mechanism to prevent
other clusters from erroneously connecting to the daemons. This
challenge/response mechanism also has a 5-second timeout, but the entire
sub-system may be disabled by running "clukey delete". In some cases, this
could help alleviate some of the messages.

Comment 3 saju 2006-04-27 04:05:30 UTC

Dear Lon,

Thanks for the replay.
I also think problem is something to do with shared quroum communication. But 
my shared storage is MSA SAN 1000, which is certified by redhat and known to 
work well with Redat Cluster Suite 3.

About CPU load: the cluster don't have heavy CPU load. Nodes are 2 Processor 
DL380 G3 machines and load is 3 to 4 most of the time. But if the load 
increase more than 5( this is a spike in load only), I can expect a service 
restart.

Network Saturation : I my case I don't think this is happening, as the network 
is of high network with 100/1000 Mbps line using Cisco Switches. There is no 
traffic Congestion or packet collision.

Saju John

Comment 4 David Juran 2006-04-27 17:35:20 UTC

Saju:

Have you tried the latest update of clumanager, the 1.2.31-1 version? Can you
confirm that it solves your problem?


/David

Comment 5 saju 2006-04-28 06:41:07 UTC

Dear David,

I can not make trial-and-error as it is a live cluster running critical 
database service.

For avoiding the cluster service restart, I removed second node from the 
cluster that solves the service restart problem. But this has the price of no 
High Availability, in case the primary node fails.

When the cluster is running with single node , there is no local servcie 
resart. There for I think the problem is cuncurrent communication problem with 
quroum partition. I mean some how metadata information in quroum is getting 
corrupted.

Saju John

Comment 8 Lon Hohberger 2006-05-02 14:21:03 UTC

Saju,

Under certain load situations, the loopback device on kernels pre 2.4.21-37.EL
had a problem where lo communication (e.g. connecting to 127.0.0.1) is disrupted
- this causes lock timeouts when talking to the *local* host.  Clumanager 1.2.31
uses UNIX domain socket communications for local lock traffic, so this should
not be an issue.

Your hardware configuration looks fine.

I am fairly sure that in your case, upgrading the kernel + clumanager will solve
your problem.  You can do a rolling upgrade during off-hours if you are worried
about downtime.

Comment 11 saju john 2006-05-23 09:40:55 UTC

Dear Lon,

The Cluster has HBA card(FCA2214) driver from qlogic. Qlogic latest driver
version  for Redhat Linux 3.0 32-bit is 7.07.03 which will support kernel
2.4.21-32ELsmp(ie  Update5) and not 2.4.21-37.EL. Because of this, I think
update will not be possible to 2.4.21-37.ELsmp.

I would like to know your opinion

Thank You,
Saju John

Comment 13 David Juran 2006-05-23 09:58:40 UTC

Hello Saju.

Is there a reason why you aren't able to use the default qlogic driver that is
included in kernel-2.4.21-40.smp ?

/David

Comment 14 saju john 2006-05-24 04:32:41 UTC

Dear David,

The default driver 7.00.03(if my knowdledge is correct) is very old and it used
to make lot of errors in /var/log/messages. Thats why I upgraded the driver in
my cluster to 7.05.00 around one year back. 

I would like to know whether qla 7.07.03 will support Redhat AS3 U7.Any one is
currently using it

Thank You,
Saju John

Comment 15 saju 2006-05-24 04:34:02 UTC

Dear David,

The default driver 7.00.03(if my knowdledge is correct) is very old and it used
to make lot of errors in /var/log/messages. Thats why I upgraded the driver in
my cluster to 7.05.00 around one year back. 

I would like to know whether qla 7.07.03 will support Redhat AS3 U7.Any one is
currently using it

Thank You,
Saju John

Comment 16 David Juran 2006-05-24 08:35:23 UTC

Hello Saju.

kernel-2.4.21-40.EL in RHEL3 U7 contains the 7.05.00 version of the qlogic driver.

/David

Comment 17 saju 2006-05-30 04:09:14 UTC

Dear David,

I updated the cluster to kernel-2.4.21-40.ELsmp and clumanager-1.2.31-1
It seems that problem is FIXED and no more service restart till today(It is now
3 days after update).

Please consider this as initial confirmation of the problem fix. I will monitor
the cluster for some more days and update this

I really appreciate the efforts from Mr. Lon and Mr. David in solving this problem.

Thank You,
Saju John

Comment 18 saju 2006-07-01 05:17:55 UTC


Dear David,

Now it is more than one month after the update and the cluster is quite stable.
Thanks every one for their support.
I think you can close this bug report.

Thank you,
Saju John

Note You need to log in before you can comment on or make changes to this bug.