Red Hat Bugzilla – Bug 185484
Cluster service restarting Locally
Last modified: 2009-04-16 16:19:55 EDT
I have a 2 node cluster with RHAS3 update 3.
Kernel : 2.4.21-20.Elsmp
Clumanager : clumanager-1.2.16-1
Nodes : HP DL380 G3
SAN : MSA 1000
For more than a year everyting had been fine. Suddenly it started showing the
follwing and restarted the service locally.This used to happen quite freequently.
clusvcmgrd: <err> Unable to obtain cluster lock: Connection timed out
clulockd: <warning> Denied A.B.C.D: Broken pipe
clulockd: <err> select error: Broken pipe
clusvcmgrd: : <notice> service notice: Stopping service postgresql ...
clusvcmgrd: : <notice> service notice: Running user script
clusvcmgrd: : <notice> service notice: Stopped service postgresql
clusvcmgrd: : <notice> service notice: Starting service postgresql ...
clusvcmgrd: : <notice> service notice: Running user script
clusvcmgrd: : <notice> service notice: Started service postgresql ...
*** Bug 185485 has been marked as a duplicate of this bug. ***
These messages are caused when a node fails to take a lock because the lock
master is not responding. The lock master failing to respond can be caused by a
variety of things.
Here are the major possible causes:
(a) Poor shared storage configuration. Multi-initiator parallel SCSI is known
(and documented) to not work reliably. Furthermore, host-RAID controllers
require extensive testing to certify, and I believe none currently are. See
section 2.1.1 of the Red Hat Cluster Suite 3 manual:
In this case, what happens is that a client connects to the lock master and
issues a lock request. At this point, the lock master writes out the lock
blocks (a combined 1024 bytes or 2 SCSI blocks) to shared storage. If the lock
master hangs in the write(2) system call (i.e. waiting for the write to
complete), it will not respond to messages from other clients.
Unfortunately, there is not much that can be done to eliminate this particular
scenario entirely short of purchasing a high-performance SAN array. The shared
storage is simply not powerful enough to handle the given load.
Possible solution: Make the timeouts in the code configurable so users may
account for the slow performance. With a longer timeout (up to, perhaps the 75
second TCP connect timeout), the likelihood that the client will not give up
before the lock master has a chance to process its connection request is increased.
(b) The lock master daemon is not getting scheduled due to heavy CPU load on the
system in which it resides.
Possible solution: In addition to (a)'s possible solution, it may be possible
to attain some benefit from dumping the clulockd process in to one of the
real-time scheduling queues, so that it has a higher priority over other
applications on the system. The clulockd process does not consume very many
resources, so this should not cause additional problems. (Side note: all of the
cluster daemons are locked in memory, and so, they should never be paged out to
(c) Network saturation preventing lock traffic TCP messages from getting back
and forth in a timely manner, or inability for the lock master to accept the
message due to low system resources.
In this case, the lock master either never gets the TCP connect messages from
the client, or fails to accept it correctly. This, then, causes the client to
eventually give up - aborting the lock attempt.
This may be at least partially solved in 1.2.31 in which communication with the
local instance of clulockd is performed using UNIX domain sockets instad of TCP.
On clusters with several services, the most obvious effect of the TCP vs. UNIX
socket communication is the distinct lack of several hundred (or more) sockets
in the TIME_WAIT state.
Potential solution: It may be further alleviated in the manner of (a). Also,
using a private network for cluster communications should eliminate this problem.
(d) Network saturation preventing clulockd from writing information to an iSCSI
This has the same potential solutions as (c), mostly.
(e) Kernel bug causing interruption on loopback network devices (see bugzilla
#168665, #155892). #168665 looks a whole lot like this bugzilla.
Basically, communication to the loopback interface stops working - causing errors.
Solution: This bug should be fully worked-around by 1.2.31 - since the loopback
connections are no longer made for locks or lock master queries.
After a connection is established at the TCP level, clumanager daemons
authenticate connections using a challenge/response MD5 mechanism to prevent
other clusters from erroneously connecting to the daemons. This
challenge/response mechanism also has a 5-second timeout, but the entire
sub-system may be disabled by running "clukey delete". In some cases, this
could help alleviate some of the messages.
Thanks for the replay.
I also think problem is something to do with shared quroum communication. But
my shared storage is MSA SAN 1000, which is certified by redhat and known to
work well with Redat Cluster Suite 3.
About CPU load: the cluster don't have heavy CPU load. Nodes are 2 Processor
DL380 G3 machines and load is 3 to 4 most of the time. But if the load
increase more than 5( this is a spike in load only), I can expect a service
Network Saturation : I my case I don't think this is happening, as the network
is of high network with 100/1000 Mbps line using Cisco Switches. There is no
traffic Congestion or packet collision.
Have you tried the latest update of clumanager, the 1.2.31-1 version? Can you
confirm that it solves your problem?
I can not make trial-and-error as it is a live cluster running critical
For avoiding the cluster service restart, I removed second node from the
cluster that solves the service restart problem. But this has the price of no
High Availability, in case the primary node fails.
When the cluster is running with single node , there is no local servcie
resart. There for I think the problem is cuncurrent communication problem with
quroum partition. I mean some how metadata information in quroum is getting
Under certain load situations, the loopback device on kernels pre 2.4.21-37.EL
had a problem where lo communication (e.g. connecting to 127.0.0.1) is disrupted
- this causes lock timeouts when talking to the *local* host. Clumanager 1.2.31
uses UNIX domain socket communications for local lock traffic, so this should
not be an issue.
Your hardware configuration looks fine.
I am fairly sure that in your case, upgrading the kernel + clumanager will solve
your problem. You can do a rolling upgrade during off-hours if you are worried
The Cluster has HBA card(FCA2214) driver from qlogic. Qlogic latest driver
version for Redhat Linux 3.0 32-bit is 7.07.03 which will support kernel
2.4.21-32ELsmp(ie Update5) and not 2.4.21-37.EL. Because of this, I think
update will not be possible to 2.4.21-37.ELsmp.
I would like to know your opinion
Is there a reason why you aren't able to use the default qlogic driver that is
included in kernel-2.4.21-40.smp ?
The default driver 7.00.03(if my knowdledge is correct) is very old and it used
to make lot of errors in /var/log/messages. Thats why I upgraded the driver in
my cluster to 7.05.00 around one year back.
I would like to know whether qla 7.07.03 will support Redhat AS3 U7.Any one is
currently using it
kernel-2.4.21-40.EL in RHEL3 U7 contains the 7.05.00 version of the qlogic driver.
I updated the cluster to kernel-2.4.21-40.ELsmp and clumanager-1.2.31-1
It seems that problem is FIXED and no more service restart till today(It is now
3 days after update).
Please consider this as initial confirmation of the problem fix. I will monitor
the cluster for some more days and update this
I really appreciate the efforts from Mr. Lon and Mr. David in solving this problem.
Now it is more than one month after the update and the cluster is quite stable.
Thanks every one for their support.
I think you can close this bug report.