Bug 185484
Summary: | Cluster service restarting Locally | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | saju john <saju8in> |
Component: | clumanager | Assignee: | Lon Hohberger <lhh> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 3 | CC: | cluster-maint, djuran, paul.langedijk, saju8 |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | 1.2.31-1 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-07-20 17:12:13 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
saju john
2006-03-15 05:02:18 UTC
*** Bug 185485 has been marked as a duplicate of this bug. *** These messages are caused when a node fails to take a lock because the lock master is not responding. The lock master failing to respond can be caused by a variety of things. Here are the major possible causes: (a) Poor shared storage configuration. Multi-initiator parallel SCSI is known (and documented) to not work reliably. Furthermore, host-RAID controllers require extensive testing to certify, and I believe none currently are. See section 2.1.1 of the Red Hat Cluster Suite 3 manual: http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en-3/ch-hardware.html#S1-HARDWARE-CHOOSING In this case, what happens is that a client connects to the lock master and issues a lock request. At this point, the lock master writes out the lock blocks (a combined 1024 bytes or 2 SCSI blocks) to shared storage. If the lock master hangs in the write(2) system call (i.e. waiting for the write to complete), it will not respond to messages from other clients. Unfortunately, there is not much that can be done to eliminate this particular scenario entirely short of purchasing a high-performance SAN array. The shared storage is simply not powerful enough to handle the given load. Possible solution: Make the timeouts in the code configurable so users may account for the slow performance. With a longer timeout (up to, perhaps the 75 second TCP connect timeout), the likelihood that the client will not give up before the lock master has a chance to process its connection request is increased. (b) The lock master daemon is not getting scheduled due to heavy CPU load on the system in which it resides. Possible solution: In addition to (a)'s possible solution, it may be possible to attain some benefit from dumping the clulockd process in to one of the real-time scheduling queues, so that it has a higher priority over other applications on the system. The clulockd process does not consume very many resources, so this should not cause additional problems. (Side note: all of the cluster daemons are locked in memory, and so, they should never be paged out to disk.) (c) Network saturation preventing lock traffic TCP messages from getting back and forth in a timely manner, or inability for the lock master to accept the message due to low system resources. In this case, the lock master either never gets the TCP connect messages from the client, or fails to accept it correctly. This, then, causes the client to eventually give up - aborting the lock attempt. This may be at least partially solved in 1.2.31 in which communication with the local instance of clulockd is performed using UNIX domain sockets instad of TCP. On clusters with several services, the most obvious effect of the TCP vs. UNIX socket communication is the distinct lack of several hundred (or more) sockets in the TIME_WAIT state. Potential solution: It may be further alleviated in the manner of (a). Also, using a private network for cluster communications should eliminate this problem. (d) Network saturation preventing clulockd from writing information to an iSCSI target. This has the same potential solutions as (c), mostly. (e) Kernel bug causing interruption on loopback network devices (see bugzilla #168665, #155892). #168665 looks a whole lot like this bugzilla. Basically, communication to the loopback interface stops working - causing errors. Solution: This bug should be fully worked-around by 1.2.31 - since the loopback connections are no longer made for locks or lock master queries. Other considerations: After a connection is established at the TCP level, clumanager daemons authenticate connections using a challenge/response MD5 mechanism to prevent other clusters from erroneously connecting to the daemons. This challenge/response mechanism also has a 5-second timeout, but the entire sub-system may be disabled by running "clukey delete". In some cases, this could help alleviate some of the messages. Dear Lon, Thanks for the replay. I also think problem is something to do with shared quroum communication. But my shared storage is MSA SAN 1000, which is certified by redhat and known to work well with Redat Cluster Suite 3. About CPU load: the cluster don't have heavy CPU load. Nodes are 2 Processor DL380 G3 machines and load is 3 to 4 most of the time. But if the load increase more than 5( this is a spike in load only), I can expect a service restart. Network Saturation : I my case I don't think this is happening, as the network is of high network with 100/1000 Mbps line using Cisco Switches. There is no traffic Congestion or packet collision. Saju John Saju: Have you tried the latest update of clumanager, the 1.2.31-1 version? Can you confirm that it solves your problem? /David Dear David, I can not make trial-and-error as it is a live cluster running critical database service. For avoiding the cluster service restart, I removed second node from the cluster that solves the service restart problem. But this has the price of no High Availability, in case the primary node fails. When the cluster is running with single node , there is no local servcie resart. There for I think the problem is cuncurrent communication problem with quroum partition. I mean some how metadata information in quroum is getting corrupted. Saju John Saju, Under certain load situations, the loopback device on kernels pre 2.4.21-37.EL had a problem where lo communication (e.g. connecting to 127.0.0.1) is disrupted - this causes lock timeouts when talking to the *local* host. Clumanager 1.2.31 uses UNIX domain socket communications for local lock traffic, so this should not be an issue. Your hardware configuration looks fine. I am fairly sure that in your case, upgrading the kernel + clumanager will solve your problem. You can do a rolling upgrade during off-hours if you are worried about downtime. Dear Lon, The Cluster has HBA card(FCA2214) driver from qlogic. Qlogic latest driver version for Redhat Linux 3.0 32-bit is 7.07.03 which will support kernel 2.4.21-32ELsmp(ie Update5) and not 2.4.21-37.EL. Because of this, I think update will not be possible to 2.4.21-37.ELsmp. I would like to know your opinion Thank You, Saju John Hello Saju. Is there a reason why you aren't able to use the default qlogic driver that is included in kernel-2.4.21-40.smp ? /David Dear David, The default driver 7.00.03(if my knowdledge is correct) is very old and it used to make lot of errors in /var/log/messages. Thats why I upgraded the driver in my cluster to 7.05.00 around one year back. I would like to know whether qla 7.07.03 will support Redhat AS3 U7.Any one is currently using it Thank You, Saju John Dear David, The default driver 7.00.03(if my knowdledge is correct) is very old and it used to make lot of errors in /var/log/messages. Thats why I upgraded the driver in my cluster to 7.05.00 around one year back. I would like to know whether qla 7.07.03 will support Redhat AS3 U7.Any one is currently using it Thank You, Saju John Hello Saju. kernel-2.4.21-40.EL in RHEL3 U7 contains the 7.05.00 version of the qlogic driver. /David Dear David, I updated the cluster to kernel-2.4.21-40.ELsmp and clumanager-1.2.31-1 It seems that problem is FIXED and no more service restart till today(It is now 3 days after update). Please consider this as initial confirmation of the problem fix. I will monitor the cluster for some more days and update this I really appreciate the efforts from Mr. Lon and Mr. David in solving this problem. Thank You, Saju John Dear David, Now it is more than one month after the update and the cluster is quite stable. Thanks every one for their support. I think you can close this bug report. Thank you, Saju John |