Description of problem: I've been seeing a lockup for the last week and half that I am unable to track down. At the moment, I am at a loss as to where the problem might be. So far, the symptoms of the problem are that a node will become unresponsive to everything with the exception of ping. I am unable to log onto the node through serial connections, gettys or ssh. The only thing I am able to do is ping the node. I have ccs, cman, dlm, clvm and gfs all running on 8 nodes of a nine node cluster. (The ninth node is in pieces at the moment). I am not running any load on the system with the exception of whatever default system cron jobs are running (slocate/updatedb) Once the node locks up, I see a message printed on the console of one of the other nodes: CMAN: no HELLO from trin-02, removing from the cluster The other nodes then try to fence the node. I am using a network aware manual fencing script that requires me to acknowldege when nodes have been fenced(mm_fence). If I don't acknowledge the node as being fenced after a while (~5 mins) the fencing subsystem becomes totally unresponsive on the master node attempting to do the fence operation. This is probably a behavior that is its own bug, but since I am unable to determine what is going on in the cluster, I am logging it here for now. Version-Release number of selected component (if applicable): [root@trin-06 ~]# rpm -qa | grep -E "(gfs|ccs|cman|dlm|fence|kernel)" ccs-0.9-0 kernel-2.6.9-1.906_EL dlm-kernel-2.6.9-3.1 kernel-2.6.9-1.641_EL kernel-utils-2.4-13.1.37 cman-kernel-2.6.9-3.3 fence-1.3-1 dlm-1.0-0.pre9.1 cman-1.0-0.pre5.0 GFS-kernel-2.6.9-4.2 How reproducible: at least twice a day for me Steps to Reproduce: 1. I wish I knew. It seems more likely to be triggered a few minutes after a node has stopped gfs/clvmd (i.e. a cluster change event) 2. 3. Actual results: Expected results: Additional info:
assigning this to the dlm component for now since I've not seen this behavior with gulm yet.
I saw this again with the code from the last night (built Dec 22 2004 17:15:58). The cluster wasn't doing any load other than what is default on the RHEL4 system (cron/slocate/etc)
Could you try getting more info using kdb or point me at the machines which do this so I can try myself? I've not seen anything like this myself.
I can reproduce this pretty readily in a three node cluster by untarring a kernel source tree and then running in a loop 'ls -lR <srctre>' where the srctree is on a gfs/clvm/dlm/cman/ccs. Happens within a couple of minutes. Ping responds. No ssh or console access. The nodes are link-10,link-11, and link-12 if you'd like to use them to see it happen.
I had link-10 and link-11 looping ls -lR on linux src tree for about a day. I then got link-12 added to the mix and all where running this for a couple hours. I then had all three run through an iteration of time-tar with the linux src tree. I'll let them continue running time-tar indefinately until someone takes the nodes back. Let me know if there's more I should do for this one.
I'd like to see it run with the kernel we're going to release on for RHEL4 (2.6.9-5.EL). I booted back to this kernel and saw the problem happen immediately.
*** Bug 144140 has been marked as a duplicate of this bug. ***
It looks like dlm_sendd gets scheduled while another DLM process is filling a buffer. dlm_sendd notices the buffer and ignores it, but because an outstanding buffer still exists, it loops round and tries to send it because the socket says it can. Putting a schedule in the else of "if (len)" gives the other process a chance to commit the buffer and make it sendable. Checking in lowcomms.c; /cvs/cluster/cluster/dlm-kernel/src/lowcomms.c,v <-- lowcomms.c new revision: 1.27; previous revision: 1.26 done Checking in lowcomms.c; /cvs/cluster/cluster/dlm-kernel/src/lowcomms.c,v <-- lowcomms.c new revision: 1.22.2.4; previous revision: 1.22.2.3 done
This fix WORKSFORME. The simple ls test ran overnight so I can get on to some real load on these nodes. Adam opened this bug so I'll leave it in it's current state for awhile to see if it fails again for him.