Bug 143448
Summary: | ccs/cman/dlm/clvm/gfs combination lock up a node -- possible spinlock error | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Adam "mantis" Manthei <amanthei> |
Component: | dlm | Assignee: | David Teigland <teigland> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | cluster-maint, djansa |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-02-22 06:25:49 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 144795 |
Description
Adam "mantis" Manthei
2004-12-21 00:12:05 UTC
assigning this to the dlm component for now since I've not seen this behavior with gulm yet. I saw this again with the code from the last night (built Dec 22 2004 17:15:58). The cluster wasn't doing any load other than what is default on the RHEL4 system (cron/slocate/etc) Could you try getting more info using kdb or point me at the machines which do this so I can try myself? I've not seen anything like this myself. I can reproduce this pretty readily in a three node cluster by untarring a kernel source tree and then running in a loop 'ls -lR <srctre>' where the srctree is on a gfs/clvm/dlm/cman/ccs. Happens within a couple of minutes. Ping responds. No ssh or console access. The nodes are link-10,link-11, and link-12 if you'd like to use them to see it happen. I had link-10 and link-11 looping ls -lR on linux src tree for about a day. I then got link-12 added to the mix and all where running this for a couple hours. I then had all three run through an iteration of time-tar with the linux src tree. I'll let them continue running time-tar indefinately until someone takes the nodes back. Let me know if there's more I should do for this one. I'd like to see it run with the kernel we're going to release on for RHEL4 (2.6.9-5.EL). I booted back to this kernel and saw the problem happen immediately. *** Bug 144140 has been marked as a duplicate of this bug. *** It looks like dlm_sendd gets scheduled while another DLM process is filling a buffer. dlm_sendd notices the buffer and ignores it, but because an outstanding buffer still exists, it loops round and tries to send it because the socket says it can. Putting a schedule in the else of "if (len)" gives the other process a chance to commit the buffer and make it sendable. Checking in lowcomms.c; /cvs/cluster/cluster/dlm-kernel/src/lowcomms.c,v <-- lowcomms.c new revision: 1.27; previous revision: 1.26 done Checking in lowcomms.c; /cvs/cluster/cluster/dlm-kernel/src/lowcomms.c,v <-- lowcomms.c new revision: 1.22.2.4; previous revision: 1.22.2.3 done This fix WORKSFORME. The simple ls test ran overnight so I can get on to some real load on these nodes. Adam opened this bug so I'll leave it in it's current state for awhile to see if it fails again for him. |