Bug 143448

Summary:	ccs/cman/dlm/clvm/gfs combination lock up a node -- possible spinlock error
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Adam "mantis" Manthei <amanthei>
Component:	dlm	Assignee:	David Teigland <teigland>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	cluster-maint, djansa
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-02-22 06:25:49 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	144795

Description Adam "mantis" Manthei 2004-12-21 00:12:05 UTC

Description of problem:
I've been seeing a lockup for the last week and half that I am unable
to track down.  At the moment, I am at a loss as to where the problem
might be.  So far, the symptoms of the problem are that a node will
become unresponsive to everything with the exception of ping.  I am
unable to log onto the node through serial connections, gettys or ssh.
 The only thing I am able to do is ping the node.

I have ccs, cman, dlm, clvm and gfs all running on 8 nodes of a nine
node cluster.  (The ninth node is in pieces at the moment).  I am not
running any load on the system with the exception of whatever default
system cron jobs are running (slocate/updatedb)

Once the node locks up, I see a message printed on the console of one
of the other nodes:
CMAN: no HELLO from trin-02, removing from the cluster

The other nodes then try to fence the node.  I am using a network
aware manual fencing script that requires me to acknowldege when nodes
have been fenced(mm_fence).  If I don't acknowledge the node as being
fenced after a while (~5 mins) the fencing subsystem becomes totally
unresponsive on the master node attempting to do the fence operation.
 This is probably a behavior that is its own bug, but since I am
unable to determine what is going on in the cluster, I am logging it
here for now.


Version-Release number of selected component (if applicable):
[root@trin-06 ~]# rpm -qa | grep -E "(gfs|ccs|cman|dlm|fence|kernel)"
ccs-0.9-0
kernel-2.6.9-1.906_EL
dlm-kernel-2.6.9-3.1
kernel-2.6.9-1.641_EL
kernel-utils-2.4-13.1.37
cman-kernel-2.6.9-3.3
fence-1.3-1
dlm-1.0-0.pre9.1
cman-1.0-0.pre5.0
GFS-kernel-2.6.9-4.2


How reproducible:
at least twice a day for me

Steps to Reproduce:
1. I wish I knew.  It seems more likely to be triggered a few minutes
after a node has stopped gfs/clvmd (i.e. a cluster change event)
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Adam "mantis" Manthei 2004-12-21 00:13:04 UTC

assigning this to the dlm component for now since I've not seen this
behavior with gulm yet.

Comment 2 Adam "mantis" Manthei 2004-12-23 15:06:01 UTC

I saw this again with the code from the last night (built Dec 22 2004
17:15:58).  The cluster wasn't doing any load other than what is
default on the RHEL4 system (cron/slocate/etc)

Comment 3 David Teigland 2005-01-04 07:48:01 UTC

Could you try getting more info using kdb or point me at the machines
which do this so I can try myself?  I've not seen anything like this
myself.

Comment 4 Derek Anderson 2005-01-07 22:04:17 UTC

I can reproduce this pretty readily in a three node cluster by
untarring a kernel source tree and then running in a loop 'ls -lR
<srctre>' where the srctree is on a gfs/clvm/dlm/cman/ccs.  Happens
within a couple of minutes.  Ping responds.  No ssh or console access.
 The nodes are link-10,link-11, and link-12 if you'd like to use them
to see it happen.

Comment 5 David Teigland 2005-01-14 08:10:05 UTC

I had link-10 and link-11 looping ls -lR on linux src tree for about a
day.  I then got link-12 added to the mix and all where running this 
for a couple hours.  I then had all three run through an iteration
of time-tar with the linux src tree.

I'll let them continue running time-tar indefinately until someone
takes the nodes back.  Let me know if there's more I should do for
this one.

Comment 6 Derek Anderson 2005-01-14 17:22:48 UTC

I'd like to see it run with the kernel we're going to release on for
RHEL4 (2.6.9-5.EL).  I booted back to this kernel and saw the problem
happen immediately.

Comment 7 David Teigland 2005-01-18 04:11:30 UTC

*** Bug 144140 has been marked as a duplicate of this bug. ***

Comment 8 Christine Caulfield 2005-01-28 11:09:46 UTC

It looks like dlm_sendd gets scheduled while another DLM process is
filling a buffer. dlm_sendd notices the buffer and ignores it, but
because an outstanding buffer still exists, it loops round and tries
to send it because the socket says it can.

Putting a schedule in the else of "if (len)" gives the other process a
chance to commit the buffer and make it sendable.

Checking in lowcomms.c;
/cvs/cluster/cluster/dlm-kernel/src/lowcomms.c,v  <--  lowcomms.c
new revision: 1.27; previous revision: 1.26
done

Checking in lowcomms.c;
/cvs/cluster/cluster/dlm-kernel/src/lowcomms.c,v  <--  lowcomms.c
new revision: 1.22.2.4; previous revision: 1.22.2.3
done

Comment 9 Derek Anderson 2005-02-01 16:18:44 UTC

This fix WORKSFORME.  The simple ls test ran overnight so I can get on
to some real load on these nodes.  Adam opened this bug so I'll leave
it in it's current state for awhile to see if it fails again for him.