251562 – lowcomms_stop can't stop dlm_recvd

Bug 251562 - lowcomms_stop can't stop dlm_recvd

Summary: lowcomms_stop can't stop dlm_recvd

Keywords:
Status:	CLOSED DUPLICATE of bug 238490
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	dlm-kernel
Sub Component:
Version:	4.5
Hardware:	ia64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-08-09 19:01 UTC by Dean Jansa
Modified:	2009-04-16 23:09 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-08-13 14:39:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
link-13 sysreq-t (77.42 KB, text/plain) 2007-08-09 19:05 UTC, Dean Jansa	no flags	Details
link-15 sysreq-t (68.69 KB, text/plain) 2007-08-09 19:06 UTC, Dean Jansa	no flags	Details
link-16 sysreq-t (72.63 KB, text/plain) 2007-08-09 19:06 UTC, Dean Jansa	no flags	Details
View All

Description Dean Jansa 2007-08-09 19:01:44 UTC

Description of problem:

At the end of a revolver run our test suite attempted to stop the clvmd serice
on node causing other nodes clvmd to hang.  

link-13 was running 'service clvmd stop' which never completed.  All further
attempts to run any cluster lvm command hung.  I grabbed the stacks will attach
them to this bug.

Dave took a look at the stacks and said, "I think we'll need to have pjc look at
this -- it does look like a dlm problem.  clvmd trying to shut down dlm_recvd,
but dlm_recvd is blocking on a connection semaphore

Version-Release number of selected component (if applicable):

2.6.9-55.0.3.ELlargesmp
dlm-kernel-2.6.9-46.16.0.5

How reproducible:

Haven't tried.

Comment 1 Dean Jansa 2007-08-09 19:05:14 UTC

Created attachment 161006 [details]
link-13 sysreq-t

Comment 2 Dean Jansa 2007-08-09 19:06:32 UTC

Created attachment 161007 [details]
link-15 sysreq-t

Comment 3 Dean Jansa 2007-08-09 19:06:54 UTC

Created attachment 161008 [details]
link-16 sysreq-t

Comment 4 Christine Caulfield 2007-08-10 10:35:01 UTC

Oh good grief that "security bug" really opened a can of worms didn't it!

I suspect this is down to the 'othercon' structures being freed while the
receive thread is waiting to use it. I can't see any other way of getting into
that situation that described in the sysreq-t dumps (receive_from_sock waiting
for a semaphore that no-one seems to be holding). And the fact that all three
nodes are showing the same symptoms reinforces that it's not really some odd
race condition. I'm going to move the close connection logic into the
receive_from_sock() code so that it always happens on the same thread. The send
queue should be fine because we can manage that ourself but incoming stuff is a
little more unpredictable.

Comment 6 Christine Caulfield 2007-08-13 14:39:17 UTC

Make this a duplicate of 238490 as it has all the patches and detail.

*** This bug has been marked as a duplicate of 238490 ***

Note You need to log in before you can comment on or make changes to this bug.