159727 – clvmd startup deadlock issue

Bug 159727 - clvmd startup deadlock issue

Summary: clvmd startup deadlock issue

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	lvm2-cluster
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-06-07 15:00 UTC by Corey Marthaler
Modified:	2010-01-12 04:03 UTC (History)
CC List:	1 user (show)
Fixed In Version:	U2?
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-03-07 19:50:37 UTC
Embargoed:

Attachments	(Terms of Use)
Don't defer closing of old FDs (1.64 KB, patch) 2005-06-13 11:59 UTC, Christine Caulfield	no flags	Details \| Diff
View All

Description Corey Marthaler 2005-06-07 15:00:55 UTC

Description of problem:
Same symptoms as bz 155729 for both DLM and GULM.

I ran revolver all weekend on the 4 node tank cluster (tank-01, 03, 04, 05)
without allowing quorum to be lost (only shooting one node at a time) and never
saw any issues. 

This morning I restarted revolver so that quorum gets lost (three nodes shot
each time) and after 40 iterations, I saw the hang on all three nodes shot. Two
of the nodes were stuck starting clvmd and one was stuck doing a vgchange. CMAN
on the the node left up reported that everyone was apart of the cluster. I then
killed one of the hung nodes and that allowed the other two hung nodes to get
past the deadlock and continue. The killed node then also came back up without
problems.

##############################################################

Also hit this last night on a three node gulm cluster. One slave was shot and
got stuck coming back up while doing a vgchange. Apparently you do not need to
lose quorum inorder for this to happen.

Comment 1 Christine Caulfield 2005-06-10 09:20:49 UTC

OK, I've spotted this in the lab now and am testing a fix.
Apologies to everyone - it is a clvmd bug. Still, at least we got rid of some
CMAN & DLM bugs in the process !

Comment 2 Christine Caulfield 2005-06-13 11:59:19 UTC

Created attachment 115349 [details]
Don't defer closing of old FDs

Comment 3 Christine Caulfield 2005-06-13 12:00:23 UTC

That patch should fix the problem. It should be applied to the RPM after the
current patch

Comment 4 Corey Marthaler 2005-07-08 14:54:43 UTC

I'm seeing this again lately. On a three node cluster (link-01, link-02,
link-08) link-01 was shot by revolver, it join the cluster but hangs when
attempting to activate the VGs:

link-01:
Starting ccsd:  ip_tables: (C) 2000-2002 Netfilter core team
[  OK  ]
Starting cman:CMAN 2.6.9-37.0 (built Jul  5 2005 12:20:39) installed
CMAN: quorum regained, resuming activity
DLM 2.6.9-35.0 (built Jul  5 2005 12:29:45) installed
[  OK  ]
Starting lock_gulmd:[WARNING]
Starting fence domain:[  OK  ]
Starting clvmd: [  OK  ]
Activating VGs:                                                                
     
[HANG]


from another node in the cluster:
[root@link-02 ~]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    3   M   link-01
   2    1    3   M   link-08
   3    1    3   M   link-02
[root@link-02 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[3 2 1]

DLM Lock Space:  "clvmd"                             3   4 run       -
[3 2 1]

DLM Lock Space:  "gfs0"                              4   5 run       -
[3 2]

GFS Mount Group: "gfs0"                              5   6 run       -
[3 2]


I'll try and gather more info.

Comment 5 Christine Caulfield 2005-07-11 08:11:11 UTC

Interestingly I had this quite often on my Fedora Xen cluster, but it went away
when I upgraded the RPMs to:

lvm2-2.01.12-1.0
lvm2-cluster-2.01.09-5.0
dlm-1.0-0.pre21.10
dlm-kernel-xenU-2.6.11.5-20050601.152643.FC4.2
cman-1.0-0.pre33.15
cman-kernel-xenU-2.6.11.5-20050601.152643.FC4.2

from
cman-kernel-xenU-2.6.11.4-20050517.141233.FC4.3
dlm-kernel-xenU-2.6.11.3-20050425.154843.FC4.16

Though, to be honest, I'm not sure what the difference was (and I don't have the
old lvm2-cluster package version, sorry)

Comment 6 Corey Marthaler 2005-08-02 16:45:55 UTC

Although I still have no new info for you :( (as any debugging attempt causes
this not to appear) I'd like to still bring attention to this issue as we are
still seeing it regulary with the init scripts turned on.

Comment 7 Christine Caulfield 2005-08-03 08:09:45 UTC

I'm trying to figure out if this is in any way peculiar to any hardware type.

I see link-01/08 are dual-proc AMD86 boxes. As I only have UP x86 boxes it may
not be surprising that I can't reproduce this. (I do have a single dual x86 box
but it's currently too hot here to run it for any length of time).

Have you seen this on other machines ?

Comment 8 Christine Caulfield 2005-08-03 09:42:52 UTC

Just in case (I'm off to the UKUUG Linux conference very soon) I've dropped a
debugging version of clvmd in /root of link-08. You'll need to run it

 clvmd -d 2>/root/clvmd.log &

to capture the output.
If you can make it happen with this running the log files (from all machines)
should show what is going on.

Comment 9 Christine Caulfield 2005-08-08 15:55:31 UTC

The good news is that I've been able to reproduce this on my SMP machine with 3
SMP Xen VMs. The bad news is that it seems to be some strange pthread interaction.

Anyway, I'm on the case.

Comment 10 Christine Caulfield 2005-08-09 10:42:05 UTC

OK, try this:

Checking in clvmd.c;
/cvs/lvm2/LVM2/daemons/clvmd/clvmd.c,v  <--  clvmd.c
new revision: 1.25; previous revision: 1.24
done

clvmd is a little less cavalier in its signalling of subthreads, if a thread is
known to be waiting nicely then there's no need to signal it, just notifying the
condition variable will do.

You may have to wait for agk to include this patch in the lvm2-cluster package.

Note You need to log in before you can comment on or make changes to this bug.