Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 447799

Summary: clvmd init script hangs during lock_gulm startup
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: lvm2-clusterAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: agk, ccaulfie, dwysocha, edamato, jbrassow, mbroz, prockai
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-04-24 14:49:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log from the startup hang
none
clvmd -d log
none
hung vgscan strace none

Description Corey Marthaler 2008-05-21 20:50:42 UTC
Description of problem:
It seems like we see this bug every release cycle. I rebooted all three of my
gulm nodes (with the init scripts on) and during the start up one of the nodes
got stuck in the clvmd script while trying to activate the vgs. There should be
some kind of time out here.

grant-01:
Starting ccsd: [  OK  ]
Starting cman:[WARNING]
[WARNING]lustered mirror log:[WARNING]
Starting lock_gulmd:[  OK  ]
Starting fence domain:[WARNING]
Starting clvmd: [  OK  ]
[DEADLOCK]

[root@grant-02 ~]# gulm_tool getstats grant-01
I_am = Slave
Master = grant-03.lab.msp.redhat.com
rank = 0
quorate = true
GenerationID = 1211401202380673
run time = 1176
pid = 3085
verbosity = Default
failover = enabled


[root@grant-03 ~]# gulm_tool getstats $(hostname)
I_am = Master
quorum_has = 3
quorum_needs = 2
rank = 2
quorate = true
GenerationID = 1211401202380673
run time = 601
pid = 3090
verbosity = Default
failover = enabled


If you search bz, there are countless bzs dealing with this exact issue, but all
appear to be "fixed" and closed. 

Version-Release number of selected component (if applicable):
2.6.9-70.ELsmp
lvm2-2.02.36-1.el4
lvm2-cluster-2.02.36-1.el4

Comment 1 Corey Marthaler 2008-05-21 21:27:22 UTC
Created attachment 306317 [details]
log from the startup hang

Comment 2 Christine Caulfield 2008-05-22 08:51:51 UTC
Can you get a log from clvmd started up as "clvmd -d" please ?

I did try to reproduce this on my cluster but it seems to work fine for me. The
dump we have seems to show clvmd waiting for gulm, but more than that I can't tell.

Comment 3 Corey Marthaler 2008-06-04 19:22:26 UTC
I've repoduced this with the requested info. The hang is during the 'vgscan'.
I'll attach the clvmd -d log as well as an strace of the vgscan.

Comment 4 Corey Marthaler 2008-06-04 19:22:59 UTC
Created attachment 308383 [details]
clvmd -d log

Comment 5 Corey Marthaler 2008-06-04 19:23:47 UTC
Created attachment 308384 [details]
hung vgscan strace

Comment 6 Corey Marthaler 2008-06-04 19:31:42 UTC
*** Bug 444600 has been marked as a duplicate of this bug. ***

Comment 7 Christine Caulfield 2008-06-05 11:56:37 UTC
That's a really bizarre place for the log to end. It ends in the middle of a
loop around the nodes in the cluster, for clvmd to hang there I think it would
have to be stuck in a dm_hash_* function which seems VERY odd.

How easy is this for you to reproduce? I've tried quite hard on my 3node roth
cluster with no luck

Comment 8 Corey Marthaler 2008-06-05 14:27:52 UTC
The key to reproducing this is to not have clvmd running on the other nodes in
the cluster, just lock_gulmd. So when the clvmd init hangs, it's the only node
attempting to join that service.

Comment 9 Christine Caulfield 2008-06-05 14:31:41 UTC
Yes, I'd guessed that much from the logs, it still works for me though. I'll try
repeating it ad nauseum.

Comment 10 Christine Caulfield 2008-06-20 11:00:49 UTC
Ah, I think I see what's happening. clvmd sees that the other nodes are down but
still waits for the command timeout to trigger. If you waited for a couple of
minutes I suspect that vgscan would return. This patch fixes that so it's
consistent with cman in returning "clvmd node running" errors immediately.

Checking in daemons/clvmd/clvmd-gulm.c;
/cvs/lvm2/LVM2/daemons/clvmd/clvmd-gulm.c,v  <--  clvmd-gulm.c
new revision: 1.23; previous revision: 1.22
done