Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 447799

Summary:

clvmd init script hangs during lock_gulm startup

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Corey Marthaler <cmarthal>

Component:

lvm2-cluster

Assignee:

Christine Caulfield <ccaulfie>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

agk, ccaulfie, dwysocha, edamato, jbrassow, mbroz, prockai

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-04-24 14:49:55 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
log from the startup hang	none
clvmd -d log	none
hung vgscan strace	none

Description Corey Marthaler 2008-05-21 20:50:42 UTC

Description of problem:
It seems like we see this bug every release cycle. I rebooted all three of my
gulm nodes (with the init scripts on) and during the start up one of the nodes
got stuck in the clvmd script while trying to activate the vgs. There should be
some kind of time out here.

grant-01:
Starting ccsd: [  OK  ]
Starting cman:[WARNING]
[WARNING]lustered mirror log:[WARNING]
Starting lock_gulmd:[  OK  ]
Starting fence domain:[WARNING]
Starting clvmd: [  OK  ]
[DEADLOCK]

[root@grant-02 ~]# gulm_tool getstats grant-01
I_am = Slave
Master = grant-03.lab.msp.redhat.com
rank = 0
quorate = true
GenerationID = 1211401202380673
run time = 1176
pid = 3085
verbosity = Default
failover = enabled


[root@grant-03 ~]# gulm_tool getstats $(hostname)
I_am = Master
quorum_has = 3
quorum_needs = 2
rank = 2
quorate = true
GenerationID = 1211401202380673
run time = 601
pid = 3090
verbosity = Default
failover = enabled


If you search bz, there are countless bzs dealing with this exact issue, but all
appear to be "fixed" and closed. 

Version-Release number of selected component (if applicable):
2.6.9-70.ELsmp
lvm2-2.02.36-1.el4
lvm2-cluster-2.02.36-1.el4

Comment 1 Corey Marthaler 2008-05-21 21:27:22 UTC

Created attachment 306317 [details]
log from the startup hang

Comment 2 Christine Caulfield 2008-05-22 08:51:51 UTC

Can you get a log from clvmd started up as "clvmd -d" please ?

I did try to reproduce this on my cluster but it seems to work fine for me. The
dump we have seems to show clvmd waiting for gulm, but more than that I can't tell.

Comment 3 Corey Marthaler 2008-06-04 19:22:26 UTC

I've repoduced this with the requested info. The hang is during the 'vgscan'.
I'll attach the clvmd -d log as well as an strace of the vgscan.

Comment 4 Corey Marthaler 2008-06-04 19:22:59 UTC

Created attachment 308383 [details]
clvmd -d log

Comment 5 Corey Marthaler 2008-06-04 19:23:47 UTC

Created attachment 308384 [details]
hung vgscan strace

Comment 6 Corey Marthaler 2008-06-04 19:31:42 UTC

*** Bug 444600 has been marked as a duplicate of this bug. ***

Comment 7 Christine Caulfield 2008-06-05 11:56:37 UTC

That's a really bizarre place for the log to end. It ends in the middle of a
loop around the nodes in the cluster, for clvmd to hang there I think it would
have to be stuck in a dm_hash_* function which seems VERY odd.

How easy is this for you to reproduce? I've tried quite hard on my 3node roth
cluster with no luck

Comment 8 Corey Marthaler 2008-06-05 14:27:52 UTC

The key to reproducing this is to not have clvmd running on the other nodes in
the cluster, just lock_gulmd. So when the clvmd init hangs, it's the only node
attempting to join that service.

Comment 9 Christine Caulfield 2008-06-05 14:31:41 UTC

Yes, I'd guessed that much from the logs, it still works for me though. I'll try
repeating it ad nauseum.

Comment 10 Christine Caulfield 2008-06-20 11:00:49 UTC

Ah, I think I see what's happening. clvmd sees that the other nodes are down but
still waits for the command timeout to trigger. If you waited for a couple of
minutes I suspect that vgscan would return. This patch fixes that so it's
consistent with cman in returning "clvmd node running" errors immediately.

Checking in daemons/clvmd/clvmd-gulm.c;
/cvs/lvm2/LVM2/daemons/clvmd/clvmd-gulm.c,v  <--  clvmd-gulm.c
new revision: 1.23; previous revision: 1.22
done