159783 – dlm module not getting loaded, causing clvmd to not start

Bug 159783 - dlm module not getting loaded, causing clvmd to not start

Summary: dlm module not getting loaded, causing clvmd to not start

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	lvm2-cluster
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Abhijith Das
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:	157094
Blocks:
TreeView+	depends on / blocked

Reported:	2005-06-07 22:12 UTC by Corey Marthaler
Modified:	2010-01-12 04:03 UTC (History)
CC List:	4 users (show)
Fixed In Version:	RHBA-2006-0556
Clone Of:
Environment:
Last Closed:	2006-08-10 21:32:11 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2006:0556	0	normal	SHIPPED_LIVE	cman bug fix update	2006-08-10 04:00:00 UTC

Description Corey Marthaler 2005-06-07 22:12:58 UTC

Description of problem:
I tried the test case in 157094 (proagate a conf file while a node is down) and
then ended up geting the downed node (morph-03) back into the cluster:


[root@morph-03 ~]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    4   M   morph-01
   2    1    4   M   morph-02
   3    1    4   M   morph-04
   4    1    4   M   morph-03


[root@morph-03 ~]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 2
Cluster name: morph-cluster
Cluster ID: 41652
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 4
Expected_votes: 4
Total_votes: 4
Quorum: 3
Active subsystems: 1
Node name: morph-03
Node addresses: 10.15.84.63

[root@morph-03 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[2 3 1 4]


but now I can't get clvmd to start:


[root@morph-03 ~]# clvmd
clvmd could not connect to cluster manager
Consult syslog for more information

SYSLOG:
Jun  7 07:01:23 morph-03 clvmd: Unable to create lockspace for CLVM: No such
file or directory


Another node's services:
[root@morph-01 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[3 1 2 4]

DLM Lock Space:  "clvmd"                             3   3 run       -
[3 1 2]

DLM Lock Space:  "gfs0"                              4   4 run       -
[3 1 2]

DLM Lock Space:  "Magma"                             7   7 run       -
[3 2 1]

GFS Mount Group: "gfs0"                              5   5 run       -
[3 1 2]

User:            "usrm::manager"                     6   6 run       -
[3 1 2]

Comment 1 Christine Caulfield 2005-06-08 07:03:21 UTC

what's in /proc/misc & /dev/misc ?

Either the dlm.ko module isn't loaded or the /dev/misc/dlm-control device hasn't
been created. libdlm is supposed to create this on-demand but it may not have
been able to for some reason I can't see at the moment.

Comment 2 Corey Marthaler 2005-06-08 19:34:38 UTC

I saw this again after hitting 159877.

The dlm mod did not get loaded like it should have. I loaded it by hand and then
is worked just fine.

There was nothing in /dev/misc and /proc/misc contained:
[root@link-01 ~]# cat /proc/misc
183 hw_random
 63 device-mapper
135 rtc
227 mcelog

Comment 3 Christine Caulfield 2005-06-09 07:08:07 UTC

So this needs reassigning to whoever is responsible for the init scripts or the
GUI (not sure which) ?

Comment 4 Adam "mantis" Manthei 2005-06-09 16:30:14 UTC

clvmd will not load the dlm module, that is cman's job.  Looking at the output
you've provided in bug #159877, cman failed to start.  Doing so will prevent
dlm.ko from getting loaded.  Without cman properlly working, how do you expect
clvmd to work, even with dlm.ko loaded?  

Sounds like it's NOTABUG to me.

Comment 5 Corey Marthaler 2005-06-13 18:15:58 UTC

I'm not sure where in 159877 it mentions cman not starting. Cman was indeed
started, I (by hand or with a script) would never try to start clvmd with out
first starting a lock manager as it's required. 

In the original report, I posted the status of /proc/cluster/nodes, status, and
services which shows that cman was running and everyone was in the cluster. I
didn't post statuses for the second time I saw this since they where the same as
the first.  Again, after modprobing dlm by hand on the node where I saw this
issue, clvmd started just fine, since the cluster was already up.

Comment 6 Adam "mantis" Manthei 2005-06-13 19:11:54 UTC

(In reply to comment #5)
> I'm not sure where in 159877 it mentions cman not starting. 

oops.  you're right.  I made a typo entering the bug number, bug #157094
demonstrates that the cman script failed to start.

> Cman was indeed started

But according to bug #157094, it failed

> I (by hand or with a script) would never try to 
> start clvmd with out first starting a lock manager as it's required. 

Did you verify that it was running properlly?

Comment 7 Corey Marthaler 2005-06-13 19:27:45 UTC

(In reply to comment #6)
> But according to bug #157094, it failed

correct, it did fail, but at the end of that comment in that bug, I mention that
I then tried it by hand and got it working. :)

> Did you verify that it was running properlly?

indeed.

Comment 8 Adam "mantis" Manthei 2005-06-13 19:51:19 UTC

I guess I just don't understand what your test case is then.  Here is what I've
been able to discern.

0. You have some nodes with some initial configuration

1. You killed a node

2. You changed something in cluster.conf

3. You propagated changes, but the changes didn't make it to the failed node
(bug #157094)

4. Your killed node came back online

5. The killed node was unable to join the cman cluster because the cluster.conf
was not at the same revision level.  This caused the cman script to fail (which
will prevent DLM from loading).

6. Because cman failed to successfully join, the dlm module was not loaded and
clvmd failed to start, hence the reason for this bug report.

7. After the system came up "everything" was done by hand and things worked.


Based on the above description, this would appear to depend on the REOPENED bug
#157094.  Once that is fixed, this should go away since the cman init script
will load the dlm module and cause this problem to go away.  As such, I'm
marking this as DEPENDS ON bug #157094, rather than marking it as NOTABUG.

In the meantime, I'm placing this in the NEEDINFO status until a reproducable
test is written that demonstrate this bug ("I then tried everything on morph-01
by hand" just doesn't help me)

Comment 9 Corey Marthaler 2005-06-13 20:32:28 UTC

I'll try to gather more info for you.

The simplest case that I've seen this is in comment #2. Here no config changes
were made at all. I merely had a cluster up and running, all the nodes paniced
at the same time (due to 159877), I power cycled them, when they were back up, I
started ccsd, cman, fenced, and clvmd. On one node clvmd failed to start, I then
modprobed dlm and then it did.

Comment 10 Corey Marthaler 2005-06-15 20:40:40 UTC

For some odd reason, I thought that one of the cluster start up commands also
loaded the dlm module, that is not true. Only when using the init script does it
do a dlm module load (but only if cman joins the cluster first). So as the init
script currently stands, Adam is right, without cman starting through the init
script, dlm will not get loaded and will have to be done by hand.

Jon and I discussed moving the dlm module up some lines next to the cman module
load and then it wouldn't be dependant on cman actually join the cluster and
would be less confusing for users when clvmd fails.

Comment 11 Christine Caulfield 2005-07-11 10:41:06 UTC

Reassign to Adam as he seems to know what's going on, and it doesn't look to be
my problem ;-)

Comment 15 Corey Marthaler 2006-08-03 19:14:53 UTC

fix verified.

Comment 17 Red Hat Bugzilla 2006-08-10 21:32:12 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0556.html

Note You need to log in before you can comment on or make changes to this bug.