Bug 138491 - existence of dlm-control causes clvmd startup to fail
Summary: existence of dlm-control causes clvmd startup to fail
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: dlm
Version: 4
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-11-09 16:28 UTC by Corey Marthaler
Modified: 2009-04-16 20:29 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-02-28 21:45:01 UTC
Embargoed:


Attachments (Terms of Use)
check when opening the dlm-control device that its major/minor match the assignments in /proc/misc (1.52 KB, patch)
2004-12-20 23:28 UTC, Adam "mantis" Manthei
no flags Details | Diff

Description Corey Marthaler 2004-11-09 16:28:02 UTC
Description of problem:
Dean and I have seen this bug when attempting to start up clvmd.

[root@morph-01 root]# clvmd
clvmd could not connect to cluster
[root@morph-01 root]# ls -l /dev/misc/dlm-control
crw-------  1 root root 10, 63 Nov  8 18:42 /dev/misc/dlm-control
[root@morph-01 root]# rm /dev/misc/dlm-control
rm: remove character special file `/dev/misc/dlm-control'? y
[root@morph-01 root]# clvmd
[root@morph-01 root]# ls -l /dev/misc/dlm-control
crw-------  1 root root 10, 62 Nov  9 11:26 /dev/misc/dlm-control

Note the difference in major/minor numbers.

[root@morph-01 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 5 2 4 3]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1 3 2 4 5]

How reproducible:
Sometimes

Comment 1 Christine Caulfield 2004-11-10 15:27:30 UTC
I would recommend that the startup script either deletes all the dlm*
devices.

I can put a check into libdlm to compare the entry in /proc/misc with
the minor number of the device but it's quite a large overhead on each
open.

Comment 2 Corey Marthaler 2004-11-10 16:59:52 UTC
If it goes into a startup script, then this should be documented
somewhere because a user who doesn't use a start up script will not
know to translate "clvmd could not connect to cluster" to mean they
need to run "rm -f /dev/misc/dlm-control".

Comment 3 Christine Caulfield 2004-11-11 08:52:52 UTC
Quite so.

Also consider. 

a) Anyone running this "supported" should use our startup scripts.
b) Anyone running this on a slightly recent OS will have udev mounted
over tmpfs so that redundant device nodes don't exist anyway.
c) Anyone who deliberately ignores a provided startup script really
ought to know what they are doing (but probably doesn't IME).
d) device-mapper has exactly the same problem - I've been through this
with Debian. Even in this environment, where people like to muck
around with their system, a startup script calmed everything.



Comment 4 Christine Caulfield 2004-11-12 10:41:09 UTC
To help the Udev case I've checked in a script into
dlm/scripts/51-dlm.rules

It goes into /etc/udev/rules.d

Comment 5 Christine Caulfield 2004-12-20 16:15:53 UTC
On Mon, 2004-12-20 at 09:51, Patrick Caulfield wrote:
> #138491 should be against the init scripts but I can't find a
component for
> them.
>
Init scripts should be part of the component, in this case dlm.  You can
reassign the defect to Adam to fix however :).

Kevin


Comment 6 Adam "mantis" Manthei 2004-12-20 17:08:15 UTC
Why is this an init.d script bug?  This is something that should go in
the library.  libdlm knows how to create the device node, so it should
also know how to detect if the device node has got the wrong
major/minor numbers.  

Additionaly, this isn't just a clvmd problem, but any user space
process using libdlm.  It's my understanding that dlm is a sepearate
entity so that programs other than clvm can use it.  As such, a more
correct approach should be taken to fixing the problem rather than
pushing it off onto every application that may need to use it.

I don't follow Patrick's argument in comment #1.  Looking at the code,
it seems that the device node check is only done in
open_control_device() which is called in dlm_create_lockspace() and
dlm_release_lockspace().  How often are these two functions called? 
If it's just once at start and stop of the application, the overhead
for doing the proper checking is quite insignficant.

I can apply the workaround to the init script as a temporary hack, but
 I want to understand why we aren't properlly fixing the bug before I
do.  If I do that, we will also then need to document this behavior as
Corey correctly points out in comment #2.

Lastly, if the dlm/scripts/51-dlm.rules script needs to be added to
the the rpm, a new bug assigned to Chris Feist should be created.


Comment 7 Adam "mantis" Manthei 2004-12-20 20:23:04 UTC
Corey - How did you get differring minor numbers for /dev/misc/dlm-control?

Comment 8 Corey Marthaler 2004-12-20 20:28:53 UTC
I believe the differing minor numbers were left over from a past
running cluster. We'd then tear that cluster down and build it back up
fresh (nothing changed though, it was still the same nodes and storage
and such).

Comment 9 Adam "mantis" Manthei 2004-12-20 23:28:14 UTC
Created attachment 108924 [details]
check when opening the dlm-control device that its major/minor match the assignments in /proc/misc

Comment 10 Adam "mantis" Manthei 2004-12-21 17:52:46 UTC
I checked the patch from comment #9 into CVS.  The update in 

/cvs/cluster/cluster/dlm/lib/libdlm.c revision: 1.15

Comment 11 Corey Marthaler 2005-01-10 23:51:05 UTC
fix verified.

Comment 12 Dean Jansa 2005-01-20 20:48:49 UTC
Hmm, looks like this has popped up again:

Using the rpms from the cluster-i686-2005-01-18-1415 build;:

[root@tank-02 misc]# ls -l
total 0
crw-------  1 root root 10, 60 Jan 18 16:15 dlm_clvmd
crw-------  1 root root 10, 63 Jan 18 14:58 dlm-control

[root@tank-02 misc]# clvmd
clvmd could not connect to cluster manager
Consult syslog for more information

[root@tank-02 misc]# rm ./dlm-control
rm: remove character special file `./dlm-control'? y

[root@tank-02 misc]# clvmd
[root@tank-02 misc]# ls -l
total 0
crw-------  1 root root 10, 60 Jan 18 16:15 dlm_clvmd
crw-------  1 root root 10, 62 Jan 20 14:47 dlm-control

[root@tank-02 misc]# rpm -qa | grep -i dlm
dlm-kernel-2.6.9-10.0
dlm-devel-1.0-0.pre18.1
dlm-kernheaders-2.6.9-10.0
dlm-1.0-0.pre18.1

from /var/log/messages: Jan 20 14:38:26 tank-02 kernel: Lock_DLM
(built Jan 18 2005 14:28:48) installed

[root@tank-02 misc]# rpm -qa | grep -i lvm
lvm2-cluster-2.00.32-1.0.RHEL4
lvm2-2.00.15-2



Comment 13 Adam "mantis" Manthei 2005-01-20 22:59:44 UTC
for future reference, could you also include cat /proc/misc for this
bug as well?  

Comment 14 Adam "mantis" Manthei 2005-01-25 16:40:21 UTC
Moving to NEEDINFO.

I've scanned the code and am not seeing anything jump out at me as
wrong.  I've also tried reproducing the bug but have not had success.

Before I invest anymore time in this, I am going to need 
1. a testscript that reliably demonstrates the bug
2. an strace of clvmd when it fails to start
3. cat /proc/devices
4. cat /proc/misc
5. ls -l /dev/dlm*
6. whoami

Also will need to know if this bug happens only on specific nodes?  on
all nodes?  

Was clvmd started as root? (the above steps in comment #12 appears to
indicate that it was)

Comment 15 David Teigland 2006-01-04 18:14:21 UTC
I don't think this has been a problem for a long time.


Comment 16 Corey Marthaler 2006-02-28 21:45:01 UTC
we've been running without the work around for this bug for quite awhile, closing...


Note You need to log in before you can comment on or make changes to this bug.