|Summary:||existence of dlm-control causes clvmd startup to fail|
|Product:||[Retired] Red Hat Cluster Suite||Reporter:||Corey Marthaler <cmarthal>|
|Component:||dlm||Assignee:||David Teigland <teigland>|
|Status:||CLOSED WORKSFORME||QA Contact:||Cluster QE <mspqa-list>|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2006-02-28 21:45:01 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Corey Marthaler 2004-11-09 16:28:02 UTC
Description of problem: Dean and I have seen this bug when attempting to start up clvmd. [root@morph-01 root]# clvmd clvmd could not connect to cluster [root@morph-01 root]# ls -l /dev/misc/dlm-control crw------- 1 root root 10, 63 Nov 8 18:42 /dev/misc/dlm-control [root@morph-01 root]# rm /dev/misc/dlm-control rm: remove character special file `/dev/misc/dlm-control'? y [root@morph-01 root]# clvmd [root@morph-01 root]# ls -l /dev/misc/dlm-control crw------- 1 root root 10, 62 Nov 9 11:26 /dev/misc/dlm-control Note the difference in major/minor numbers. [root@morph-01 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 5 2 4 3] DLM Lock Space: "clvmd" 2 3 run - [1 3 2 4 5] How reproducible: Sometimes
Comment 1 Christine Caulfield 2004-11-10 15:27:30 UTC
I would recommend that the startup script either deletes all the dlm* devices. I can put a check into libdlm to compare the entry in /proc/misc with the minor number of the device but it's quite a large overhead on each open.
Comment 2 Corey Marthaler 2004-11-10 16:59:52 UTC
If it goes into a startup script, then this should be documented somewhere because a user who doesn't use a start up script will not know to translate "clvmd could not connect to cluster" to mean they need to run "rm -f /dev/misc/dlm-control".
Comment 3 Christine Caulfield 2004-11-11 08:52:52 UTC
Quite so. Also consider. a) Anyone running this "supported" should use our startup scripts. b) Anyone running this on a slightly recent OS will have udev mounted over tmpfs so that redundant device nodes don't exist anyway. c) Anyone who deliberately ignores a provided startup script really ought to know what they are doing (but probably doesn't IME). d) device-mapper has exactly the same problem - I've been through this with Debian. Even in this environment, where people like to muck around with their system, a startup script calmed everything.
Comment 4 Christine Caulfield 2004-11-12 10:41:09 UTC
To help the Udev case I've checked in a script into dlm/scripts/51-dlm.rules It goes into /etc/udev/rules.d
Comment 5 Christine Caulfield 2004-12-20 16:15:53 UTC
On Mon, 2004-12-20 at 09:51, Patrick Caulfield wrote: > #138491 should be against the init scripts but I can't find a component for > them. > Init scripts should be part of the component, in this case dlm. You can reassign the defect to Adam to fix however :). Kevin
Comment 6 Adam "mantis" Manthei 2004-12-20 17:08:15 UTC
Why is this an init.d script bug? This is something that should go in the library. libdlm knows how to create the device node, so it should also know how to detect if the device node has got the wrong major/minor numbers. Additionaly, this isn't just a clvmd problem, but any user space process using libdlm. It's my understanding that dlm is a sepearate entity so that programs other than clvm can use it. As such, a more correct approach should be taken to fixing the problem rather than pushing it off onto every application that may need to use it. I don't follow Patrick's argument in comment #1. Looking at the code, it seems that the device node check is only done in open_control_device() which is called in dlm_create_lockspace() and dlm_release_lockspace(). How often are these two functions called? If it's just once at start and stop of the application, the overhead for doing the proper checking is quite insignficant. I can apply the workaround to the init script as a temporary hack, but I want to understand why we aren't properlly fixing the bug before I do. If I do that, we will also then need to document this behavior as Corey correctly points out in comment #2. Lastly, if the dlm/scripts/51-dlm.rules script needs to be added to the the rpm, a new bug assigned to Chris Feist should be created.
Comment 7 Adam "mantis" Manthei 2004-12-20 20:23:04 UTC
Corey - How did you get differring minor numbers for /dev/misc/dlm-control?
Comment 8 Corey Marthaler 2004-12-20 20:28:53 UTC
I believe the differing minor numbers were left over from a past running cluster. We'd then tear that cluster down and build it back up fresh (nothing changed though, it was still the same nodes and storage and such).
Comment 9 Adam "mantis" Manthei 2004-12-20 23:28:14 UTC
Created attachment 108924 [details] check when opening the dlm-control device that its major/minor match the assignments in /proc/misc
Comment 10 Adam "mantis" Manthei 2004-12-21 17:52:46 UTC
I checked the patch from comment #9 into CVS. The update in /cvs/cluster/cluster/dlm/lib/libdlm.c revision: 1.15
Comment 11 Corey Marthaler 2005-01-10 23:51:05 UTC
Comment 12 Dean Jansa 2005-01-20 20:48:49 UTC
Hmm, looks like this has popped up again: Using the rpms from the cluster-i686-2005-01-18-1415 build;: [root@tank-02 misc]# ls -l total 0 crw------- 1 root root 10, 60 Jan 18 16:15 dlm_clvmd crw------- 1 root root 10, 63 Jan 18 14:58 dlm-control [root@tank-02 misc]# clvmd clvmd could not connect to cluster manager Consult syslog for more information [root@tank-02 misc]# rm ./dlm-control rm: remove character special file `./dlm-control'? y [root@tank-02 misc]# clvmd [root@tank-02 misc]# ls -l total 0 crw------- 1 root root 10, 60 Jan 18 16:15 dlm_clvmd crw------- 1 root root 10, 62 Jan 20 14:47 dlm-control [root@tank-02 misc]# rpm -qa | grep -i dlm dlm-kernel-2.6.9-10.0 dlm-devel-1.0-0.pre18.1 dlm-kernheaders-2.6.9-10.0 dlm-1.0-0.pre18.1 from /var/log/messages: Jan 20 14:38:26 tank-02 kernel: Lock_DLM (built Jan 18 2005 14:28:48) installed [root@tank-02 misc]# rpm -qa | grep -i lvm lvm2-cluster-2.00.32-1.0.RHEL4 lvm2-2.00.15-2
Comment 13 Adam "mantis" Manthei 2005-01-20 22:59:44 UTC
for future reference, could you also include cat /proc/misc for this bug as well?
Comment 14 Adam "mantis" Manthei 2005-01-25 16:40:21 UTC
Moving to NEEDINFO. I've scanned the code and am not seeing anything jump out at me as wrong. I've also tried reproducing the bug but have not had success. Before I invest anymore time in this, I am going to need 1. a testscript that reliably demonstrates the bug 2. an strace of clvmd when it fails to start 3. cat /proc/devices 4. cat /proc/misc 5. ls -l /dev/dlm* 6. whoami Also will need to know if this bug happens only on specific nodes? on all nodes? Was clvmd started as root? (the above steps in comment #12 appears to indicate that it was)
Comment 15 David Teigland 2006-01-04 18:14:21 UTC
I don't think this has been a problem for a long time.
Comment 16 Corey Marthaler 2006-02-28 21:45:01 UTC
we've been running without the work around for this bug for quite awhile, closing...