Description of problem: Dean and I have seen this bug when attempting to start up clvmd. [root@morph-01 root]# clvmd clvmd could not connect to cluster [root@morph-01 root]# ls -l /dev/misc/dlm-control crw------- 1 root root 10, 63 Nov 8 18:42 /dev/misc/dlm-control [root@morph-01 root]# rm /dev/misc/dlm-control rm: remove character special file `/dev/misc/dlm-control'? y [root@morph-01 root]# clvmd [root@morph-01 root]# ls -l /dev/misc/dlm-control crw------- 1 root root 10, 62 Nov 9 11:26 /dev/misc/dlm-control Note the difference in major/minor numbers. [root@morph-01 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 5 2 4 3] DLM Lock Space: "clvmd" 2 3 run - [1 3 2 4 5] How reproducible: Sometimes
I would recommend that the startup script either deletes all the dlm* devices. I can put a check into libdlm to compare the entry in /proc/misc with the minor number of the device but it's quite a large overhead on each open.
If it goes into a startup script, then this should be documented somewhere because a user who doesn't use a start up script will not know to translate "clvmd could not connect to cluster" to mean they need to run "rm -f /dev/misc/dlm-control".
Quite so. Also consider. a) Anyone running this "supported" should use our startup scripts. b) Anyone running this on a slightly recent OS will have udev mounted over tmpfs so that redundant device nodes don't exist anyway. c) Anyone who deliberately ignores a provided startup script really ought to know what they are doing (but probably doesn't IME). d) device-mapper has exactly the same problem - I've been through this with Debian. Even in this environment, where people like to muck around with their system, a startup script calmed everything.
To help the Udev case I've checked in a script into dlm/scripts/51-dlm.rules It goes into /etc/udev/rules.d
On Mon, 2004-12-20 at 09:51, Patrick Caulfield wrote: > #138491 should be against the init scripts but I can't find a component for > them. > Init scripts should be part of the component, in this case dlm. You can reassign the defect to Adam to fix however :). Kevin
Why is this an init.d script bug? This is something that should go in the library. libdlm knows how to create the device node, so it should also know how to detect if the device node has got the wrong major/minor numbers. Additionaly, this isn't just a clvmd problem, but any user space process using libdlm. It's my understanding that dlm is a sepearate entity so that programs other than clvm can use it. As such, a more correct approach should be taken to fixing the problem rather than pushing it off onto every application that may need to use it. I don't follow Patrick's argument in comment #1. Looking at the code, it seems that the device node check is only done in open_control_device() which is called in dlm_create_lockspace() and dlm_release_lockspace(). How often are these two functions called? If it's just once at start and stop of the application, the overhead for doing the proper checking is quite insignficant. I can apply the workaround to the init script as a temporary hack, but I want to understand why we aren't properlly fixing the bug before I do. If I do that, we will also then need to document this behavior as Corey correctly points out in comment #2. Lastly, if the dlm/scripts/51-dlm.rules script needs to be added to the the rpm, a new bug assigned to Chris Feist should be created.
Corey - How did you get differring minor numbers for /dev/misc/dlm-control?
I believe the differing minor numbers were left over from a past running cluster. We'd then tear that cluster down and build it back up fresh (nothing changed though, it was still the same nodes and storage and such).
Created attachment 108924 [details] check when opening the dlm-control device that its major/minor match the assignments in /proc/misc
I checked the patch from comment #9 into CVS. The update in /cvs/cluster/cluster/dlm/lib/libdlm.c revision: 1.15
fix verified.
Hmm, looks like this has popped up again: Using the rpms from the cluster-i686-2005-01-18-1415 build;: [root@tank-02 misc]# ls -l total 0 crw------- 1 root root 10, 60 Jan 18 16:15 dlm_clvmd crw------- 1 root root 10, 63 Jan 18 14:58 dlm-control [root@tank-02 misc]# clvmd clvmd could not connect to cluster manager Consult syslog for more information [root@tank-02 misc]# rm ./dlm-control rm: remove character special file `./dlm-control'? y [root@tank-02 misc]# clvmd [root@tank-02 misc]# ls -l total 0 crw------- 1 root root 10, 60 Jan 18 16:15 dlm_clvmd crw------- 1 root root 10, 62 Jan 20 14:47 dlm-control [root@tank-02 misc]# rpm -qa | grep -i dlm dlm-kernel-2.6.9-10.0 dlm-devel-1.0-0.pre18.1 dlm-kernheaders-2.6.9-10.0 dlm-1.0-0.pre18.1 from /var/log/messages: Jan 20 14:38:26 tank-02 kernel: Lock_DLM (built Jan 18 2005 14:28:48) installed [root@tank-02 misc]# rpm -qa | grep -i lvm lvm2-cluster-2.00.32-1.0.RHEL4 lvm2-2.00.15-2
for future reference, could you also include cat /proc/misc for this bug as well?
Moving to NEEDINFO. I've scanned the code and am not seeing anything jump out at me as wrong. I've also tried reproducing the bug but have not had success. Before I invest anymore time in this, I am going to need 1. a testscript that reliably demonstrates the bug 2. an strace of clvmd when it fails to start 3. cat /proc/devices 4. cat /proc/misc 5. ls -l /dev/dlm* 6. whoami Also will need to know if this bug happens only on specific nodes? on all nodes? Was clvmd started as root? (the above steps in comment #12 appears to indicate that it was)
I don't think this has been a problem for a long time.
we've been running without the work around for this bug for quite awhile, closing...