Description of problem: I tried the test case in 157094 (proagate a conf file while a node is down) and then ended up geting the downed node (morph-03) back into the cluster: [root@morph-03 ~]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 4 M morph-01 2 1 4 M morph-02 3 1 4 M morph-04 4 1 4 M morph-03 [root@morph-03 ~]# cat /proc/cluster/status Protocol version: 5.0.1 Config version: 2 Cluster name: morph-cluster Cluster ID: 41652 Cluster Member: Yes Membership state: Cluster-Member Nodes: 4 Expected_votes: 4 Total_votes: 4 Quorum: 3 Active subsystems: 1 Node name: morph-03 Node addresses: 10.15.84.63 [root@morph-03 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [2 3 1 4] but now I can't get clvmd to start: [root@morph-03 ~]# clvmd clvmd could not connect to cluster manager Consult syslog for more information SYSLOG: Jun 7 07:01:23 morph-03 clvmd: Unable to create lockspace for CLVM: No such file or directory Another node's services: [root@morph-01 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [3 1 2 4] DLM Lock Space: "clvmd" 3 3 run - [3 1 2] DLM Lock Space: "gfs0" 4 4 run - [3 1 2] DLM Lock Space: "Magma" 7 7 run - [3 2 1] GFS Mount Group: "gfs0" 5 5 run - [3 1 2] User: "usrm::manager" 6 6 run - [3 1 2]
what's in /proc/misc & /dev/misc ? Either the dlm.ko module isn't loaded or the /dev/misc/dlm-control device hasn't been created. libdlm is supposed to create this on-demand but it may not have been able to for some reason I can't see at the moment.
I saw this again after hitting 159877. The dlm mod did not get loaded like it should have. I loaded it by hand and then is worked just fine. There was nothing in /dev/misc and /proc/misc contained: [root@link-01 ~]# cat /proc/misc 183 hw_random 63 device-mapper 135 rtc 227 mcelog
So this needs reassigning to whoever is responsible for the init scripts or the GUI (not sure which) ?
clvmd will not load the dlm module, that is cman's job. Looking at the output you've provided in bug #159877, cman failed to start. Doing so will prevent dlm.ko from getting loaded. Without cman properlly working, how do you expect clvmd to work, even with dlm.ko loaded? Sounds like it's NOTABUG to me.
I'm not sure where in 159877 it mentions cman not starting. Cman was indeed started, I (by hand or with a script) would never try to start clvmd with out first starting a lock manager as it's required. In the original report, I posted the status of /proc/cluster/nodes, status, and services which shows that cman was running and everyone was in the cluster. I didn't post statuses for the second time I saw this since they where the same as the first. Again, after modprobing dlm by hand on the node where I saw this issue, clvmd started just fine, since the cluster was already up.
(In reply to comment #5) > I'm not sure where in 159877 it mentions cman not starting. oops. you're right. I made a typo entering the bug number, bug #157094 demonstrates that the cman script failed to start. > Cman was indeed started But according to bug #157094, it failed > I (by hand or with a script) would never try to > start clvmd with out first starting a lock manager as it's required. Did you verify that it was running properlly?
(In reply to comment #6) > But according to bug #157094, it failed correct, it did fail, but at the end of that comment in that bug, I mention that I then tried it by hand and got it working. :) > Did you verify that it was running properlly? indeed.
I guess I just don't understand what your test case is then. Here is what I've been able to discern. 0. You have some nodes with some initial configuration 1. You killed a node 2. You changed something in cluster.conf 3. You propagated changes, but the changes didn't make it to the failed node (bug #157094) 4. Your killed node came back online 5. The killed node was unable to join the cman cluster because the cluster.conf was not at the same revision level. This caused the cman script to fail (which will prevent DLM from loading). 6. Because cman failed to successfully join, the dlm module was not loaded and clvmd failed to start, hence the reason for this bug report. 7. After the system came up "everything" was done by hand and things worked. Based on the above description, this would appear to depend on the REOPENED bug #157094. Once that is fixed, this should go away since the cman init script will load the dlm module and cause this problem to go away. As such, I'm marking this as DEPENDS ON bug #157094, rather than marking it as NOTABUG. In the meantime, I'm placing this in the NEEDINFO status until a reproducable test is written that demonstrate this bug ("I then tried everything on morph-01 by hand" just doesn't help me)
I'll try to gather more info for you. The simplest case that I've seen this is in comment #2. Here no config changes were made at all. I merely had a cluster up and running, all the nodes paniced at the same time (due to 159877), I power cycled them, when they were back up, I started ccsd, cman, fenced, and clvmd. On one node clvmd failed to start, I then modprobed dlm and then it did.
For some odd reason, I thought that one of the cluster start up commands also loaded the dlm module, that is not true. Only when using the init script does it do a dlm module load (but only if cman joins the cluster first). So as the init script currently stands, Adam is right, without cman starting through the init script, dlm will not get loaded and will have to be done by hand. Jon and I discussed moving the dlm module up some lines next to the cman module load and then it wouldn't be dependant on cman actually join the cluster and would be less confusing for users when clvmd fails.
Reassign to Adam as he seems to know what's going on, and it doesn't look to be my problem ;-)
fix verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0556.html