Red Hat Bugzilla – Bug 251179
clvmd startup times out and fails, but continues running
Last modified: 2007-11-16 20:14:59 EST
+++ This bug was initially created as a clone of Bug #238490 +++
Description of problem:
While doing recovery testing with revolver, sometimes clvmd will fail to start
through the init scripts, but clvmd is still running after the test fails to
find active volumes.
Apr 30 12:05:31 morph-05 fenced: startup succeeded
Apr 30 12:05:51 morph-05 clvmd: clvmd startup timed out
Apr 30 12:05:51 morph-05 clvmd: clvmd startup failed
Apr 30 12:05:56 morph-05 clvmd: Cluster LVM daemon started - connected to CMAN
[root@morph-05 ~]# vgs
VG #PV #LV #SN Attr VSize VFree
morph-cluster 3 3 0 wz--nc 712.21G 0
[root@morph-05 ~]# lvs
LV VG Attr LSize Origin Snap% Move Log Copy%
morph-cluster0 morph-cluster -wi--- 237.40G
morph-cluster1 morph-cluster -wi--- 237.40G
morph-cluster2 morph-cluster -wi--- 237.40G
[root@morph-05 ~]# ls /dev/mapper
control VolGroup00-LogVol00 VolGroup00-LogVol01
[root@morph-05 ~]# ps -fC clvmd
UID PID PPID C STIME TTY TIME CMD
root 2824 1 0 12:05 ? 00:00:00 clvmd -T20 -t 90
Version-Release number of selected component (if applicable):
once every 10 iterations.
Steps to Reproduce:
1. form a cluster with GFS and run an I/O load
2. reboot -fin nodes randomly
3. when a node returns, try to remount the file systems
a. clvmd should start up
b. clvmd should not be running if it can't start up
c. clvmd should log a useful error message when it can't start up.
-- Additional comment from firstname.lastname@example.org on 2007-05-01 04:10 EST --
It's more a question of how you want startup to work.
Are you prepared to wait indefinitely for clvmd to get running or
do you want it to timeout and let the startup continue?
It's simply a question of setting the right timeout surely.
OK, there are issues of why it takes this long to get going, but if the rest of
the cluster is very busy then that's a lot of network and CPU traffic for the
(lower priority) joining protocols to get through.
-- Additional comment from email@example.com on 2007-05-01 10:00 EST --
I think there is more here than just picking out the right timeout values.
We've never been able to get useful debug output or tracing from clvmd. It's a
black box that often fails and never helps to answer why. Why is clvmd still
running after it reported that it failed to start up? Why didn't it activate
clustered volume groups after it did connect to the cluster?
-- Additional comment from firstname.lastname@example.org on 2007-05-01 10:28 EST --
I don't see why it's any more of a black box than any other binary? Starting it
with -d will add debugging output from it and stop it forking, that's no secret.
What happens when you start clvmd is that the parent process forks the 'real'
clvmd and waits for it to acknowledge that it has started up: connected to the
cluster manager and created its lockspace (note, NOT activated logical volumes).
If this takes too long, the parent process can time out but the daemon will
carry on. If errors occur here then you will get output from clvmd saying what
they were. If the parent process exits non-zero then the volumes will not be
activated ... that's part of the init script and perfectly transparent.
The reason clvmd starts up in this "odd" way is to try to provde MORE feedback
to the console, and to allow some synchronisation so that no attempt to activate
the volumes happens when clvmd has failed to start up. If the daemon just
started up without then then it could fail and no-one would know (unless you
inspected syslog) and the script would be littered with "sleep" and check
commands to see if it had started up. And even so it would have the same effect
as what's there now if you view the init script as a black box - it would either
wait indefinitely or it would timeout.
So what happpened here is that clvmd took a long time to connect to the cluster
or create the lockspace so that the parent timed-out. Because the parent
timed-out the script did not activate the logical volumes. clvmd itself
continued to run and probably got going eventually.
-- Additional comment from email@example.com on 2007-05-01 11:01 EST --
By making the debug mode not fork it is really hard to run the debug version in
the same environment where we hit the bug. Whenever we have used the debug
option the output hasn't been any more useful. Can we separate the debug and
don't fork options?
So if the parent process times out, why doesn't it kill the "real" process?
It's counter-intuitive to me that it logs a failure yet continues to run.
Can the debug option add more output while it is trying to connect to the
cluster? Right now I don't know if clvmd is waiting for a join response or
creating the lock space or whatever.
-- Additional comment from firstname.lastname@example.org on 2007-05-01 11:17 EST --
Not forking when the debug flag is pretty standard daemon behaviour, all the
other cluster suite daemons do it, as do a great many others.
I suppose it could kill the parent process, but I'm not really in favour of
killing a process that's doing useful work. If the script wants to do that then
that's fair enough but I don't think it belongs in the binary.
Adding more debugging output is fair enough - though to be honest I think you'll
find that 99% of the time it's waiting for the lockspace to be created.
-- Additional comment from email@example.com on 2007-05-02 04:25 EST --
I've added some more debug logging to clvmd at startup so you can see where it's
waiting. It's in CVS but don't expect it to turn up in a package near you any
time soon though!
-- Additional comment from firstname.lastname@example.org on 2007-05-07 15:58 EST --
I'm running with a modified clvmd. The modified version always writes debug
statements to /tmp/clvmd.log. The delay seems to be DLM initialisation.
CLVMD[b7f318e0]: May 7 14:32:40 CMAN initialisation complete
CLVMD[b7f318e0]: May 7 14:33:00 DLM initialisation complete
I looked back through my failed run logs and every time startup fails, we're
coming back from shooting three out of five nodes. So the node that fails clvmd
startup is probably the first one to come up before the cluster regains quorum.
I don't know what the startup scripts can do to handle this. Perhaps we'll have
to put something in our tests to activate volume groups after recovery is verified.
-- Additional comment from email@example.com on 2007-05-08 11:00 EST --
If this is quorum-related, is it worth making clvmd (perhaps optionally) ignore
the timeout and wait for quorum instead?
Another solution is to put something that /does/ wait for quorum ahead of clvmd
in the startup sequence.
-- Additional comment from firstname.lastname@example.org on 2007-05-08 15:32 EST --
Doesn't the cman startup script wait for quroum? I does seem odd to me that
that could be the reason.
Here is a larger chunk of /var/log/messages from a failing case.
May 2 14:16:08 morph-05 kernel: CMAN: Waiting to join or form a Linux-cluster
May 2 14:16:08 morph-05 ccsd: Connected to cluster infrastruture via: CMA
N/SM Plugin v126.96.36.199
May 2 14:16:08 morph-05 ccsd: Initial status:: Inquorate
May 2 14:16:10 morph-05 kernel: CMAN: sending membership request
May 2 14:16:11 morph-05 kernel: CMAN: got node morph-04
May 2 14:16:11 morph-05 kernel: CMAN: got node morph-02
May 2 14:16:11 morph-05 kernel: CMAN: got node morph-01
May 2 14:16:11 morph-05 kernel: CMAN: got node morph-03
May 2 14:16:11 morph-05 kernel: CMAN: quorum regained, resuming activity
May 2 14:16:11 morph-05 ccsd: Cluster is quorate. Allowing connections.
May 2 14:16:11 morph-05 cman: startup succeeded
May 2 14:16:11 morph-05 kernel: dm-cmirror: dm-cmirror 0.2.0 (built Apr 26 2007
May 2 14:16:11 morph-05 cmirror: startup succeeded
May 2 14:16:11 morph-05 lock_gulmd: no <gulm> section detected in /etc/cluster/
May 2 14:16:26 morph-05 fenced: startup succeeded
May 2 14:16:46 morph-05 clvmd: clvmd startup timed out
May 2 14:16:46 morph-05 clvmd: clvmd startup failed
Are there other quorums that we need to wait for?
Dave, do you have any ideas?
-- Additional comment from email@example.com on 2007-07-31 09:15 EST --
Created an attachment (id=160316)
group_tool dump after the failure
I got this problem revolver forced recovery. Additional info:
[root@roth-03 ~]# group_tool -v
type level name id state node id local_done
fence 0 default 00010002 none
[1 2 3]
dlm 1 clvmd 00010003 none
[1 2 3]
[root@roth-03 ~]# ps ax -o pid,cmd,wchan | grep clvmd
2245 clvmd -T20 stext
An strace of clvmd showed this:
open("/dev/misc/dlm-control", O_RDWR) = 6
fcntl(6, F_SETFD, FD_CLOEXEC) = 0
read(6, "\6\0\0\0\0\0\0\0\0\0\0\0", 12) = 12
write(6, "\6\0\0\0\0\0\0\0\0\0\0\0\4\1\0\0\0\0\0\0\0\0\0\0clvmd\0"..., 125
Which, according to Patrick, means create_lockspace didn't respond.
The funny thing is that the clvmd lock space apparently already existed.
-- Additional comment from firstname.lastname@example.org on 2007-07-31 09:18 EST --
Created an attachment (id=160317)
sysrq-t at the time of failure.
-- Additional comment from email@example.com on 2007-08-01 09:53 EST --
Created an attachment (id=160411)
Patch to fix (I hope)
This seems to be to do with the DLM 'othercon' thing in lowcomms. The don't get
freed up (and pointers NULLed) as they should do. A previous patch addressed
this to some extent but missed some important areas.
Can you test this for me please bob? - it works on my systems. I've left some
debugging in there in case it still misbehaves.
-- Additional comment from firstname.lastname@example.org on 2007-08-02 09:41 EST --
The previously attached patch seems to work well.
-- Additional comment from email@example.com on 2007-08-02 09:59 EST --
OK, I've sent the patch to Steve.
-- Additional comment from firstname.lastname@example.org on 2007-08-03 09:10 EST --
Hmm, just spotted that all of Nate's comments refer to RHEL-4, whereas Bob's
(and my patch) are for RHEL-5.
-- Additional comment from email@example.com on 2007-08-07 12:15 EST --
Whatever, I think this patch needs to go into RHEL5.1. Can we have some acks ?
This is a clone to get ACKS for 5.1
Based on what I've seen in bz# 251562 there is a better (read simpler) fix
available for this. It needs a little more testing on RHEL4 then porting into
the RHEL5 lowcomms.
This issue will likely cause GFS2 recovery testing to fail, which will not allow us to meet our GA release
criteria. Marking blocker so we can get this fix into a build.
You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.