Bug 251179

Summary:	clvmd startup times out and fails, but continues running
Product:	Red Hat Enterprise Linux 5	Reporter:	Christine Caulfield <ccaulfie>
Component:	kernel	Assignee:	Don Zickus <dzickus>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5.1	CC:	agk, ccaulfie, dwysocha, jbrassow, mbroz, prockai, rmccabe, rpeterso, teigland
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHBA-2007-0959	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-11-07 19:58:14 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	238490
Bug Blocks:

Description Christine Caulfield 2007-08-07 16:23:02 UTC

+++ This bug was initially created as a clone of Bug #238490 +++

Description of problem:

While doing recovery testing with revolver, sometimes clvmd will fail to start
through the init scripts, but clvmd is still running after the test fails to
find active volumes.

Apr 30 12:05:31 morph-05 fenced: startup succeeded
Apr 30 12:05:51 morph-05 clvmd: clvmd startup timed out
Apr 30 12:05:51 morph-05 clvmd: clvmd startup failed
...
Apr 30 12:05:56 morph-05 clvmd: Cluster LVM daemon started - connected to CMAN

[root@morph-05 ~]# vgs
  VG            #PV #LV #SN Attr   VSize   VFree
  morph-cluster   3   3   0 wz--nc 712.21G    0
[root@morph-05 ~]# lvs
  LV             VG            Attr   LSize   Origin Snap%  Move Log Copy%
  morph-cluster0 morph-cluster -wi--- 237.40G
  morph-cluster1 morph-cluster -wi--- 237.40G
  morph-cluster2 morph-cluster -wi--- 237.40G
[root@morph-05 ~]# ls /dev/mapper
control  VolGroup00-LogVol00  VolGroup00-LogVol01

[root@morph-05 ~]# ps -fC clvmd
UID        PID  PPID  C STIME TTY          TIME CMD
root      2824     1  0 12:05 ?        00:00:00 clvmd -T20 -t 90

Version-Release number of selected component (if applicable):
lvm2-cluster-2.02.21-7.el4

How reproducible:
once every 10 iterations.

Steps to Reproduce:
1. form a cluster with GFS and run an I/O load
2. reboot -fin nodes randomly
3. when a node returns, try to remount the file systems
  
Actual results:
see above

Expected results:
a. clvmd should start up
b. clvmd should not be running if it can't start up
c. clvmd should log a useful error message when it can't start up.

Additional info:

-- Additional comment from pcaulfie on 2007-05-01 04:10 EST --
It's more a question of how you want startup to work.

Are you prepared to wait indefinitely for clvmd to get running or
do you want it to timeout and let the startup continue?

It's simply a question of setting the right timeout surely.

OK, there are issues of why it takes this long to get going, but if the rest of
the cluster is very busy then that's a lot of network and CPU traffic for the
(lower priority) joining protocols to get through.



-- Additional comment from nstraz on 2007-05-01 10:00 EST --
I think there is more here than just picking out the right timeout values. 
We've never been able to get useful debug output or tracing from clvmd.  It's a
black box that often fails and never helps to answer why.  Why is clvmd still
running after it reported that it failed to start up?  Why didn't it activate
clustered volume groups after it did connect to the cluster?  

-- Additional comment from pcaulfie on 2007-05-01 10:28 EST --
I don't see why it's any more of a black box than any other binary? Starting it
with -d will add debugging output from it and stop it forking, that's no secret.

What happens when you start clvmd is that the parent process forks the 'real'
clvmd and waits for it to acknowledge that it has started up: connected to the
cluster manager and created its lockspace (note, NOT activated logical volumes).
If this takes too long, the parent process can time out but the daemon will
carry on. If errors occur here then you will get output from clvmd saying what
they were. If the parent process exits non-zero then the volumes will not be
activated ... that's part of the init script and perfectly transparent.

The reason clvmd starts up in this "odd" way is to try to provde MORE feedback
to the console, and to allow some synchronisation so that no attempt to activate
the volumes happens when clvmd has failed to start up. If the daemon just
started up without then then it could fail and no-one would know (unless you
inspected syslog) and the script would be littered with "sleep" and check
commands to see if it had started up. And even so it would have the same effect
as what's there now if you view the init script as a black box - it would either
wait indefinitely or it would timeout.

So what happpened here is that clvmd took a long time to connect to the cluster
or create the lockspace so that the parent timed-out. Because the parent
timed-out the script did not activate the logical volumes. clvmd itself
continued to run and probably got going eventually.

-- Additional comment from nstraz on 2007-05-01 11:01 EST --
By making the debug mode not fork it is really hard to run the debug version in
the same environment where we hit the bug.  Whenever we have used the debug
option the output hasn't been any more useful.  Can we separate the debug and
don't fork options?

So if the parent process times out, why doesn't it kill the "real" process? 
It's counter-intuitive to me that it logs a failure yet continues to run.

Can the debug option add more output while it is trying to connect to the
cluster?  Right now I don't know if clvmd is waiting for a join response or
creating the lock space or whatever.  

-- Additional comment from pcaulfie on 2007-05-01 11:17 EST --
Not forking when the debug flag is pretty standard daemon behaviour, all the
other cluster suite daemons do it, as do a great many others. 

I suppose it could kill the parent process, but I'm not really in favour of
killing a process that's doing useful work. If the script wants to do that then
that's fair enough but I don't think it belongs in the binary.

Adding more debugging output is fair enough - though to be honest I think you'll
find that 99% of the time it's waiting for the lockspace to be created.

-- Additional comment from pcaulfie on 2007-05-02 04:25 EST --
I've added some more debug logging to clvmd at startup so you can see where it's
waiting. It's in CVS but don't expect it to turn up in a package near you any
time soon though!

-- Additional comment from nstraz on 2007-05-07 15:58 EST --
I'm running with a modified clvmd.  The modified version always writes debug
statements to /tmp/clvmd.log.  The delay seems to be DLM initialisation.

CLVMD[b7f318e0]: May  7 14:32:40 CMAN initialisation complete
CLVMD[b7f318e0]: May  7 14:33:00 DLM initialisation complete

I looked back through my failed run logs and every time startup fails, we're
coming back from shooting three out of five nodes.  So the node that fails clvmd
startup is probably the first one to come up before the cluster regains quorum.

I don't know what the startup scripts can do to handle this.  Perhaps we'll have
to put something in our tests to activate volume groups after recovery is verified.


-- Additional comment from pcaulfie on 2007-05-08 11:00 EST --
If this is quorum-related, is it worth making clvmd (perhaps optionally) ignore
the timeout and wait for quorum instead?

Another solution is to put something that /does/ wait for quorum ahead of clvmd
in the startup sequence.

-- Additional comment from nstraz on 2007-05-08 15:32 EST --
Doesn't the cman startup script wait for quroum?  I does seem odd to me that
that could be the reason.

Here is a larger chunk of /var/log/messages from a failing case.

May  2 14:16:08 morph-05 kernel: CMAN: Waiting to join or form a Linux-cluster
May  2 14:16:08 morph-05 ccsd[2713]: Connected to cluster infrastruture via: CMA
N/SM Plugin v1.1.7.4 
May  2 14:16:08 morph-05 ccsd[2713]: Initial status:: Inquorate 
May  2 14:16:10 morph-05 kernel: CMAN: sending membership request
May  2 14:16:11 morph-05 kernel: CMAN: got node morph-04
May  2 14:16:11 morph-05 kernel: CMAN: got node morph-02
May  2 14:16:11 morph-05 kernel: CMAN: got node morph-01
May  2 14:16:11 morph-05 kernel: CMAN: got node morph-03
May  2 14:16:11 morph-05 kernel: CMAN: quorum regained, resuming activity
May  2 14:16:11 morph-05 ccsd[2713]: Cluster is quorate.  Allowing connections. 
May  2 14:16:11 morph-05 cman: startup succeeded
May  2 14:16:11 morph-05 kernel: dm-cmirror: dm-cmirror 0.2.0 (built Apr 26 2007
 16:33:46) installed
May  2 14:16:11 morph-05 cmirror: startup succeeded
May  2 14:16:11 morph-05 lock_gulmd: no <gulm> section detected in /etc/cluster/
cluster.conf succeeded
May  2 14:16:26 morph-05 fenced: startup succeeded
May  2 14:16:46 morph-05 clvmd: clvmd startup timed out
May  2 14:16:46 morph-05 clvmd: clvmd startup failed

Are there other quorums that we need to wait for?

Dave, do you have any ideas?

-- Additional comment from rpeterso on 2007-07-31 09:15 EST --
Created an attachment (id=160316)
group_tool dump after the failure

I got this problem revolver forced recovery.  Additional info:

[root@roth-03 ~]# group_tool -v
type		 level name	id	 state node id local_done
fence		 0     default	00010002 none	     
[1 2 3]
dlm		 1     clvmd	00010003 none	     
[1 2 3]
[root@roth-03 ~]# ps ax -o pid,cmd,wchan | grep clvmd
 2245 clvmd -T20		  stext

An strace of clvmd showed this:
open("/dev/misc/dlm-control", O_RDWR)	= 6
fcntl(6, F_SETFD, FD_CLOEXEC)		= 0
read(6, "\6\0\0\0\0\0\0\0\0\0\0\0", 12) = 12
write(6, "\6\0\0\0\0\0\0\0\0\0\0\0\4\1\0\0\0\0\0\0\0\0\0\0clvmd\0"..., 125
<unfinished ...>

Which, according to Patrick, means create_lockspace didn't respond.
The funny thing is that the clvmd lock space apparently already existed.


-- Additional comment from rpeterso on 2007-07-31 09:18 EST --
Created an attachment (id=160317)
sysrq-t at the time of failure.


-- Additional comment from pcaulfie on 2007-08-01 09:53 EST --
Created an attachment (id=160411)
Patch to fix (I hope)

This seems to be to do with the DLM 'othercon' thing in lowcomms. The don't get
freed up (and pointers NULLed) as they should do. A previous patch addressed
this to some extent but missed some important areas.

Can you test this for me please bob? - it works on my systems. I've left some
debugging in there in case it still misbehaves.

-- Additional comment from rpeterso on 2007-08-02 09:41 EST --
The previously attached patch seems to work well.


-- Additional comment from pcaulfie on 2007-08-02 09:59 EST --
OK, I've sent the patch to Steve.

-- Additional comment from pcaulfie on 2007-08-03 09:10 EST --
Hmm, just spotted that all of Nate's comments refer to RHEL-4, whereas Bob's
(and my patch) are for RHEL-5.

-- Additional comment from pcaulfie on 2007-08-07 12:15 EST --
Whatever, I think this patch needs to go into RHEL5.1. Can we have some acks ?

Comment 1 Christine Caulfield 2007-08-07 16:23:41 UTC

This is a clone to get ACKS for 5.1

Comment 2 Christine Caulfield 2007-08-13 09:18:12 UTC

Based on what I've seen in bz# 251562 there is a better (read simpler) fix
available for this. It needs a little more testing on RHEL4 then porting into
the RHEL5 lowcomms.

Comment 3 Dean Jansa 2007-08-13 14:42:38 UTC

This issue will likely cause GFS2 recovery testing to fail, which will not allow us to meet our GA release 
criteria.  Marking blocker so we can get this fix into a build.

Comment 4 Don Zickus 2007-08-21 18:35:34 UTC

in 2.6.18-42.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 8 errata-xmlrpc 2007-11-07 19:58:14 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html