Bug 238490

Summary: clvmd startup times out and fails, but continues running
Product: [Retired] Red Hat Cluster Suite Reporter: Nate Straz <nstraz>
Component: dlm-kernelAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: agk, ccaulfie, cluster-maint, djansa, dwysocha, jbrassow, mbroz, prockai, rpeterso, teigland
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0995 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-21 21:55:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 251179, 267641    
Attachments:
Description Flags
group_tool dump after the failure
none
sysrq-t at the time of failure.
none
Patch to fix (I hope)
none
New Patch
none
RHEL-5 patch
none
RHEL-4 patch none

Description Nate Straz 2007-04-30 20:48:18 UTC
Description of problem:

While doing recovery testing with revolver, sometimes clvmd will fail to start
through the init scripts, but clvmd is still running after the test fails to
find active volumes.

Apr 30 12:05:31 morph-05 fenced: startup succeeded
Apr 30 12:05:51 morph-05 clvmd: clvmd startup timed out
Apr 30 12:05:51 morph-05 clvmd: clvmd startup failed
...
Apr 30 12:05:56 morph-05 clvmd: Cluster LVM daemon started - connected to CMAN

[root@morph-05 ~]# vgs
  VG            #PV #LV #SN Attr   VSize   VFree
  morph-cluster   3   3   0 wz--nc 712.21G    0
[root@morph-05 ~]# lvs
  LV             VG            Attr   LSize   Origin Snap%  Move Log Copy%
  morph-cluster0 morph-cluster -wi--- 237.40G
  morph-cluster1 morph-cluster -wi--- 237.40G
  morph-cluster2 morph-cluster -wi--- 237.40G
[root@morph-05 ~]# ls /dev/mapper
control  VolGroup00-LogVol00  VolGroup00-LogVol01

[root@morph-05 ~]# ps -fC clvmd
UID        PID  PPID  C STIME TTY          TIME CMD
root      2824     1  0 12:05 ?        00:00:00 clvmd -T20 -t 90

Version-Release number of selected component (if applicable):
lvm2-cluster-2.02.21-7.el4

How reproducible:
once every 10 iterations.

Steps to Reproduce:
1. form a cluster with GFS and run an I/O load
2. reboot -fin nodes randomly
3. when a node returns, try to remount the file systems
  
Actual results:
see above

Expected results:
a. clvmd should start up
b. clvmd should not be running if it can't start up
c. clvmd should log a useful error message when it can't start up.

Additional info:

Comment 1 Christine Caulfield 2007-05-01 08:10:26 UTC
It's more a question of how you want startup to work.

Are you prepared to wait indefinitely for clvmd to get running or
do you want it to timeout and let the startup continue?

It's simply a question of setting the right timeout surely.

OK, there are issues of why it takes this long to get going, but if the rest of
the cluster is very busy then that's a lot of network and CPU traffic for the
(lower priority) joining protocols to get through.



Comment 2 Nate Straz 2007-05-01 14:00:32 UTC
I think there is more here than just picking out the right timeout values. 
We've never been able to get useful debug output or tracing from clvmd.  It's a
black box that often fails and never helps to answer why.  Why is clvmd still
running after it reported that it failed to start up?  Why didn't it activate
clustered volume groups after it did connect to the cluster?  

Comment 3 Christine Caulfield 2007-05-01 14:28:49 UTC
I don't see why it's any more of a black box than any other binary? Starting it
with -d will add debugging output from it and stop it forking, that's no secret.

What happens when you start clvmd is that the parent process forks the 'real'
clvmd and waits for it to acknowledge that it has started up: connected to the
cluster manager and created its lockspace (note, NOT activated logical volumes).
If this takes too long, the parent process can time out but the daemon will
carry on. If errors occur here then you will get output from clvmd saying what
they were. If the parent process exits non-zero then the volumes will not be
activated ... that's part of the init script and perfectly transparent.

The reason clvmd starts up in this "odd" way is to try to provde MORE feedback
to the console, and to allow some synchronisation so that no attempt to activate
the volumes happens when clvmd has failed to start up. If the daemon just
started up without then then it could fail and no-one would know (unless you
inspected syslog) and the script would be littered with "sleep" and check
commands to see if it had started up. And even so it would have the same effect
as what's there now if you view the init script as a black box - it would either
wait indefinitely or it would timeout.

So what happpened here is that clvmd took a long time to connect to the cluster
or create the lockspace so that the parent timed-out. Because the parent
timed-out the script did not activate the logical volumes. clvmd itself
continued to run and probably got going eventually.

Comment 4 Nate Straz 2007-05-01 15:01:31 UTC
By making the debug mode not fork it is really hard to run the debug version in
the same environment where we hit the bug.  Whenever we have used the debug
option the output hasn't been any more useful.  Can we separate the debug and
don't fork options?

So if the parent process times out, why doesn't it kill the "real" process? 
It's counter-intuitive to me that it logs a failure yet continues to run.

Can the debug option add more output while it is trying to connect to the
cluster?  Right now I don't know if clvmd is waiting for a join response or
creating the lock space or whatever.  

Comment 5 Christine Caulfield 2007-05-01 15:17:03 UTC
Not forking when the debug flag is pretty standard daemon behaviour, all the
other cluster suite daemons do it, as do a great many others. 

I suppose it could kill the parent process, but I'm not really in favour of
killing a process that's doing useful work. If the script wants to do that then
that's fair enough but I don't think it belongs in the binary.

Adding more debugging output is fair enough - though to be honest I think you'll
find that 99% of the time it's waiting for the lockspace to be created.

Comment 6 Christine Caulfield 2007-05-02 08:25:21 UTC
I've added some more debug logging to clvmd at startup so you can see where it's
waiting. It's in CVS but don't expect it to turn up in a package near you any
time soon though!

Comment 7 Nate Straz 2007-05-07 19:58:36 UTC
I'm running with a modified clvmd.  The modified version always writes debug
statements to /tmp/clvmd.log.  The delay seems to be DLM initialisation.

CLVMD[b7f318e0]: May  7 14:32:40 CMAN initialisation complete
CLVMD[b7f318e0]: May  7 14:33:00 DLM initialisation complete

I looked back through my failed run logs and every time startup fails, we're
coming back from shooting three out of five nodes.  So the node that fails clvmd
startup is probably the first one to come up before the cluster regains quorum.

I don't know what the startup scripts can do to handle this.  Perhaps we'll have
to put something in our tests to activate volume groups after recovery is verified.


Comment 8 Christine Caulfield 2007-05-08 15:00:19 UTC
If this is quorum-related, is it worth making clvmd (perhaps optionally) ignore
the timeout and wait for quorum instead?

Another solution is to put something that /does/ wait for quorum ahead of clvmd
in the startup sequence.

Comment 9 Nate Straz 2007-05-08 19:32:01 UTC
Doesn't the cman startup script wait for quroum?  I does seem odd to me that
that could be the reason.

Here is a larger chunk of /var/log/messages from a failing case.

May  2 14:16:08 morph-05 kernel: CMAN: Waiting to join or form a Linux-cluster
May  2 14:16:08 morph-05 ccsd[2713]: Connected to cluster infrastruture via: CMA
N/SM Plugin v1.1.7.4 
May  2 14:16:08 morph-05 ccsd[2713]: Initial status:: Inquorate 
May  2 14:16:10 morph-05 kernel: CMAN: sending membership request
May  2 14:16:11 morph-05 kernel: CMAN: got node morph-04
May  2 14:16:11 morph-05 kernel: CMAN: got node morph-02
May  2 14:16:11 morph-05 kernel: CMAN: got node morph-01
May  2 14:16:11 morph-05 kernel: CMAN: got node morph-03
May  2 14:16:11 morph-05 kernel: CMAN: quorum regained, resuming activity
May  2 14:16:11 morph-05 ccsd[2713]: Cluster is quorate.  Allowing connections. 
May  2 14:16:11 morph-05 cman: startup succeeded
May  2 14:16:11 morph-05 kernel: dm-cmirror: dm-cmirror 0.2.0 (built Apr 26 2007
 16:33:46) installed
May  2 14:16:11 morph-05 cmirror: startup succeeded
May  2 14:16:11 morph-05 lock_gulmd: no <gulm> section detected in /etc/cluster/
cluster.conf succeeded
May  2 14:16:26 morph-05 fenced: startup succeeded
May  2 14:16:46 morph-05 clvmd: clvmd startup timed out
May  2 14:16:46 morph-05 clvmd: clvmd startup failed

Are there other quorums that we need to wait for?

Dave, do you have any ideas?

Comment 10 Robert Peterson 2007-07-31 13:15:00 UTC
Created attachment 160316 [details]
group_tool dump after the failure

I got this problem revolver forced recovery.  Additional info:

[root@roth-03 ~]# group_tool -v
type		 level name	id	 state node id local_done
fence		 0     default	00010002 none	     
[1 2 3]
dlm		 1     clvmd	00010003 none	     
[1 2 3]
[root@roth-03 ~]# ps ax -o pid,cmd,wchan | grep clvmd
 2245 clvmd -T20		  stext

An strace of clvmd showed this:
open("/dev/misc/dlm-control", O_RDWR)	= 6
fcntl(6, F_SETFD, FD_CLOEXEC)		= 0
read(6, "\6\0\0\0\0\0\0\0\0\0\0\0", 12) = 12
write(6, "\6\0\0\0\0\0\0\0\0\0\0\0\4\1\0\0\0\0\0\0\0\0\0\0clvmd\0"..., 125
<unfinished ...>

Which, according to Patrick, means create_lockspace didn't respond.
The funny thing is that the clvmd lock space apparently already existed.

Comment 11 Robert Peterson 2007-07-31 13:18:13 UTC
Created attachment 160317 [details]
sysrq-t at the time of failure.

Comment 12 Christine Caulfield 2007-08-01 13:53:45 UTC
Created attachment 160411 [details]
Patch to fix (I hope)

This seems to be to do with the DLM 'othercon' thing in lowcomms. The don't get
freed up (and pointers NULLed) as they should do. A previous patch addressed
this to some extent but missed some important areas.

Can you test this for me please bob? - it works on my systems. I've left some
debugging in there in case it still misbehaves.

Comment 13 Robert Peterson 2007-08-02 13:41:04 UTC
The previously attached patch seems to work well.


Comment 14 Christine Caulfield 2007-08-02 13:59:03 UTC
OK, I've sent the patch to Steve.

Comment 15 Christine Caulfield 2007-08-03 13:10:29 UTC
Hmm, just spotted that all of Nate's comments refer to RHEL-4, whereas Bob's
(and my patch) are for RHEL-5.

Comment 16 Christine Caulfield 2007-08-07 16:15:09 UTC
Whatever, I think this patch needs to go into RHEL5.1. Can we have some acks ?

Comment 17 Christine Caulfield 2007-08-07 16:24:53 UTC
Bug #251179 cloned for 5.1


Comment 18 Christine Caulfield 2007-08-13 10:31:01 UTC
Created attachment 161164 [details]
New Patch

This is a much neater patch, that also addresses a problem with the earlier
patch that caused bz#251562

Would you test this for me please Bob?

Comment 19 Christine Caulfield 2007-08-13 10:49:10 UTC
Created attachment 161165 [details]
RHEL-5 patch

Here's a patch that applies to a RHEL-5 kernel.

Comment 20 Christine Caulfield 2007-08-13 14:27:34 UTC
Created attachment 161176 [details]
RHEL-4 patch

Just to complete the set here's the RHEL-4 patch :-)

Comment 21 Christine Caulfield 2007-08-13 14:39:24 UTC
*** Bug 251562 has been marked as a duplicate of this bug. ***

Comment 22 Kiersten (Kerri) Anderson 2007-08-13 14:55:30 UTC
Setting flags and also requesting for 4.5.z release - need to respin dlm-kernel
for the errata release.

Comment 24 Christine Caulfield 2007-08-13 20:09:54 UTC
on RHEL4
Checking in lowcomms.c;
/cvs/cluster/cluster/dlm-kernel/src/Attic/lowcomms.c,v  <--  lowcomms.c
new revision: 1.22.2.17; previous revision: 1.22.2.16
done


and on RHEL45:
Checking in lowcomms.c;
/cvs/cluster/cluster/dlm-kernel/src/Attic/lowcomms.c,v  <--  lowcomms.c
new revision: 1.22.2.13.2.3; previous revision: 1.22.2.13.2.2
done


Comment 31 errata-xmlrpc 2007-11-21 21:55:42 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0995.html