Bug 204347 - clvmd can continue to appear as a dlm service after it has been stopped
Summary: clvmd can continue to appear as a dlm service after it has been stopped
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: David Teigland
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-08-28 17:08 UTC by Corey Marthaler
Modified: 2009-09-03 16:51 UTC (History)
6 users (show)

Fixed In Version: beta2
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-12-23 00:05:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
daemon logs. nodes 14 & 16 failed to remove the lockspace (150.33 KB, application/x-gzip)
2006-08-29 15:08 UTC, Christine Caulfield
no flags Details
dmesg logs & sysfs dlm files (9.45 KB, application/x-gzip)
2006-08-29 15:46 UTC, Christine Caulfield
no flags Details

Description Corey Marthaler 2006-08-28 17:08:53 UTC
Description of problem:
Ran 'service clvmd stop' and even though it reported that it was stopped... it
wasn't.
Deactivating VG vg:   0 logical volume(s) in volume group "vg" now active
                                                           [  OK  ]
Stopping clvm:                                             [  OK  ]

type             level name     id       state
fence            0     default  00010004 none
[1 2 3 4 5]
dlm              1     clvmd    00020004 none
[3 4]

This then will cause the cman init script to fail as well since the clvmd
service is still registered. 


Version-Release number of selected component (if applicable):
[root@morph-04 ~]# uname -ar
Linux morph-04 2.6.17-1.2519.4.14.el5 #1 SMP Thu Aug 24 19:31:14 EDT 2006 i686
i686 i386 GNU/Linux
[root@morph-04 ~]# rpm -q lvm2-cluster
lvm2-cluster-2.02.09-1.0.RHEL5


How reproducible:
fairly reproducable

Comment 1 Abhijith Das 2006-08-28 18:37:33 UTC
'service clvmd stop' does stop the clvmd daemon.
Try the following after 'service clvmd stop'

[root@morph-05 ~]# ps -ef | grep clvmd
root      2542  2520  0 13:22 pts/1    00:00:00 grep clvmd
[root@morph-05 ~]# service clvmd status
clvmd is stopped
active volumes: (none)

But 'cman_tool services' still shows clvmd bound to the dlm.

[root@morph-05 ~]# cman_tool services
type             level name     id       state
fence            0     default  00010004 none
[1 2 3 4 5]
dlm              1     clvmd    00020004 none
[5]


Comment 2 Corey Marthaler 2006-08-28 20:42:43 UTC
FWIW, also reproduced this on x86_64.

Comment 3 David Teigland 2006-08-28 22:04:11 UTC
There's a FIXME in the dlm to add AUTOFREE like in RHEL4.
Not sure if we need to implement that or if libdlm can do
without it.  [I've largely worked out how to add it but would
like to avoid it if possible since it's a little unpleasant
adding more special cases for user lockspaces to the normal
paths.]


Comment 4 Christine Caulfield 2006-08-29 09:14:39 UTC
Under most circumstances clvmd should close the lockspace when it shuts down. If
this isn't happening then we need to know how clvmd is being stopped. Killing it
-9 will obviously do this!

The library can't implement autofree because it doesn't know who else has the
lockspace open (only the kernel does) - libraries live in the same process
address space as the calling program and are no more privileged (what do you
think this is? VMS?).

In any case that wouldn't help as the clvmd lockspace is not marked AUTOFREE
anyway - that's for the default lockspace and is never used by anthing else. One
reason why it hasn't been added: it's not really needed.

There's nothing wrong with leaving the clvmd lockspace open when the daemon 
isn't running (or there shouldn't be), it can happen under the same
circumstances on RHCS4.

If it's preventing shutdown then we need to look at why it's doing so - because
it shouldn't.

Comment 5 David Teigland 2006-08-29 13:42:28 UTC
Oops, I should have checked clvmd more carefully before assuming it
wanted to use autofree.

It looks like doing a 'cman_tool leave' should cause clvmd to release
the lockspace.  The question is then how the clvmd/cman init scripts
should shut down a node.  It looks like we're currently doing:

> service clvmd stop
> service cman stop

the clvmd daemon exits in the first step so it never gets the shutdown
callback from cman_tool leave in the second step, so it never releasing
the lockspace, unless I'm still confused about all this.

Perhaps the clvmd daemon should remain running after the first step and
then when clvmd gets a shutdown callback from the second step, it would
release the lockspace and exit.


Comment 6 Nate Straz 2006-08-29 13:54:44 UTC
Dave, that sounds completely backwards.  I would expect to see no traces of clvmd
after running "service clvmd stop."  It should release its lockspace before
clvmd exits.

Comment 7 Christine Caulfield 2006-08-29 14:00:29 UTC
If clvmd is sent a SIGTERM or SIGINT then it should shut down cleanly taking the
lockspace with it. Only a SIGKILL (or a crash) should leave the lockspace open.

Comment 8 Christine Caulfield 2006-08-29 14:18:08 UTC
What I'm saying here I suppose is that if clvmd is being killed -KILL then don't,

If it's crashing then I want to know where (either a core file, strace or similar),

If it's being killed -TERM but still leaving the lockspace open then there might
be something in the DLM that's failing the remove_lockspace call - I'm not sure.

Looking at the information in comment #1 it seems that this is not that common
(on a 5 node cluster it has removed the lockspace correctly on 3 or 4 nodes).

My additional comment in #4 also still stands. Why is this a problem ?

Comment 9 Christine Caulfield 2006-08-29 14:36:57 UTC
OK that wasn't hard, so far.

clvmd is calling dlm_release_lockspace, and I can see i nthe log for
dlm_controld that it is making an attempt to remove the lockspace. but it
doesn't go away:

1156861126 groupd callback: finish clvmd (unused)
1156861129 groupd callback: stop clvmd
1156861129 write "0" to "/sys/kernel/dlm/clvmd/control"
1156861129 groupd callback: start clvmd count 3 members 19 16 14
1156861129 dir_member 19
1156861129 dir_member 18
1156861129 dir_member 14
1156861129 dir_member 16
1156861129 set_members rmdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/
18"
1156861129 write "1" to "/sys/kernel/dlm/clvmd/control"
1156861129 groupd callback: finish clvmd (unused)
1156861130 groupd callback: stop clvmd
1156861130 write "0" to "/sys/kernel/dlm/clvmd/control"
1156861130 groupd callback: start clvmd count 2 members 16 14
1156861130 dir_member 19
1156861130 dir_member 14
1156861130 dir_member 16
1156861130 set_members rmdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/
19"
1156861130 write "1" to "/sys/kernel/dlm/clvmd/control"
1156861130 groupd callback: finish clvmd (unused)


group_tool shows the lockspace still in existance, and the directory
/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes still exists (with node
numbers 14 and 16 (us) in it.

Comment 10 Christine Caulfield 2006-08-29 15:08:39 UTC
Created attachment 135144 [details]
daemon logs. nodes 14 & 16 failed to remove the lockspace

Comment 11 David Teigland 2006-08-29 15:29:08 UTC
It looks like the dlm was probably doing recovery on 14 and 16.
We'd want to look at kernel messages and info under /sys/kernel/dlm
for hints about why dlm recovery wasn't completing.


Comment 12 Christine Caulfield 2006-08-29 15:46:04 UTC
Created attachment 135151 [details]
dmesg logs & sysfs dlm files

Comment 13 David Teigland 2006-08-29 16:51:41 UTC
sequence of events that should be taking place:

- clvmd calls libdlm: dlm_release_lockspace()
- libdlm writes DLM_USER_REMOVE_LOCKSPACE to dlm control device
- dlm-kernel: device_remove_lockspace()
- dlm-kernel: dlm_release_lockspace()
- dlm-kernel: do_uevent() does "offline" uevent that should be seen by
  dlm_controld and reported in debug output: uevent: offline@/kernel/dlm/clvmd
- dlm-kernel: do_uevent() waits for ack from dlm_controld through sysfs
- dlm_controld leaves the group, after which no "clvmd" group should appear
  in group_tool's output
- dlm_controld writes to /sys/kernel/dlm/clvmd/event to tell the kernel
  that the lockspace has been left
- do_uevent() returns
- dlm_release_lockspace() cleans up the lockspace in the kernel
- device_remove_lockspace() returns
- libdlm write on control device returns
- libdlm dlm_release_lockspace() returns
- clvmd continues

Looking at the logs, it's evident that dlm_controld on 14 & 16 have not
yet gotten a uevent from the kernel.  It would be interesting to see
where the clvmd process is waiting and whether libdlm has called into
the kernel yet for removing the lockspace.


Comment 14 Christine Caulfield 2006-08-30 08:07:37 UTC
clvmd has finished up cleanly and gone home. 

I'm 98% certain that the release lockspace calls has completed to the library's
satisfaction at least because I can see no abnormal finish in the clvmd logs and
all the locks have been tidily released.

Comment 15 David Teigland 2006-08-30 15:53:53 UTC
dlm-kernel was ignoring the FORCEFREE flag from the lib and
not sending a force value to dlm_release_lockspace(), so any
locks in the lockspace were causing dlm_release_lockspace()
to fail and return EBUSY.  Patch sent.


Comment 16 Christine Caulfield 2006-08-31 14:10:33 UTC
I can confirm that this fixes the bug for me, at least.

Comment 17 Benjamin Kahn 2006-09-14 19:10:18 UTC
The cluster-5.0 flag is going away....

Comment 19 RHEL Program Management 2006-12-23 00:05:32 UTC
A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.


Comment 20 Nate Straz 2007-12-13 17:40:45 UTC
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.


Note You need to log in before you can comment on or make changes to this bug.