Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 204347 - clvmd can continue to appear as a dlm service after it has been stopped
clvmd can continue to appear as a dlm service after it has been stopped
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
Red Hat Kernel QE team
Depends On:
  Show dependency treegraph
Reported: 2006-08-28 13:08 EDT by Corey Marthaler
Modified: 2009-09-03 12:51 EDT (History)
6 users (show)

See Also:
Fixed In Version: beta2
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2006-12-22 19:05:32 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
daemon logs. nodes 14 & 16 failed to remove the lockspace (150.33 KB, application/x-gzip)
2006-08-29 11:08 EDT, Christine Caulfield
no flags Details
dmesg logs & sysfs dlm files (9.45 KB, application/x-gzip)
2006-08-29 11:46 EDT, Christine Caulfield
no flags Details

  None (edit)
Description Corey Marthaler 2006-08-28 13:08:53 EDT
Description of problem:
Ran 'service clvmd stop' and even though it reported that it was stopped... it
Deactivating VG vg:   0 logical volume(s) in volume group "vg" now active
                                                           [  OK  ]
Stopping clvm:                                             [  OK  ]

type             level name     id       state
fence            0     default  00010004 none
[1 2 3 4 5]
dlm              1     clvmd    00020004 none
[3 4]

This then will cause the cman init script to fail as well since the clvmd
service is still registered. 

Version-Release number of selected component (if applicable):
[root@morph-04 ~]# uname -ar
Linux morph-04 2.6.17-1.2519.4.14.el5 #1 SMP Thu Aug 24 19:31:14 EDT 2006 i686
i686 i386 GNU/Linux
[root@morph-04 ~]# rpm -q lvm2-cluster

How reproducible:
fairly reproducable
Comment 1 Abhijith Das 2006-08-28 14:37:33 EDT
'service clvmd stop' does stop the clvmd daemon.
Try the following after 'service clvmd stop'

[root@morph-05 ~]# ps -ef | grep clvmd
root      2542  2520  0 13:22 pts/1    00:00:00 grep clvmd
[root@morph-05 ~]# service clvmd status
clvmd is stopped
active volumes: (none)

But 'cman_tool services' still shows clvmd bound to the dlm.

[root@morph-05 ~]# cman_tool services
type             level name     id       state
fence            0     default  00010004 none
[1 2 3 4 5]
dlm              1     clvmd    00020004 none
Comment 2 Corey Marthaler 2006-08-28 16:42:43 EDT
FWIW, also reproduced this on x86_64.
Comment 3 David Teigland 2006-08-28 18:04:11 EDT
There's a FIXME in the dlm to add AUTOFREE like in RHEL4.
Not sure if we need to implement that or if libdlm can do
without it.  [I've largely worked out how to add it but would
like to avoid it if possible since it's a little unpleasant
adding more special cases for user lockspaces to the normal
Comment 4 Christine Caulfield 2006-08-29 05:14:39 EDT
Under most circumstances clvmd should close the lockspace when it shuts down. If
this isn't happening then we need to know how clvmd is being stopped. Killing it
-9 will obviously do this!

The library can't implement autofree because it doesn't know who else has the
lockspace open (only the kernel does) - libraries live in the same process
address space as the calling program and are no more privileged (what do you
think this is? VMS?).

In any case that wouldn't help as the clvmd lockspace is not marked AUTOFREE
anyway - that's for the default lockspace and is never used by anthing else. One
reason why it hasn't been added: it's not really needed.

There's nothing wrong with leaving the clvmd lockspace open when the daemon 
isn't running (or there shouldn't be), it can happen under the same
circumstances on RHCS4.

If it's preventing shutdown then we need to look at why it's doing so - because
it shouldn't.
Comment 5 David Teigland 2006-08-29 09:42:28 EDT
Oops, I should have checked clvmd more carefully before assuming it
wanted to use autofree.

It looks like doing a 'cman_tool leave' should cause clvmd to release
the lockspace.  The question is then how the clvmd/cman init scripts
should shut down a node.  It looks like we're currently doing:

> service clvmd stop
> service cman stop

the clvmd daemon exits in the first step so it never gets the shutdown
callback from cman_tool leave in the second step, so it never releasing
the lockspace, unless I'm still confused about all this.

Perhaps the clvmd daemon should remain running after the first step and
then when clvmd gets a shutdown callback from the second step, it would
release the lockspace and exit.
Comment 6 Nate Straz 2006-08-29 09:54:44 EDT
Dave, that sounds completely backwards.  I would expect to see no traces of clvmd
after running "service clvmd stop."  It should release its lockspace before
clvmd exits.
Comment 7 Christine Caulfield 2006-08-29 10:00:29 EDT
If clvmd is sent a SIGTERM or SIGINT then it should shut down cleanly taking the
lockspace with it. Only a SIGKILL (or a crash) should leave the lockspace open.
Comment 8 Christine Caulfield 2006-08-29 10:18:08 EDT
What I'm saying here I suppose is that if clvmd is being killed -KILL then don't,

If it's crashing then I want to know where (either a core file, strace or similar),

If it's being killed -TERM but still leaving the lockspace open then there might
be something in the DLM that's failing the remove_lockspace call - I'm not sure.

Looking at the information in comment #1 it seems that this is not that common
(on a 5 node cluster it has removed the lockspace correctly on 3 or 4 nodes).

My additional comment in #4 also still stands. Why is this a problem ?
Comment 9 Christine Caulfield 2006-08-29 10:36:57 EDT
OK that wasn't hard, so far.

clvmd is calling dlm_release_lockspace, and I can see i nthe log for
dlm_controld that it is making an attempt to remove the lockspace. but it
doesn't go away:

1156861126 groupd callback: finish clvmd (unused)
1156861129 groupd callback: stop clvmd
1156861129 write "0" to "/sys/kernel/dlm/clvmd/control"
1156861129 groupd callback: start clvmd count 3 members 19 16 14
1156861129 dir_member 19
1156861129 dir_member 18
1156861129 dir_member 14
1156861129 dir_member 16
1156861129 set_members rmdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/
1156861129 write "1" to "/sys/kernel/dlm/clvmd/control"
1156861129 groupd callback: finish clvmd (unused)
1156861130 groupd callback: stop clvmd
1156861130 write "0" to "/sys/kernel/dlm/clvmd/control"
1156861130 groupd callback: start clvmd count 2 members 16 14
1156861130 dir_member 19
1156861130 dir_member 14
1156861130 dir_member 16
1156861130 set_members rmdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/
1156861130 write "1" to "/sys/kernel/dlm/clvmd/control"
1156861130 groupd callback: finish clvmd (unused)

group_tool shows the lockspace still in existance, and the directory
/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes still exists (with node
numbers 14 and 16 (us) in it.
Comment 10 Christine Caulfield 2006-08-29 11:08:39 EDT
Created attachment 135144 [details]
daemon logs. nodes 14 & 16 failed to remove the lockspace
Comment 11 David Teigland 2006-08-29 11:29:08 EDT
It looks like the dlm was probably doing recovery on 14 and 16.
We'd want to look at kernel messages and info under /sys/kernel/dlm
for hints about why dlm recovery wasn't completing.
Comment 12 Christine Caulfield 2006-08-29 11:46:04 EDT
Created attachment 135151 [details]
dmesg logs & sysfs dlm files
Comment 13 David Teigland 2006-08-29 12:51:41 EDT
sequence of events that should be taking place:

- clvmd calls libdlm: dlm_release_lockspace()
- libdlm writes DLM_USER_REMOVE_LOCKSPACE to dlm control device
- dlm-kernel: device_remove_lockspace()
- dlm-kernel: dlm_release_lockspace()
- dlm-kernel: do_uevent() does "offline" uevent that should be seen by
  dlm_controld and reported in debug output: uevent: offline@/kernel/dlm/clvmd
- dlm-kernel: do_uevent() waits for ack from dlm_controld through sysfs
- dlm_controld leaves the group, after which no "clvmd" group should appear
  in group_tool's output
- dlm_controld writes to /sys/kernel/dlm/clvmd/event to tell the kernel
  that the lockspace has been left
- do_uevent() returns
- dlm_release_lockspace() cleans up the lockspace in the kernel
- device_remove_lockspace() returns
- libdlm write on control device returns
- libdlm dlm_release_lockspace() returns
- clvmd continues

Looking at the logs, it's evident that dlm_controld on 14 & 16 have not
yet gotten a uevent from the kernel.  It would be interesting to see
where the clvmd process is waiting and whether libdlm has called into
the kernel yet for removing the lockspace.
Comment 14 Christine Caulfield 2006-08-30 04:07:37 EDT
clvmd has finished up cleanly and gone home. 

I'm 98% certain that the release lockspace calls has completed to the library's
satisfaction at least because I can see no abnormal finish in the clvmd logs and
all the locks have been tidily released.
Comment 15 David Teigland 2006-08-30 11:53:53 EDT
dlm-kernel was ignoring the FORCEFREE flag from the lib and
not sending a force value to dlm_release_lockspace(), so any
locks in the lockspace were causing dlm_release_lockspace()
to fail and return EBUSY.  Patch sent.
Comment 16 Christine Caulfield 2006-08-31 10:10:33 EDT
I can confirm that this fixes the bug for me, at least.
Comment 17 Benjamin Kahn 2006-09-14 15:10:18 EDT
The cluster-5.0 flag is going away....
Comment 19 RHEL Product and Program Management 2006-12-22 19:05:32 EST
A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.
Comment 20 Nate Straz 2007-12-13 12:40:45 EST
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.

Note You need to log in before you can comment on or make changes to this bug.