174547 – slab error in kmem_cache_destroy() after stopping gfs service

Bug 174547 - slab error in kmem_cache_destroy() after stopping gfs service

Summary: slab error in kmem_cache_destroy() after stopping gfs service

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	dlm
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Robert Peterson
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	174548 (view as bug list)
Depends On:
Blocks:	180185
TreeView+	depends on / blocked

Reported:	2005-11-29 21:48 UTC by Corey Marthaler
Modified:	2018-10-19 23:27 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-12-11 15:09:17 UTC
Embargoed:

Attachments	(Terms of Use)
Possible patch to fix the problem (614 bytes, patch) 2007-11-09 16:09 UTC, Robert Peterson	no flags	Details \| Diff
View All

Description Corey Marthaler 2005-11-29 21:48:05 UTC

Description of problem:
I was taking down my link cluster, and the 'service cman stop' failed, even
though there were no services listed in /proc/cluster/services (though that file
did still exist) so I checked to see which modules were loaded (as the last
steps of the cman init script are to unlaod the dlm mods) and the dlm mods were
still loaded and being used by gfs, so I did a service gfs stop. This caused
this kernel error:

slab error in kmem_cache_destroy(): cache `dlm_lvb/range': Can't free all objects

Call Trace:<ffffffff801607bb>{kmem_cache_destroy+202}
<ffffffffa0244c8d>{:dlm:cleanup_module+23}
       <ffffffff8014cdfd>{sys_delete_module+479} <ffffffff801e77bd>{__up_write+20}
       <ffffffff8016ac6c>{sys_munmap+94} <ffffffff80110052>{system_call+126}

Nov 29 12:44:51 link-06 kernel: slab error in kmem_cache_destroy(): cache
`dlm_lvb/range': Can't free all objects
Nov 29 12:44:51 link-06 kernel:
Nov 29 12:44:51 link-06 kernel: Call
Trace:<ffffffff801607bb>{kmem_cache_destroy+202}
<ffffffffa0244c8d>{:dlm:cleanup_module+23}
Nov 29 12:44:51 link-06 kernel:        <ffffffff8014cdfd>{sys_delete_module+479}
<ffffffff801e77bd>{__up_write+20}
Nov 29 12:44:51 link-06 kernel:        <ffffffff8016ac6c>{sys_munmap+94}
<ffffffff80110052>{system_call+126}
Nov 29 12:44:51 link-06 kernel:
Nov 29 12:44:51 link-06 kernel: NET: Unregistered protocol family 30

After that I was actually able to stop cman services and everything appeared fine.


ON ALL NODES:

[root@link-02 ~]# service fenced stop
Stopping fence domain:                                     [  OK  ]
[root@link-02 ~]# service cman stop
Stopping cman:                                             [FAILED]
[root@link-02 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
[root@link-02 ~]# vi /etc/init.d/cman
[root@link-02 ~]# lsmod
Module                  Size  Used by
lock_dlm               46068  0
gfs                   322060  0
lock_harness            6960  2 lock_dlm,gfs
dlm                   130180  1 lock_dlm
cman                  136480  2 lock_dlm,dlm
md5                     5697  1
ipv6                  282657  18
parport_pc             29185  1
lp                     15089  0
parport                43981  2 parport_pc,lp
autofs4                23241  0
i2c_dev                13633  0
i2c_core               28481  1 i2c_dev
sunrpc                170425  1
ds                     21449  0
yenta_socket           22977  0
pcmcia_core            69329  2 ds,yenta_socket
button                  9057  0
battery                11209  0
ac                      6729  0
ohci_hcd               24273  0
hw_random               7137  0
tg3                    91717  0
floppy                 65809  0
dm_snapshot            18561  0
dm_zero                 3649  0
dm_mirror              28889  0
ext3                  137681  2
jbd                    68849  1 ext3
dm_mod                 66433  6 dm_snapshot,dm_zero,dm_mirror
qla2300               126017  0
qla2xxx               178849  7 qla2300
scsi_transport_fc      11201  1 qla2xxx
sd_mod                 19393  12
scsi_mod              140177  3 qla2xxx,scsi_transport_fc,sd_mod
[root@link-02 ~]# service cman stop
Stopping cman:                                             [FAILED]
[root@link-02 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
[root@link-02 ~]# service gfs stop
root@link-02 ~]# service gfs stop
[root@link-02 ~]#
[root@link-02 ~]#
[root@link-02 ~]# service cman stop
Stopping cman:                                             [  OK  ]



Version-Release number of selected component (if applicable):
kernel: GFS 2.6.9-45.0 (built Nov 28 2005 11:39:41) installed

Comment 1 Christine Caulfield 2005-11-30 08:54:19 UTC

*** Bug 174548 has been marked as a duplicate of this bug. ***

Comment 2 David Teigland 2005-11-30 17:11:54 UTC

It seems unwise to me for the init scripts to unload modules
at all.  I don't understand why we'd even want them to in the
first place.  Should we add a new bugzilla about that?
Unloading modules is never an entirely safe thing to do;
as I understand it there are some unavoidable races involved.

Aside from that, I'll see if I can track down what's causing
this error.

Comment 4 Robert Peterson 2006-04-18 17:05:20 UTC

We haven't seen this problem again since it was reported.
Corey said it was okay to close it "Works For Me."
If the problem occurs again, it can be reopened and addressed
at that time.

Comment 7 Robert Peterson 2007-11-09 16:09:30 UTC

Created attachment 252981 [details]
Possible patch to fix the problem

Revisiting this problem, I discovered a place in the dlm code that
MAY account for this problem.  It also would explain why the occurrence
is rare.  Basically, in function deserialise_lkb() there is a section
of code that allocates a small chunk of slab memory for a new lkb.
That new lkb may later be discarded if the conditions are just right,
but that new chunk of slab memory is never freed.  This would produce
exactly the symptoms of this bugzilla.	It would show up more during
recovery testing (like our "revolver test") than anywhere else I think.

This patch fixes that problem, but it is completely untested.

I spoke with Dave Teigland about this and he agreed that it's a
problem, but a very rare one.  So the question becomes how to proceed.
Given (a) the problem's extreme rarity, (b) we can't recreate the
problem, and (c) there's no good way to test it other than our normal
regression testing, do we want to ship the fix?  How does the customer
want to proceed?  Do they want to try the patch?

I'll set the status to NEEDINFO to get those answers.

Comment 8 Robert Peterson 2007-11-09 16:13:29 UTC

I'm asking for direction from support, based on comment #7, so I'm
setting status to NEEDINFO.  I'm also adding Dave Teigland to the cc
list to keep him in the loop.

Comment 10 Robert Peterson 2007-12-11 15:09:17 UTC

The problem does not exist in RHEL5, so this is a RHEL4 problem only.
Due to lack of demand, I'm going to close this bug as deferred.
It is apparently a bug, and we can certainly fix it if any customers
want us to.  The fact of the matter is that it is rare and nobody cares
if we do the fix at this point.

Note You need to log in before you can comment on or make changes to this bug.