Description of problem: I was taking down my link cluster, and the 'service cman stop' failed, even though there were no services listed in /proc/cluster/services (though that file did still exist) so I checked to see which modules were loaded (as the last steps of the cman init script are to unlaod the dlm mods) and the dlm mods were still loaded and being used by gfs, so I did a service gfs stop. This caused this kernel error: slab error in kmem_cache_destroy(): cache `dlm_lvb/range': Can't free all objects Call Trace:<ffffffff801607bb>{kmem_cache_destroy+202} <ffffffffa0244c8d>{:dlm:cleanup_module+23} <ffffffff8014cdfd>{sys_delete_module+479} <ffffffff801e77bd>{__up_write+20} <ffffffff8016ac6c>{sys_munmap+94} <ffffffff80110052>{system_call+126} Nov 29 12:44:51 link-06 kernel: slab error in kmem_cache_destroy(): cache `dlm_lvb/range': Can't free all objects Nov 29 12:44:51 link-06 kernel: Nov 29 12:44:51 link-06 kernel: Call Trace:<ffffffff801607bb>{kmem_cache_destroy+202} <ffffffffa0244c8d>{:dlm:cleanup_module+23} Nov 29 12:44:51 link-06 kernel: <ffffffff8014cdfd>{sys_delete_module+479} <ffffffff801e77bd>{__up_write+20} Nov 29 12:44:51 link-06 kernel: <ffffffff8016ac6c>{sys_munmap+94} <ffffffff80110052>{system_call+126} Nov 29 12:44:51 link-06 kernel: Nov 29 12:44:51 link-06 kernel: NET: Unregistered protocol family 30 After that I was actually able to stop cman services and everything appeared fine. ON ALL NODES: [root@link-02 ~]# service fenced stop Stopping fence domain: [ OK ] [root@link-02 ~]# service cman stop Stopping cman: [FAILED] [root@link-02 ~]# cat /proc/cluster/services Service Name GID LID State Code [root@link-02 ~]# vi /etc/init.d/cman [root@link-02 ~]# lsmod Module Size Used by lock_dlm 46068 0 gfs 322060 0 lock_harness 6960 2 lock_dlm,gfs dlm 130180 1 lock_dlm cman 136480 2 lock_dlm,dlm md5 5697 1 ipv6 282657 18 parport_pc 29185 1 lp 15089 0 parport 43981 2 parport_pc,lp autofs4 23241 0 i2c_dev 13633 0 i2c_core 28481 1 i2c_dev sunrpc 170425 1 ds 21449 0 yenta_socket 22977 0 pcmcia_core 69329 2 ds,yenta_socket button 9057 0 battery 11209 0 ac 6729 0 ohci_hcd 24273 0 hw_random 7137 0 tg3 91717 0 floppy 65809 0 dm_snapshot 18561 0 dm_zero 3649 0 dm_mirror 28889 0 ext3 137681 2 jbd 68849 1 ext3 dm_mod 66433 6 dm_snapshot,dm_zero,dm_mirror qla2300 126017 0 qla2xxx 178849 7 qla2300 scsi_transport_fc 11201 1 qla2xxx sd_mod 19393 12 scsi_mod 140177 3 qla2xxx,scsi_transport_fc,sd_mod [root@link-02 ~]# service cman stop Stopping cman: [FAILED] [root@link-02 ~]# cat /proc/cluster/services Service Name GID LID State Code [root@link-02 ~]# service gfs stop root@link-02 ~]# service gfs stop [root@link-02 ~]# [root@link-02 ~]# [root@link-02 ~]# service cman stop Stopping cman: [ OK ] Version-Release number of selected component (if applicable): kernel: GFS 2.6.9-45.0 (built Nov 28 2005 11:39:41) installed
*** Bug 174548 has been marked as a duplicate of this bug. ***
It seems unwise to me for the init scripts to unload modules at all. I don't understand why we'd even want them to in the first place. Should we add a new bugzilla about that? Unloading modules is never an entirely safe thing to do; as I understand it there are some unavoidable races involved. Aside from that, I'll see if I can track down what's causing this error.
We haven't seen this problem again since it was reported. Corey said it was okay to close it "Works For Me." If the problem occurs again, it can be reopened and addressed at that time.
Created attachment 252981 [details] Possible patch to fix the problem Revisiting this problem, I discovered a place in the dlm code that MAY account for this problem. It also would explain why the occurrence is rare. Basically, in function deserialise_lkb() there is a section of code that allocates a small chunk of slab memory for a new lkb. That new lkb may later be discarded if the conditions are just right, but that new chunk of slab memory is never freed. This would produce exactly the symptoms of this bugzilla. It would show up more during recovery testing (like our "revolver test") than anywhere else I think. This patch fixes that problem, but it is completely untested. I spoke with Dave Teigland about this and he agreed that it's a problem, but a very rare one. So the question becomes how to proceed. Given (a) the problem's extreme rarity, (b) we can't recreate the problem, and (c) there's no good way to test it other than our normal regression testing, do we want to ship the fix? How does the customer want to proceed? Do they want to try the patch? I'll set the status to NEEDINFO to get those answers.
I'm asking for direction from support, based on comment #7, so I'm setting status to NEEDINFO. I'm also adding Dave Teigland to the cc list to keep him in the loop.
The problem does not exist in RHEL5, so this is a RHEL4 problem only. Due to lack of demand, I'm going to close this bug as deferred. It is apparently a bug, and we can certainly fix it if any customers want us to. The fact of the matter is that it is rare and nobody cares if we do the fix at this point.