Bug 154262

Summary: slab error in kmem_cache_destroy(): cache `dlm_conn': Can't free all objects when clvmd exits
Product: [Retired] Red Hat Cluster Suite Reporter: Dean Jansa <djansa>
Component: dlmAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED NEXTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-11-29 21:55:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dean Jansa 2005-04-08 20:02:40 UTC
Description of problem:

At times while stopping clvmd I hit:
slab error in kmem_cache_destroy(): cache `dlm_conn': Can't free all objects
 [<c0142cff>] kmem_cache_destroy+0x99/0x132
 [<f8a9934b>] lowcomms_stop+0xd4/0xdb [dlm]
 [<f8a9704e>] threads_stop+0x5/0xa [dlm]
 [<f8a97147>] dlm_release+0x83/0xa0 [dlm]
 [<f8a97983>] release_lockspace+0x199/0x1cf [dlm]
 [<f8a912f3>] unregister_lockspace+0xa/0x5c [dlm]
 [<f8a91a9c>] do_user_remove_lockspace+0x7d/0x94 [dlm]
 [<f8a92574>] dlm_write+0x169/0x1ae [dlm]
 [<c01561a8>] vfs_write+0xb6/0xe2
 [<c0156272>] sys_write+0x3c/0x62
 [<c02c746b>] syscall_call+0x7/0xb
kmem_cache_create: duplicate cache dlm_conn
------------[ cut here ]------------
kernel BUG at mm/slab.c:1453!
invalid operand: 0000 [#1]
SMP
Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_dlm(U) dlm(U) cman(U) lock
_harness(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc button battery ac uhci
_hcd hw_random e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla230
0 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    1
EIP:    0060:[<c0142a8e>]    Not tainted VLI
EFLAGS: 00010202   (2.6.9-6.37.ELsmp)
EIP is at kmem_cache_create+0x4b3/0x526


Upon restart I hit duplicate cache (which is reasonable seeing as we didn't
clear it up above, but thought the stack may help at any rate)
:
Mar 30 14:33:08 morph-04 kernel: kmem_cache_create: duplicate cache dlm_conn
Mar 30 14:33:08 morph-04 kernel: ------------[ cut here ]------------
Mar 30 14:33:08 morph-04 kernel: kernel BUG at mm/slab.c:1453!
Mar 30 14:33:08 morph-04 kernel: invalid operand: 0000 [#1]
Mar 30 14:33:08 morph-04 kernel: SMP
Mar 30 14:33:08 morph-04 kernel: Modules linked in: gnbd(U) lock_nolock(U) gfs(U
) lock_dlm(U) dlm(U) cman(U) lock_harness(U) md5 ipv6 parport_pc lp parport auto
fs4 sunrpc button battery ac uhci_hcd hw_random e1000 floppy dm_snapshot dm_zero
 dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
Mar 30 14:33:08 morph-04 kernel: CPU:    1
Mar 30 14:33:08 morph-04 kernel: EIP:    0060:[<c0142a8e>]    Not tainted VLI
Mar 30 14:33:08 morph-04 kernel: EFLAGS: 00010202   (2.6.9-6.37.ELsmp)
Mar 30 14:33:08 morph-04 kernel: EIP is at kmem_cache_create+0x4b3/0x526
Mar 30 14:33:08 morph-04 kernel: eax: 0000002c   ebx: f4554a74   ecx: c042530c
 edx: c02dad97
Mar 30 14:33:08 morph-04 kernel: esi: f8aa282c   edi: f8aa2835   ebp: f4554880
 esp: f2e4bec8
Mar 30 14:33:08 morph-04 kernel: ds: 007b   es: 007b   ss: 0068
Mar 30 14:33:08 morph-04 kernel: Process clvmd (pid: 14464, threadinfo=f2e4b000
task=f43f7730)
Mar 30 14:33:08 morph-04 kernel: Stack: c201cd60 c0000000 00000000 f8aa282c 0000
0050 00000000 fffffff4 f5f2ba00
Mar 30 14:33:08 morph-04 kernel:        f5f2b118 f8a994a1 00000000 00000000 0000
0000 00000000 f5f2b900 00000005
Mar 30 14:33:08 morph-04 kernel:        f8a9702b f8aab850 f8a9706a f8a97721 f5f2
b900 00000000 f5f2b11e ffffffff
Mar 30 14:33:08 morph-04 kernel: Call Trace:
Mar 30 14:33:08 morph-04 kernel:  [<f8a994a1>] lowcomms_start+0x14f/0x1f6 [dlm]
Mar 30 14:33:08 morph-04 kernel:  [<f8a9702b>] threads_start+0x20/0x3e [dlm]
Mar 30 14:33:08 morph-04 kernel:  [<f8a9706a>] init_internal+0x17/0x30 [dlm]
Mar 30 14:33:08 morph-04 kernel:  [<f8a97721>] dlm_new_lockspace+0x39/0x61 [dlm]
Mar 30 14:33:08 morph-04 kernel:  [<f8a91242>] register_lockspace+0xa3/0x14a [dl
m]
Mar 30 14:33:08 morph-04 kernel:  [<f8a91a0e>] do_user_create_lockspace+0x21/0x3
2 [dlm]
Mar 30 14:33:08 morph-04 kernel:  [<f8a92561>] dlm_write+0x156/0x1ae [dlm]
Mar 30 14:33:08 morph-04 kernel:  [<c01561a8>] vfs_write+0xb6/0xe2
Mar 30 14:33:08 morph-04 kernel:  [<c0156272>] sys_write+0x3c/0x62
Mar 30 14:33:08 morph-04 kernel:  [<c02c746b>] syscall_call+0x7/0xb
Mar 30 14:33:09 morph-04 kernel: Code: 04 19 c0 0c 01 85 c0 75 2a ff 74 24 0c 68
 97 ad 2d c0 e8 50 ef fd ff 59 b9 0c 53 42 c0 5e f0 ff 05 0c 53 42 c0 0f 8e eb 1
4 00 00 <0f> 0b ad 05 14 ad 2d c0 8b 1b eb 84 8b 54 24 04 b8 00 f0 ff ff
Mar 30 14:33:09 morph-04 kernel:  <0>Fatal exception: panic in 5 seconds




Version-Release number of selected component (if applicable):

2.6.9-6.37.ELsmp

DLM 2.6.9-30.1 (built Mar 29 2005 18:29:33) installed
Lock_DLM (built Mar 29 2005 18:33:25) installed

How reproducible:

Sometimes


Steps to Reproduce:
1. start clmvd
2. create/vols
3. tear down vols
4. stop clvmd

Comment 1 Christine Caulfield 2005-04-11 13:32:29 UTC
grief, the locking in nodeid2con is well broken, there's a read lock protecting
a write! Which explains why two connections to the same node can be created at
the same time. Of course, only one of them gets freed; hence this bug.

Changed the RW semaphore into a simple semaphore protecting the whole operation
rather than a rw_semaphore that is upped & downed all over the place in the one
routine.

Checking in lowcomms.c;
/cvs/cluster/cluster/dlm-kernel/src/lowcomms.c,v  <--  lowcomms.c
new revision: 1.22.2.8; previous revision: 1.22.2.7
done


Comment 2 Dean Jansa 2005-11-29 21:55:48 UTC
Have not seen this after the fix went in.