241699 – dlm fails to unregister all lockspaces

Bug 241699 - dlm fails to unregister all lockspaces

Summary: dlm fails to unregister all lockspaces

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	dlm
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-05-29 17:04 UTC by Brad Walker
Modified:	2009-04-16 20:31 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-05-30 18:08:08 UTC
Embargoed:

Attachments	(Terms of Use)

Description Brad Walker 2007-05-29 17:04:47 UTC

Description of problem:

I was working on a fencing problem when I discovered this kernel crash.

Basically what happens is the kernel crashes into kdb with dlm problem.

The steps to reproduce this are:

service ccsd start
service cman start
service clvmd start
service clvmd stop
service cman stop

Starting portlock:  ip_tables: (C) 2000-2002 Netfilter core team
[  OK  ]
CMAN <CVS> (built May  9 2007 14:54:51) installed
CMAN: quorum regained, resuming activity
DLM <CVS> (built May  9 2007 14:55:00) installed
WARNING: dlm_emergency_shutdown
WARNING: dlm_emergency_shutdown
slab error in kmem_cache_destroy(): cache `dlm_lkb': Can't free all objects
Call Trace:
    <ffffffff8016191f>{kmem_cache_destroy+202}
    <ffffffffa02440ac>{:dlm:dlm_memory_exit+37}
    <ffffffffa024bbf9>{:dlm:cleanup_module+23}
    <ffffffff8014dc54>{sys_delete_module+479}
    <ffffffff80110c61>{error_exit+0}
    <ffffffff801101c6>{system_call+126}

CMAN <CVS> (built May  9 2007 14:54:51) installed
SLAB: cache with size 232 has lost its name
CMAN: quorum regained, resuming activity
kmem_cache_create: duplicate cache dlm_lkb

Kernel BUG at slab:1453
invalid operand: 0000 [1] SMP

Entering kdb (current=0x00000100e1b477f0, pid 8489) on processor 0 Oops: <NULL>
due to oops @ 0xffffffff801623b8
     r15 = 0xffffffffa024dd47      r14 = 0x0000010000000000
     r13 = 0x0000000000000000      r12 = 0xffffffff8048a0e0
     rbp = 0x00000100e3cec880      rbx = 0x00000100e3cecb70
     r11 = 0x0000000000000001      r10 = 0x0000000100000000
      r9 = 0x00000100e3cecb70       r8 = 0xffffffff803e5ac8
     rax = 0x000000000000002b      rcx = 0xffffffff803e5ac8
     rdx = 0xffffffff803e5ac8      rsi = 0x0000000000000246
     rdi = 0xffffffff8048a0e0 orig_rax = 0xffffffffffffffff
     rip = 0xffffffff801623b8       cs = 0x0000000000000010
  eflags = 0x0000000000010202      rsp = 0x00000100e15d7ec0
      ss = 0x00000100e15d6000 &regs = 0x00000100e15d7e28
[0]kdb>
[forced to `spy' mode by cwsupport]
[0]kdb> bt
Stack traceback for pid 8489
0x00000100e1b477f0     8489     8445  1    0   R  0x00000100e1b47bf0 *modprobe
RSP           RIP                Function (args)
0x100e15d7ec0 0xffffffff801623b8 kmem_cache_create+0x532
0x100e15d7f38 0xffffffffa0243fb0 [dlm]dlm_memory_init+0x80
0x100e15d7f48 0xffffffffa025c01a [dlm]init_module+0x1a
0x100e15d7f58 0xffffffff8014f739 sys_init_module+0x116

Comment 1 Brad Walker 2007-05-29 17:07:17 UTC

Suggested fix is to do the following at the end of lockspace.c:release_lockspace()

                  }
        }

+       spin_lock(&ls->ls_trash_spin);
+printk("release_lockspace: %d on the trash list\n",ls->ls_trash_count);
+       if (ls->ls_trash_count) {
+               struct dlm_lkb *lkb1, *lkb2;
+               list_for_each_entry_safe(lkb1, lkb2, &ls->ls_trash_list,
+                                        lkb_idtbl_list) {
+                       list_del(&lkb1->lkb_idtbl_list);
+                       free_lkb(lkb1);
+               }
+       }
+       spin_unlock(&ls->ls_trash_spin);
+
        astd_resume();

Comment 2 David Teigland 2007-05-29 17:19:31 UTC

You're evidently running with a debugging patch that was used while working
on bug 199673 (patch in comment 16 of that bug).  That debugging patch is
definately not suitable for general usage and appears to be the cause of your
problems.  You should remove that patch and update to the most recent version
of the dlm.

Comment 3 David Teigland 2007-05-30 18:08:08 UTC

Please reopen this bug if there's still a problem after removing the patch.

Note You need to log in before you can comment on or make changes to this bug.