Description of problem: Running dlmlock (attempt to get a lock in a lockspace, and attempt to get the same lock, then release lockspace) results the dlmlock program to hang (on an SMP kern). Running the test again while the first copy is hung results in a Oops (note, I had to run dlmlock a few times to hit the Oops). On a UP kernel the machine just "reboots." No stack trace. Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: f8a88ad2 *pde = f23dc067 Oops: 0000 [#1] SMP Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs lock_harness lpfc scsi_transport_fc sd_mod ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg scsi_mod microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd CPU: 2 EIP: 0060:[<f8a88ad2>] Not tainted EFLAGS: 00010282 (2.6.7) EIP is at dlm_close+0x162/0x270 [dlm] eax: ffffffea ebx: ffffffea ecx: 00000017 edx: f632ec00 esi: 00000000 edi: ffffffd8 ebp: f7b43394 esp: f5b0dec4 ds: 007b es: 007b ss: 0068 Process dlmlock (pid: 3904, threadinfo=f5b0c000 task=f78d8bd0) Stack: f5b0df2c f5b0df28 00000000 f63aaed8 ffffffd8 00000000 00000000 ffffffff ffffffff 00000000 f78d8bd0 c011f600 00000000 00000000 f5b0df4c f5b0df4c 00000000 00000000 f78d8bd0 c011f600 f5b0df40 f5b0df40 f5905670 00000000 Call Trace: [<c011f600>] default_wake_function+0x0/0x10 [<c011f600>] default_wake_function+0x0/0x10 [<c015c7c6>] __fput+0xf6/0x110 [<c015afcf>] filp_close+0x4f/0x80 [<c0105e3d>] sysenter_past_esp+0x52/0x71 Code: 8b 47 28 83 e8 28 89 44 24 10 8d 47 28 39 e8 75 8d 8d 54 24 Segmentation fault Version-Release number of selected component (if applicable): DLM <CVS> (built Aug 17 2004 09:37:25) installed How reproducible: Always Steps to Reproduce: 1 ./dlmlock 2. Above will hang at iteration #3. 3. ./dlmlock (repeat a few times and the Oops will trip) Additional info: dlmlock.c #include <stdio.h> #include <stdlib.h> #include <string.h> #include <limits.h> #include <assert.h> #include <pthread.h> #include <sys/types.h> #include <netdb.h> #define _REENTRANT #include <libdlm.h> #define NAMESPACE "dlmtest" void eat_locks(int attempts); static dlm_lshandle_t mylockspace; int main(void) { int i; for (i = 1; i <= 10; i++) { printf("eat_locks(%d)\n", i); mylockspace = NULL; if ((mylockspace = dlm_create_lockspace(NAMESPACE, 0600)) == NULL) { fprintf(stderr, "dlm_create_lockspace() failed.\n"); exit(1); } dlm_ls_pthread_init(mylockspace); eat_locks(i); if (dlm_release_lockspace(NAMESPACE, mylockspace, 0) < 0) { fprintf(stderr, "dlm_release_lockspace() failed.\n"); exit(1); } dlm_close_lockspace(mylockspace); } return 0; } void eat_locks(int attempts) { struct dlm_range range; struct dlm_lksb lksb; int retval; int i; for (i = 0; i < attempts; i+=2) { range.ra_start = i; range.ra_end = i+1; retval = dlm_ls_lock_wait(mylockspace, LKM_EXMODE, &lksb, LKF_NOQUEUE, "bogus", strlen("bogus"), 0, NULL, NULL, &range); fprintf(stderr, "eat_locks: (1)dlm_ls_lock_wait returned %d\n", retval); fprintf(stderr, "eat_locks: (1)lksb.sb_status is %d\n", lksb.sb_status); retval = dlm_ls_lock_wait(mylockspace, LKM_EXMODE, &lksb, LKF_NOQUEUE, "bogus", strlen("bogus"), 0, NULL, NULL, &range); fprintf(stderr, "eat_locks: (2)dlm_ls_lock_wait returned %d\n", retval); fprintf(stderr, "eat_locks: (2)lksb.sb_status is %d\n", lksb.sb_status); } return; }
Checking in device.c; /cvs/cluster/cluster/dlm-kernel/src/device.c,v <-- device.c new revision: 1.13; previous revision: 1.12 done Checking in dlm_internal.h; /cvs/cluster/cluster/dlm-kernel/src/dlm_internal.h,v <-- dlm_internal.h new revision: 1.17; previous revision: 1.16 done Don't hang lkbs off the ownerqueue list as we don't have any control over their lifetime. Now that LKBs are destroyed before the ASTs are run this causes real problems. The ownerqueue is now strung through the lock_info structs themselves and we free those up when we can see that the lkb has been removed by the DLM core.
Test case now passes.
Updating version to the right level in the defects. Sorry for the storm.