Description of problem: Under certain conditions, dlm_close crashes with a NULL pointer reference. This usually happens after a LUN/device scan operation is initiated to discover, or rediscover drives. The problem can be found in dlm_close where lkb is used, event though it it NULL. The key lines are: lkb = dlm_get_lkb(f->fi_ls->ls_lockspace, old_li->li_lksb.sb_lkid); /* Don't unlock persistent locks */ if (lkb && lkb->lkb_flags & GDLM_LKFLG_PERSISTENT) { list_del(&old_li->li_ownerqueue); /* Update master copy */ if (lkb->lkb_resource->res_nodeid) { li.li_lksb.sb_lkid = lkb->lkb_id; status = dlm_lock(f->fi_ls->ls_lockspace, lkb->lkb_grmode, &li.li_lksb, DLM_LKF_CONVERT|DLM_LKF_ORPHAN, NULL, 0, 0, ast_routine, &li, NULL, NULL); if (status == 0) wait_for_ast(&li); } lkb->lkb_flags |= GDLM_LKFLG_ORPHAN; /* But tidy our references in it */ kfree(old_li); lkb->lkb_astparam = (long)NULL; put_file_info(f); continue; } clear_bit(LI_FLAG_COMPLETE, &li.li_flags); /* If it's not granted then cancel the request. * If the lock was WAITING then it will be dropped, * if it was converting then it will be reverted to GRANTED, * then we will unlock it. */ lock_status = lkb->lkb_status; The problem is that lkb is used even though it may be NULL (i.e. the lock_status line above). It seems clear from the code that dlm_get_lkb may return a NULL, but not all of the code in lm_close handles this case correctly. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. set up a cluster 2. generate traffic using dt 3. perform a LUN rescan (using something like echo "- - -" > /sys/class/scsi_host/host1/scan) Actual results: The system crashes/oops. The following is a trace using KDB: Unable to handle kernel NULL pointer dereference at 0000000000000004 RIP: <ffffffffa020b9a6>{:dlm:dlm_close+469} PML4 2055f067 PGD 0 Oops: 0000 [1] SMP Entering kdb (current=0x00000100bb54b030, pid 5129) on processor 0 Oops: <NULL> due to oops @ 0xffffffffa020b9a6 r15 = 0x0000010078127e00 r14 = 0x00000100e20294c0 r13 = 0x0000000000000000 r12 = 0x00000100dc1c73c0 rbp = 0x0000000000000000 rbx = 0x00000100c06a3c40 r11 = 0x0000000000000000 r10 = 0xffffffff8040cf40 r9 = 0x0000000000000208 r8 = 0x00000100d97f0000 rax = 0x0000000000000000 rcx = 0x0000000000000000 rdx = 0x00000100d97f4100 rsi = 0x0000000000010208 rdi = 0x0000000000000000 orig_rax = 0xffffffffffffffff rip = 0xffffffffa020b9a6 cs = 0x0000000000000010 eflags = 0x0000000000010246 rsp = 0x00000100c06a3b70 ss = 0x00000100c06a2000 ®s = 0x00000100c06a3ad8 [0]kdb> bt Stack traceback for pid 5129 0x00000100bb54b030 5129 3175 1 0 R 0x00000100bb54b430 *clvmd RSP RIP Function (args) 0x100c06a3b70 0xffffffffa020b9a6 [dlm]dlm_close+0x1d5 (0x0, 0x1007d512ec0, 0x0, 0x100e7f20d40, 0x1) 0x100c06a3d18 0xffffffff801753c7 __fput+0x63 0x100c06a3f58 0xffffffff8010f50f ptregscall_common+0x67 Expected results: Additional info: The dlm location for "dlm_close+0x1d5" corresponds to the line: lock_status = lkb->lkb_status; in dlm_close.
Created attachment 127599 [details] Proposed fix Yes, that looks like a fair assessment. If the lkb doesn't exist then there's no point in attempting to do the unlock at all (see patch). I imagine this could happen where close is called very soon after an unlock.
Fixed for RHEL4 Checking in device.c; /cvs/cluster/cluster/dlm-kernel/src/device.c,v <-- device.c new revision: 1.24.2.8; previous revision: 1.24.2.7 done
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0558.html