Bug 188525 - dlm_close crashes when initiating a LUN rescan with I/O present.
dlm_close crashes when initiating a LUN rescan with I/O present.
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: dlm (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2006-04-10 16:57 EDT by Henry Harris
Modified: 2009-04-16 16:00 EDT (History)
2 users (show)

See Also:
Fixed In Version: RHBA-2006-0558
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2006-08-10 17:27:18 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Proposed fix (2.66 KB, patch)
2006-04-11 03:32 EDT, Christine Caulfield
no flags Details | Diff

  None (edit)
Description Henry Harris 2006-04-10 16:57:58 EDT
Description of problem:
Under certain conditions, dlm_close crashes with a NULL pointer reference.  This
usually happens after a LUN/device scan operation is initiated to discover, or
rediscover drives.

The problem can be found in dlm_close where lkb is used, event though it it NULL.

The key lines are:

		lkb = dlm_get_lkb(f->fi_ls->ls_lockspace, old_li->li_lksb.sb_lkid);

		/* Don't unlock persistent locks */
		if (lkb && lkb->lkb_flags & GDLM_LKFLG_PERSISTENT) {

			/* Update master copy */
			if (lkb->lkb_resource->res_nodeid) {
				li.li_lksb.sb_lkid = lkb->lkb_id;
				status = dlm_lock(f->fi_ls->ls_lockspace,
						lkb->lkb_grmode, &li.li_lksb,
						NULL, 0, 0, ast_routine, &li,
						NULL, NULL);
				if (status == 0)
			lkb->lkb_flags |= GDLM_LKFLG_ORPHAN;

			/* But tidy our references in it */
			lkb->lkb_astparam = (long)NULL;


		clear_bit(LI_FLAG_COMPLETE, &li.li_flags);

		/* If it's not granted then cancel the request.
		 * If the lock was WAITING then it will be dropped,
		 *    if it was converting then it will be reverted to GRANTED,
		 *    then we will unlock it.
		lock_status = lkb->lkb_status;

The problem is that lkb is used even though it may be NULL (i.e. the lock_status
line above).  It seems clear from the code that dlm_get_lkb may return a NULL,
but not all of the code in lm_close handles this case correctly.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. set up a cluster
2. generate traffic using dt
3. perform a LUN rescan (using something like echo "- - -" >
Actual results:
The system crashes/oops.  The following is a trace using KDB:

Unable to handle kernel NULL pointer dereference at 0000000000000004 RIP:
<ffffffffa020b9a6>{:dlm:dlm_close+469} PML4 2055f067 PGD 0 Oops: 0000 [1] SMP
Entering kdb (current=0x00000100bb54b030, pid 5129) on processor 0 Oops: <NULL>
due to oops @ 0xffffffffa020b9a6 r15 = 0x0000010078127e00 r14 =
0x00000100e20294c0 r13 = 0x0000000000000000 r12 = 0x00000100dc1c73c0 rbp =
0x0000000000000000 rbx = 0x00000100c06a3c40 r11 = 0x0000000000000000 r10 =
0xffffffff8040cf40 r9 = 0x0000000000000208 r8 = 0x00000100d97f0000 rax =
0x0000000000000000 rcx = 0x0000000000000000 rdx = 0x00000100d97f4100 rsi =
0x0000000000010208 rdi = 0x0000000000000000 orig_rax = 0xffffffffffffffff rip =
0xffffffffa020b9a6 cs = 0x0000000000000010 eflags = 0x0000000000010246 rsp =
0x00000100c06a3b70 ss = 0x00000100c06a2000 &regs = 0x00000100c06a3ad8 [0]kdb> bt
Stack traceback for pid 5129 0x00000100bb54b030 5129 3175 1 0 R
0x00000100bb54b430 *clvmd RSP RIP Function (args) 0x100c06a3b70
0xffffffffa020b9a6 [dlm]dlm_close+0x1d5 (0x0, 0x1007d512ec0, 0x0, 0x100e7f20d40,
0x1) 0x100c06a3d18 0xffffffff801753c7 __fput+0x63 0x100c06a3f58
0xffffffff8010f50f ptregscall_common+0x67 

Expected results:

Additional info:
The dlm location for "dlm_close+0x1d5" corresponds to the line:
lock_status = lkb->lkb_status;
in dlm_close.
Comment 1 Christine Caulfield 2006-04-11 03:32:31 EDT
Created attachment 127599 [details]
Proposed fix

Yes, that looks like a fair assessment.
If the lkb doesn't exist then there's no point in attempting to do the unlock
at all (see patch).

I imagine this could happen where close is called very soon after an unlock.
Comment 2 Christine Caulfield 2006-04-12 05:30:17 EDT
Fixed for RHEL4

Checking in device.c;
/cvs/cluster/cluster/dlm-kernel/src/device.c,v  <--  device.c
new revision:; previous revision:
Comment 5 Red Hat Bugzilla 2006-08-10 17:27:20 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.