Bug 188525 - dlm_close crashes when initiating a LUN rescan with I/O present.
dlm_close crashes when initiating a LUN rescan with I/O present.
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: dlm (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-04-10 16:57 EDT by Henry Harris
Modified: 2009-04-16 16:00 EDT (History)
2 users (show)

See Also:
Fixed In Version: RHBA-2006-0558
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-08-10 17:27:18 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Proposed fix (2.66 KB, patch)
2006-04-11 03:32 EDT, Christine Caulfield
no flags Details | Diff

  None (edit)
Description Henry Harris 2006-04-10 16:57:58 EDT
Description of problem:
Under certain conditions, dlm_close crashes with a NULL pointer reference.  This
usually happens after a LUN/device scan operation is initiated to discover, or
rediscover drives.

The problem can be found in dlm_close where lkb is used, event though it it NULL.

The key lines are:

		lkb = dlm_get_lkb(f->fi_ls->ls_lockspace, old_li->li_lksb.sb_lkid);

		/* Don't unlock persistent locks */
		if (lkb && lkb->lkb_flags & GDLM_LKFLG_PERSISTENT) {
			list_del(&old_li->li_ownerqueue);

			/* Update master copy */
			if (lkb->lkb_resource->res_nodeid) {
				li.li_lksb.sb_lkid = lkb->lkb_id;
				status = dlm_lock(f->fi_ls->ls_lockspace,
						lkb->lkb_grmode, &li.li_lksb,
						DLM_LKF_CONVERT|DLM_LKF_ORPHAN,
						NULL, 0, 0, ast_routine, &li,
						NULL, NULL);
				if (status == 0)
					wait_for_ast(&li);
			}
			lkb->lkb_flags |= GDLM_LKFLG_ORPHAN;

			/* But tidy our references in it */
			kfree(old_li);
			lkb->lkb_astparam = (long)NULL;
			put_file_info(f);

			continue;
		}

		clear_bit(LI_FLAG_COMPLETE, &li.li_flags);

		/* If it's not granted then cancel the request.
		 * If the lock was WAITING then it will be dropped,
		 *    if it was converting then it will be reverted to GRANTED,
		 *    then we will unlock it.
		 */
		lock_status = lkb->lkb_status;

The problem is that lkb is used even though it may be NULL (i.e. the lock_status
line above).  It seems clear from the code that dlm_get_lkb may return a NULL,
but not all of the code in lm_close handles this case correctly.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. set up a cluster
2. generate traffic using dt
3. perform a LUN rescan (using something like echo "- - -" >
/sys/class/scsi_host/host1/scan)
  
Actual results:
The system crashes/oops.  The following is a trace using KDB:

Unable to handle kernel NULL pointer dereference at 0000000000000004 RIP:
<ffffffffa020b9a6>{:dlm:dlm_close+469} PML4 2055f067 PGD 0 Oops: 0000 [1] SMP
Entering kdb (current=0x00000100bb54b030, pid 5129) on processor 0 Oops: <NULL>
due to oops @ 0xffffffffa020b9a6 r15 = 0x0000010078127e00 r14 =
0x00000100e20294c0 r13 = 0x0000000000000000 r12 = 0x00000100dc1c73c0 rbp =
0x0000000000000000 rbx = 0x00000100c06a3c40 r11 = 0x0000000000000000 r10 =
0xffffffff8040cf40 r9 = 0x0000000000000208 r8 = 0x00000100d97f0000 rax =
0x0000000000000000 rcx = 0x0000000000000000 rdx = 0x00000100d97f4100 rsi =
0x0000000000010208 rdi = 0x0000000000000000 orig_rax = 0xffffffffffffffff rip =
0xffffffffa020b9a6 cs = 0x0000000000000010 eflags = 0x0000000000010246 rsp =
0x00000100c06a3b70 ss = 0x00000100c06a2000 &regs = 0x00000100c06a3ad8 [0]kdb> bt
Stack traceback for pid 5129 0x00000100bb54b030 5129 3175 1 0 R
0x00000100bb54b430 *clvmd RSP RIP Function (args) 0x100c06a3b70
0xffffffffa020b9a6 [dlm]dlm_close+0x1d5 (0x0, 0x1007d512ec0, 0x0, 0x100e7f20d40,
0x1) 0x100c06a3d18 0xffffffff801753c7 __fput+0x63 0x100c06a3f58
0xffffffff8010f50f ptregscall_common+0x67 



Expected results:


Additional info:
The dlm location for "dlm_close+0x1d5" corresponds to the line:
lock_status = lkb->lkb_status;
in dlm_close.
Comment 1 Christine Caulfield 2006-04-11 03:32:31 EDT
Created attachment 127599 [details]
Proposed fix

Yes, that looks like a fair assessment.
If the lkb doesn't exist then there's no point in attempting to do the unlock
at all (see patch).

I imagine this could happen where close is called very soon after an unlock.
Comment 2 Christine Caulfield 2006-04-12 05:30:17 EDT
Fixed for RHEL4

Checking in device.c;
/cvs/cluster/cluster/dlm-kernel/src/device.c,v  <--  device.c
new revision: 1.24.2.8; previous revision: 1.24.2.7
done
Comment 5 Red Hat Bugzilla 2006-08-10 17:27:20 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0558.html

Note You need to log in before you can comment on or make changes to this bug.