188525 – dlm_close crashes when initiating a LUN rescan with I/O present.

Bug 188525 - dlm_close crashes when initiating a LUN rescan with I/O present.

Summary: dlm_close crashes when initiating a LUN rescan with I/O present.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	dlm
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-04-10 20:57 UTC by Henry Harris
Modified:	2009-04-16 20:00 UTC (History)
CC List:	2 users (show)
Fixed In Version:	RHBA-2006-0558
Clone Of:
Environment:
Last Closed:	2006-08-10 21:27:18 UTC
Embargoed:

Attachments	(Terms of Use)
Proposed fix (2.66 KB, patch) 2006-04-11 07:32 UTC, Christine Caulfield	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2006:0558	0	normal	SHIPPED_LIVE	dlm-kernel bug fix update	2006-08-10 04:00:00 UTC

Description Henry Harris 2006-04-10 20:57:58 UTC

Description of problem:
Under certain conditions, dlm_close crashes with a NULL pointer reference.  This
usually happens after a LUN/device scan operation is initiated to discover, or
rediscover drives.

The problem can be found in dlm_close where lkb is used, event though it it NULL.

The key lines are:

		lkb = dlm_get_lkb(f->fi_ls->ls_lockspace, old_li->li_lksb.sb_lkid);

		/* Don't unlock persistent locks */
		if (lkb && lkb->lkb_flags & GDLM_LKFLG_PERSISTENT) {
			list_del(&old_li->li_ownerqueue);

			/* Update master copy */
			if (lkb->lkb_resource->res_nodeid) {
				li.li_lksb.sb_lkid = lkb->lkb_id;
				status = dlm_lock(f->fi_ls->ls_lockspace,
						lkb->lkb_grmode, &li.li_lksb,
						DLM_LKF_CONVERT|DLM_LKF_ORPHAN,
						NULL, 0, 0, ast_routine, &li,
						NULL, NULL);
				if (status == 0)
					wait_for_ast(&li);
			}
			lkb->lkb_flags |= GDLM_LKFLG_ORPHAN;

			/* But tidy our references in it */
			kfree(old_li);
			lkb->lkb_astparam = (long)NULL;
			put_file_info(f);

			continue;
		}

		clear_bit(LI_FLAG_COMPLETE, &li.li_flags);

		/* If it's not granted then cancel the request.
		 * If the lock was WAITING then it will be dropped,
		 *    if it was converting then it will be reverted to GRANTED,
		 *    then we will unlock it.
		 */
		lock_status = lkb->lkb_status;

The problem is that lkb is used even though it may be NULL (i.e. the lock_status
line above).  It seems clear from the code that dlm_get_lkb may return a NULL,
but not all of the code in lm_close handles this case correctly.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. set up a cluster
2. generate traffic using dt
3. perform a LUN rescan (using something like echo "- - -" >
/sys/class/scsi_host/host1/scan)
  
Actual results:
The system crashes/oops.  The following is a trace using KDB:

Unable to handle kernel NULL pointer dereference at 0000000000000004 RIP:
<ffffffffa020b9a6>{:dlm:dlm_close+469} PML4 2055f067 PGD 0 Oops: 0000 [1] SMP
Entering kdb (current=0x00000100bb54b030, pid 5129) on processor 0 Oops: <NULL>
due to oops @ 0xffffffffa020b9a6 r15 = 0x0000010078127e00 r14 =
0x00000100e20294c0 r13 = 0x0000000000000000 r12 = 0x00000100dc1c73c0 rbp =
0x0000000000000000 rbx = 0x00000100c06a3c40 r11 = 0x0000000000000000 r10 =
0xffffffff8040cf40 r9 = 0x0000000000000208 r8 = 0x00000100d97f0000 rax =
0x0000000000000000 rcx = 0x0000000000000000 rdx = 0x00000100d97f4100 rsi =
0x0000000000010208 rdi = 0x0000000000000000 orig_rax = 0xffffffffffffffff rip =
0xffffffffa020b9a6 cs = 0x0000000000000010 eflags = 0x0000000000010246 rsp =
0x00000100c06a3b70 ss = 0x00000100c06a2000 &regs = 0x00000100c06a3ad8 [0]kdb> bt
Stack traceback for pid 5129 0x00000100bb54b030 5129 3175 1 0 R
0x00000100bb54b430 *clvmd RSP RIP Function (args) 0x100c06a3b70
0xffffffffa020b9a6 [dlm]dlm_close+0x1d5 (0x0, 0x1007d512ec0, 0x0, 0x100e7f20d40,
0x1) 0x100c06a3d18 0xffffffff801753c7 __fput+0x63 0x100c06a3f58
0xffffffff8010f50f ptregscall_common+0x67 



Expected results:


Additional info:
The dlm location for "dlm_close+0x1d5" corresponds to the line:
lock_status = lkb->lkb_status;
in dlm_close.

Comment 1 Christine Caulfield 2006-04-11 07:32:31 UTC

Created attachment 127599 [details]
Proposed fix

Yes, that looks like a fair assessment.
If the lkb doesn't exist then there's no point in attempting to do the unlock
at all (see patch).

I imagine this could happen where close is called very soon after an unlock.

Comment 2 Christine Caulfield 2006-04-12 09:30:17 UTC

Fixed for RHEL4

Checking in device.c;
/cvs/cluster/cluster/dlm-kernel/src/device.c,v  <--  device.c
new revision: 1.24.2.8; previous revision: 1.24.2.7
done

Comment 5 Red Hat Bugzilla 2006-08-10 21:27:20 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0558.html

Note You need to log in before you can comment on or make changes to this bug.