134736 – kernel panic in md driver (md lacks proper locking of device lists)

Bug 134736 - kernel panic in md driver (md lacks proper locking of device lists)

Summary: kernel panic in md driver (md lacks proper locking of device lists)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	170417 RHEL3U8CanFix
TreeView+	depends on / blocked

Reported:	2004-10-05 20:30 UTC by Paul Clements
Modified:	2007-11-30 22:07 UTC (History)
CC List:	4 users (show)
Fixed In Version:	RHSA-2006-0437
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-07-20 13:17:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
locking patch against 2.6.20-28.7 (14.85 KB, patch) 2005-12-29 17:42 UTC, James Bottomley	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2006:0437	0	normal	SHIPPED_LIVE	Important: Updated kernel packages for Red Hat Enterprise Linux 3 Update 8	2006-07-20 13:11:00 UTC

Description Paul Clements 2004-10-05 20:30:32 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040116

Description of problem:
Unable to handle kernel NULL pointer dereference at virtual address
00000004
 printing eip:
c01fbb68
*pde = 00000000
Oops: 0000
nbd raid1 mousedev input parport_pc lp parport autofs audit e100
floppy sg micr
CPU:    0
EIP:    0060:[<c01fbb68>]    Not tainted
EFLAGS: 00010292
                                                                     
          
EIP is at md_do_recovery [kernel] 0x78 (2.4.21-15.EL/i686)
eax: 00000016   ebx: de634d80   ecx: c171e000   edx: 00000073
esi: 00000000   edi: d54ca000   ebp: fffffffc   esp: c171ff6c
ds: 0068   es: 0068   ss: 0068
Process mdrecoveryd (pid: 8, stackpage=c171f000)
Stack: d7154580 c171ff7c 00000000 d71545d4 d54ca300 c171e000 c25aadc0
c171ffb4
       c25aadc8 c01fa877 00000000 c029b4aa 00000000 00000000 c171e000
00000000
       00000000 00000000 c171e000 c0341820 dffedfb0 00000000 c171e000
00000000
Call Trace:   [<c01fa877>] md_thread [kernel] 0xe7 (0xc171ff90)
[<c01fa790>] md_thread [kernel] 0x0 (0xc171ffe0)
[<c010945d>] kernel_thread_helper [kernel] 0x5 (0xc171fff0)
                                                                     
          
Code: 81 7e 04 10 33 38 c0 75 9f 83 c4 14 5b 5e 5f 5d c3 8d b4 26
                                                                     
          
Kernel panic: Fatal exception


The above panic occurs at linux-2.4.21-15.EL/drivers/md/md.c, line 3694:

void md_do_recovery(void *data)
{
        int err;
        mddev_t *mddev;
        mdp_super_t *sb;
        mdp_disk_t *spare;
        struct md_list_head *tmp;

        dprintk(KERN_INFO "md: recovery thread got woken up ...\n");
restart:
        ITERATE_MDDEV(mddev,tmp) {


        ^^^^^^^^^^^^^^^^^^^^^^^^^^


ITERATE_MDDEV does not contain the necessary locking to ensure that
the device list does not change while it's being iterated over.

The panic is easily reproducible if 2 md devices are configured on a
system and they are stopped while one of the devices is in recovery.
There are, no doubt, other related panics/oopses that can occur due to
the lack of locking around access to the device lists in the 2.4
kernel md driver. 

Version-Release number of selected component (if applicable):
kernel-2.4

How reproducible:
Always

Steps to Reproduce:
1. create 2 md raid1 devices
2. allow the first device to complete recovery
3. allow the second device to begin recovery
4. stop the first device while the second device is in recovery
5. stop the second device while it is in recovery
6. the kernel panics in md_do_recovery()

Actual Results:  kernel panic in md_do_recovery()

Expected Results:  md device stops

Additional info:

The 2.6 md driver does not have this problem. I believe there have
also been patches circulated on the linux kernel/raid mailing lists
that attempt to correct this problem.

Comment 2 David Woodhouse 2005-12-21 18:03:42 UTC

David, it's not entirely clear -- has the patch at
http://marc.theaimsgroup.com/?l=linux-raid&m=106393738529573&w=2 been tested at
the customer site, or another patch of your own?

Comment 3 David Milburn 2005-12-21 18:21:37 UTC

David, no, I only tried to add locking to ITERATE_MDDEV(), I thinking that the
patch didn't apply cleanly to rhel3 and there were some fix ups to be made.

Comment 5 James Bottomley 2005-12-29 17:38:56 UTC

We've applied the fix in Neil Brown's email to a Red Hat 8 kernel 2.4.20-28.7
and run into a locking failure in the seq_file interface.

The down_read(&all_mddevs_sem); needs to be moved from md_seq_next() to
md_seq_start() in order to avoid a proc file bug which will cause the system to
hang.  I've attached our complete patch.

With these locking changes, the system is stable for us and no-longer oopses

Comment 6 James Bottomley 2005-12-29 17:42:36 UTC

Created attachment 122629 [details]
locking patch against 2.6.20-28.7

This patch is modified from Neil Browns original to apply against 2.6.20-28.7
and also has the proc file locking problem fixed

Comment 7 Doug Ledford 2006-04-18 21:15:12 UTC

I modified the patch to work with a RHEL3 kernel and with the md event interface
we have in RHEL3.  It then passed my testing, and I've submitted it internally
for review.

Comment 8 Ernie Petrides 2006-04-25 03:30:36 UTC

A fix for this problem has just been committed to the RHEL3 U8
patch pool this evening (in kernel version 2.4.21-40.10.EL).

Comment 9 Ernie Petrides 2006-04-28 21:54:04 UTC

Adding a couple dozen bugs to CanFix list so I can complete the stupid advisory.

Comment 11 Joshua Giles 2006-05-30 15:24:33 UTC

A kernel has been released that contains a patch for this problem.  Please
verify if your problem is fixed with the latest available kernel from the RHEL3
public beta channel at rhn.redhat.com.

Comment 12 Ernie Petrides 2006-05-30 20:25:40 UTC

Reverting to ON_QA.

Comment 14 Red Hat Bugzilla 2006-07-20 13:17:22 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0437.html

Note You need to log in before you can comment on or make changes to this bug.