Bug 134736

Summary: kernel panic in md driver (md lacks proper locking of device lists)
Product: Red Hat Enterprise Linux 3 Reporter: Paul Clements <paul.clements>
Component: kernelAssignee: Doug Ledford <dledford>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: dwmw2, james.bottomley, petrides, riel
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2006-0437 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-07-20 13:17:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 170417, 181405    
Attachments:
Description Flags
locking patch against 2.6.20-28.7 none

Description Paul Clements 2004-10-05 20:30:32 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040116

Description of problem:
Unable to handle kernel NULL pointer dereference at virtual address
00000004
 printing eip:
c01fbb68
*pde = 00000000
Oops: 0000
nbd raid1 mousedev input parport_pc lp parport autofs audit e100
floppy sg micr
CPU:    0
EIP:    0060:[<c01fbb68>]    Not tainted
EFLAGS: 00010292
                                                                     
          
EIP is at md_do_recovery [kernel] 0x78 (2.4.21-15.EL/i686)
eax: 00000016   ebx: de634d80   ecx: c171e000   edx: 00000073
esi: 00000000   edi: d54ca000   ebp: fffffffc   esp: c171ff6c
ds: 0068   es: 0068   ss: 0068
Process mdrecoveryd (pid: 8, stackpage=c171f000)
Stack: d7154580 c171ff7c 00000000 d71545d4 d54ca300 c171e000 c25aadc0
c171ffb4
       c25aadc8 c01fa877 00000000 c029b4aa 00000000 00000000 c171e000
00000000
       00000000 00000000 c171e000 c0341820 dffedfb0 00000000 c171e000
00000000
Call Trace:   [<c01fa877>] md_thread [kernel] 0xe7 (0xc171ff90)
[<c01fa790>] md_thread [kernel] 0x0 (0xc171ffe0)
[<c010945d>] kernel_thread_helper [kernel] 0x5 (0xc171fff0)
                                                                     
          
Code: 81 7e 04 10 33 38 c0 75 9f 83 c4 14 5b 5e 5f 5d c3 8d b4 26
                                                                     
          
Kernel panic: Fatal exception


The above panic occurs at linux-2.4.21-15.EL/drivers/md/md.c, line 3694:

void md_do_recovery(void *data)
{
        int err;
        mddev_t *mddev;
        mdp_super_t *sb;
        mdp_disk_t *spare;
        struct md_list_head *tmp;

        dprintk(KERN_INFO "md: recovery thread got woken up ...\n");
restart:
        ITERATE_MDDEV(mddev,tmp) {


        ^^^^^^^^^^^^^^^^^^^^^^^^^^


ITERATE_MDDEV does not contain the necessary locking to ensure that
the device list does not change while it's being iterated over.

The panic is easily reproducible if 2 md devices are configured on a
system and they are stopped while one of the devices is in recovery.
There are, no doubt, other related panics/oopses that can occur due to
the lack of locking around access to the device lists in the 2.4
kernel md driver. 

Version-Release number of selected component (if applicable):
kernel-2.4

How reproducible:
Always

Steps to Reproduce:
1. create 2 md raid1 devices
2. allow the first device to complete recovery
3. allow the second device to begin recovery
4. stop the first device while the second device is in recovery
5. stop the second device while it is in recovery
6. the kernel panics in md_do_recovery()

Actual Results:  kernel panic in md_do_recovery()

Expected Results:  md device stops

Additional info:

The 2.6 md driver does not have this problem. I believe there have
also been patches circulated on the linux kernel/raid mailing lists
that attempt to correct this problem.

Comment 2 David Woodhouse 2005-12-21 18:03:42 UTC
David, it's not entirely clear -- has the patch at
http://marc.theaimsgroup.com/?l=linux-raid&m=106393738529573&w=2 been tested at
the customer site, or another patch of your own?

Comment 3 David Milburn 2005-12-21 18:21:37 UTC
David, no, I only tried to add locking to ITERATE_MDDEV(), I thinking that the
patch didn't apply cleanly to rhel3 and there were some fix ups to be made.

Comment 5 James Bottomley 2005-12-29 17:38:56 UTC
We've applied the fix in Neil Brown's email to a Red Hat 8 kernel 2.4.20-28.7
and run into a locking failure in the seq_file interface.

The down_read(&all_mddevs_sem); needs to be moved from md_seq_next() to
md_seq_start() in order to avoid a proc file bug which will cause the system to
hang.  I've attached our complete patch.

With these locking changes, the system is stable for us and no-longer oopses

Comment 6 James Bottomley 2005-12-29 17:42:36 UTC
Created attachment 122629 [details]
locking patch against 2.6.20-28.7

This patch is modified from Neil Browns original to apply against 2.6.20-28.7
and also has the proc file locking problem fixed

Comment 7 Doug Ledford 2006-04-18 21:15:12 UTC
I modified the patch to work with a RHEL3 kernel and with the md event interface
we have in RHEL3.  It then passed my testing, and I've submitted it internally
for review.

Comment 8 Ernie Petrides 2006-04-25 03:30:36 UTC
A fix for this problem has just been committed to the RHEL3 U8
patch pool this evening (in kernel version 2.4.21-40.10.EL).


Comment 9 Ernie Petrides 2006-04-28 21:54:04 UTC
Adding a couple dozen bugs to CanFix list so I can complete the stupid advisory.

Comment 11 Joshua Giles 2006-05-30 15:24:33 UTC
A kernel has been released that contains a patch for this problem.  Please
verify if your problem is fixed with the latest available kernel from the RHEL3
public beta channel at rhn.redhat.com.

Comment 12 Ernie Petrides 2006-05-30 20:25:40 UTC
Reverting to ON_QA.

Comment 14 Red Hat Bugzilla 2006-07-20 13:17:22 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0437.html