Red Hat Bugzilla – Bug 134736
kernel panic in md driver (md lacks proper locking of device lists)
Last modified: 2007-11-30 17:07:04 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040116
Description of problem:
Unable to handle kernel NULL pointer dereference at virtual address
*pde = 00000000
nbd raid1 mousedev input parport_pc lp parport autofs audit e100
floppy sg micr
EIP: 0060:[<c01fbb68>] Not tainted
EIP is at md_do_recovery [kernel] 0x78 (2.4.21-15.EL/i686)
eax: 00000016 ebx: de634d80 ecx: c171e000 edx: 00000073
esi: 00000000 edi: d54ca000 ebp: fffffffc esp: c171ff6c
ds: 0068 es: 0068 ss: 0068
Process mdrecoveryd (pid: 8, stackpage=c171f000)
Stack: d7154580 c171ff7c 00000000 d71545d4 d54ca300 c171e000 c25aadc0
c25aadc8 c01fa877 00000000 c029b4aa 00000000 00000000 c171e000
00000000 00000000 c171e000 c0341820 dffedfb0 00000000 c171e000
Call Trace: [<c01fa877>] md_thread [kernel] 0xe7 (0xc171ff90)
[<c01fa790>] md_thread [kernel] 0x0 (0xc171ffe0)
[<c010945d>] kernel_thread_helper [kernel] 0x5 (0xc171fff0)
Code: 81 7e 04 10 33 38 c0 75 9f 83 c4 14 5b 5e 5f 5d c3 8d b4 26
Kernel panic: Fatal exception
The above panic occurs at linux-2.4.21-15.EL/drivers/md/md.c, line 3694:
void md_do_recovery(void *data)
struct md_list_head *tmp;
dprintk(KERN_INFO "md: recovery thread got woken up ...\n");
ITERATE_MDDEV does not contain the necessary locking to ensure that
the device list does not change while it's being iterated over.
The panic is easily reproducible if 2 md devices are configured on a
system and they are stopped while one of the devices is in recovery.
There are, no doubt, other related panics/oopses that can occur due to
the lack of locking around access to the device lists in the 2.4
kernel md driver.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. create 2 md raid1 devices
2. allow the first device to complete recovery
3. allow the second device to begin recovery
4. stop the first device while the second device is in recovery
5. stop the second device while it is in recovery
6. the kernel panics in md_do_recovery()
Actual Results: kernel panic in md_do_recovery()
Expected Results: md device stops
The 2.6 md driver does not have this problem. I believe there have
also been patches circulated on the linux kernel/raid mailing lists
that attempt to correct this problem.
David, it's not entirely clear -- has the patch at
http://marc.theaimsgroup.com/?l=linux-raid&m=106393738529573&w=2 been tested at
the customer site, or another patch of your own?
David, no, I only tried to add locking to ITERATE_MDDEV(), I thinking that the
patch didn't apply cleanly to rhel3 and there were some fix ups to be made.
We've applied the fix in Neil Brown's email to a Red Hat 8 kernel 2.4.20-28.7
and run into a locking failure in the seq_file interface.
The down_read(&all_mddevs_sem); needs to be moved from md_seq_next() to
md_seq_start() in order to avoid a proc file bug which will cause the system to
hang. I've attached our complete patch.
With these locking changes, the system is stable for us and no-longer oopses
Created attachment 122629 [details]
locking patch against 2.6.20-28.7
This patch is modified from Neil Browns original to apply against 2.6.20-28.7
and also has the proc file locking problem fixed
I modified the patch to work with a RHEL3 kernel and with the md event interface
we have in RHEL3. It then passed my testing, and I've submitted it internally
A fix for this problem has just been committed to the RHEL3 U8
patch pool this evening (in kernel version 2.4.21-40.10.EL).
Adding a couple dozen bugs to CanFix list so I can complete the stupid advisory.
A kernel has been released that contains a patch for this problem. Please
verify if your problem is fixed with the latest available kernel from the RHEL3
public beta channel at rhn.redhat.com.
Reverting to ON_QA.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.