From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040116 Description of problem: Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: c01fbb68 *pde = 00000000 Oops: 0000 nbd raid1 mousedev input parport_pc lp parport autofs audit e100 floppy sg micr CPU: 0 EIP: 0060:[<c01fbb68>] Not tainted EFLAGS: 00010292 EIP is at md_do_recovery [kernel] 0x78 (2.4.21-15.EL/i686) eax: 00000016 ebx: de634d80 ecx: c171e000 edx: 00000073 esi: 00000000 edi: d54ca000 ebp: fffffffc esp: c171ff6c ds: 0068 es: 0068 ss: 0068 Process mdrecoveryd (pid: 8, stackpage=c171f000) Stack: d7154580 c171ff7c 00000000 d71545d4 d54ca300 c171e000 c25aadc0 c171ffb4 c25aadc8 c01fa877 00000000 c029b4aa 00000000 00000000 c171e000 00000000 00000000 00000000 c171e000 c0341820 dffedfb0 00000000 c171e000 00000000 Call Trace: [<c01fa877>] md_thread [kernel] 0xe7 (0xc171ff90) [<c01fa790>] md_thread [kernel] 0x0 (0xc171ffe0) [<c010945d>] kernel_thread_helper [kernel] 0x5 (0xc171fff0) Code: 81 7e 04 10 33 38 c0 75 9f 83 c4 14 5b 5e 5f 5d c3 8d b4 26 Kernel panic: Fatal exception The above panic occurs at linux-2.4.21-15.EL/drivers/md/md.c, line 3694: void md_do_recovery(void *data) { int err; mddev_t *mddev; mdp_super_t *sb; mdp_disk_t *spare; struct md_list_head *tmp; dprintk(KERN_INFO "md: recovery thread got woken up ...\n"); restart: ITERATE_MDDEV(mddev,tmp) { ^^^^^^^^^^^^^^^^^^^^^^^^^^ ITERATE_MDDEV does not contain the necessary locking to ensure that the device list does not change while it's being iterated over. The panic is easily reproducible if 2 md devices are configured on a system and they are stopped while one of the devices is in recovery. There are, no doubt, other related panics/oopses that can occur due to the lack of locking around access to the device lists in the 2.4 kernel md driver. Version-Release number of selected component (if applicable): kernel-2.4 How reproducible: Always Steps to Reproduce: 1. create 2 md raid1 devices 2. allow the first device to complete recovery 3. allow the second device to begin recovery 4. stop the first device while the second device is in recovery 5. stop the second device while it is in recovery 6. the kernel panics in md_do_recovery() Actual Results: kernel panic in md_do_recovery() Expected Results: md device stops Additional info: The 2.6 md driver does not have this problem. I believe there have also been patches circulated on the linux kernel/raid mailing lists that attempt to correct this problem.
David, it's not entirely clear -- has the patch at http://marc.theaimsgroup.com/?l=linux-raid&m=106393738529573&w=2 been tested at the customer site, or another patch of your own?
David, no, I only tried to add locking to ITERATE_MDDEV(), I thinking that the patch didn't apply cleanly to rhel3 and there were some fix ups to be made.
We've applied the fix in Neil Brown's email to a Red Hat 8 kernel 2.4.20-28.7 and run into a locking failure in the seq_file interface. The down_read(&all_mddevs_sem); needs to be moved from md_seq_next() to md_seq_start() in order to avoid a proc file bug which will cause the system to hang. I've attached our complete patch. With these locking changes, the system is stable for us and no-longer oopses
Created attachment 122629 [details] locking patch against 2.6.20-28.7 This patch is modified from Neil Browns original to apply against 2.6.20-28.7 and also has the proc file locking problem fixed
I modified the patch to work with a RHEL3 kernel and with the md event interface we have in RHEL3. It then passed my testing, and I've submitted it internally for review.
A fix for this problem has just been committed to the RHEL3 U8 patch pool this evening (in kernel version 2.4.21-40.10.EL).
Adding a couple dozen bugs to CanFix list so I can complete the stupid advisory.
A kernel has been released that contains a patch for this problem. Please verify if your problem is fixed with the latest available kernel from the RHEL3 public beta channel at rhn.redhat.com.
Reverting to ON_QA.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0437.html