Problem: When hot-inserting /dev/sda in a root mirror (software raid1 with md), and doing "cat /proc/mdstat", the following Oops occurs in md_update_sb/mdrecoveryd. Proposed Solution: There is a patch for this from Neil Brown in 2.5. I'll attach it, formatted for 2.4.9-e.3, so that it can be included in RHN. Oops: 0000 Kernel 2.4.9-e.3smp CPU: 1 EIP: 0010:[<c01cca4b>] EFLAGS: 00010206 EIP is at md_update_sb [kernel] 0xfb eax: c1624620 ebx: 00000001 ecx: c02661a2 edx: c1624620 esi: c16251a0 edi: c16251b4 ebp: 0000002d esp: c1613f64 ds: 0018 es: 0018 ss: 0018 Process mdrecoveryd (pid: 14, stackpage=c1613000) Stack: 00000064 c16251a0 c1517000 00000000 00000001 c01d0875 c16251a0 c1625334 00000000 c1612000 c1624f60 00000000 00000001 c01cf917 00000000 00000000 c1612000 00000000 00000000 00000000 c1612000 c1641eac c010710e 00000000 Call Trace: [<c01d0875>] md_do_recovery [kernel] 0x45 [<c01cf917>] md_thread [kernel] 0x147 [<c010710e>] ret_from_fork [kernel] 0x6 [<c0100000>] kernel_thread [kernel] 0x26 [<c0100000>] md_thread [kernel] 0x0 Code: 39 7b 04 75 e0 85 ed 74 27 ff 0c 24 74 17 68 a0 8a 25 c0 e8 <0>Kernel panic: not continuing ---------- Action by: andrewrcress Issue Registered ---------- Action by: andrewrcress raid_oops patch for 2.4.9-e.3 (ported from 2.5 Neil Brown patch) Status set to: Waiting on Tech File uploaded: raid_oops-as21.patch ISSUE TRACKER 17379 opened by Intel as sev 3
Created attachment 90622 [details] raid_oops-as21.patch
FROM ISSUE TRACKER Event posted 10-20-2003 12:17pm by andrewrcress with duration of 0.00 After 7 iterations with e.24, here are the results: 1) panic'd somewhere doing an scsi io during the md_recovery (about 9% through). I couldn't save the output because I didn't have a serial console hooked up. At this point I hooked up a serial console. 2) some delay during remove of sda, but insert & rebuild worked fine. 3 - 6) ditto 7) All disk IOs hung while doing raidhotremove of sda partitions. Note that this test scenario works fine on RedHat kernels built on 2.4.19 or greater. There is apparently still something wrong with this test scenario (hotplugging sda in a software root mirror) using 2.4.9-e.24. Status set to: Waiting on Tech
FROM ISSUE TRACKER Event posted 10-24-2003 03:08pm by andrewrcress with duration of 0.00 After a few more iterations, I did reproduce the panic. It is in scsi_reset. Serial console output is below. In this test, I hot-removed sdb, then did "cat /proc/mdstat". The I/O that is on the stack was an sg command in this case (it does inquiry, test-unit-ready, and get capacity commands). I understand that later kernels have a good bit of rework in the scsi_reset area. Unable to handle kernel NULL pointer dereference at virtual address 00000204 *pde = 00000000 Oops: 0000 Kernel 2.4.9-e.24smp CPU: 0 EIP: 0010:[<c883843c>] Tainted: P EFLAGS: 00010006 Process sgraidmon (pid: 10250, stackpage=c4171000) Stack: 00000009 00000200 00000000 00000000 c7ce2000 c4171c48 c8807638 c4171c48 00000006 c4171c48 00000286 c8806a60 00000046 c8806b0a c4171c48 00000006 c8812920 00000000 00000000 00000000 c038e1c0 c4171c48 c01249d1 c4171c48 Call Trace: [<c8807638>] scsi_reset [<c883890e>] aic7xxx_reset [aic7xxx] 0x59e [<c8807638>] scsi_reset [scsi_mod] 0xe8 set md1 8 [<c88078f1>] scsi_old_reset [scsi_mod] 0x41 [<c013da56>] _wrapped_alloc_pages [kernel] 0x76 [<c012d14d>] do_anonymous_page [kernel] 0x18d [<c8802ca0>] scsi_reset_provider_done_command [scsi_mod] 0x0 [<c8803a25>] scsi_ioctl_Rsmp_914b0d65 [scsi_mod] 0x25 [<c88cdbaf>] sg_ioctl [sg] 0xb2f [<c8836bea>] aic7xxx_queue [aic7xxx] 0x15a [<c88006a1>] scsi_dispatch_cmd [scsi_mod] 0x161 [<c880878e>] scsi_request_fn [scsi_mod] 0x31e [<c8807a34>] __scsi_insert_special [scsi_mod] 0x74 [<c8807a9a>] scsi_insert_special_req [scsi_mod] 0x1a [<c880097b>] scsi_do_req_Rsmp_bdc72156 [scsi_mod] 0x14b [<c88cc53c>] sg_read [sg] 0xfc [<c88cde40>] sg_cmd_done_bh [sg] 0x0 [<c88cca4a>] sg_write [sg] 0x12a [<c0145d26>] sys_write [kernel] 0x96 [<c0145d9e>] sys_write [kernel] 0x10e [<c0155887>] sys_ioctl [kernel] 0x257 [<c01073c3>] system_call [kernel] 0x33 Code: 39 78 04 74 53 f6 05 3d 29 84 c8 10 74 29 8b 47 4c 83 e0 07 <0>Kernel panic: not continuing In interrupt handler - not syncing
The patch as attached to this bugzilla is broken. It has 2 specific problems. First, it doesn't add an sb_page item to the md_k.h header file so it won't even compile. I fixed that. But, it also doesn't set the BH_Lock bit on the buffer head when it's being constructed, where as the md.c file in RHEL3 does. I added that lock to the buffer head in my updated version of the patch. I'll attach the new patch to this bugzilla and also submit it for review upon successful testing.
Created attachment 111189 [details] Redone version of the raid-oops-as21.patch file
Patch tested and submitted for review.
A fix for this problem has just been committed to the RHEL2.1 U7 patch pool this evening (in kernel version 2.4.9-61).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-283.html