Bug 86222

Summary:

[PATCH] AS 2.1 gives oops in md_update_sb

Product:

Red Hat Enterprise Linux 2.1

Reporter:

Larry Troan <ltroan>

Component:

kernel

Assignee:

Doug Ledford <dledford>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

2.1

CC:

fhirtz, ichute, jparadis, tao, tburke

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2005-04-28 15:05:07 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

132992

Attachments:

Description	Flags
raid_oops-as21.patch	none
Redone version of the raid-oops-as21.patch file	none

Description Larry Troan 2003-03-17 15:00:08 UTC

Problem:
When hot-inserting /dev/sda in a root mirror (software raid1 with md), and doing
"cat /proc/mdstat", the following
Oops occurs in md_update_sb/mdrecoveryd.
Proposed Solution:
There is a patch for this from Neil Brown in 2.5.  I'll attach it, formatted for
2.4.9-e.3, so that it can be included in RHN. 

Oops: 0000
Kernel 2.4.9-e.3smp
CPU:  1
EIP:  0010:[<c01cca4b>]
EFLAGS: 00010206
EIP is at md_update_sb [kernel] 0xfb
eax: c1624620  ebx: 00000001  ecx: c02661a2  edx: c1624620
esi: c16251a0  edi: c16251b4  ebp: 0000002d  esp: c1613f64
ds: 0018    es: 0018   ss: 0018
Process mdrecoveryd (pid: 14, stackpage=c1613000)
Stack: 00000064 c16251a0 c1517000 00000000 00000001 c01d0875 c16251a0 c1625334
       00000000 c1612000 c1624f60 00000000 00000001 c01cf917 00000000 00000000
       c1612000 00000000 00000000 00000000 c1612000 c1641eac c010710e 00000000
Call Trace:  [<c01d0875>] md_do_recovery [kernel] 0x45
[<c01cf917>] md_thread [kernel] 0x147
[<c010710e>] ret_from_fork [kernel] 0x6
[<c0100000>] kernel_thread [kernel] 0x26
[<c0100000>] md_thread [kernel] 0x0

Code: 39 7b 04 75 e0 85 ed 74 27 ff 0c 24 74 17 68 a0 8a 25 c0 e8
<0>Kernel panic: not continuing

----------
Action by: andrewrcress
Issue Registered
----------
Action by: andrewrcress
raid_oops patch for 2.4.9-e.3 (ported from 2.5 Neil Brown patch)

Status set to: Waiting on Tech
File uploaded: raid_oops-as21.patch

ISSUE TRACKER 17379 opened by Intel as sev 3

Comment 1 Larry Troan 2003-03-17 15:01:18 UTC

Created attachment 90622 [details]
raid_oops-as21.patch

Comment 2 Larry Troan 2003-10-27 02:07:34 UTC

FROM ISSUE TRACKER
Event posted 10-20-2003 12:17pm by andrewrcress with duration of 0.00       
After 7 iterations with e.24, here are the results:
1) panic'd somewhere doing an scsi io during the md_recovery (about 9% through).
 I couldn't save the output because I didn't have a serial console hooked up. 
At this point I hooked up a serial console.
2) some delay during remove of sda, but insert & rebuild worked fine.
3 - 6) ditto
7) All disk IOs hung while doing raidhotremove of sda partitions.

Note that this test scenario works fine on RedHat kernels built on 2.4.19 or
greater.
There is apparently still something wrong with this test scenario (hotplugging
sda in a software root mirror) using 2.4.9-e.24.

Status set to: Waiting on Tech

Comment 3 Larry Troan 2003-10-27 02:09:25 UTC

FROM ISSUE TRACKER
Event posted 10-24-2003 03:08pm by andrewrcress with duration of 0.00       
After a few more iterations, I did reproduce the panic.  It is in scsi_reset. 
Serial console output is below.
In this test, I hot-removed sdb, then did "cat /proc/mdstat".
The I/O that is on the stack was an sg command in this case (it does inquiry,
test-unit-ready, and get capacity commands).
I understand that later kernels have a good bit of rework in the scsi_reset area.

Unable to handle kernel NULL pointer dereference at virtual address 00000204
*pde = 00000000
Oops: 0000
Kernel 2.4.9-e.24smp
CPU:    0
EIP:    0010:[<c883843c>]    Tainted: P

EFLAGS: 00010006
Process sgraidmon (pid: 10250, stackpage=c4171000)
Stack: 00000009 00000200 00000000 00000000 c7ce2000 c4171c48 c8807638 c4171c48
      00000006 c4171c48 00000286 c8806a60 00000046 c8806b0a c4171c48 00000006
      c8812920 00000000 00000000 00000000 c038e1c0 c4171c48 c01249d1 c4171c48
Call Trace: [<c8807638>] scsi_reset                                 
[<c883890e>] aic7xxx_reset [aic7xxx] 0x59e
[<c8807638>] scsi_reset [scsi_mod] 0xe8 set md1       8
[<c88078f1>] scsi_old_reset [scsi_mod] 0x41      
[<c013da56>] _wrapped_alloc_pages [kernel] 0x76
[<c012d14d>] do_anonymous_page [kernel] 0x18d
[<c8802ca0>] scsi_reset_provider_done_command [scsi_mod] 0x0
[<c8803a25>] scsi_ioctl_Rsmp_914b0d65 [scsi_mod] 0x25
[<c88cdbaf>] sg_ioctl [sg] 0xb2f
[<c8836bea>] aic7xxx_queue [aic7xxx] 0x15a
[<c88006a1>] scsi_dispatch_cmd [scsi_mod] 0x161
[<c880878e>] scsi_request_fn [scsi_mod] 0x31e
[<c8807a34>] __scsi_insert_special [scsi_mod] 0x74
[<c8807a9a>] scsi_insert_special_req [scsi_mod] 0x1a
[<c880097b>] scsi_do_req_Rsmp_bdc72156 [scsi_mod] 0x14b
[<c88cc53c>] sg_read [sg] 0xfc
[<c88cde40>] sg_cmd_done_bh [sg] 0x0
[<c88cca4a>] sg_write [sg] 0x12a
[<c0145d26>] sys_write [kernel] 0x96
[<c0145d9e>] sys_write [kernel] 0x10e
[<c0155887>] sys_ioctl [kernel] 0x257
[<c01073c3>] system_call [kernel] 0x33


Code: 39 78 04 74 53 f6 05 3d 29 84 c8 10 74 29 8b 47 4c 83 e0 07
<0>Kernel panic: not continuing
In interrupt handler - not syncing

Comment 9 Doug Ledford 2005-02-18 00:42:29 UTC

The patch as attached to this bugzilla is broken.  It has 2 specific
problems.  First, it doesn't add an sb_page item to the md_k.h header
file so it won't even compile.  I fixed that.  But, it also doesn't
set the BH_Lock bit on the buffer head when it's being constructed,
where as the md.c file in RHEL3 does.  I added that lock to the buffer
head in my updated version of the patch.  I'll attach the new patch to
this bugzilla and also submit it for review upon successful testing.

Comment 10 Doug Ledford 2005-02-18 00:43:36 UTC

Created attachment 111189 [details]
Redone version of the raid-oops-as21.patch file

Comment 11 Doug Ledford 2005-02-18 07:11:06 UTC

Patch tested and submitted for review.

Comment 12 Jim Paradis 2005-03-03 03:57:18 UTC

A fix for this problem has just been committed to the RHEL2.1 U7
patch pool this evening (in kernel version 2.4.9-61).

Comment 13 John Flanagan 2005-04-28 15:05:07 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-283.html