465265 – mpt 3.12.19.00rh on RHEL4.7 causes panic if a RAID 1 is configured.

Bug 465265 - mpt 3.12.19.00rh on RHEL4.7 causes panic if a RAID 1 is configured.

Summary: mpt 3.12.19.00rh on RHEL4.7 causes panic if a RAID 1 is configured.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.7
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Tomas Henzl
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:	452163 469236
Blocks:
TreeView+	depends on / blocked

Reported:	2008-10-02 14:04 UTC by Olivier Fourdan
Modified:	2018-10-19 23:52 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-11-19 13:45:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
This patch adds the test "if (pTarget) { /.../ }" as shown above. (1.03 KB, patch) 2008-10-02 14:04 UTC, Olivier Fourdan	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2008:0972	0	normal	SHIPPED_LIVE	Important: kernel security and bug fix update	2008-11-19 13:44:42 UTC

Description Olivier Fourdan 2008-10-02 14:04:34 UTC

Created attachment 319236 [details]
This patch adds the test "if (pTarget) { /.../ }" as shown above.

Description of problem:

On Primergy systems with LSI SCSI IME 53C1020/1030 the native driver mpt 3.12.19.00rh of RHEL 4.7 x86 and x86_64 panics during boot if a RAID 1 is configured.

This problem is new in RHEL 4.7 and was not present in the previous version of the MPT driver for RHEL 4.6, so this is a regression

Version-Release number of selected component (if applicable):

kernel 2.6.9-78
mpt driver 3.12.19.00rh 
Firmware: MPT BIOS 5.13.08, MPT 1.03.48

How reproducible:

100% reproducible by customer

Steps to Reproduce:
1. Configure a RAID1 array on LSI SCSI IME 53C1020/1030
2. Boot the system with RHEL 4.7 kernel 2.6.9-78
  
Actual results:

mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 1
mptbase: ioc0:   PhysDisk is now online, out of sync
mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 1
mptbase: ioc0:   PhysDisk is now initializing, out of sync
mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 1
mptbase: ioc0:   PhysDisk is now online, out of sync
Unable to handle kernel NULL pointer dereference at 0000000000000040 RIP: 
<ffffffffa0044379>{:mptscsi:mptscsih_event_process+183}
PML4 3855c067 PGD 0 
Oops: 0000 [1] SMP 
CPU 0 
Modules linked in: mptctl md5 ipv6 parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core ipmi(U) smbus(U) sunrpc ds yenta_socket pcmcia_core cpufreq_powersave dm_mirror joydev dm_multipath dm_mod
 button battery ac uhci_hcd ehci_hcd e1000 floppy sg ext3 jbd ata_piix libata mptscsih mptsas mptspi mptfc mptscsi mptbase sd_mod scsi_mod
Pid: 0, comm: swapper Not tainted 2.6.9-78.ELsmp
RIP: 0010:[<ffffffffa0044379>] <ffffffffa0044379>{:mptscsi:mptscsih_event_process+183}
RSP: 0018:ffffffff80473118  EFLAGS: 00010202
RAX: 0000000000000042 RBX: 000001003e9d02a8 RCX: 000001003e8f2520
RDX: 0000000000000008 RSI: 0000000000000042 RDI: 000001003e9d0000
RBP: 0000000000000008 R08: 0000000000000008 R09: 0000000000000003
R10: 000001003f219a48 R11: 000000000000000a R12: 0000000000000000
R13: 0000000000000001 R14: 000001003e9d0000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff8050d280(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000040 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff80510000, task ffffffff803e8880)
Stack: ffffffffa003f138 000000000000000b 000001003eac164c ffffffffa0034ffe 
       0000000100025558 000000010b000246 0000000100030246 000001003eac1630 
       000001003eac2800 000000003f222940 
Call Trace:<IRQ> <ffffffffa0034ffe>{:mptbase:mpt_base_reply+2262} <ffffffffa002d638>{:mptbase:mpt_interrupt+1211} 
       <ffffffff80112ff2>{handle_IRQ_event+41} <ffffffff8011326c>{do_IRQ+197} 
       <ffffffff801108bf>{ret_from_intr+0}  <EOI> <ffffffff8010e88c>{mwait_idle+86} 
       <ffffffff8010e81c>{cpu_idle+26} <ffffffff8051367b>{start_kernel+470} 
       <ffffffff805131d5>{_sinittext+469} 

Code: 49 8b 0c ec 80 4c 1d 24 1a 48 85 c9 75 12 45 31 c9 45 31 c0 
RIP <ffffffffa0044379>{:mptscsi:mptscsih_event_process+183} RSP <ffffffff80473118>
CR2: 0000000000000040

Expected results:

System boot without panic.

Additional info:

In drivers/message/fusion/mptbase.c

 512 static int
 513 mpt_base_reply(MPT_ADAPTER *ioc, MPT_FRAME_HDR *mf, MPT_FRAME_HDR *reply)
 514 {
 515         int freereq = 1;
 516         u8 func;
 517 
 518         func = reply->u.hdr.Function;
 519 
 520         if (func == MPI_FUNCTION_EVENT_NOTIFICATION) {
 521                 EventNotificationReply_t *pEvReply = (EventNotificationReply_t *) reply;
 522                 int evHandlers = 0;
 523                 int results;
 524 
 525                 results = ProcessEventNotification(ioc, pEvReply, &evHandlers);
     [...]
        
        
7509 static int
7510 ProcessEventNotification(MPT_ADAPTER *ioc, EventNotificationReply_t *pEventReply, int *evHandlers)
7511 {
     [...]
7558         case MPI_EVENT_INTEGRATED_RAID:
7559                 mptbase_raid_process_event_data(ioc,
7560                     (MpiEventDataRaid_t *)pEventReply->Data);
7561                 break;

The problem seems to come from mptbase_raid_process_event_data() in drivers/message/fusion/mptbase.c 

5234 static void
5235 mptbase_raid_process_event_data(MPT_ADAPTER *ioc,
5236     MpiEventDataRaid_t * pRaidEventData)
5237 {
     [...]
5285         case MPI_EVENT_RAID_RC_VOLUME_STATUS_CHANGED:
5286                 bus     = pRaidEventData->VolumeBus;
5287                 pMptTarget = ioc->Target_List[bus];
5288                 pTarget = (VirtDevice *)pMptTarget->Target[id];
5289                 if (pTarget) {
5290                         if ((state == MPI_RAIDVOL0_STATUS_STATE_FAILED) ||
5291                             (state == MPI_RAIDVOL0_STATUS_STATE_MISSING)) {
5292                                 pTarget->tflags |= MPT_TARGET_FLAGS_DELETED;
5293                         } else {
5294                                 pTarget->tflags &= ~MPT_TARGET_FLAGS_DELETED;
5295                         }
5296                 }
5297                 printk(MYIOC_s_INFO_FMT "  volume is now %s%s%s%s\n",
5298                         ioc->name,


kernel 2.6.9-67 in RHEL 4.6 (known to work) was:
        
        [...]
        case MPI_EVENT_RAID_RC_VOLUME_STATUS_CHANGED:
                printk(MYIOC_s_INFO_FMT "  volume is now %s%s%s%s\n",
                        ioc->name,
        [...]


In RHEL4.7, kernel 2.6.9-78, that portion of code was changed to:

        [...]
        case MPI_EVENT_RAID_RC_VOLUME_STATUS_CHANGED:
                bus     = pRaidEventData->VolumeBus;
                pMptTarget = ioc->Target_List[bus];
                pTarget = (VirtDevice *)pMptTarget->Target[id];
                if ((state == MPI_RAIDVOL0_STATUS_STATE_FAILED) ||
                    (state == MPI_RAIDVOL0_STATUS_STATE_MISSING)) {
                        pTarget->tflags |= MPT_TARGET_FLAGS_DELETED;
                } else {
                        pTarget->tflags &= ~MPT_TARGET_FLAGS_DELETED;
                }
                printk(MYIOC_s_INFO_FMT "  volume is now %s%s%s%s\n",
                        ioc->name,
        [...]
                
This was added as part of Bug #308341 (see attachement https://bugzilla.redhat.com/attachment.cgi?id=228701)

But MPT Fusion driver 3.12.29.00rh scheduled for 4.8 has this code changed to:

        [...]
        case MPI_EVENT_RAID_RC_VOLUME_STATUS_CHANGED:
                bus     = pRaidEventData->VolumeBus;
                pMptTarget = ioc->Target_List[bus];
                pTarget = (VirtDevice *)pMptTarget->Target[id];
                if (pTarget) {
                        if ((state == MPI_RAIDVOL0_STATUS_STATE_FAILED) ||
                            (state == MPI_RAIDVOL0_STATUS_STATE_MISSING)) {
                                pTarget->tflags |= MPT_TARGET_FLAGS_DELETED;
                        } else {
                                pTarget->tflags &= ~MPT_TARGET_FLAGS_DELETED;
                        }
                }
                printk(MYIOC_s_INFO_FMT "  volume is now %s%s%s%s\n",
                        ioc->name,
        [...]

[LSI-F 4.8 feat] Update MPT Fusion to version 3.12.29.00rh

    https://bugzilla.redhat.com/show_bug.cgi?id=452163

    https://bugzilla.redhat.com/attachment.cgi?id=314310

ie it adds a test on pTarget to check if it's not NULL.

Backporting that patch to 2.6.9-78 fixes the issue as reported by the customer.

Comment 4 Martin Wilck 2008-10-08 11:14:07 UTC

Unfortunately, this doesn't seem to be the final solution. While the original problem has been solved, another panic has been seen with a degraded RAID under I/O load. More information to follow.

Comment 5 Martin Wilck 2008-10-08 15:14:19 UTC

Please ignore the last comment for the moment. The new problem is unrelated (though a similar NULL pointer dereference), and is not a regression wrt EL4.6 (occurs with EL4.6 as well).

We will open a new issue for that problem once we have all data.

Comment 6 RHEL Program Management 2008-10-09 12:58:04 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Tomas Henzl 2008-10-09 13:45:49 UTC

Posted today on RHKL.

Comment 8 Larry Troan 2008-10-22 13:43:33 UTC

This bug may be a DUP of bug 452163 in that the refreshed driver planned for 4.8 apparently fixes this problem. 

If we do close thsi bug as a DUP of the other, we need to propagate the z-stream request and the two issue trackers to this bug.

Comment 9 Andrius Benokraitis 2008-10-22 13:51:27 UTC

I say we keep this bug specific for the z-stream and use bug 452163 that includes this fix for 4.8 (and has the wholesale update of the driver in 4.8). There is no DUP in RHEL 4, remember.

Comment 11 RHEL Program Management 2008-10-22 16:29:27 UTC

This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 19 Vitaly Mayatskikh 2008-11-05 12:47:03 UTC

Committed in 78.0.8

Comment 23 errata-xmlrpc 2008-11-19 13:45:04 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0972.html

Note You need to log in before you can comment on or make changes to this bug.