Created attachment 319236 [details] This patch adds the test "if (pTarget) { /.../ }" as shown above. Description of problem: On Primergy systems with LSI SCSI IME 53C1020/1030 the native driver mpt 3.12.19.00rh of RHEL 4.7 x86 and x86_64 panics during boot if a RAID 1 is configured. This problem is new in RHEL 4.7 and was not present in the previous version of the MPT driver for RHEL 4.6, so this is a regression Version-Release number of selected component (if applicable): kernel 2.6.9-78 mpt driver 3.12.19.00rh Firmware: MPT BIOS 5.13.08, MPT 1.03.48 How reproducible: 100% reproducible by customer Steps to Reproduce: 1. Configure a RAID1 array on LSI SCSI IME 53C1020/1030 2. Boot the system with RHEL 4.7 kernel 2.6.9-78 Actual results: mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 1 mptbase: ioc0: PhysDisk is now online, out of sync mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 1 mptbase: ioc0: PhysDisk is now initializing, out of sync mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 1 mptbase: ioc0: PhysDisk is now online, out of sync Unable to handle kernel NULL pointer dereference at 0000000000000040 RIP: <ffffffffa0044379>{:mptscsi:mptscsih_event_process+183} PML4 3855c067 PGD 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: mptctl md5 ipv6 parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core ipmi(U) smbus(U) sunrpc ds yenta_socket pcmcia_core cpufreq_powersave dm_mirror joydev dm_multipath dm_mod button battery ac uhci_hcd ehci_hcd e1000 floppy sg ext3 jbd ata_piix libata mptscsih mptsas mptspi mptfc mptscsi mptbase sd_mod scsi_mod Pid: 0, comm: swapper Not tainted 2.6.9-78.ELsmp RIP: 0010:[<ffffffffa0044379>] <ffffffffa0044379>{:mptscsi:mptscsih_event_process+183} RSP: 0018:ffffffff80473118 EFLAGS: 00010202 RAX: 0000000000000042 RBX: 000001003e9d02a8 RCX: 000001003e8f2520 RDX: 0000000000000008 RSI: 0000000000000042 RDI: 000001003e9d0000 RBP: 0000000000000008 R08: 0000000000000008 R09: 0000000000000003 R10: 000001003f219a48 R11: 000000000000000a R12: 0000000000000000 R13: 0000000000000001 R14: 000001003e9d0000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffffff8050d280(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000040 CR3: 0000000000101000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffffffff80510000, task ffffffff803e8880) Stack: ffffffffa003f138 000000000000000b 000001003eac164c ffffffffa0034ffe 0000000100025558 000000010b000246 0000000100030246 000001003eac1630 000001003eac2800 000000003f222940 Call Trace:<IRQ> <ffffffffa0034ffe>{:mptbase:mpt_base_reply+2262} <ffffffffa002d638>{:mptbase:mpt_interrupt+1211} <ffffffff80112ff2>{handle_IRQ_event+41} <ffffffff8011326c>{do_IRQ+197} <ffffffff801108bf>{ret_from_intr+0} <EOI> <ffffffff8010e88c>{mwait_idle+86} <ffffffff8010e81c>{cpu_idle+26} <ffffffff8051367b>{start_kernel+470} <ffffffff805131d5>{_sinittext+469} Code: 49 8b 0c ec 80 4c 1d 24 1a 48 85 c9 75 12 45 31 c9 45 31 c0 RIP <ffffffffa0044379>{:mptscsi:mptscsih_event_process+183} RSP <ffffffff80473118> CR2: 0000000000000040 Expected results: System boot without panic. Additional info: In drivers/message/fusion/mptbase.c 512 static int 513 mpt_base_reply(MPT_ADAPTER *ioc, MPT_FRAME_HDR *mf, MPT_FRAME_HDR *reply) 514 { 515 int freereq = 1; 516 u8 func; 517 518 func = reply->u.hdr.Function; 519 520 if (func == MPI_FUNCTION_EVENT_NOTIFICATION) { 521 EventNotificationReply_t *pEvReply = (EventNotificationReply_t *) reply; 522 int evHandlers = 0; 523 int results; 524 525 results = ProcessEventNotification(ioc, pEvReply, &evHandlers); [...] 7509 static int 7510 ProcessEventNotification(MPT_ADAPTER *ioc, EventNotificationReply_t *pEventReply, int *evHandlers) 7511 { [...] 7558 case MPI_EVENT_INTEGRATED_RAID: 7559 mptbase_raid_process_event_data(ioc, 7560 (MpiEventDataRaid_t *)pEventReply->Data); 7561 break; The problem seems to come from mptbase_raid_process_event_data() in drivers/message/fusion/mptbase.c 5234 static void 5235 mptbase_raid_process_event_data(MPT_ADAPTER *ioc, 5236 MpiEventDataRaid_t * pRaidEventData) 5237 { [...] 5285 case MPI_EVENT_RAID_RC_VOLUME_STATUS_CHANGED: 5286 bus = pRaidEventData->VolumeBus; 5287 pMptTarget = ioc->Target_List[bus]; 5288 pTarget = (VirtDevice *)pMptTarget->Target[id]; 5289 if (pTarget) { 5290 if ((state == MPI_RAIDVOL0_STATUS_STATE_FAILED) || 5291 (state == MPI_RAIDVOL0_STATUS_STATE_MISSING)) { 5292 pTarget->tflags |= MPT_TARGET_FLAGS_DELETED; 5293 } else { 5294 pTarget->tflags &= ~MPT_TARGET_FLAGS_DELETED; 5295 } 5296 } 5297 printk(MYIOC_s_INFO_FMT " volume is now %s%s%s%s\n", 5298 ioc->name, kernel 2.6.9-67 in RHEL 4.6 (known to work) was: [...] case MPI_EVENT_RAID_RC_VOLUME_STATUS_CHANGED: printk(MYIOC_s_INFO_FMT " volume is now %s%s%s%s\n", ioc->name, [...] In RHEL4.7, kernel 2.6.9-78, that portion of code was changed to: [...] case MPI_EVENT_RAID_RC_VOLUME_STATUS_CHANGED: bus = pRaidEventData->VolumeBus; pMptTarget = ioc->Target_List[bus]; pTarget = (VirtDevice *)pMptTarget->Target[id]; if ((state == MPI_RAIDVOL0_STATUS_STATE_FAILED) || (state == MPI_RAIDVOL0_STATUS_STATE_MISSING)) { pTarget->tflags |= MPT_TARGET_FLAGS_DELETED; } else { pTarget->tflags &= ~MPT_TARGET_FLAGS_DELETED; } printk(MYIOC_s_INFO_FMT " volume is now %s%s%s%s\n", ioc->name, [...] This was added as part of Bug #308341 (see attachement https://bugzilla.redhat.com/attachment.cgi?id=228701) But MPT Fusion driver 3.12.29.00rh scheduled for 4.8 has this code changed to: [...] case MPI_EVENT_RAID_RC_VOLUME_STATUS_CHANGED: bus = pRaidEventData->VolumeBus; pMptTarget = ioc->Target_List[bus]; pTarget = (VirtDevice *)pMptTarget->Target[id]; if (pTarget) { if ((state == MPI_RAIDVOL0_STATUS_STATE_FAILED) || (state == MPI_RAIDVOL0_STATUS_STATE_MISSING)) { pTarget->tflags |= MPT_TARGET_FLAGS_DELETED; } else { pTarget->tflags &= ~MPT_TARGET_FLAGS_DELETED; } } printk(MYIOC_s_INFO_FMT " volume is now %s%s%s%s\n", ioc->name, [...] [LSI-F 4.8 feat] Update MPT Fusion to version 3.12.29.00rh https://bugzilla.redhat.com/show_bug.cgi?id=452163 https://bugzilla.redhat.com/attachment.cgi?id=314310 ie it adds a test on pTarget to check if it's not NULL. Backporting that patch to 2.6.9-78 fixes the issue as reported by the customer.
Unfortunately, this doesn't seem to be the final solution. While the original problem has been solved, another panic has been seen with a degraded RAID under I/O load. More information to follow.
Please ignore the last comment for the moment. The new problem is unrelated (though a similar NULL pointer dereference), and is not a regression wrt EL4.6 (occurs with EL4.6 as well). We will open a new issue for that problem once we have all data.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Posted today on RHKL.
This bug may be a DUP of bug 452163 in that the refreshed driver planned for 4.8 apparently fixes this problem. If we do close thsi bug as a DUP of the other, we need to propagate the z-stream request and the two issue trackers to this bug.
I say we keep this bug specific for the z-stream and use bug 452163 that includes this fix for 4.8 (and has the wholesale update of the driver in 4.8). There is no DUP in RHEL 4, remember.
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.
Committed in 78.0.8
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0972.html