577178 – megaraid_sas: fix physical disk handling

Bug 577178 - megaraid_sas: fix physical disk handling

Summary: megaraid_sas: fix physical disk handling

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.8
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	4.9
Assignee:	Tomas Henzl
QA Contact:	Gris Ge
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	563086 (view as bug list)
Depends On:
Blocks:	631903
TreeView+	depends on / blocked

Reported:	2010-03-26 11:55 UTC by Bryn M. Reeves
Modified:	2018-10-27 13:58 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	A bug was found in the way the megaraid_sas driver handled physical disks and management IOCTLs. All physical disks were exported to the disk layer, allowing an oops in megasas_complete_cmd_dpc() when completing the IOCTL command if a timeout occurred.
Clone Of:
Clones:	602714 (view as bug list)
Environment:
Last Closed:	2011-02-16 15:24:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
backport commit 147aab6aa22ce7775be944f8fb9932aa000dda61 to RHEL4 (1.32 KB, patch) 2010-03-26 12:12 UTC, Bryn M. Reeves	no flags	Details \| Diff
patch from LSI in Comment #40, refreshed against RHEL4.8.z (2.6.9-89.0.27.EL) (933 bytes, patch) 2010-07-19 07:27 UTC, Mark Goodwin	no flags	Details \| Diff
add OCR support to megaraid sas driver in rhel4.8.z (51.17 KB, patch) 2010-08-31 04:59 UTC, bo yang	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0263	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 4.9 kernel security and bug fix update	2011-02-16 15:14:55 UTC

Description Bryn M. Reeves 2010-03-26 11:55:29 UTC

Description of problem:
The megaraid_sas driver in RHEL4 has a problem with handling physical disks and management ioctls; all physical disks are exported to the disk layer allowing an oops in megasas_complete_cmd_dpc when completing the ioctl command if a timeout occurs.

The megasas_mgmt_fw_ioctl constructs a megasas_cmd struct with a null cmd->scmd field and hands this to the adapter via megasas_issue_blocked_cmd() (setting cmd->sync_cmd to 1 to prevent the ISR from completing the command to the mid-layer, e.g.:

crash-4.0-6.3> struct megasas_cmd 0000010037dd4980 | less
 struct megasas_cmd {
   frame = 0x10000051800,
   frame_phys_addr = 333824,
   sense = 0x1000004cb00 "",
   sense_phys_addr = 314112,
   index = 566,
   sync_cmd = 0 '\0', <-- cleared by megasas_mgmt_fw_ioctl() after timeout
   cmd_status = 61 '=',
   abort_aen = 0,
   list = {
     next = 0x10037dd4228,
     prev = 0x100052916a8
   },
   scmd = 0x0, <-- scmd is NULL 
   instance = 0x10237439248,
   frame_count = 2
 }

Once submitted the driver uses wait_event_timeout to wait for the command to complete. If the timeout fires sync_cmd is cleared and megasas_complete_cmd() is called to complete the command.

megasas_complete_cmd(struct megasas_instance *instance, struct megasas_cmd *cmd,
                      u8 alt_status)
 {
[...]
                 /*
                  * MFI_CMD_PD_SCSI_IO and MFI_CMD_LD_SCSI_IO could have been
                  * issued either through an IO path or an IOCTL path. If it
                  * was via IOCTL, we will send it to internal completion.
                  */
                 if (cmd->sync_cmd) {
                         cmd->sync_cmd = 0;
                         megasas_complete_int_cmd(instance, cmd);
                         break;
                 }

                 /*
                  * Don't export physical disk devices to mid-layer.
                  */
                 if (!MEGASAS_IS_LOGICAL(cmd->scmd) && ***** crash *****
                     (hdr->cmd_status == MFI_STAT_OK) &&
                     (cmd->scmd->cmnd[0] == INQUIRY)) {

                         if (((*(u8 *) cmd->scmd->request_buffer) & 0x1F) ==
                             TYPE_DISK) {
                                 cmd->scmd->result = DID_BAD_TARGET << 16;
                                 exception = 1;
                         }
                 }
[...]

Since sync_cmd is already 0 the code proceeds to MEGASAS_IS_LOGICAL(cmd->scmd) and oopses on the NULL cmd->scmd member.

Upstream deleted much of the above code in the following commit:

Chandra_Nelogal noticed that megaraid_sas currently exports all physical
disks normally to the disk layer, which is obviously quite bad.

The problems is that megaraid_sas is doing inquiry sniffing, and since
2.6.15 inquiry commands are sent down as one-element scatterlists on
which the code in the driver doesn't work anymore.  The right place to
keep the scsi midlayer from attaching to a device is the slave_alloc
method in the host template.  To completely prevent attaching the method
needs to return -ENXIO, but the patch below sets the no_uld_attach flag
instead which prevents upper level drivers from attaching while still
allowing scsi generic access to it, as in other raid HBA drivers.

commit 147aab6aa22ce7775be944f8fb9932aa000dda61
Author: Christoph Hellwig <hch>
Date:   Fri Feb 17 12:13:48 2006 +0100

    [SCSI] megaraid_sas: fix physical disk handling
    
    This patch hides the devices completely from the midlayer instead.
    It requires the patch to handle the slave_configure failure I posted
    earlier.
    
    Signed-off-by: Christoph Hellwig <hch>
    Signed-off-by: James Bottomley <James.Bottomley>



Version-Release number of selected component (if applicable):
2.6.9-89.EL

How reproducible:
Difficult - requires the megasas_issue_blocked_cmd timeout to fire. The system where this was observed was under severe memory pressure at the time of the crash.

Steps to Reproduce:
1. Issue management ioctls to disk devices on megaraid_sas controller
2. Generate high system/ I/O load to try to provoke timeout (may be triggerable with the SCSI fault injection framework, not tested).
3.
  
Actual results:
Oops in megasas_complete_cmd_dpc

Expected results:
No oops even under sever load

Additional info:
Fixed in commit 147aab6aa22ce7775be944f8fb9932aa000dda61

Comment 1 Bryn M. Reeves 2010-03-26 12:03:27 UTC

Current RHEL4 version of the driver seems to have part of the changes from commit 147aab6aa22ce7775be944f8fb9932aa000dda61 applied (along with some later changes) but does not include the 3rd hunk that removes the code from megasas_complete_cmd():

$ grep 'export physical' ../../*megar*
../../linux-2.6.9-megaraid-sas.patch:+           * Don't export physical disk devices to mid-layer.

Comment 2 Bryn M. Reeves 2010-03-26 12:12:34 UTC

Created attachment 402822 [details]
backport commit 147aab6aa22ce7775be944f8fb9932aa000dda61 to RHEL4

Attached patch adds the missing hunks from the upstream commit (the RHEL4 driver already has a megasas_configure_slave() but this was only used to set the extended timeouts for the RAID fw (backport of commit e5b3a65fd7244e662691cf617145983ecde28cc9) and was missing corresponding changes to megasas_complete_cmd to prevent the export of physical disks & the oops reported in the management ioctls functions.

There have been several later changes to the driver upstream in particular:

commit 044833b572b96afe91506a0edec42efd84ba4939
Author: Yang, Bo <Bo.Yang>
Date:   Tue Oct 6 14:33:06 2009 -0600

    [SCSI] megaraid_sas: report system PDs to OS
    
    When OS issue inquiry, it will check driver's internal pd_list.
    
    Signed-off-by Bo Yang<bo.yang>
    Signed-off-by: James Bottomley <James.Bottomley>

It may be desirable to take these in RHEL4 as well although I'm not familiar enough with the megaraid code to say for sure & we have no reports of problems  resulting from this.

Comment 5 Bryn M. Reeves 2010-04-28 16:14:33 UTC

With the changes in comment #2 the reporter is getting a different crash:

Unable to handle kernel NULL pointer dereference at 00000000000001d0 RIP:
<ffffffffa002de41>{:megaraid_sas:megasas_complete_cmd_dpc+204}
PML4 79dd0067 PGD 7a24c067 PMD 776c0067 PTE 0
Oops: 0002 [1] SMP
CPU 0
Modules linked in: md5 ipv6 parport_pc lp parport mptctl mptbase autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_co
re cpufreq_powersave ide_dump scsi_dump diskdump zlib_deflate dm_mirror dm_mod button battery ac joydev uhci_hcd ehci_hcd
hw_random igb inet_lro sr_mod sg ext3 jbd ata_piix libata megaraid_sas(U) sd_mod scsi_mod
Pid: 0, comm: swapper Not tainted 2.6.9-89.EL.0.it604473smp
RIP: 0010:[<ffffffffa002de41>] <ffffffffa002de41>{:megaraid_sas:megasas_complete_cmd_dpc+204}
RSP: 0018:ffffffff8046dde8  EFLAGS: 00010046
RAX: 0000000000000000 RBX: 000001007c151800 RCX: 0000000000000246
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 000001007d3624b8
RBP: 000001007c0c7600 R08: ffffffff80508000 R09: 0000000000008000
R10: 0000000000008000 R11: 0000000000000000 R12: 000001007d362448
R13: 0000000000000167 R14: 0000000000000168 R15: 0000000000000246
FS:  0000000000000000(0000) GS:ffffffff80504500(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000000001d0 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff80508000, task ffffffff803e1f00)
Stack: 0000000000000000 000001007d362580 0000000000000000 000000000000000a
      0000000000000000 ffffffff80509f08 0000000000000000 ffffffff8013dbc4
      ffffffff80504500 0000000000000001
Call Trace:<IRQ> <ffffffff8013dbc4>{tasklet_action+103} <ffffffff8013d864>{__do_softirq+88}
      <ffffffff8013d90d>{do_softirq+49} <ffffffff801132f3>{do_IRQ+328}
      <ffffffff801108c3>{ret_from_intr+0}  <EOI> <ffffffff8010e88c>{mwait_idle+86}
      <ffffffff8010e81c>{cpu_idle+26} <ffffffff8050b687>{start_kernel+470}
      <ffffffff8050b1e1>{_sinittext+481}

Code: c7 80 d0 01 00 00 00 00 00 00 e9 99 00 00 00 0f b6 43 03 48
RIP <ffffffffa002de41>{:megaraid_sas:megasas_complete_cmd_dpc+204} RSP <ffffffff8046dde8>
CR2: 00000000000001d0

crash> struct megasas_cmd 0x000001007c0c7600
struct megasas_cmd {
 frame = 0x1007c151800,
 frame_phys_addr = 2081757184,
 sense = 0x1007c14f500 "",
 sense_phys_addr = 2081748224,
 index = 362,
 sync_cmd = 0 '\0',
 cmd_status = 61 '=',
 abort_aen = 0,
 list = {
   next = 0x1007c0c75a8,
   prev = 0x1007c0c76a8
 },
 scmd = 0x0,
 instance = 0x1007d362448,
 frame_count = 1
}

crash> union megasas_frame 0x1007c151800
union megasas_frame {
 hdr = {
   cmd = 4 '\004',
   sense_len = 32 ' ',
   cmd_status = 0 '\0',
   scsi_status = 0 '\0',
   target_id = 15 '\017',
   lun = 0 '\0',
   cdb_len = 6 '\006',
   sge_count = 1 '\001',
   context = 362,
   pad_0 = 0,
   flags = 16,
   timeout = 180,
   data_xferlen = 312
 },

static void
megasas_complete_cmd(struct megasas_instance *instance, struct megasas_cmd *cmd,
                    u8 alt_status)
{
   :
       case MFI_CMD_PD_SCSI_IO:
       case MFI_CMD_LD_SCSI_IO:

               /*
                * MFI_CMD_PD_SCSI_IO and MFI_CMD_LD_SCSI_IO could have been
                * issued either through an IO path or an IOCTL path. If it
                * was via IOCTL, we will send it to internal completion.
                */
               if (cmd->sync_cmd) {
                       cmd->sync_cmd = 0;
                       megasas_complete_int_cmd(instance, cmd);
                       break;
               }

       case MFI_CMD_LD_READ:
       case MFI_CMD_LD_WRITE:
   :
               switch (hdr->cmd_status) {

               case MFI_STAT_OK:
                       cmd->scmd->result = DID_OK << 16; ***** crash *****
                       break;

               case MFI_STAT_SCSI_IO_FAILED:
               case MFI_STAT_LD_INIT_IN_PROGRESS:
   :
}

Comment 6 Bryn M. Reeves 2010-04-28 16:23:37 UTC

Created attachment 409895 [details]
proposed patch from partner for panic in comment #5

Fix timeout handling for megasas_mgmt_fw_ioctl():

- make megasas_issue_blocked_cmd check for a timeout condition and set cmd->cmd_status = ETIME

- ignore completions in megasas_complete_cmd() with cmd->cmd_status == ETIME

- have megasas_mgmt_fw_ioctl() check for cmd_status == ETIME and return -ETIME

This is working in the reporter's tests but has not yet been submitted upstream. The patch seems fine to me although I did wonder if -ETIME was the best return value for ioctl(2)?

Comment 9 Andrius Benokraitis 2010-05-20 14:25:24 UTC

Bo, can you take a look at this issue and the proposed patch? Let us know if this is upstream already and a known issue...

Comment 10 Bryn M. Reeves 2010-05-20 14:43:09 UTC

Last I heard from the author of the patch in comment #6 was that they were about to submit it upstream but I've not seen it go by on the lists yet.

Comment 11 Andrius Benokraitis 2010-05-20 14:50:16 UTC

Still would like to hear from LSI on this directly...

Comment 12 Bryn M. Reeves 2010-05-20 16:01:32 UTC

Actually this was posted upstream a couple of weeks ago:

http://www.spinics.net/lists/linux-scsi/msg43355.html

Comment 14 Tomas Henzl 2010-05-24 11:12:00 UTC

Bo,
this issue was consulted with LSI(probably you) with the result that a backport of these postings :
> [PATCH 1/7] scsi: megaraid_sas - Online controller Reset Support
http://marc.info/?l=linux-scsi&m=127315824530949&w=2
> [PATCH 2/7] scsi: megaraid_sas - Online controller Reset Support
http://marc.info/?l=linux-scsi&m=127316050902626&w=2
> [PATCH 3/7] scsi: megaraid_sas - Online COntroller Reset (OCR)
http://marc.info/?l=linux-scsi&m=127316118904104&w=2
> [PATCH 4/7] scsi: megaraid_sas - support devices update flag
http://marc.info/?l=linux-scsi&m=127316186005556&w=2
> [PATCH 5/7] scsi: megaraid_sas - Add input parameter for
http://marc.info/?l=linux-scsi&m=127316264807259&w=2
> [PATCH 6/7] scsi: megaraid_sas - Add three times Online controller
http://marc.info/?l=linux-scsi&m=127316366808895&w=2
> [PATCH 7/7] scsi: megaraid_sas - Version and documentation update
http://marc.info/?l=linux-scsi&m=127316389509269&w=2

The patch has changed wait_event_timeout() in megasas_issue_blocked_cmd() to wait_event(), so the phenomenon we saw in this issue won't occur anymore.  And, megaraid_sas firmware takes care of ioctl commands and makes sure that they will be returned, so waiting-forever-at-the-wait_event() situation can be avoided.  If something wrong happens to the firmware, the megaraid_sas driver will do OCR (Online Controller Reset), which also was introduced in the patch series, to recover it.
-------------------------
From my point of view the patches to RHEL4 should be as small as possible, so I'm also fine with the solution from comment#12:
http://www.spinics.net/lists/linux-scsi/msg43355.html
-------------------------

With two possibilities, which one do you (LSI) prefer? And please port the preferred one to RHEL4. Thanks.

Comment 15 bo yang 2010-05-24 14:46:08 UTC

The latest patches we submited to kernel already fixed this issue. In the patches of OCR support, driver changed wait_event_timeout to wait_event which will fix this issue.  The root cause of this issue is: for some of the special encl and HD, the application take too long.  When fw finished those application cmds back to driver, driver already timeed out and returned the cmds to cmd poll which will cause the one cmd double used.  By changing it to wait event and OCR support will fix this cmds double use issue.

Regards,

Bo Yang

Comment 18 Andrius Benokraitis 2010-06-10 13:53:41 UTC

This looks to be included in bug 564249, which is a wholesale driver update for RHEL 5.6.

Comment 19 Bryn M. Reeves 2010-06-10 14:01:55 UTC

Note that this is for RHEL4

Comment 20 bo yang 2010-06-10 17:02:10 UTC

I am preparing the rhel4.9 patches.  It will be submitted in next few days.

Comment 21 bo yang 2010-06-14 15:01:53 UTC

Our FTS engineer is talking with FTS to find out what is need to submit.  We will submit the changes as soon as it finalized.

Comment 22 Martin Wilck 2010-06-22 14:01:46 UTC

(In reply to comment #15)
> The latest patches we submited to kernel already fixed this issue. In the
> patches of OCR support, driver changed wait_event_timeout to wait_event which
> will fix this issue. 

Does that mean that this changeset from PATCH 1/7 would be sufficient to fix the problem?

										@@ -599,8 +789,7 @@ megasas_issue_blocked_cmd(struct megasas
        instance->instancet->fire_cmd(instance,
                        cmd->frame_phys_addr, 0, instance->reg_set);

-       wait_event_timeout(instance->int_cmd_wait_q, (cmd->cmd_status != ENODATA),
-               MEGASAS_INTERNAL_CMD_WAIT_TIME*HZ);
+       wait_event(instance->int_cmd_wait_q, cmd->cmd_status != ENODATA);

        return 0;
 }
@@ -648,8 +837,8 @@ megasas_issue_blocked_abort_cmd(struct m
        /*
         * Wait for this cmd to complete
         */
-       wait_event_timeout(instance->abort_cmd_wait_q, (cmd->cmd_status != 0xFF),
-               MEGASAS_INTERNAL_CMD_WAIT_TIME*HZ);
+       wait_event(instance->abort_cmd_wait_q, cmd->cmd_status != 0xFF);
+       cmd->sync_cmd = 0;

        megasas_return_cmd(instance, cmd);
        return 0;

Comment 23 Mark Goodwin 2010-07-02 06:01:02 UTC

(In reply to comment #22)
> (In reply to comment #15)
> > The latest patches we submited to kernel already fixed this issue. In the
> > patches of OCR support, driver changed wait_event_timeout to wait_event which
> > will fix this issue. 
> 
> Does that mean that this changeset from PATCH 1/7 would be sufficient to fix
> the problem?
> 
> -       wait_event_timeout(instance->int_cmd_wait_q, (cmd->cmd_status !=
> ENODATA),
> -               MEGASAS_INTERNAL_CMD_WAIT_TIME*HZ);
> +       wait_event(instance->int_cmd_wait_q, cmd->cmd_status != ENODATA);

Looks like a minimal patch to change from wait_event_timeout()
to wait_event() should avoid the crash later on where the driver
dereferences a NULL cmd->scmd in megasas_complete_cmd(). However,
it might hang too for all the cases where wait_event_timeout()
has been timing out after 180 seconds, e.g. the reported test case
when processing an ioctl under heavy load.

So is a hang more or less evil than a crash? :)

For 5.3.z, IMO the full driver update (as per BZ 564249 for 5.6)
would be a better solution - and more supportable by LSI. But a
full driver update is not something that can be done in a z-stream
patch.
 
Maybe we can keep the timeout, but make it longer, and then protect the
code that currently derefs the NULL pointer in megasas_complete_cmd()?
Either way, this would need a reproducible test case to demonstrate
that the chosen fix is viable.

Cheers
-- Mark Goodwin (GSS/SEG)

Comment 24 Dwight (Bud) Brown 2010-07-02 23:52:14 UTC

Pulled 5.6 4.17 version of the driver back into 5.3z - won't compile as it uses new pci enablement routines(?) that are not present in 5.3/5.3z so pull-back is not an option without also adding in new pci functionality or removing it from the 5.6 driver.

recieved the following compile time errors, did a search for pci_enable_device_mem, for example, -- its not present in 5.3, but is in 5.6, etc.

drivers/scsi/megaraid/megaraid_sas.c: In function 'megasas_init_mfi':
drivers/scsi/megaraid/megaraid_sas.c:2698: error: implicit declaration of function 'pci_request_selected_regions'
drivers/scsi/megaraid/megaraid_sas.c:2699: error: implicit declaration of function 'pci_select_bars'
drivers/scsi/megaraid/megaraid_sas.c:2849: error: implicit declaration of function 'pci_release_selected_regions'
drivers/scsi/megaraid/megaraid_sas.c: In function 'megasas_probe_one':
drivers/scsi/megaraid/megaraid_sas.c:3179: error: implicit declaration of function 'pci_enable_device_mem'
make[3]: *** [drivers/scsi/megaraid/megaraid_sas.o] Error 1
make[2]: *** [drivers/scsi/megaraid] Error 2

Bug 602714 contains a 5.3z patch from LSI that has been merged into a current 5.3z test sandbox and is going through brew at the moment.

Comment 30 bo yang 2010-07-15 12:44:34 UTC

Can you only apply the following changes to see if it will fix this issue if the customer only used PPC controller?

          @@ -599,8 +789,7 @@ megasas_issue_blocked_cmd(struct megasas
        instance->instancet->fire_cmd(instance,
                        cmd->frame_phys_addr, 0, instance->reg_set);

-       wait_event_timeout(instance->int_cmd_wait_q, (cmd->cmd_status !=
ENODATA),
-               MEGASAS_INTERNAL_CMD_WAIT_TIME*HZ);
+       wait_event(instance->int_cmd_wait_q, cmd->cmd_status != ENODATA);

        return 0;
 }
@@ -648,8 +837,8 @@ megasas_issue_blocked_abort_cmd(struct m
        /*
         * Wait for this cmd to complete
         */
-       wait_event_timeout(instance->abort_cmd_wait_q, (cmd->cmd_status !=
0xFF),
-               MEGASAS_INTERNAL_CMD_WAIT_TIME*HZ);
+       wait_event(instance->abort_cmd_wait_q, cmd->cmd_status != 0xFF);
+       cmd->sync_cmd = 0;

Thanks,

Bo Yang

Comment 31 Mark Goodwin 2010-07-19 07:27:32 UTC

Created attachment 432780 [details]
patch from LSI in Comment #40, refreshed against RHEL4.8.z (2.6.9-89.0.27.EL)


This is a minimal patch for RHEL4.8.z and RHEl4.9, as proposed  by LSI in Comment #40

Comment 32 Mark Goodwin 2010-07-19 07:34:59 UTC

[sorry, in Comment 31, I ment to refer to Comment 30 (not Comment 40)]

A test kernel built from RHEL4.8.z + the patch in Comment 31 is available from:
http://people.redhat.com/mgoodwin/BZ577178/
This test kernel is version 2.6.9-89.0.27.EL.BZ577178. I've booted it up
in a RHEl4.8 VM and modprobed the megaraid_sas kmod, but don't have the
h/w to test it any further.

Cheers
-- Mark Goodwin

Comment 33 Issue Tracker 2010-07-22 03:21:40 UTC

Event posted on 07-22-2010 12:21pm JST by moshiro

Hi Mark,
Following comment is from FJ:

==============================================================
Mark,

> http://people.redhat.com/mgoodwin/BZ577178/

Thank you for the test packages.  We believe they don't have the OCR
feature.  Actually, Fujitsu talked with LSI and requested a fix including
the OCR feature for 4.9 and 4.8.z.  (Please see bug 602714.)  Therefore, a
new fix must be provided to you from LSI soon.  Fujitsu will be waiting for
new test packages with the OCR.

Kei Tokunaga
============================================================== 

Best Regards,
Moritoshi


This event sent from IssueTracker by moshiro 
 issue 604473

Comment 34 Issue Tracker 2010-07-27 03:43:33 UTC

Event posted on 2010-07-27 12:43 JST by myamazak

Hi all,

I'll forward a comment on I-T from FJ.
----------------------------------------------------------------------
Here is a summary of this issue.

- Fujitsu requested LSI to provide a fix with OCR and they acknowledged
it.

Kei Tokunaga
----------------------------------------------------------------------

Regards,
M Yamazaki



This event sent from IssueTracker by myamazak 
 issue 604473

Comment 35 bo yang 2010-07-28 14:27:22 UTC

To implement the OCR support for rhel5.4, 5.5, 4.8 and 4.9.  We are waiting for the feedback from FJ.

Comment 36 Mark Goodwin 2010-07-30 22:43:56 UTC

(In reply to comment #35)
> To implement the OCR support for rhel5.4, 5.5, 4.8 and 4.9.  We are waiting for
> the feedback from FJ.    

Hi Bo, can you please elaborate - exactly what feed-back from FJ
do you need?

Thanks
-- Mark

Comment 37 bo yang 2010-08-04 18:20:14 UTC

The latest message I get from our program management (PM) team: 

Our PM team is talking with Fujitsu for the porting as well as testing.  I believe our PM team will provide the schedule to Fujitsu and our team (dev team) to finish all the requests for Fujitsu.


Regards,

Bo Yang

Comment 38 Moritoshi Oshiro 2010-08-05 12:33:11 UTC

Hi Bo-san,

Fujitsu have questions. Could you please reply?

---
> Our PM team is talking with Fujitsu for the porting as well as testing.

Do you mean that the PM team has already been talking with some
Fujitsu people, or will talk to Fujitsu in the future?

If the former is the case, who are they talking with?  Fujitsu
Japan?  FTS?


> I believe our PM team will provide the schedule to Fujitsu and our team (dev team) to finish all the requests for Fujitsu.

What content is schedule?
Is it a schedule that the porting of 5.5.z, 5.4.z, 4.8.z and 4.9 is fix?

Best Regards,
Masahiro Maeda
---
Best Regards,
Moritoshi Oshiro

Comment 39 bo yang 2010-08-05 14:04:03 UTC

I was told they are talking with Fujitsu people to schedule the back porting to 5.4z, 5.5z, 4.8z and 4.9z.

I will find out who they are talking with and keep the post.

Our team is waiting for the timeline for the porting.

Bo Yang

Comment 40 Issue Tracker 2010-08-10 02:11:49 UTC

Event posted on 08-10-2010 11:11am JST by moshiro

Dear Bo-san,

Here is the reply from Fujitsu:
---
Thank you for the information.

---
I will find out who they are talking with and keep the post.
---

Thank you.


---
Our team is waiting for the timeline for the porting.
---

Will the back porting have stopped until the timeline was presented?

Best Regards,
Masahiro Maeda 
---

Best Regards,
Moritoshi Oshiro


This event sent from IssueTracker by moshiro 
 issue 604473

Comment 41 Issue Tracker 2010-08-11 05:00:05 UTC

Event posted on 08-11-2010 02:00pm JST by moshiro

Dear Bo-san,

We would like to make sure about the current status. Could you please
answer the questions below?

Our understanding is that Fujitsu and Redhat are waiting for you to update
regarding your last two comments:

comment 1:
---
Our PM team is talking with Fujitsu for the porting as well as testing.  I
believe our PM team will provide the schedule to Fujitsu and our team (dev
team) to finish all the requests for Fujitsu.
---
You are waiting for your LSI PM team, not Redhat's PM, right? 

comment 2:
---
I was told they are talking with Fujitsu people to schedule the back
porting to 5.4z, 5.5z, 4.8z and 4.9z.

I will find out who they are talking with and keep the post.

Our team is waiting for the timeline for the porting.
---
Will the back porting be stopped until the timeline was presented? 

It sounds like LSI people are directly discussing with Fujitsu. Could you
please update every time here in the bz ticket as well?

We are trying to make sure that we are all on the right track.

Best Regards,
Moritoshi Oshiro 


This event sent from IssueTracker by moshiro 
 issue 604473

Comment 42 bo yang 2010-08-11 13:21:50 UTC

The engineer from Fujitsu (he is onsite LSI and closely with the LSI Megaraid RAID group) come to me and asked how much test we did for the porting.  He like to do the varification before we submit the patches.

I will find out what are the decision be made and give the update.

Bo Yang

Comment 43 Martin Wilck 2010-08-12 16:20:09 UTC

Bo,

I am confused. Please verify if the following statements are correct, and respond to the questions under 2.) below.

Fujitsu has 2 current problems with megaraid_sas. 

1. Recovery from the "fatal firmware error" condition on certain controllers, first reported in bug #563083. This problem is solved with the OCR patch set.

2. Panic if ioctl times out (the problem reported here). This one is *not* solved by OCR but by not exporting physical disks to the SCSI midlayer (upstream commit 147aab6aa22ce7775be944f8fb9932aa000dda61, as mentioned in the problem description). In comment #15, you suggested a different solution (replacing wait_event_timeout() by wait_event()), but from the analysis here and on bug #602714, that change wouldn't be necessary if the upstream solution was applied. Why do we need a different solution here? Or don't we?

[There is one more problem, panic if disk pulled in enclosure (bug #607930). But that one affects only kernels with enclosure support (i.e. RHEL6)]

Fujitsu would like to see both these critical bugs solved in the relevant RHEL releases.

Both problems are solved in the latest 4.31 driver. Both will be solved in 5.6, too (bug #564249). A backport to 5.3.z which solves both is also available (bug #602714). Backports to 5.4.z, 5.5.z, and 4.9 are currently being discussed here.

Comment 48 Tomas Henzl 2010-08-20 13:51:16 UTC

Bo,
please can you update the status here, when do you expect you can post the patch, and answer the comment#43

Comment 49 bo yang 2010-08-20 13:58:45 UTC

Does #607930 also seeing in rhel4x and rhel5x?

To port the changes to rhel5.5z, Tomas said rhel4.x do have the high priority?  If this is the case, I would like to do rhel4x first.  Please confirm.

Comment 51 Martin Wilck 2010-08-20 14:16:46 UTC

(In reply to comment #49)
> Does #607930 also seeing in rhel4x and rhel5x?

I think no, because bu #607930 depends on SCSI enclosure support in the kernel, which RHEL5 and RHEL4 do not have. (Thomas, can you verify that?)

> To port the changes to rhel5.5z, Tomas said rhel4.x do have the high priority? 
> If this is the case, I would like to do rhel4x first.  Please confirm.

The following is the Fujitsu priority list consolidated between Fujitsu Japan (Tokunaga-san) and FTS (myself and Manfred Graeder).

1) 5.6 (done)
2) 5.3.z (done, official errata not yet released)
3) 4.9 
4) 5.5.z
5) 4.8.z

5.4.z is not on this list because only Red Hat requires it.

Martin

Comment 52 Tomas Henzl 2010-08-20 14:33:08 UTC

(In reply to comment #49)
> Does #607930 also seeing in rhel4x and rhel5x?
I think it is not needed.

Martin, was faster with the response, his priority list matches ours, only the deadline for 4.8.z is sooner then 5.5.z. 

Please look also on bz#563086, I hope both can be solved at once.

You can find the RHEL4 sources here -> http://people.redhat.com/vgoyal/rhel4/

Comment 53 Martin Wilck 2010-08-20 14:36:50 UTC

(In reply to comment #52)
> Please look also on bz#563086, I hope both can be solved at once.

They will, because Bo's fix for this bug includes the OCR fix which solves 563086.

Comment 54 Issue Tracker 2010-08-23 00:47:35 UTC

Event posted on 08-23-2010 09:47am JST by moshiro

Comment from FJ in terms of security 

----
Dear Oshiro-san,

---
Our engineer would like to find out about the issue in terms of security.
Can this be triggered by a unprivileged user, i.e. non-root, normal user,
etc?
---

It is necessary to issue ioctl to the SES chip of PRIMERGY SX35 to
cause this trouble, and the ioctl needs the special character device.

We think that there is no problem of security because this character
device cannot be accessed from the unprivileged user usually.

Best Regards,
Masahiro Maeda
----

Best Regards,
M oshiro


This event sent from IssueTracker by moshiro 
 issue 604473

Comment 55 Tomas Henzl 2010-08-26 13:20:45 UTC

(In reply to comment #49)
> If this is the case, I would like to do rhel4x first.  Please confirm.

Bo,
thanks for posting the for for 5.5.z. 
Please don't forget that the priority list below is still valid
and this has still the highest priority.

> The following is the Fujitsu priority list consolidated between Fujitsu Japan
> (Tokunaga-san) and FTS (myself and Manfred Graeder).
> 1) 5.6 (done)
> 2) 5.3.z (done, official errata not yet released)
> 3) 4.9 
> 4) 5.5.z
> 5) 4.8.z

Comment 56 RHEL Program Management 2010-08-27 09:49:09 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 57 bo yang 2010-08-27 13:41:27 UTC

Tomas,

I was told by different by Fujitsu.  They told me:

 1) 5.6 (done)
 2) 5.3.z (done, official errata not yet released)
 3) 5.5.z
 4) 4.8/9.z

This is why I submitted 5.5.z first on Wedesday.  I am doing the rhel4.8/9.  It should be done by Monday.

Regards,

Bo Yang

Comment 59 bo yang 2010-08-31 04:59:48 UTC

Created attachment 442089 [details]
add OCR support to megaraid sas driver in rhel4.8.z

Tomas,

Please find attached patch for rhel4.8.z.  This patch should apply to rhel4.9 also.

Please let me know if you have any question.  Also please let me know if there are changes needed.

Comment 60 Tomas Henzl 2010-09-02 09:16:04 UTC

(In reply to comment #59)
> Please let me know if you have any question.  Also please let me know if there
> are changes needed.

Bo, 
the questinon I have here are almost the same as in bz#619365, 
please could you post an answer?

Comment 62 Vivek Goyal 2010-09-13 20:22:00 UTC

Committed in 89.34.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 63 Moritoshi Oshiro 2010-09-27 07:17:14 UTC

Got Eamil from Fujitsu: 
---
Dear Oshiro-san,

    We've verified the test package. It fixed this bug correctly.

Best Regards,
TARUISI Hiroaki
---

Comment 64 Tomas Henzl 2010-09-29 13:31:39 UTC

*** Bug 563086 has been marked as a duplicate of this bug. ***

Comment 70 Gris Ge 2011-01-11 05:20:31 UTC

Failed to reproduce this issue.

Fujitsu verified the fix as comment #63 confirmed.

Code reviewed, the patch linux-2.6.9-megaraid_sas-fix-physical-disk-handling-and-manageme.patch was applied into kernel-2.6.9-95.EL

Comment 71 Douglas Silas 2011-01-31 00:11:43 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
A bug was found in the way the megaraid_sas driver handled physical disks and management IOCTLs. All physical disks were exported to the disk layer, allowing an oops in megasas_complete_cmd_dpc() when completing the IOCTL command if a timeout occurred.

Comment 72 Martin Wilck 2011-02-01 12:26:22 UTC

(In reply to comment #70)

> Code reviewed, the patch
> linux-2.6.9-megaraid_sas-fix-physical-disk-handling-and-manageme.patch was
> applied into kernel-2.6.9-95.EL

it is already in kernel-2.6.9-92.EL (RHEL4.9), too, AFAICS.

Comment 73 errata-xmlrpc 2011-02-16 15:24:53 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html

Note You need to log in before you can comment on or make changes to this bug.