Description of problem: The mptscsi driver in linux version 2.6.9-34 changes a SCSI return status of scsi busy into a SCSI status of host busy and scsi (target) busy before returning the status to the linux I/O stack. Version-Release number of selected component (if applicable): The code change to the mptscsi driver that appends the DID_BUS_BUSY status to the SCSI status was made between linux versions 2.6.9-22 and 2.6.9-34. How reproducible: 100% reproducible Steps to Reproduce: 1. Create a condition where the LSI hardware returns a scsi status of MPI_SCSI_STATUS_BUSY to the driver. This is easy to do runnning RHEL 4 AS U3 as a virtual machine with the VMware ESX product. 2. The mptscsi driver converts the scsi status of MPI_SCSI_STATUS_BUSY into a host status of DID_BUS_BUSY and a scsi status of MPI_SCSI_STATUS_BUSY. 3. When the SCSI I/O command is returned to the Linux SCSI midlayer with a host status of DID_BUS_BUSY, the command will be retried five times and then if the target continues to return a status of BUSY the I/O command will be marked as failed. If the SCSI I/O command originated within the ext3 file system, the I/O failure will cause the ext3 file system to be marked as read-only. This behavior is different from the mptscsi driver in linux version 2.6.9-22. In the version *22, the mptscsi driver did not append the host status of DID_BUS_BUSY to a SCSI command failure of MPI_SCSI_STATUS_BUSY. Without the host status of DID_BUS_BUSY, the Linux SCSI midlayer will retry the failed I/O command an unlimited number of times. In the case of the ext3 file system, a brief period of busy status from the SCSI target would not cause the file system to be marked read-only. Actual results: When an LSI SCSI device returns a status of target busy for a very brief period of time, I/Os will fail after a few retries. Expected results: When an LSI SCSI device returns a status of target busy for a very brief period of time, I/Os will be retried. Additional info: The problematic change to the mptscsi.c file is in the mptscsih_io_done() routine: case MPI_IOCSTATUS_SUCCESS: /* 0x0000 */ if (scsi_status == MPI_SCSI_STATUS_BUSY) sc->result = (DID_BUS_BUSY << 16) | scsi_status; else sc->result = (DID_OK << 16) | scsi_status; This modification was clearly made for a reason, but it may to be out of compliance with the LSI device specificiation. It is unexpected that a scsi status of BUSY would be treated the same as a host status of busy.
Created attachment 133859 [details] The hunk in question This shows the difference between the RHEL-4-U2 and RHEL-4-U3 versions that is in question here. It certainly looks like the change was intentional. Chip
From coldwell Wed Aug 9 14:20:52 2006 Date: Wed, 9 Aug 2006 14:20:57 -0400 (EDT) From: Chip Coldwell <coldwell> To: Eric.Moore Subject: DID_BUS_BUSY semantics Hi Eric, I'm a programmer in the kernel storage group at Red Hat, looking into an issue that was raised by another one of our partners concerning a change that was made between versions 3.02.18 (RHEL-4 update 2) and 3.02.62.01rh (RHEL-4 update 3) of the mptscsi driver. The hunk in question looks like this: --- RHEL-4-U2/kernel-2.6.9/linux-2.6.9/drivers/message/fusion/mptscsih.c 2006-08-09 13:34:48.000000000 -0400 +++ RHEL-4-U3/kernel-2.6.9/linux-2.6.9/drivers/message/fusion/mptscsi.c 2006-08-09 13:32:48.000000000 -0400 @@ -800,10 +769,14 @@ mptscsih_io_done(MPT_ADAPTER *ioc, MPT_F break; + case MPI_IOCSTATUS_SCSI_DATA_OVERRUN: /* 0x0044 */ + sc->resid=0; case MPI_IOCSTATUS_SCSI_RECOVERED_ERROR: /* 0x0040 */ case MPI_IOCSTATUS_SUCCESS: /* 0x0000 */ - scsi_status = pScsiReply->SCSIStatus; - sc->result = (DID_OK << 16) | scsi_status; + if (scsi_status == MPI_SCSI_STATUS_BUSY) + sc->result = (DID_BUS_BUSY << 16) | scsi_status; + else + sc->result = (DID_OK << 16) | scsi_status; if (scsi_state == 0) { ; } else if (scsi_state & MPI_SCSI_STATE_AUTOSENSE_VALID) { The question here is whether it is correct to return DID_BUS_BUSY instead of DID_OK when the target reports BUSY. In the old version, the driver returned (DID_OK << 16) | MPI_SCSI_STATUS_BUSY; in the new version it returns (DID_BUS_BUSY << 16) | MPI_SCSI_STATUS_BUSY. Can you shed any light? Thanks, Chip -- Charles M. "Chip" Coldwell Senior Software Engineer Red Hat, Inc 978-392-2426
From Eric.Moore Tue Sep 5 14:36:16 2006 Date: Tue, 5 Sep 2006 12:36:04 -0600 From: "Moore, Eric" <Eric.Moore> To: Chip Coldwell <coldwell> Subject: RE: DID_BUS_BUSY semantics On Wednesday, August 09, 2006 12:21 PM, Chip Coldwell wrote: > Hi Eric, > > I'm a programmer in the kernel storage group at Red Hat, looking into > an issue that was raised by another one of our partners concerning a > change that was made between versions 3.02.18 (RHEL-4 update 2) and > 3.02.62.01rh (RHEL-4 update 3) of the mptscsi driver. The hunk in > question looks like this: > > --- > RHEL-4-U2/kernel-2.6.9/linux-2.6.9/drivers/message/fusion/mpt > scsih.c 2006-08-09 13:34:48.000000000 -0400 > +++ > RHEL-4-U3/kernel-2.6.9/linux-2.6.9/drivers/message/fusion/mp > tscsi.c 2006-08-09 13:32:48.000000000 -0400 > @@ -800,10 +769,14 @@ mptscsih_io_done(MPT_ADAPTER *ioc, MPT_F > > break; > > + case MPI_IOCSTATUS_SCSI_DATA_OVERRUN: > /* 0x0044 */ > + sc->resid=0; > case MPI_IOCSTATUS_SCSI_RECOVERED_ERROR: > /* 0x0040 */ > case MPI_IOCSTATUS_SUCCESS: > /* 0x0000 */ > - scsi_status = pScsiReply->SCSIStatus; > - sc->result = (DID_OK << 16) | scsi_status; > + if (scsi_status == MPI_SCSI_STATUS_BUSY) > + sc->result = (DID_BUS_BUSY << > 16) | scsi_status; > + else > + sc->result = (DID_OK << 16) | > scsi_status; > if (scsi_state == 0) { > ; > } else if (scsi_state & > MPI_SCSI_STATE_AUTOSENSE_VALID) { > > > The question here is whether it is correct to return DID_BUS_BUSY > instead of DID_OK when the target reports BUSY. In the old version, > the driver returned (DID_OK << 16) | MPI_SCSI_STATUS_BUSY; in the new > version it returns (DID_BUS_BUSY << 16) | MPI_SCSI_STATUS_BUSY. Can > you shed any light? > > Thanks, > Sorry for the delay in responding. Here is snip of release notes relevant to this change. It occurred in the 3.02.18 release, and was requested by Engenio, for fixing a problem with RDAC/MPP failover driver. * When a target device responds with BUSY status, the MPT driver was sending DID_OK to the SCSI mid layer, which caused the IO to be retried indefinitely between the mid layer and the driver. By changing the driver return status to DID_BUS_BUSY, the target BUSY status can now flow through the mid layer to an upper layer Failover driver, which will manage the I/O timeout.
(In reply to comment #4) > From Eric.Moore Tue Sep 5 14:36:16 2006 > > Here is snip of release notes relevant to this change. It occurred in > the 3.02.18 release, > and was requested by Engenio, for fixing a problem with RDAC/MPP > failover driver. > > * When a target device responds with BUSY status, the MPT driver > was sending DID_OK to the SCSI mid layer, which caused the IO to be > retried indefinitely between the mid layer and the driver. By changing > the driver return status to DID_BUS_BUSY, the target BUSY status can now > flow through the mid layer to an upper layer Failover driver, which will > manage the I/O timeout. It seems that LSI Logic agrees with VMWare: this change prevents the mid-layer and driver from retrying commands indefinitely. Unfortunately, VMWare wants that, and Engenio does not. Chip
(In reply to comment #5) > (In reply to comment #4) > > From Eric.Moore Tue Sep 5 14:36:16 2006 > > > > Here is snip of release notes relevant to this change. It occurred in > > the 3.02.18 release, > > and was requested by Engenio, for fixing a problem with RDAC/MPP > > failover driver. > > > > * When a target device responds with BUSY status, the MPT driver > > was sending DID_OK to the SCSI mid layer, which caused the IO to be > > retried indefinitely between the mid layer and the driver. By changing > > the driver return status to DID_BUS_BUSY, the target BUSY status can now > > flow through the mid layer to an upper layer Failover driver, which will > > manage the I/O timeout. > > It seems that LSI Logic agrees with VMWare: this change prevents the > mid-layer and driver from retrying commands indefinitely. Unfortunately, > VMWare wants that, and Engenio does not. Under the circumstances, we feel it would be best if VMware and LSI Logic find a mutually acceptable solution to this problem, and Red Hat will adopt whatever goes in the upstream kernel. Chip > > Chip >
Why not make DID_BUS_BUSY (or better yet, some new host status value like DID_NOMPIO_RETRY) command retries indefinite IFF REQ_FAILFAST is not set in the io request's cmd_flags? These retries would not be impacted by the maximum command retry count. This host status would also differ from DID_IMM_RETRY in that no retries would be done by scsi-ml if REQ_FAILFAST was set. Using a new host status would also require changing mptscsih.c to use the new host status instead of DID_BUS_BUSY. The approach also requires that io from a VMware ESX Server guest OS would not have REQ_FAILFAST set while io managed by the RDAC/MPP driver would have REQ_FAILFAST set. The only Linux io which has REQ_FAILFAST set is MPIO and readahead requests. Since there is no logical reason for a VMware ESX Server guest to run multi-pathing services (since MPIO is provided in by ESX Server itself), this restriction should be fine for them. This approach may require changing the RDAC/MPP driver to set REQ_FAILFAST for their io, similar to the dm-multipath driver dm-mpath.c, if they don't do so already.
(In reply to comment #7) > Why not make DID_BUS_BUSY (or better yet, some new host status value like > DID_NOMPIO_RETRY) command retries indefinite IFF REQ_FAILFAST is not set in the > io request's cmd_flags? These retries would not be impacted by the maximum > command retry count. This host status would also differ from DID_IMM_RETRY in > that no retries would be done by scsi-ml if REQ_FAILFAST was set. Using a new > host status would also require changing mptscsih.c to use the new host status > instead of DID_BUS_BUSY. > > The approach also requires that io from a VMware ESX Server guest OS would not > have REQ_FAILFAST set while io managed by the RDAC/MPP driver would have > REQ_FAILFAST set. The only Linux io which has REQ_FAILFAST set is MPIO and > readahead requests. Since there is no logical reason for a VMware ESX Server > guest to run multi-pathing services (since MPIO is provided in by ESX Server > itself), this restriction should be fine for them. This approach may require > changing the RDAC/MPP driver to set REQ_FAILFAST for their io, similar to the > dm-multipath driver dm-mpath.c, if they don't do so already. I agree with Ed's comments. It should really be up to the LSI RDAC/MPP driver to set the failfast bit here. Last I looked it was doing some weird hook in to the scsi mid layer but it should still be able to set the failfast bit. I do not think we can change the behavior of DID_BUS_BUSY at this point so the new host byte value sounds better or it might be better to take the case to linux-scsi and see if we can change the bahavior of DID_IMM_RETRY. I think we can change the behavior of DID_IMM_RETRY because it is used by ipr which is not going to be used with something like multipath or raid so the failfast bit does not really come into play in such a way that it changes what ipr wanted to use it for. And for drivers like qla2xxx and drivers using the fc transport class helpers, I think it would be beneficial if during periods where the rport is transitioning to blocked or changing state that if failfast is set that we returned the request to some layer like dm-multipath which can retry the request on another path. One question I had was if failfast is set and the status byte returns busy, does it make sense to just check the failfast bit there? If it was set, then return command to upper layer. If not then continue to retry the command in the scsi layer. So here in scsi_error.c switch (status_byte(scmd->result)) { case QUEUE_FULL: /* * the case of trying to send too many commands to a * tagged queueing device. */ case BUSY: /* * device can't talk to us at the moment. Should only * occur (SAM-3) when the task queue is empty, so will cause * the empty queue handling to trigger a stall in the * device. */ if (blk_noretry_request(scmd->request)) return SUCCESS; /* send to upper layers */ else return ADD_TO_MLQUEUE; This is basically what we are doing with the new host byte or changing the behavior of DID_IMM_RETRY if we use it when the status returns busy, but we do not have to change every driver.
That last bit is not true. Some drivers return DID_BUS_BUSY when they get busy status and others return DID_OK | status, so we have mixed bag :) Some drivers would need to be converted and others would not but for once we could provide a consistent behavior :) I think I have a bugzilla from NEC mentioning that they do not like how we retry forver on busy status.
Question for Eric @ LSI: Do you know if there was a patch submitted/approved upstream for this issue? Although this bug was closed notabug, a fix is being requested from a mutual customer. Question for Chip/Mike: Why was this closed as notabug?
Ed Goggin just pinged me this morning, and I've sent an email over to the Engenio wrt using the REQ_FAILFAST flag. Mike Christie - have you seen this thread on lsml? I'm not clear where James Bottomley stands on this? My take is he wants the driver merely to return the sam status as is, however that is not goign to work for mpp rdac, as you know. Here's the thread in case you've not seen it: http://marc.theaimsgroup.com/?l=linux- scsi&w=2&r=1&s=vmware+bug+fix+prevent+inifinite+retries&q=b
Keeping this open until VMware and LSI Logic find a mutually acceptable solution to this problem, and Red Hat will then adopt whatever goes in the upstream kernel. This is looking more like a RHEL 4.6 issue given the discussions on linux-scsi.
I'm awaiting feedback from enenio.
Ed (VMWare) - Any progress between you and LSI on this matter? Any later and this will have to be a RHEL 4.6 issue.
Andrius, Engenio and LSI Logic both seem OK with the general idea of the approach and its implementation for the RHEL 4 Update 5 or 6 release. They are concerned about upstream code since the RDAC/MPP code uses scsi_execute_async in upstream kernels instead of using scsi_do_req. Unfortunately, scsi_execute_async currently provides no way to set the cmd_flags (and thereby the REQ_FAILFAST bit) for an IO request. I've been assuming that Red Hat will require a solution for the problem submitted upstream before the solution is fixed in RHEL 4. Is that the case, or can we simply work out a simpler patch now for RHEL 4 and one later for upstream? Please advise. Ed
Andrius - Ed Goggin is going to post a patch in this bugzilla to address his and Engenio's MPP driver concerns for RHEL4.5. We are not going to post upstream for this particular fix today, as you know, I'm trying to get patchs togeather mainstream with regard to RHEL5.1 stuff which is due on Monday. We can handle upstreams later next week, Ok?
Created attachment 146697 [details] Patch made against scsi-misc to (1) create new scsi host status DID_COND_REQUEUE and (2) return this status in mptscsih.c iodone handler detects scsi status of SAM_STAT_BUSY. Created an attachment containing a patch for this problem. The patch is built against scsi-misc git tree. This patch provides the framework for solving the problem described in this bugzilla by (1) introducing a new host status called DID_COND_REQUEUE, (2) adding code to scsi_decide_disposition to service a DID_COND_REQUEUE host status by indefinitely requeuing the scsi cmnd IFF REQ_FAILFAST is not set on the scsi cmnd's i/o request, otherwise returning SCSI_SUCCESS, and (3) changing the mptscsih_io_done handler to return DID_COND_REQUEUE instead of DID_BUSY_BUSY host status in the event of a SAM_STAT_BUSY scsi status. VMware guests, having no reason to run multi-pathing software in the guest OS, will therefore requeue the i/o request indefinitely in this use case. RDAC/MPP multi-pathing software will be changed to set the REQ_FAILFAST bit of the i/o requests it manages so that these i/o requests will not be requeued at all in this use case.
Andrius, I've not yet tested the patch attached with comment #20. While it appears simple enough, nothing is beyond possible regression. I was hoping that you would be OK with us testing the patch in parallel with its submission. I also expect that both LSI Logic and Engenio have a vested interest in regression testing this patch.
Ed, I will have to defer to Heath, who is a RH customer TAM that originally experienced this problem. Maybe he or his customer can help test... Heath? I'd also like Chip/Tom to comment with code review if possible...
(In reply to comment #23) > Ed, I will have to defer to Heath, who is a RH customer TAM that originally > experienced this problem. Maybe he or his customer can help test... Heath? > > I'd also like Chip/Tom to comment with code review if possible... I'm looking at the patch. It seems that there are three cases that it is important to test -- MPIO, no-MPIO, and VMWare guest. Are we getting coverage for all of them? Chip
Chip - this BZ is only for RHEL 4, but I assume this patch would also have to be proposed/backported for 5.1 as well, so yes, that may be a good idea to test.
Chip, Not sure why you think there are 3 test cases -- I think the last two overlap since a VMware guest gets no value from running multipathing -- since the underlying ESX micro-kernel provides multi-pathing and only exports a single path to each logical unit. If a VMware guest runs multi-pathing, this patch will actually make it worse for the guest since the scsi cmnd will not be retransmitted at all whereas it is now being retransmitted up to the scmd->allowed value.
(In reply to comment #27) > > Not sure why you think there are 3 test cases -- I think the last two overlap > since a VMware guest gets no value from running multipathing -- since the > underlying ESX micro-kernel provides multi-pathing and only exports a single > path to each logical unit. Right, but I guess I don't want to close the bug until we confirm that the VMware problem is fixed. Otherwise, we're just making an arbitrary change for no reason. Chip
Chip, VMware will test the VMware guest use case you mention above in comment #25. I think this also covers the non-MPIO use case. Could Engenio confirm that they will be testing the (RDAC/MPP) MPIO use case? Ed
(In reply to comment #30) > Chip, > > VMware will test the VMware guest use case you mention above in comment #25. I > think this also covers the non-MPIO use case. > > Could Engenio confirm that they will be testing the (RDAC/MPP) MPIO use case? Actually, I think Engenio are an LSI customer; it might be up to LSI to test the MPIO case. Chip
Chip, Also, will Red Hat be providing patched binary versions of the MPT driver for both RHEL Updates 3 and 4? If so, will these be available on RHN? Thanks, Ed
(In reply to comment #32) > Chip, > > Also, will Red Hat be providing patched binary versions of the MPT driver for > both RHEL Updates 3 and 4? If so, will these be available on RHN? If the testing goes well, the binary MPT driver that comes with RHEL4.5 (4.6?) will include the patch. I don't think RHEL3 has this problem, right? It's been some time since we touched the fusion driver in RHEL3. Chip
I'm referring to fixes for RHEL4 customers running (and wanting to stay on) Update 3 or 4 (not wanting to upgrade to update 5 or 6). Not even considering RHEL3. Is it looking like this will slide to RHEL4 U6? Then having a more readily available solution for these RHEL4 Update 3/4/5 customers will be even more important.
(In reply to comment #34) > I'm referring to fixes for RHEL4 customers running (and wanting to stay on) > Update 3 or 4 (not wanting to upgrade to update 5 or 6). Not even considering > RHEL3. I see. That is a question above my pay grade -- Tom? > Is it looking like this will slide to RHEL4 U6? I don't know. There are two issues with this patch: it isn't tested and it isn't in the upstream kernel. Since it touches the SCSI midlayer, that second point is going to stir up some controversy in our internal engineering review process. Chip
Is something missing from this hunk? diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 2ecb6ff..59eb561 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -1216,6 +1216,14 @@ int scsi_decide_disposition(struct scsi_ case DID_IMM_RETRY: return NEEDS_RETRY; + case DID_COND_REQUEUE: + /* + * Return immediately w/o requeue if the request + * indicates no retry. + */ + if (blk_noretry_request(scmd->request)) { + return SUCCESS; + } case DID_REQUEUE: return ADD_TO_MLQUEUE; What if the host byte is DID_COND_RETRY and the REQ_FASTFAIL bit is cleared? Shouldn't we return NEEDS_RETRY? I.e. if (blk_noretry_request(scmd->request)) return SUCCESS; else return NEEDS_RETRY; Chip
Isn't that case covered by falling into the return ADD_TO_MLQUEUE for the DID_REQUEUE host status? If that coding is too subtle or dependent on ordering of the particular host status handlers, we can simply change to if (blk_noretry_request(scmd->request)) return SUCCESS; else return ADD_TO_MLQUEUE;
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.
With regard to Comment #31. Engenio is a division within LSI. They are a seperate division from the one I'm in; which currently is called SCG (Storage Components Group). And SCG will be renamed soon as a result of the merger with Agere. At one time they considered spinnig off Engenio from LSI, but that never happened after we got a new CEO a couple years ago. This past year, the megaraid group was merged into Engenio. I believe they are totally droping the Engenio name, and that division is part of LSI. With regard to Comment #35. I asked Ed to hold of a day or two, b'cuase I posted a major mpt fusion driver release on lsml, which occurred this morning. I think in a couple days, we can post that. With regard to Comment #36. I see no need to change the patch, b'cause doesn't NEEDS_RETRY translates to ADD_TO_MLQUEUE?
They both add the command's request back to the device queue via scsi_queue_insert but ADD_TO_MLQUEUE also sets the device_busy count (defaults to 3) in order to stall the device for a bit. I thought returning ADD_TO_MLQUEUE instead of NEEDS_RETRY might make it easier to get the patch accepted upstream since I though that James Bottomley was in favor of treating this simply as SAM_STAT_BUSY -- and the SAM_STAT_BUSY handler returns ADD_TO_MLQUEUE.
(In reply to comment #34) > I'm referring to fixes for RHEL4 customers running (and wanting to stay on) > Update 3 or 4 (not wanting to upgrade to update 5 or 6). Not even considering > RHEL3. We generally do not release patches for older versions, once they have been superseded. Customers may be able to work out an exception to this rule by working with their Red Hat support representative.
(In reply to comment #37) > Isn't that case covered by falling into the return ADD_TO_MLQUEUE for the > DID_REQUEUE host status? If that coding is too subtle or dependent on ordering > of the particular host status handlers, we can simply change to > > if (blk_noretry_request(scmd->request)) > return SUCCESS; > else > return ADD_TO_MLQUEUE; The issue is that the fall-through falls through to a different case in RHEL-4. The next case below DID_IMM_RETRY in RHEL-4 is DID_ERROR, not DID_REQUEUE. I'll touch up the patch so it applies to RHEL-4. Chip
Created attachment 146944 [details] touchup of the previous patch to apply to the RHEL-4 kernel.
OK. Updated patch looks good. Thanks!
committed in stream U5 build 45. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
This whole approach (see patch from comment #20) has been rejected upstream due to the fact that since 2.6.14, infinite retry for even BUSY and QUEUE_FULL scsi status errors is prevented by retry limiting code in scsi_softirq_done (or scsi_softirq). As a result, a reasonable course of action for the RDAC/MPP driver is for the fusion MPT driver to simply not interpret the scsi status when ioc status is success (as was the case before the patch to return DID_BUS_BUSY was applied) and for the RDAC/MPP driver to simply wait out the 180 second retry limit before getting hold of the failed I/O. Unfortunately, the affect on this use case for a VMware Linux guest OS is an I/O failure after 180 seconds. At the moment it is not clear what to do with the RHEL 4 Update 5 code.
Is Red Hat open to porting the SCSI command retry limiting code in scsi_softirq_done from upstream kernels (2.6.14 at earliest) and changing the mpt fusion driver to no longer interpret the SCSI status in cases where ioc status is success in the RHEL4 U5 (or U6) release? I'm suspecting that porting the retry limiting code will require a change to the SCSI command structure and thereby require changing the RHEL4 KBI.
(In reply to comment #48) > Is Red Hat open to porting the SCSI command retry limiting code in > scsi_softirq_done from upstream kernels (2.6.14 at earliest) This has kABI implications since it would change the size of the scsi_cmnd struct (have to add a jiffies_at_alloc field to it). I don't think it's an option for 4.5 at this point. Chip
We've decided to carry the patch attached to this bugzilla instead of trying to backport the upstream code. This seems like the safest way to fix the problem. Chip
(In reply to comment #30) > Chip, > > VMware will test the VMware guest use case you mention above in comment #25. I > think this also covers the non-MPIO use case. > > Could Engenio confirm that they will be testing the (RDAC/MPP) MPIO use case? We are depending on Engenio (or EMC) to test the patch to this driver in RHEL 4. This must include the problematic case in vmware, as well as regression testing with and without vmware. Please post a summary of the testing that has been done so far, as well as what you plan to do during Beta.
http://marc.theaimsgroup.com/?l=linux-scsi&m=117045392500159&w=2
RH and LSI will be leaning on VMWare to help test this, as we don't have the right QA setups... Expect this to be added to a post-Beta snapshot of RHEL 4.5. VMWare - can you lead the testing on this?
VMware can commit to testing the original "problematic case in vmware" (see comment #55) using the test kernel from comment #46 in order to verify that the problem is fixed. Is this sufficient for VMware's portion of the testing?
(In reply to comment #58) > VMware can commit to testing the original "problematic case in vmware" (see > comment #55) using the test kernel from comment #46 in order to verify that the > problem is fixed. Is this sufficient for VMware's portion of the testing? Yes, thanks. Chip
Needinfo from VMWARE: has this patch been tested with the RHEL 4.5 Beta?
The RHEL 4.5 Beta test kernel is actively being tested by VMware SQA. We hope to have feedback by the end of this week. Do you need the feedback sooner?
Hi Ed, the sooner the better, we are already spinning post-Beta kernels and the feedback period ends 13-Mar-07 for partners.
Andrius, VMware QA has acked this patch in the RHEL 4.5 Beta test kernel. Thanks, Ed
Patch is in the -52.EL kernel
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html
I installed a new kernel (2.6.9-55.ELsmp) and when doing a restore from TSM we still get the file system changed to Read-Only. This is in a VM (ESX 3.0.1).
Robert - we relied exclusively on VMWare to test the patch prior to inclusion into RHEL 4.5. Ed - can you follow up with Robert? If this is still an issue we'll need to create a new bugzilla that continues where this bug left off.
Andrius, sorry for the delay ... I've been out for 3 days due to a family emergency. Robert, could you describe your host/SAN configuration in detail. I'm particularly interested in type of storage array, whether or not there is physical path redundancy, whether or not the problem is repeatable, how quickly the problem occurs.
How about forward-porting this fix to RHEL5? The latest RHEL5 kernel exhibits the same problem.
(In reply to comment #72) > How about forward-porting this fix to RHEL5? The latest RHEL5 kernel exhibits > the same problem. Bug 228108 has already been opened to address this issue.
according to the other bugzilla(228108), the fix from Ed Goggin is already been applied in RHEL5 branch, (e.g. don't return DID_BUS_BUSY for SAM_STAT_BUSY).