Description of problem: (See Red Hat bugzilla #197158 for original version of this bug for RHEL4. I cloned Tom Phelan's bug entry for this bugzilla.) The mptscsi driver in linux version 2.6.9-34 changes a SCSI return status of scsi busy into a SCSI status of host busy and scsi (target) busy before returning the status to the linux I/O stack. Version-Release number of selected component (if applicable): The code change to the mptscsi driver that appends the DID_BUS_BUSY status to the SCSI status was made between linux versions 2.6.9-22 and 2.6.9-34. How reproducible: 100% reproducible Steps to Reproduce: 1. Create a condition where the LSI hardware returns a scsi status of MPI_SCSI_STATUS_BUSY to the driver. This is easy to do runnning RHEL 4 AS U3 as a virtual machine with the VMware ESX product. 2. The mptscsi driver converts the scsi status of MPI_SCSI_STATUS_BUSY into a host status of DID_BUS_BUSY and a scsi status of MPI_SCSI_STATUS_BUSY. 3. When the SCSI I/O command is returned to the Linux SCSI midlayer with a host status of DID_BUS_BUSY, the command will be retried five times and then if the target continues to return a status of BUSY the I/O command will be marked as failed. If the SCSI I/O command originated within the ext3 file system, the I/O failure will cause the ext3 file system to be marked as read-only. This behavior is different from the mptscsi driver in linux version 2.6.9-22. In the version *22, the mptscsi driver did not append the host status of DID_BUS_BUSY to a SCSI command failure of MPI_SCSI_STATUS_BUSY. Without the host status of DID_BUS_BUSY, the Linux SCSI midlayer will retry the failed I/O command an unlimited number of times. In the case of the ext3 file system, a brief period of busy status from the SCSI target would not cause the file system to be marked read-only. Actual results: When an LSI SCSI device returns a status of target busy for a very brief period of time, I/Os will fail after a few retries. Expected results: When an LSI SCSI device returns a status of target busy for a very brief period of time, I/Os will be retried. Additional info: The problematic change to the mptscsi.c file is in the mptscsih_io_done() routine: case MPI_IOCSTATUS_SUCCESS: /* 0x0000 */ if (scsi_status == MPI_SCSI_STATUS_BUSY) sc->result = (DID_BUS_BUSY << 16) | scsi_status; else sc->result = (DID_OK << 16) | scsi_status; This modification was clearly made for a reason, but it may to be out of compliance with the LSI device specificiation. It is unexpected that a scsi status of BUSY would be treated the same as a host status of busy.
A reasonable course of action may be to simply remove the interpretation of scsi status when ioc status is success -- do not return DID_BUS_BUSY in this case. Doing so will limit scsi busy retries to 180 seconds which may be a reasonable compromise solution for the needs of both Engenio's RDAC/MPP multi- pathing driver and VMware's ESX Server. Please advise.
Can this change be included in RHEL5? It is a simple change to return a host status of DID_OK instead of DID_BUS_BUSY when the MPT fusion driver's IOC status is success but the scsi status is SAM_STAT_BUSY.
Ed, can you please attach the patch so that this can be proposed for RHEL 5.1?
Created attachment 149407 [details] no-DID_BUS_BUSY Change mpt driver (mptscsih_io_done in mptscsih.c) to not parse scsi status and return DID_OK instead of DID_BUSY_BUSY when ioc status is SUCCESS.
Created attachment 149408 [details] no-DID_BUS_BUSY Sorry about the line wrap for the previous submission. Here it is again, hopefully without the line wrap.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 155002 [details] Patch we made against kernel-2.6.18-8.1.4.el5 from kernel-2.6.9-55.EL Attached is a patch we forward-ported from kernel-2.6.9-55.EL, which is the same patch provided in bug 197158.
Ed @ VMWare - can you review the patch in Comment #7? Red Hat will need to know which patch to use, since you will be our main source for QA. Thanks!
Andrius, Use the patch from comment #5. Since 2.6.14, code submitted by Christoph Hellwig enables the upstream kernel to limit SCSI command retransmissions for the same command to 180 seconds worth of retries. This is true for all cases without exception, even for a SCSI status of SAM_STAT_BUSY or QUEUE_FULL, both of which would have caused indefinite retries in the past without also returning a driver status of DID_BUS_BUSY to limit retries. Because of this new limit, Enginio is OK with this case simply returning the SCSI status (SAM_STAT_BUSY) with a driver status of DID_OK. Check out http://marc.info/?l=linux-scsi&m=117432237504253&w=2 to see (basically this same patch) submitted by Eric Moore to upstream kernel to do the same thing. This overall simpler solution to this issue could not be used for the RHEL4 (or the SLES9 ones either) kernels because to do so would likely have introduced a binary incompatibility to the module KABI since the scsi_cmnd structure would have changed to have a field used to store the start tick count when the scsi command was initially dispatched to queuecommand in order to be able to track the 180 second limit on a per command basis.
Comment on attachment 155002 [details] Patch we made against kernel-2.6.18-8.1.4.el5 from kernel-2.6.9-55.EL Based on Comments from VMWare, obsoleting this patch in favor of the original patch created by VMWare/LSI.
I put some test kernels that should include this fix at http://people.redhat.com/coldwell/kernel/bugs/225177/ (yes, that's a different bug number) Could someone from VMWare please verify that the problem is fixed? Thanks, Chip
POSTed 04-Jun-2007
This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST.
We need confirmation that the driver update mentioned in comment 11 fixes this bug.
in 2.6.18-27.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Chip, Tom: I'll get back to you by the end of this week about this. Thanks, Ed
Chip, Tom: VMware QA is verifying the test kernel from http://people.redhat.com/dzickus/el5. I hope to be able to confirm successful verification of the fix by mid (Wednesday hopefully) next week. Ed
Has anyone from VMware QA verified the -38.el5 from the URL specified in comment #17? I'm having the read-only filesystem issue on a RHEL5 guest on ESX 3.0.1 and just wondering if one of the kernels at that URL will potentially resolve the issue. If so, which release should I try? I'm assuming -38.el5, but wanted to verify. Thanks, Andy.
A fix for this issue should have been included in the packages contained in the RHEL5.1-Snapshot3 on partners.redhat.com. Requested action: Please verify that your issue is fixed as soon as possible to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. More assistance: If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should have been included in the packages contained in the RHEL5.1-Snapshot4 on partners.redhat.com. Requested action: Please verify that your issue is fixed *as soon as possible* to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should have been included in the packages contained in the RHEL5.1-Snapshot6 on partners.redhat.com. Requested action: Please verify that your issue is fixed ASAP to confirm that it will be included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should have been included in the packages contained in the RHEL5.1-Snapshot7 on partners.redhat.com. Requested action: Please verify that your issue is fixed ASAP to confirm that it will be included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should be included in the packages contained in RHEL5.1-Snapshot8--available now on partners.redhat.com. IMPORTANT: This is the last opportunity to confirm that your issue is fixed in the RHEL5.1 update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
(In reply to comment #25) > A fix for this issue should be included in the packages contained in > RHEL5.1-Snapshot8--available now on partners.redhat.com. > IMPORTANT: This is the last opportunity to confirm that your issue is fixed in > the RHEL5.1 update release. > After you (Red Hat Partner) have verified that this issue has been addressed, > please perform the following: > 1) Change the *status* of this bug to VERIFIED. > 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) > If this issue is not fixed, please add a comment describing the most recent > symptoms of the problem you are having and change the status of the bug to FAILS_QA. > If you cannot access bugzilla, please reply with a message to Issue Tracker and > I will change the status for you. If you need assistance accessing > ftp://partners.redhat.com, please contact your Partner Manager. VMware has verified that this solution addresses the problem in this bugzilla. Thanks, Ed Goggin
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html
(In reply to comment #27) > (In reply to comment #25) > > A fix for this issue should be included in the packages contained in > > RHEL5.1-Snapshot8--available now on partners.redhat.com. > > IMPORTANT: This is the last opportunity to confirm that your issue is fixed in > > the RHEL5.1 update release. > > After you (Red Hat Partner) have verified that this issue has been addressed, > > please perform the following: > > 1) Change the *status* of this bug to VERIFIED. > > 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) > > If this issue is not fixed, please add a comment describing the most recent > > symptoms of the problem you are having and change the status of the bug to > FAILS_QA. > > If you cannot access bugzilla, please reply with a message to Issue Tracker > and > > I will change the status for you. If you need assistance accessing > > ftp://partners.redhat.com, please contact your Partner Manager. > > VMware has verified that this solution addresses the problem in this bugzilla. > > Thanks, > > Ed Goggin > > Ed, Have you seen any failures since the the release of RHEL 5.1? I am working with a customer that is still seeing the same errors despite having upgraded their RHEL 5 ESX guest to RHEL 5.1. Are there any caveats that you know of that we need to take into consideration, in addition to updating their kernel? If you would like any info from the virtual machine in question, let me know. They are not seeing this (still), on RHEL 4 guests
Chris, The one caveat I am aware of has nothing to do with virtualization. Christopher Hellwig put a change into the mid-layer at 2.6.14 the causes the Linux SCSI mid-layer to error out a SCSI IO after 180 seconds. This will affect RHEL5.x (including RHEL5.1) but not the RHEL4.x releases. Is the customer having the problem with a non-virtualized configuration with RHEL5.1? Also, what are the use case(s) associated with the problem? The one use case that I'm aware of which could possibly extend for 180 seconds involves flooding a target with so much IO (from other initiators or VMs or both) that it consistently returns device busy due to starvation. The other two use cases that I'm aware of shouldn't reasonably go on for 180 seconds.
Ed, We dont have many details yet (or a large subset of machines that are behaving like this yet), but this is what we do know: """ We still only have experienced one such failure since the 5.1 upgrade (but have double, triple, checked that the machines was at 5.1 at the time of the failure). We are periodically (10 or so times a day across a 10 node cluster) seeing this message from ESX: Nov 16 16:30:00 vi05 vmkernel: 11:03:26:05.074 cpu6:1032)SCSI: 3731: AsyncIO timeout (5000); aborting cmd w/ sn 417503, handle 6055/0x3d2027d0 At the time of ext3 failure and SCSI timeout of the VM, we did *NOT* see any such error in the ESX log. I'm not sure if that message indicates what I think it means (a timeout of a path) but the subsequent messages tend to indicate something like that: Nov 16 16:30:00 vi05 vmkernel: 11:03:26:05.074 cpu6:1032)LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x3d2027d0, originSN 417503 from vmhba0:0:0 Nov 16 16:30:00 vi05 vmkernel: 11:03:26:05.074 cpu6:1032)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK Nov 16 16:30:00 vi05 vmkernel: 11:03:26:05.074 cpu6:1032)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY At the time of ext3 failure and SCSI timeout of the VM, we did NOT see any such error in the ESX log. In fact, we saw nothing interesting at all in the ESX log. We have not seen these associate with ext3 failures (although I haven't exhaustively tried to correlate them) but only about 15% of our VMs are RHEL5. So again, not enough data to indicate anything in particular. I have not seen this on a hardware 5.1 hosts, but we only have 2, versus about a dozen VMs, and given the rate we're seeing problems that doesn't seem to mean much. """ ... I think at this point I think we're going to need more data to proceed, but I wanted to first check with you to see if we missed something obvious. If you have any other comments, I'd appreciate them. I'm wondering they're hitting a new set of issues that may be causing a similar error. --chris
Chris, Any update on this? I've heard some noise in the field also with configurations involving heavy IO load. Ed
Hi, This bug is still happening on RHEL-5.3 installed on ESXi 3.5U4. Installing a fresh rhel-5.3 fails every single time (I tried 10 times). ext3 snaps into read-only mode. I've contacted some guys from vmware, who mentioned this bug unfortunately is not yet fixed, and suggested I apply the unofficial patch found on http://www.tuxyturvy.com/blog/index.php?/archives/31-VMware-ESX-and-ext3-journal-aborts.html since I really dont like applying external patches, can someone please reopen this bug, and perhaps apply that patch to the official kernel ?
(In reply to comment #35) > Hi, > This bug is still happening on RHEL-5.3 installed on ESXi 3.5U4. Installing a > fresh rhel-5.3 fails every single time (I tried 10 times). ext3 snaps into > read-only mode. > > I've contacted some guys from vmware, who mentioned this bug unfortunately is > not yet fixed, and suggested I apply the unofficial patch found on > > http://www.tuxyturvy.com/blog/index.php?/archives/31-VMware-ESX-and-ext3-journal-aborts.html > > since I really dont like applying external patches, can someone please reopen > this bug, and perhaps apply that patch to the official kernel ? Ahmed, You had mentioned that this is happening on RHEL5 update 3, did you resolve this issue. Can you share the solution?
I have some issue on ESXi VM, guest OS Redhat 5.5, and file systems are turning read only. I dont see solution for this mentioned any where .. uname -r 2.6.18-238.9.1.el5 Write protecting the kernel read-only data: 520k lp0: console ready vmmemctl: started kernel thread pid=2328 [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4 [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032afc>] kthread+0xfe/0x132 [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4 [<ffffffff800329fe>] kthread+0x0/0x132 [<ffffffff8000c5f5>] do_generic_mapping_read+0x347/0x359 [<ffffffff8000d279>] file_read_actor+0x0/0x159 [<ffffffff8000c753>] __generic_file_aio_read+0x14c/0x198 [<ffffffff80016efc>] generic_file_aio_read+0x34/0x39 [<ffffffff8000cfa2>] do_sync_read+0xc7/0x104 [<ffffffff8000b78d>] vfs_read+0xcb/0x171 [<ffffffff80011d34>] sys_read+0x45/0x6e [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032afc>] kthread+0xfe/0x132 [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4 [<ffffffff800329fe>] kthread+0x0/0x132 [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4 [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032afc>] kthread+0xfe/0x132 [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4 [<ffffffff800329fe>] kthread+0x0/0x132 [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4 [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032afc>] kthread+0xfe/0x132 [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4 [<ffffffff800329fe>] kthread+0x0/0x132 [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4 [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032afc>] kthread+0xfe/0x132 [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4 [<ffffffff800329fe>] kthread+0x0/0x132 [<ffffffff8000c5f5>] do_generic_mapping_read+0x347/0x359 [<ffffffff8000d279>] file_read_actor+0x0/0x159 [<ffffffff8000c753>] __generic_file_aio_read+0x14c/0x198 [<ffffffff80016efc>] generic_file_aio_read+0x34/0x39 [<ffffffff8000cfa2>] do_sync_read+0xc7/0x104 [<ffffffff8000b78d>] vfs_read+0xcb/0x171 [<ffffffff80011d34>] sys_read+0x45/0x6e [<ffffffff8000c5f5>] do_generic_mapping_read+0x347/0x359 [<ffffffff8000d279>] file_read_actor+0x0/0x159 [<ffffffff8000c753>] __generic_file_aio_read+0x14c/0x198 [<ffffffff80016efc>] generic_file_aio_read+0x34/0x39 [<ffffffff8000cfa2>] do_sync_read+0xc7/0x104 [<ffffffff8000b78d>] vfs_read+0xcb/0x171 [<ffffffff80011d34>] sys_read+0x45/0x6e [<ffffffff8000c5f5>] do_generic_mapping_read+0x347/0x359 [<ffffffff8000d279>] file_read_actor+0x0/0x159 [<ffffffff8000c753>] __generic_file_aio_read+0x14c/0x198 [<ffffffff80016efc>] generic_file_aio_read+0x34/0x39 [<ffffffff8000cfa2>] do_sync_read+0xc7/0x104 [<ffffffff8000b78d>] vfs_read+0xcb/0x171 [<ffffffff80011d34>] sys_read+0x45/0x6e [<ffffffff8000c5f5>] do_generic_mapping_read+0x347/0x359 [<ffffffff8000d279>] file_read_actor+0x0/0x159 [<ffffffff8000c753>] __generic_file_aio_read+0x14c/0x198 [<ffffffff80016efc>] generic_file_aio_read+0x34/0x39 [<ffffffff8000cfa2>] do_sync_read+0xc7/0x104 [<ffffffff8000b78d>] vfs_read+0xcb/0x171 [<ffffffff80011d34>] sys_read+0x45/0x6e Remounting filesystem read-only Remounting filesystem read-only Remounting filesystem read-only Remounting filesystem read-only EXT3-fs error (device sda1): ext3_find_entry: reading directory #1639281 offset 0 command: Read(10): 28 00 00 76 af 90 00 00 20 00 EXT3-fs error (device sda1): ext3_find_entry: reading directory #3080195 offset 0
With RHEL 5.8 have the same error. Kernel 2.6.18-308.el5 hdc: drive_cmd: status=0x51 { DriveReady SeekComplete Error } hdc: drive_cmd: error=0x04 { AbortedCommand } ide: failed opcode was: 0xec mtrr: type mismatch for d8000000,400000 old: uncachable new: write-combining Process accounting paused Process accounting paused Process accounting paused Process accounting paused Process accounting paused Process accounting paused Process accounting paused Process accounting paused sd 0:0:0:0: timing out command, waited 1080s sd 0:0:0:0: Unhandled error code sd 0:0:0:0: SCSI error: return code = 0x06000008 Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK Buffer I/O error on device dm-2, logical block 864990 lost page write due to I/O error on dm-2 Buffer I/O error on device dm-2, logical block 864991 lost page write due to I/O error on dm-2 Buffer I/O error on device dm-2, logical block 864992 lost page write due to I/O error on dm-2 Buffer I/O error on device dm-2, logical block 864993 lost page write due to I/O error on dm-2 Buffer I/O error on device dm-2, logical block 864994 lost page write due to I/O error on dm-2 Buffer I/O error on device dm-2, logical block 864995 lost page write due to I/O error on dm-2 Buffer I/O error on device dm-2, logical block 864996 lost page write due to I/O error on dm-2 Buffer I/O error on device dm-2, logical block 864997 lost page write due to I/O error on dm-2 Buffer I/O error on device dm-2, logical block 864998 lost page write due to I/O error on dm-2 Buffer I/O error on device dm-2, logical block 864999 lost page write due to I/O error on dm-2 sd 0:0:0:0: timing out command, waited 1080s sd 0:0:0:0: Unhandled error code sd 0:0:0:0: SCSI error: return code = 0x06000008 Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK sd 0:0:0:0: timing out command, waited 1080s sd 0:0:0:0: Unhandled error code sd 0:0:0:0: SCSI error: return code = 0x06000008 Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK Aborting journal on device dm-2. ext3_abort called. EXT3-fs error (device dm-2): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_committed_data
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days