Bug 1666912
Summary: | smartpqi takes device offline after reset | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Jon Magrini <jmagrini> | ||||
Component: | kernel | Assignee: | Don Brace (Microchip) <dbrace> | ||||
kernel sub component: | Storage Drivers | QA Contact: | guazhang <guazhang> | ||||
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | unspecified | CC: | abeausol, afox, akaiser, bubrown, dbrace, guazhang, james.hofmeister, jaylee1230, ldigby, loberman, nkshirsa, olim, pdwyer, revers, rhandlin, rmadhuso, sbenesh, ssaner, tonay, toracat | ||||
Version: | 7.6 | ||||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-09-18 10:41:33 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Jon Magrini
2019-01-16 22:22:27 UTC
Created attachment 1521145 [details]
rhel-7.6.z compatible patch file
There is a patch to correct this issue that correct this issue. This patch is in the patch series for RHEL7.7 commit 329b1669ac50a9420c5bdd44f649371e3fa0cb28 Author: Kevin Barnett <kevin.barnett> Date: Fri Dec 7 16:29:51 2018 -0600 scsi: smartpqi: correct lun reset issues Problem: The Linux kernel takes a logical volume offline after a LUN reset. This is generally accompanied by this message in the dmesg output: Device offlined - not ready after error recovery Root Cause: The root cause is a "quirk" in the timeout handling in the Linux SCSI layer. The Linux kernel places a 30-second timeout on most media access commands (reads and writes) that it send to device drivers. When a media access command times out, the Linux kernel goes into error recovery mode for the LUN that was the target of the command that timed out. Every command that timed out is kept on a list inside of the Linux kernel to be retried later. The kernel attempts to recover the command(s) that timed out by issuing a LUN reset followed by a TEST UNIT READY. If the LUN reset and TEST UNIT READY commands are successful, the kernel retries the command(s) that timed out. Each SCSI command issued by the kernel has a result field associated with it. This field indicates the final result of the command (success or error). When a command times out, the kernel places a value in this result field indicating that the command timed out. The "quirk" is that after the LUN reset and TEST UNIT READY commands are completed, the kernel checks each command on the timed-out command list before retrying it. If the result field is still "timed out", the kernel treats that command as not having been successfully recovered for a retry. If the number of commands that are in this state are greater than two, the kernel takes the LUN offline. Fix: When our RAIDStack receives a LUN reset, it simply waits until all outstanding commands complete. Generally, all of these outstanding commands complete successfully. Therefore, the fix in the smartpqi driver is to always set the command result field to indicate success when a request completes successfully. This normally isn’t necessary because the result field is always initialized to success when the command is submitted to the driver. So when the command completes successfully, the result field is left untouched. But in this case, the kernel changes the result field behind the driver’s back and then expects the field to be changed by the driver as the commands that timed-out complete. Reviewed-by: Dave Carroll <david.carroll> Reviewed-by: Scott Teel <scott.teel> Signed-off-by: Kevin Barnett <kevin.barnett> Signed-off-by: Don Brace <don.brace> Signed-off-by: Martin K. Petersen <martin.petersen> (cherry picked from commit 2ba55c9851d74eb015a554ef69ddf2ef061d5780) Signed-off-by: Don Brace <dbrace> (In reply to Don Brace from comment #3) Do you plan to post this to rhkl? (In reply to Rob Evers from comment #4) > (In reply to Don Brace from comment #3) > > Do you plan to post this to rhkl? I can post this patch to 7.6? If so, sure. (In reply to Don Brace from comment #5) > (In reply to Rob Evers from comment #4) > > (In reply to Don Brace from comment #3) > > > > Do you plan to post this to rhkl? > > I can post this patch to 7.6? If so, sure. rhel7.6 went ga last fall. Was the patch already posted for rhel7.7 as part of the patchset for https://bugzilla.redhat.com/show_bug.cgi?id=1641112 ? The fix needs to be posted and accepted into rhel7.7 release and then backported by Red Hat into a rhel7.6 errata kernel. Hello QE request OtherQA If the issue just occur on customer environment,. Could someone help to check if customer can help to provide test result? Thanks in advance. customer don't responds OtherQA request, so QE will do sanity testing. The patch needed for this BZ is being added to 7.7 via RHBZ 1641112, which is ON_QA. I think here we just need to cherry pick http://patchwork.lab.bos.redhat.com/patch/237396/ to resolve the cases attached to this BZ. -Jon Hi Microsemi & Redhat Eng, My customer has been hitting the issue. Do you have patch driver for 7.4 as well? Hi Don What do you need from us to make progress here. We have customers starting to ask for the fix. Regards Laurence I can patch 7.4 driver. I'll let you know soon. That's what is needed...correct? (In reply to Don Brace from comment #21) > I can patch 7.4 driver. I'll let you know soon. > > That's what is needed...correct? I say this because the reset patch went into rhel77 (along with a lot of other patches). Re: [RHEL 7.7 e-stor V2 PATCH 00/32] smartpqi updates (In reply to Don Brace from comment #22) > (In reply to Don Brace from comment #21) > > I can patch 7.4 driver. I'll let you know soon. > > > > That's what is needed...correct? > > I say this because the reset patch went into rhel77 (along with a lot of > other patches). > > > Re: [RHEL 7.7 e-stor V2 PATCH 00/32] smartpqi updates I have two brew-builds going, 1 for rhel4 and 1 for rhel6. Do you just want the patch, or the brew-build? I cherry-picked the patch into RHEL7.4 and RHEL7.6 and did a brew-build. Brew build for 7.6: Task info: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=21449701 Brew build for 7.4: Task info: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=21449700 Hi Don, HPE's customer is looking for 7.5, in addition to 7.4 and 7.6. Regards, Andy. (In reply to Rob Evers from comment #6) > (In reply to Don Brace from comment #5) > > (In reply to Rob Evers from comment #4) > > > (In reply to Don Brace from comment #3) > > > > > > Do you plan to post this to rhkl? > > > > I can post this patch to 7.6? If so, sure. > > rhel7.6 went ga last fall. Was the patch already posted for rhel7.7 as part > of the patchset for https://bugzilla.redhat.com/show_bug.cgi?id=1641112 ? > > The fix needs to be posted and accepted into rhel7.7 release and then > backported by Red Hat into a rhel7.6 errata kernel. Rob, Do I need to create a BZ for RHEL7.6z and submit this patch to 7.6z? (In reply to Don Brace from comment #26) > (In reply to Rob Evers from comment #6) > > (In reply to Don Brace from comment #5) > > > (In reply to Rob Evers from comment #4) > > > > (In reply to Don Brace from comment #3) > > > > > > > > Do you plan to post this to rhkl? > > > > > > I can post this patch to 7.6? If so, sure. > > > > rhel7.6 went ga last fall. Was the patch already posted for rhel7.7 as part > > of the patchset for https://bugzilla.redhat.com/show_bug.cgi?id=1641112 ? > > > > The fix needs to be posted and accepted into rhel7.7 release and then > > backported by Red Hat into a rhel7.6 errata kernel. > > Rob, > > Do I need to create a BZ for RHEL7.6z and submit this patch to 7.6z? Andre, will the patch to fix this issue be cherry-picked from RHEL7.7 into 7.6z? Hi Don, The BZ where the original rhel7.7 patch went in needs to have z-stream requests made. I have done that (BZ 1641112) and added a reference to your patch above there. Hello I see the patch[1] have applied in rhel7.7, may I move it to verified ? [1]https://patchwork.kernel.org/patch/10718979/ Hello, I have a customer requesting this patch for 7.6z Who is doing the needful to get the patch into 7.6z? What is the schedule like? Thanks and Regards Oonkwee Lim Enterprise Cloud Support Can this be closed as current-release? (In reply to Rob Evers from comment #33) > Can this be closed as current-release? I'm good with that. Thanks. Close this based on above comments. |