Bug 1666912

Summary: smartpqi takes device offline after reset
Product: Red Hat Enterprise Linux 7 Reporter: Jon Magrini <jmagrini>
Component: kernelAssignee: Don Brace (Microchip) <dbrace>
kernel sub component: Storage Drivers QA Contact: guazhang <guazhang>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: unspecified CC: abeausol, afox, akaiser, bubrown, dbrace, guazhang, james.hofmeister, jaylee1230, ldigby, loberman, nkshirsa, olim, pdwyer, revers, rhandlin, rmadhuso, sbenesh, ssaner, tonay, toracat
Version: 7.6   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-18 10:41:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
rhel-7.6.z compatible patch file none

Description Jon Magrini 2019-01-16 22:22:27 UTC
Description of problem:

After command timeout smartpqi provided device is offlined after drivers eh performs a reset.  


Version-Release number of selected component (if applicable):
3.10.0-957.1.3.el7.x86_64

How reproducible:
Resets seem load/workload induced.  The end customer is able to reproduce the problem in testing. Unfortunately, this problem only happens when one specific job is run, and even then it only happens about every couple hours.

Steps to Reproduce:
1. tbd
2.
3.

Actual results:

Jan  8 17:21:34 hostname kernel: smartpqi 0000:5c:00.0: resetting scsi 2:1:0:0
Jan  8 17:22:07 hostname kernel: smartpqi 0000:5c:00.0: reset of scsi 2:1:0:0: SUCCESS
Jan  8 17:22:07 hostname kernel: sd 2:1:0:0: [sdb] Medium access timeout failure. Offlining disk!
Jan  8 17:22:07 hostname kernel: sd 2:1:0:0: Device offlined - not ready after error recovery

After applying the following patch to a test kernel: 
---
https://patchwork.kernel.org/patch/10718979/

Expected results are displayed and device is not offlined:
---
Jan 15 14:51:57 hostname kernel: smartpqi 0000:5c:00.0: resetting scsi 2:1:0:0
Jan 15 14:51:57 hostname kernel: smartpqi 0000:5c:00.0: reset of scsi 2:1:0:0: SUCCESS

Comment 2 Jon Magrini 2019-01-16 22:29:25 UTC
Created attachment 1521145 [details]
rhel-7.6.z compatible patch file

Comment 3 Don Brace (Microchip) 2019-01-18 19:28:18 UTC
There is a patch to correct this issue that correct this issue.
This patch is in the patch series for RHEL7.7

commit 329b1669ac50a9420c5bdd44f649371e3fa0cb28
Author: Kevin Barnett <kevin.barnett>
Date:   Fri Dec 7 16:29:51 2018 -0600

    scsi: smartpqi: correct lun reset issues
    
    Problem:
    The Linux kernel takes a logical volume offline after a LUN reset.  This is
    generally accompanied by this message in the dmesg output:
    
    Device offlined - not ready after error recovery
    
    Root Cause:
    The root cause is a "quirk" in the timeout handling in the Linux SCSI
    layer. The Linux kernel places a 30-second timeout on most media access
    commands (reads and writes) that it send to device drivers.  When a media
    access command times out, the Linux kernel goes into error recovery mode
    for the LUN that was the target of the command that timed out. Every
    command that timed out is kept on a list inside of the Linux kernel to be
    retried later. The kernel attempts to recover the command(s) that timed out
    by issuing a LUN reset followed by a TEST UNIT READY. If the LUN reset and
    TEST UNIT READY commands are successful, the kernel retries the command(s)
    that timed out.
    
    Each SCSI command issued by the kernel has a result field associated with
    it. This field indicates the final result of the command (success or
    error). When a command times out, the kernel places a value in this result
    field indicating that the command timed out.
    
    The "quirk" is that after the LUN reset and TEST UNIT READY commands are
    completed, the kernel checks each command on the timed-out command list
    before retrying it. If the result field is still "timed out", the kernel
    treats that command as not having been successfully recovered for a
    retry. If the number of commands that are in this state are greater than
    two, the kernel takes the LUN offline.
    
    Fix:
    When our RAIDStack receives a LUN reset, it simply waits until all
    outstanding commands complete. Generally, all of these outstanding commands
    complete successfully. Therefore, the fix in the smartpqi driver is to
    always set the command result field to indicate success when a request
    completes successfully. This normally isn’t necessary because the result
    field is always initialized to success when the command is submitted to the
    driver. So when the command completes successfully, the result field is
    left untouched. But in this case, the kernel changes the result field
    behind the driver’s back and then expects the field to be changed by the
    driver as the commands that timed-out complete.
    
    Reviewed-by: Dave Carroll <david.carroll>
    Reviewed-by: Scott Teel <scott.teel>
    Signed-off-by: Kevin Barnett <kevin.barnett>
    Signed-off-by: Don Brace <don.brace>
    Signed-off-by: Martin K. Petersen <martin.petersen>
    (cherry picked from commit 2ba55c9851d74eb015a554ef69ddf2ef061d5780)
    Signed-off-by: Don Brace <dbrace>

Comment 4 Rob Evers 2019-01-18 21:41:45 UTC
(In reply to Don Brace from comment #3)

Do you plan to post this to rhkl?

Comment 5 Don Brace (Microchip) 2019-01-18 21:55:16 UTC
(In reply to Rob Evers from comment #4)
> (In reply to Don Brace from comment #3)
> 
> Do you plan to post this to rhkl?

I can post this patch to 7.6? If so, sure.

Comment 6 Rob Evers 2019-01-21 15:53:54 UTC
(In reply to Don Brace from comment #5)
> (In reply to Rob Evers from comment #4)
> > (In reply to Don Brace from comment #3)
> > 
> > Do you plan to post this to rhkl?
> 
> I can post this patch to 7.6? If so, sure.

rhel7.6 went ga last fall.  Was the patch already posted for rhel7.7 as part of the patchset for https://bugzilla.redhat.com/show_bug.cgi?id=1641112 ?

The fix needs to be posted and accepted into rhel7.7 release and then backported by Red Hat into a rhel7.6 errata kernel.

Comment 7 guazhang@redhat.com 2019-01-22 01:13:24 UTC
Hello

QE request OtherQA If the issue just occur on customer environment,. Could someone help to check if customer can help to provide test result?
Thanks in advance.

Comment 12 guazhang@redhat.com 2019-02-14 02:10:10 UTC
customer don't responds OtherQA request, so QE will do sanity testing.

Comment 17 Jon Magrini 2019-03-28 20:35:51 UTC
The patch needed for this BZ is being added to 7.7 via RHBZ 1641112, which is ON_QA.  I think here we just need to cherry pick http://patchwork.lab.bos.redhat.com/patch/237396/ to resolve the cases attached to this BZ.  

-Jon

Comment 18 jaylee1230 2019-04-16 04:28:00 UTC
Hi Microsemi & Redhat Eng,

My customer has been hitting the issue. Do you have patch driver for 7.4 as well?

Comment 19 loberman 2019-04-30 18:35:38 UTC
Hi Don

What do you need from us to make progress here.
We have customers starting to ask for the fix.

Regards
Laurence

Comment 21 Don Brace (Microchip) 2019-05-02 20:37:46 UTC
I can patch 7.4 driver. I'll let you know soon.

That's what is needed...correct?

Comment 22 Don Brace (Microchip) 2019-05-02 21:09:57 UTC
(In reply to Don Brace from comment #21)
> I can patch 7.4 driver. I'll let you know soon.
> 
> That's what is needed...correct?

I say this because the reset patch went into rhel77 (along with a lot of other patches).


	Re: [RHEL 7.7 e-stor V2 PATCH 00/32] smartpqi updates

Comment 23 Don Brace (Microchip) 2019-05-02 22:33:44 UTC
(In reply to Don Brace from comment #22)
> (In reply to Don Brace from comment #21)
> > I can patch 7.4 driver. I'll let you know soon.
> > 
> > That's what is needed...correct?
> 
> I say this because the reset patch went into rhel77 (along with a lot of
> other patches).
> 
> 
> 	Re: [RHEL 7.7 e-stor V2 PATCH 00/32] smartpqi updates

I have two brew-builds going, 1 for rhel4 and 1 for rhel6.

Do you just want the patch, or the brew-build?

Comment 24 Don Brace (Microchip) 2019-05-03 13:57:25 UTC
I cherry-picked the patch into RHEL7.4 and RHEL7.6  and did a brew-build.


Brew build for 7.6: Task info: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=21449701

Brew build for 7.4: Task info: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=21449700

Comment 25 afox@redhat.com 2019-05-03 14:10:39 UTC
Hi Don, 

HPE's customer is looking for 7.5, in addition to 7.4 and 7.6. 

Regards,

Andy.

Comment 26 Don Brace (Microchip) 2019-05-03 19:12:16 UTC
(In reply to Rob Evers from comment #6)
> (In reply to Don Brace from comment #5)
> > (In reply to Rob Evers from comment #4)
> > > (In reply to Don Brace from comment #3)
> > > 
> > > Do you plan to post this to rhkl?
> > 
> > I can post this patch to 7.6? If so, sure.
> 
> rhel7.6 went ga last fall.  Was the patch already posted for rhel7.7 as part
> of the patchset for https://bugzilla.redhat.com/show_bug.cgi?id=1641112 ?
> 
> The fix needs to be posted and accepted into rhel7.7 release and then
> backported by Red Hat into a rhel7.6 errata kernel.

Rob, 

Do I need to create a BZ for RHEL7.6z and submit this patch to 7.6z?

Comment 27 Don Brace (Microchip) 2019-05-06 19:11:42 UTC
(In reply to Don Brace from comment #26)
> (In reply to Rob Evers from comment #6)
> > (In reply to Don Brace from comment #5)
> > > (In reply to Rob Evers from comment #4)
> > > > (In reply to Don Brace from comment #3)
> > > > 
> > > > Do you plan to post this to rhkl?
> > > 
> > > I can post this patch to 7.6? If so, sure.
> > 
> > rhel7.6 went ga last fall.  Was the patch already posted for rhel7.7 as part
> > of the patchset for https://bugzilla.redhat.com/show_bug.cgi?id=1641112 ?
> > 
> > The fix needs to be posted and accepted into rhel7.7 release and then
> > backported by Red Hat into a rhel7.6 errata kernel.
> 
> Rob, 
> 
> Do I need to create a BZ for RHEL7.6z and submit this patch to 7.6z?

Andre, will the patch to fix this issue be cherry-picked from RHEL7.7 into 7.6z?

Comment 28 Rob Evers 2019-05-07 14:31:27 UTC
Hi Don,  The BZ where the original rhel7.7 patch went in needs to have z-stream requests made.  I have done that (BZ 1641112) and added a reference to your patch above there.

Comment 30 guazhang@redhat.com 2019-07-08 13:19:34 UTC
Hello

I see the patch[1] have applied in rhel7.7, may I move it to verified ?

[1]https://patchwork.kernel.org/patch/10718979/

Comment 31 Oonkwee Lim 2019-07-13 16:45:11 UTC
Hello,

I have a customer requesting this patch for 7.6z

Who is doing the needful to get the patch into 7.6z?

What is the schedule like?


Thanks and Regards

Oonkwee Lim
Enterprise Cloud Support

Comment 33 Rob Evers 2019-08-22 14:54:21 UTC
Can this be closed as current-release?

Comment 34 Jon Magrini 2019-08-23 21:14:42 UTC
(In reply to Rob Evers from comment #33)
> Can this be closed as current-release?

I'm good with that. Thanks.

Comment 35 Yongcheng Yang 2019-09-18 10:41:33 UTC
Close this based on above comments.