Bug 1471302 - 4.12 renders mtx unable to manipulate a tape library
4.12 renders mtx unable to manipulate a tape library
Status: ON_QA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
rawhide
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-14 19:23 EDT by Jason Tibbitts
Modified: 2017-07-28 20:54 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-07-27 09:04:19 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Upstream fix manually applied against 4.12. (1.12 KB, patch)
2017-07-26 16:49 EDT, Jason Tibbitts
no flags Details | Diff
Upstream patch fixing this issue (2.56 KB, patch)
2017-07-27 09:02 EDT, Jason Tibbitts
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Linux Kernel 196375 None None None 2017-07-17 09:46 EDT
Launchpad 1704512 None None None 2017-07-17 09:46 EDT

  None (edit)
Description Jason Tibbitts 2017-07-14 19:23:43 EDT
This is an odd one.  I'm actually running Fedora 25, but to dig into a problem someone reported I pulled a rawhide kernel to check and then have worked through several random kernel builds from koji to try and figure out where the problem crept in.

Basically, this is what should happen:

> mtx -f /dev/sg7 next 0
Unloading drive 0 into Storage Element 46...done
Loading media from Storage Element 47 into drive 0...

But this is what does happen:

> mtx -f /dev/sg7 next 0
Unloading drive 0 into Storage Element 46...mtx: Request Sense: Long Report=yes
mtx: Request Sense: Valid Residual=no
mtx: Request Sense: Error Code=0 (Unknown?!)
mtx: Request Sense: Sense Key=No Sense
mtx: Request Sense: FileMark=no
mtx: Request Sense: EOM=no
mtx: Request Sense: ILI=no
mtx: Request Sense: Additional Sense Code = 00
mtx: Request Sense: Additional Sense Qualifier = 00
mtx: Request Sense: BPV=no
mtx: Request Sense: Error in CDB=no
mtx: Request Sense: SKSV=no
MOVE MEDIUM from Element Address 1 to 1046 Failed

Trying random kernel builds from koji, I've found that kernel-4.12.0-0.rc0.git2.1.fc27.x86_64 is good and kernel-4.12.0-0.rc0.git9.1.fc27.x86_64 is bad.  Will try some more and report back.

This matches up with what an Ubuntu user on IRC has found; the 4.12 series is bad but 4.11.10 is OK.  We have quote different libraries and HBAs (him a Dell ML6000 library and qla2462 FC board and me an Overland Neo T48 with an LSI 3008SAS).  mtx for both of us is 1.3.12 (in Fedora that's mtx-1.3.12-14.fc24.x86_64).
Comment 1 Jason Tibbitts 2017-07-14 19:56:08 EDT
So the last good build was 4.12.0-0.rc0.git2.1.fc27 which corresponds to v4.11-4395-g89c9fea.  The next successful koji build is 4.12.0-0.rc0.git4.1.fc27 which corresponds to v4.11-8539-gaf82455.  Sadly there are three days of merge window in between.

I know I need to either bisect down further in there or manually examine the changes between those two commits but sadly I don't really have any idea how.  I only every learned how to use mock to build kernel packages, so if you give me a patch I can easily test it but I don't know how to take the vanilla upstream source and do a proper bisect.
Comment 2 Jason Tibbitts 2017-07-14 19:59:26 EDT
In retrospect, that was rather poor wording.

The last good build is 4.12.0-0.rc0.git2.1.fc27.  The next koji build after that which actually produced packages is 4.12.0-0.rc0.git4.1.fc27 which I have tried and found to be bad.  So the problem came in between those two releases, which correspond to v4.11-4395-g89c9fea (good) and v4.11-8539-gaf82455 (bad).
Comment 3 Jason Tibbitts 2017-07-17 17:38:37 EDT
I spent the whole day building kernels and git bisect gives me:

5c66d9393f583778e8dc1ee6a69c5bbe9ab28eaa is the first bad commit
commit 5c66d9393f583778e8dc1ee6a69c5bbe9ab28eaa
Author: NeilBrown <neilb@suse.com>
Date:   Mon Apr 10 12:15:13 2017 +1000

    scsi: ibmvfc: don't check for failure from mempool_alloc()

    mempool_alloc() cannot fail when passed GFP_NOIO or any other gfp
    setting that is permitted to sleep.  So remove this pointless code.

    Signed-off-by: NeilBrown <neilb@suse.com>
    Acked-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

:040000 040000 2ade2de52266bc7b55c0be0d0b1c371b4dc72a1a 223c5a3fb34016c7dc230177773ff63185126442 M      drivers

That doesn't feel related to me, but I'll revert, rebuild and see what happens.
Comment 4 Jason Tibbitts 2017-07-17 19:59:02 EDT
I redid the bisection, hopefully without screwing it up, and this time it ended at:

commit 28676d869bbb5257b5f14c0c95ad3af3a7019dd5
Author: Johannes Thumshirn <jthumshirn@suse.de>
Date:   Fri Apr 7 09:34:15 2017 +0200

    scsi: sg: check for valid direction before starting the request
    
    Check for a valid direction before starting the request, otherwise we
    risk running into an assertion in the scsi midlayer checking for valid
    requests.
    
    [mkp: fixed typo]
    
    Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
    Link: http://www.spinics.net/lists/linux-scsi/msg104400.html
    Reported-by: Dmitry Vyukov <dvyukov@google.com>
    Signed-off-by: Hannes Reinecke <hare@suse.com>
    Tested-by: Johannes Thumshirn <jthumshirn@suse.de>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

which does look like it could be related.
Comment 5 Jason Tibbitts 2017-07-18 12:51:10 EDT
Supposedly fixed by the following, but I'll do some more builds to test.  If this does fix things I'll send it to stable and ask that it be included in our 4.12 kernels so my backups don't break.

commit 68c59fcea1f2c6a54c62aa896cc623c1b5bc9b47
Author: Johannes Thumshirn <jthumshirn@suse.de>
Date:   Fri Jul 7 10:56:38 2017 +0200

    scsi: sg: fix SG_DXFER_FROM_DEV transfers

    SG_DXFER_FROM_DEV transfers do not necessarily have a dxferp as we set
    it to NULL for the old sg_io read/write interface, but must have a
    length bigger than 0. This fixes a regression introduced by commit
    28676d869bbb ("scsi: sg: check for valid direction before starting the
    request")

    Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
    Fixes: 28676d869bbb ("scsi: sg: check for valid direction before starting the request")
    Reported-by: Chris Clayton <chris2553@googlemail.com>
    Tested-by: Chris Clayton <chris2553@googlemail.com>
    Cc: Douglas Gilbert <dgilbert@interlog.com>
    Reviewed-by: Hannes Reinecke <hare@suse.com>
    Tested-by: Chris Clayton <chris2553@googlemail.com>
    Acked-by: Douglas Gilbert <dgilbert@interlog.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index 21225d6..1e82d41 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -758,8 +758,11 @@ static bool sg_is_valid_dxfer(sg_io_hdr_t *hp)
                if (hp->dxferp || hp->dxfer_len > 0)
                        return false;
                return true;
-       case SG_DXFER_TO_DEV:
        case SG_DXFER_FROM_DEV:
+               if (hp->dxfer_len < 0)
+                       return false;
+               return true;
+       case SG_DXFER_TO_DEV:
        case SG_DXFER_TO_FROM_DEV:
                if (!hp->dxferp || hp->dxfer_len == 0)
                        return false;
Comment 6 Jason Tibbitts 2017-07-18 14:53:20 EDT
Turns out that patch doesn't help as far as I can tell.  A build of current master HEAD still has the problem, and cherry picking that patch on top of a clean v4.12 also still has the problem.  Reverting 28676d869bbb5257b5f14c0c95ad3af3a7019dd5 on top of a clean 4.12 checkout works just fine.
Comment 7 Jason Tibbitts 2017-07-26 16:49 EDT
Created attachment 1305032 [details]
Upstream fix manually applied against 4.12.

The original author decided to mostly revert the original patch and go with a much simpler check.  I manually applied that against 4.12 and it's also fine.  The resulting patch is attached.
Comment 8 Jason Tibbitts 2017-07-27 09:02 EDT
Created attachment 1305357 [details]
Upstream patch fixing this issue

The attached patch was committed to the scsi-fixes tree so it should appear in 4.13.  It was also sent to stable so hopefully it will appear in a 4.12 point release soon.
Comment 9 Jason Tibbitts 2017-07-27 09:04:19 EDT
I should add that this has already been committed to Fedora's 4.12 series, so I'll go ahead and close this ticket out.  My thanks to everyone who helped me to get this bisected and fixed.
Comment 10 Fedora Update System 2017-07-28 02:12:26 EDT
kernel-4.12.4-300.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-14ad2c5d17
Comment 11 Fedora Update System 2017-07-28 20:54:40 EDT
kernel-4.12.4-300.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-14ad2c5d17

Note You need to log in before you can comment on or make changes to this bug.