Bug 1471302 - 4.12 renders mtx unable to manipulate a tape library
Summary: 4.12 renders mtx unable to manipulate a tape library
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
Depends On:
TreeView+ depends on / blocked
Reported: 2017-07-14 23:23 UTC by Jason Tibbitts
Modified: 2018-04-06 18:44 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2018-04-06 18:39:39 UTC

Attachments (Terms of Use)
Upstream fix manually applied against 4.12. (1.12 KB, patch)
2017-07-26 20:49 UTC, Jason Tibbitts
no flags Details | Diff
Upstream patch fixing this issue (2.56 KB, patch)
2017-07-27 13:02 UTC, Jason Tibbitts
no flags Details | Diff

System ID Priority Status Summary Last Updated
Linux Kernel 196375 None None None 2017-07-17 13:46:06 UTC
Launchpad 1704512 None None None 2017-07-17 13:46:39 UTC

Description Jason Tibbitts 2017-07-14 23:23:43 UTC
This is an odd one.  I'm actually running Fedora 25, but to dig into a problem someone reported I pulled a rawhide kernel to check and then have worked through several random kernel builds from koji to try and figure out where the problem crept in.

Basically, this is what should happen:

> mtx -f /dev/sg7 next 0
Unloading drive 0 into Storage Element 46...done
Loading media from Storage Element 47 into drive 0...

But this is what does happen:

> mtx -f /dev/sg7 next 0
Unloading drive 0 into Storage Element 46...mtx: Request Sense: Long Report=yes
mtx: Request Sense: Valid Residual=no
mtx: Request Sense: Error Code=0 (Unknown?!)
mtx: Request Sense: Sense Key=No Sense
mtx: Request Sense: FileMark=no
mtx: Request Sense: EOM=no
mtx: Request Sense: ILI=no
mtx: Request Sense: Additional Sense Code = 00
mtx: Request Sense: Additional Sense Qualifier = 00
mtx: Request Sense: BPV=no
mtx: Request Sense: Error in CDB=no
mtx: Request Sense: SKSV=no
MOVE MEDIUM from Element Address 1 to 1046 Failed

Trying random kernel builds from koji, I've found that kernel-4.12.0-0.rc0.git2.1.fc27.x86_64 is good and kernel-4.12.0-0.rc0.git9.1.fc27.x86_64 is bad.  Will try some more and report back.

This matches up with what an Ubuntu user on IRC has found; the 4.12 series is bad but 4.11.10 is OK.  We have quote different libraries and HBAs (him a Dell ML6000 library and qla2462 FC board and me an Overland Neo T48 with an LSI 3008SAS).  mtx for both of us is 1.3.12 (in Fedora that's mtx-1.3.12-14.fc24.x86_64).

Comment 1 Jason Tibbitts 2017-07-14 23:56:08 UTC
So the last good build was 4.12.0-0.rc0.git2.1.fc27 which corresponds to v4.11-4395-g89c9fea.  The next successful koji build is 4.12.0-0.rc0.git4.1.fc27 which corresponds to v4.11-8539-gaf82455.  Sadly there are three days of merge window in between.

I know I need to either bisect down further in there or manually examine the changes between those two commits but sadly I don't really have any idea how.  I only every learned how to use mock to build kernel packages, so if you give me a patch I can easily test it but I don't know how to take the vanilla upstream source and do a proper bisect.

Comment 2 Jason Tibbitts 2017-07-14 23:59:26 UTC
In retrospect, that was rather poor wording.

The last good build is 4.12.0-0.rc0.git2.1.fc27.  The next koji build after that which actually produced packages is 4.12.0-0.rc0.git4.1.fc27 which I have tried and found to be bad.  So the problem came in between those two releases, which correspond to v4.11-4395-g89c9fea (good) and v4.11-8539-gaf82455 (bad).

Comment 3 Jason Tibbitts 2017-07-17 21:38:37 UTC
I spent the whole day building kernels and git bisect gives me:

5c66d9393f583778e8dc1ee6a69c5bbe9ab28eaa is the first bad commit
commit 5c66d9393f583778e8dc1ee6a69c5bbe9ab28eaa
Author: NeilBrown <neilb@suse.com>
Date:   Mon Apr 10 12:15:13 2017 +1000

    scsi: ibmvfc: don't check for failure from mempool_alloc()

    mempool_alloc() cannot fail when passed GFP_NOIO or any other gfp
    setting that is permitted to sleep.  So remove this pointless code.

    Signed-off-by: NeilBrown <neilb@suse.com>
    Acked-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

:040000 040000 2ade2de52266bc7b55c0be0d0b1c371b4dc72a1a 223c5a3fb34016c7dc230177773ff63185126442 M      drivers

That doesn't feel related to me, but I'll revert, rebuild and see what happens.

Comment 4 Jason Tibbitts 2017-07-17 23:59:02 UTC
I redid the bisection, hopefully without screwing it up, and this time it ended at:

commit 28676d869bbb5257b5f14c0c95ad3af3a7019dd5
Author: Johannes Thumshirn <jthumshirn@suse.de>
Date:   Fri Apr 7 09:34:15 2017 +0200

    scsi: sg: check for valid direction before starting the request
    Check for a valid direction before starting the request, otherwise we
    risk running into an assertion in the scsi midlayer checking for valid
    [mkp: fixed typo]
    Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
    Link: http://www.spinics.net/lists/linux-scsi/msg104400.html
    Reported-by: Dmitry Vyukov <dvyukov@google.com>
    Signed-off-by: Hannes Reinecke <hare@suse.com>
    Tested-by: Johannes Thumshirn <jthumshirn@suse.de>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

which does look like it could be related.

Comment 5 Jason Tibbitts 2017-07-18 16:51:10 UTC
Supposedly fixed by the following, but I'll do some more builds to test.  If this does fix things I'll send it to stable and ask that it be included in our 4.12 kernels so my backups don't break.

commit 68c59fcea1f2c6a54c62aa896cc623c1b5bc9b47
Author: Johannes Thumshirn <jthumshirn@suse.de>
Date:   Fri Jul 7 10:56:38 2017 +0200

    scsi: sg: fix SG_DXFER_FROM_DEV transfers

    SG_DXFER_FROM_DEV transfers do not necessarily have a dxferp as we set
    it to NULL for the old sg_io read/write interface, but must have a
    length bigger than 0. This fixes a regression introduced by commit
    28676d869bbb ("scsi: sg: check for valid direction before starting the

    Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
    Fixes: 28676d869bbb ("scsi: sg: check for valid direction before starting the request")
    Reported-by: Chris Clayton <chris2553@googlemail.com>
    Tested-by: Chris Clayton <chris2553@googlemail.com>
    Cc: Douglas Gilbert <dgilbert@interlog.com>
    Reviewed-by: Hannes Reinecke <hare@suse.com>
    Tested-by: Chris Clayton <chris2553@googlemail.com>
    Acked-by: Douglas Gilbert <dgilbert@interlog.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index 21225d6..1e82d41 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -758,8 +758,11 @@ static bool sg_is_valid_dxfer(sg_io_hdr_t *hp)
                if (hp->dxferp || hp->dxfer_len > 0)
                        return false;
                return true;
-       case SG_DXFER_TO_DEV:
        case SG_DXFER_FROM_DEV:
+               if (hp->dxfer_len < 0)
+                       return false;
+               return true;
+       case SG_DXFER_TO_DEV:
        case SG_DXFER_TO_FROM_DEV:
                if (!hp->dxferp || hp->dxfer_len == 0)
                        return false;

Comment 6 Jason Tibbitts 2017-07-18 18:53:20 UTC
Turns out that patch doesn't help as far as I can tell.  A build of current master HEAD still has the problem, and cherry picking that patch on top of a clean v4.12 also still has the problem.  Reverting 28676d869bbb5257b5f14c0c95ad3af3a7019dd5 on top of a clean 4.12 checkout works just fine.

Comment 7 Jason Tibbitts 2017-07-26 20:49:39 UTC
Created attachment 1305032 [details]
Upstream fix manually applied against 4.12.

The original author decided to mostly revert the original patch and go with a much simpler check.  I manually applied that against 4.12 and it's also fine.  The resulting patch is attached.

Comment 8 Jason Tibbitts 2017-07-27 13:02:03 UTC
Created attachment 1305357 [details]
Upstream patch fixing this issue

The attached patch was committed to the scsi-fixes tree so it should appear in 4.13.  It was also sent to stable so hopefully it will appear in a 4.12 point release soon.

Comment 9 Jason Tibbitts 2017-07-27 13:04:19 UTC
I should add that this has already been committed to Fedora's 4.12 series, so I'll go ahead and close this ticket out.  My thanks to everyone who helped me to get this bisected and fixed.

Comment 10 Fedora Update System 2017-07-28 06:12:26 UTC
kernel-4.12.4-300.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-14ad2c5d17

Comment 11 Fedora Update System 2017-07-29 00:54:40 UTC
kernel-4.12.4-300.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-14ad2c5d17

Comment 12 Laura Abbott 2018-04-06 18:39:39 UTC
This somehow got marked as reopened?

Comment 13 Jason Tibbitts 2018-04-06 18:44:43 UTC
It was reopened when the update was submitted, and I'm thinking that particular update was superseded by another one at some point and never actually got pushed to stable.  The problem with mtx has certainly not reoccurred.

Note You need to log in before you can comment on or make changes to this bug.