Bug 1478201 - kernel runs out of memory with 256 virtio-scsi disks
kernel runs out of memory with 256 virtio-scsi disks
Status: CLOSED UPSTREAM
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
rawhide
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks: TRACKER-bugs-affecting-libguestfs
  Show dependency treegraph
 
Reported: 2017-08-03 17:44 EDT by Richard W.M. Jones
Modified: 2018-05-02 04:55 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-05-02 04:55:16 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
find-max-disks.pl (785 bytes, text/plain)
2017-08-04 17:02 EDT, Richard W.M. Jones
no flags Details

  None (edit)
Description Richard W.M. Jones 2017-08-03 17:44:22 EDT
Description of problem:

A recent change to the Rawhide kernel has made it consume
much more RAM when scanning virtio-scsi disks.  Now it cannot
add 256 disks without failing with:

[    1.266507] scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured
[    1.272271] scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured
[    1.277880] scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured

(This happens after 238 disks in this case).  The VM has 500 MB
of RAM and nothing else is running.

Version-Release number of selected component (if applicable):

kernel-4.13.0-0.rc2.git3.1.fc27.x86_64

Didn't fail with kernel-4.11.9-300.fc26.x86_64

How reproducible:

100%

Steps to Reproduce:
1. Add 256 virtio-scsi disks to a VM.
Comment 1 Richard W.M. Jones 2017-08-04 16:07:58 EDT
I bisected this to:

5c279bd9e40624f4ab6e688671026d6005b066fa is the first bad commit
commit 5c279bd9e40624f4ab6e688671026d6005b066fa
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Jun 16 10:27:55 2017 +0200

    scsi: default to scsi-mq
    
    Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O
    path now that we had plenty of testing, and have I/O schedulers for
    blk-mq.  The module option to disable the blk-mq path is kept around for
    now.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

:040000 040000 57ec7d5d2ba76592a695f533a69f747700c31966 c79f6ecb070acc4fadf6fc05ca9ba32bc9c0c665 M	drivers
Comment 2 Richard W.M. Jones 2017-08-04 16:15:40 EDT
To bisect this I used the following libguestfs script which adds
1 appliance disk + 255 scratch disks (all virtio-scsi) to a VM,
and checks that it boots up to userspace.  The crash happens before
we reach userspace.

#!/usr/bin/perl -w

use Sys::Guestfs;

my $g = Sys::Guestfs->new ();
$g->set_trace (1);
$g->set_verbose (1);
my $i;
for ($i = 0; $i < 255; ++$i) {
    $g->add_drive_scratch (1024*1024);
}
$g->launch ();
$g->shutdown ();
print "PASSED\n"
Comment 3 Richard W.M. Jones 2017-08-04 16:59:48 EDT
I wrote a script to find using a binary search the max number of disks
that can be added to our guest which has 1 vCPU and 500MB RAM (no swap):

With scsi-mq enabled:   175 disks
With scsi-mq disabled: 1755 disks
Comment 4 Richard W.M. Jones 2017-08-04 17:02 EDT
Created attachment 1309205 [details]
find-max-disks.pl

The test I used for comment 3.  This requires supermin >= 5.1.18 and
a patched libguestfs: https://github.com/rwmjones/libguestfs/tree/max-disks
Comment 5 Richard W.M. Jones 2017-08-05 02:27:46 EDT
I started a thread on LKML.  No takers at present ...
https://lkml.org/lkml/2017/8/4/601
Comment 6 Richard W.M. Jones 2017-08-10 13:17:29 EDT
Patches posted to the kernel:
https://lkml.org/lkml/2017/8/10/708

and qemu:
https://lists.nongnu.org/archive/html/qemu-devel/2017-08/msg02085.html

If these are accepted then we will also need changes to libvirt and
libguestfs.
Comment 7 Laura Abbott 2018-04-06 14:42:48 EDT
Did these get picked up?
Comment 8 Richard W.M. Jones 2018-04-09 04:30:31 EDT
This is not fixed upstream.  Please leave this bug open.
Comment 9 Richard W.M. Jones 2018-04-20 10:21:36 EDT
I have some more news to report on this.  It was temporarily fixed
in 4.15/4.16, but it has regressed again in 4.17.0-rc1.

4.15.0-0.rc2.git2.1.fc28.x86_64: >= 256 virtio-scsi disks *
4.15.0-0.rc8.git0.1.fc28.x86_64: >= 256 virtio-scsi disks *
4.16.3-300.fc28.x86_64:          >= 256 virtio-scsi disks *
4.17.0-0.rc1.git1.1.fc29.x86_64: 191 virtio-scsi disks

Could this be something to do with Rawhide kernels & debug settings?
How do I find out if a Rawhide kernel has debug enabled?

* The version of libguestfs I'm using doesn't allow me to add more
than 256 disks.
Comment 10 Justin M. Forbes 2018-04-20 15:04:44 EDT
In general, rawhide kernels have debug enabled other than the rc*-git0.1 versions. If you want to test whether it is a debug vs non debug issue, you can always check  the kernels from the rawhide-nodebug respository.
Comment 11 Richard W.M. Jones 2018-05-02 04:55:16 EDT
kernel-4.17.0-0.rc3.git1.2.fc29.x86_64 (nodebug): >= 256 virtio-scsi disks

So yes it looks like enabling debug reduces the number of virtio-scsi
disks that can be added for whatever reason.

Since this is now working I'm going to close this bug as fixed upstream.

Note You need to log in before you can comment on or make changes to this bug.