Description of problem: A recent change to the Rawhide kernel has made it consume much more RAM when scanning virtio-scsi disks. Now it cannot add 256 disks without failing with: [ 1.266507] scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured [ 1.272271] scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured [ 1.277880] scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured (This happens after 238 disks in this case). The VM has 500 MB of RAM and nothing else is running. Version-Release number of selected component (if applicable): kernel-4.13.0-0.rc2.git3.1.fc27.x86_64 Didn't fail with kernel-4.11.9-300.fc26.x86_64 How reproducible: 100% Steps to Reproduce: 1. Add 256 virtio-scsi disks to a VM.
I bisected this to: 5c279bd9e40624f4ab6e688671026d6005b066fa is the first bad commit commit 5c279bd9e40624f4ab6e688671026d6005b066fa Author: Christoph Hellwig <hch> Date: Fri Jun 16 10:27:55 2017 +0200 scsi: default to scsi-mq Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O path now that we had plenty of testing, and have I/O schedulers for blk-mq. The module option to disable the blk-mq path is kept around for now. Signed-off-by: Christoph Hellwig <hch> Signed-off-by: Martin K. Petersen <martin.petersen> :040000 040000 57ec7d5d2ba76592a695f533a69f747700c31966 c79f6ecb070acc4fadf6fc05ca9ba32bc9c0c665 M drivers
To bisect this I used the following libguestfs script which adds 1 appliance disk + 255 scratch disks (all virtio-scsi) to a VM, and checks that it boots up to userspace. The crash happens before we reach userspace. #!/usr/bin/perl -w use Sys::Guestfs; my $g = Sys::Guestfs->new (); $g->set_trace (1); $g->set_verbose (1); my $i; for ($i = 0; $i < 255; ++$i) { $g->add_drive_scratch (1024*1024); } $g->launch (); $g->shutdown (); print "PASSED\n"
I wrote a script to find using a binary search the max number of disks that can be added to our guest which has 1 vCPU and 500MB RAM (no swap): With scsi-mq enabled: 175 disks With scsi-mq disabled: 1755 disks
Created attachment 1309205 [details] find-max-disks.pl The test I used for comment 3. This requires supermin >= 5.1.18 and a patched libguestfs: https://github.com/rwmjones/libguestfs/tree/max-disks
I started a thread on LKML. No takers at present ... https://lkml.org/lkml/2017/8/4/601
Patches posted to the kernel: https://lkml.org/lkml/2017/8/10/708 and qemu: https://lists.nongnu.org/archive/html/qemu-devel/2017-08/msg02085.html If these are accepted then we will also need changes to libvirt and libguestfs.
Did these get picked up?
This is not fixed upstream. Please leave this bug open.
I have some more news to report on this. It was temporarily fixed in 4.15/4.16, but it has regressed again in 4.17.0-rc1. 4.15.0-0.rc2.git2.1.fc28.x86_64: >= 256 virtio-scsi disks * 4.15.0-0.rc8.git0.1.fc28.x86_64: >= 256 virtio-scsi disks * 4.16.3-300.fc28.x86_64: >= 256 virtio-scsi disks * 4.17.0-0.rc1.git1.1.fc29.x86_64: 191 virtio-scsi disks Could this be something to do with Rawhide kernels & debug settings? How do I find out if a Rawhide kernel has debug enabled? * The version of libguestfs I'm using doesn't allow me to add more than 256 disks.
In general, rawhide kernels have debug enabled other than the rc*-git0.1 versions. If you want to test whether it is a debug vs non debug issue, you can always check the kernels from the rawhide-nodebug respository.
kernel-4.17.0-0.rc3.git1.2.fc29.x86_64 (nodebug): >= 256 virtio-scsi disks So yes it looks like enabling debug reduces the number of virtio-scsi disks that can be added for whatever reason. Since this is now working I'm going to close this bug as fixed upstream.
Is it really fixed? I am having a similar problem with the scsi_debug driver, bz1675071. The bug does not show up in Fedora, but this seems to be simply because scsi-mq is off by default.