Bug 1675071
| Summary: | scsi_debug consumes lots of memory | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Pavel Cahyna <pcahyna> |
| Component: | kernel | Assignee: | Ewan D. Milne <emilne> |
| kernel sub component: | Storage Drivers | QA Contact: | guazhang <guazhang> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | low | CC: | brueckner, djez, emilne, guazhang, hannsj_uhl, minlei |
| Version: | 8.0 | Keywords: | Regression |
| Target Milestone: | rc | ||
| Target Release: | 8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-05-30 12:31:58 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Pavel Cahyna
2019-02-11 19:02:48 UTC
I was able to bisect it a bit: it did not occur on 4.18.0-0.rc8.1.el8+7.x86_64 (RHEL-8.0-20180807.n.0 snapshot) it did occur on 4.18.0-32.el8.x86_64 (RHEL-8.0-20181107.0 snapshot) I bisected the problem between 4.18.0-7.el8.x86_64 and 4.18.0-8.el8.x86_64. Booting with scsi_mod.use_blk_mq=0 is a workaround. In Fedora: grep SCSI_MQ_DEFAULT /boot/config-4.19.13-300.fc29.x86_64 # CONFIG_SCSI_MQ_DEFAULT is not set which explains why the problem does not occur there. The problem also occurs on RHEL 7 when booting with scsi_mod.use_blk_mq=y (technology preview). Kernel version 3.10.0-993.el7 It turns out that reducing the queue length (the max_queue parameter) reduces the memory usage significantly. (The default is 192, I am changing it to 8 in our test now.) So I am reducing priority on this. We will try to investigate whether it affects other scsi drivers as well, like iSCSI. OK, that's good. Yes, now that the legacy request path has been removed, it is
no longer possible to disable the use of SCSI-MQ. We removed it from RHEL8
because the code is no longer going to be supported upstream.
--
I would like to understand the blk-MQ structure your openstack environment generates.
For any one of your scsi_debug SCSI devices, could you please provide the output of:
find /sys/kernel/debug/block/sd<X> (where <X> is e.g. "sdb", "sdc" or whatever
one of your scsi_debug devices is)
Thanks.
# modprobe scsi_debug # find /sys/kernel/debug/block/sda /sys/kernel/debug/block/sda /sys/kernel/debug/block/sda/hctx0 /sys/kernel/debug/block/sda/hctx0/cpu0 /sys/kernel/debug/block/sda/hctx0/cpu0/completed /sys/kernel/debug/block/sda/hctx0/cpu0/merged /sys/kernel/debug/block/sda/hctx0/cpu0/dispatched /sys/kernel/debug/block/sda/hctx0/cpu0/poll_rq_list /sys/kernel/debug/block/sda/hctx0/cpu0/read_rq_list /sys/kernel/debug/block/sda/hctx0/cpu0/default_rq_list /sys/kernel/debug/block/sda/hctx0/type /sys/kernel/debug/block/sda/hctx0/dispatch_busy /sys/kernel/debug/block/sda/hctx0/active /sys/kernel/debug/block/sda/hctx0/run /sys/kernel/debug/block/sda/hctx0/queued /sys/kernel/debug/block/sda/hctx0/dispatched /sys/kernel/debug/block/sda/hctx0/io_poll /sys/kernel/debug/block/sda/hctx0/sched_tags_bitmap /sys/kernel/debug/block/sda/hctx0/sched_tags /sys/kernel/debug/block/sda/hctx0/tags_bitmap /sys/kernel/debug/block/sda/hctx0/tags /sys/kernel/debug/block/sda/hctx0/ctx_map /sys/kernel/debug/block/sda/hctx0/busy /sys/kernel/debug/block/sda/hctx0/dispatch /sys/kernel/debug/block/sda/hctx0/flags /sys/kernel/debug/block/sda/hctx0/state /sys/kernel/debug/block/sda/sched /sys/kernel/debug/block/sda/sched/dispatch /sys/kernel/debug/block/sda/sched/starved /sys/kernel/debug/block/sda/sched/batching /sys/kernel/debug/block/sda/sched/write_next_rq /sys/kernel/debug/block/sda/sched/write_fifo_list /sys/kernel/debug/block/sda/sched/read_next_rq /sys/kernel/debug/block/sda/sched/read_fifo_list /sys/kernel/debug/block/sda/zone_wlock /sys/kernel/debug/block/sda/write_hints /sys/kernel/debug/block/sda/state /sys/kernel/debug/block/sda/pm_only /sys/kernel/debug/block/sda/requeue_list /sys/kernel/debug/block/sda/poll_stat (In reply to Pavel Cahyna from comment #9) > It turns out that reducing the queue length (the max_queue parameter) > reduces the memory usage significantly. (The default is 192, I am changing > it to 8 in our test now.) So I am reducing priority on this. We will try to > investigate whether it affects other scsi drivers as well, like iSCSI. blk-mq allocates request pool statically for saving the runtime allocation cost. And the worse thing is that two request pools are allocated(one is for none, another is for scheduler pool). Another workaround is to use 'none', then you can save lots of memory too. Only 1 CPU and 1 hctx though. 1200 sdevs, 257 scmds per sdev, the slab allocation is on the order of 124M Something is allocating ~10x that ammount. (In reply to Ewan D. Milne from comment #13) > Only 1 CPU and 1 hctx though. > > 1200 sdevs, 257 scmds per sdev, the slab allocation is on the order of 124M > Something is allocating ~10x that ammount. 1200 * 257 * 4k(one request with its scsi payload) is about 1204M, and I guess the actual allocation may be 1204*2M. The blk-mq code is potentially allocating a very large amount of memory for the request structures, it varies depending upon which driver is in use. The megaraid driver appears to also use a lot. I also had a report that the lpfc driver allocated a large amount for certain cards (e.g. 64Gb) but I was not able to reproduce this here. Dependent on the number of CPUs, perhaps? This is exacerbated when a vport is created and a whole new set of structures are allocated for it. Ming Lei has patches staged for upstream 5.3 that remove the large SG table from
the preallocated request structures. Applying those to RHEL8, and loading the
scsi_debug driver with the same options, I see:
[root@rhel-storage-44 ~]# free -m
total used free shared buff/cache available
Mem: 64011 584 62130 34 1297 62790
Swap: 11447 0 11447
[root@rhel-storage-44 ~]# modprobe scsi_debug max_luns=2 num_tgts=200 add_host=3 vpd_use_hostno=0
[root@rhel-storage-44 ~]# free -m
total used free shared buff/cache available
Mem: 64011 1064 61477 38 1469 62262
Swap: 11447 0 11447
[root@rhel-storage-44 ~]# bc
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
62790-62262
528
This compares with this usage on the same system, without the changes:
[root@rhel-storage-44 ~]# free -m
total used free shared buff/cache available
Mem: 64011 539 63011 10 459 62905
Swap: 11447 0 11447
[root@rhel-storage-44 ~]# modprobe scsi_debug max_luns=2 num_tgts=200 add_host=3 vpd_use_hostno=0
[root@rhel-storage-44 ~]# free -m
total used free shared buff/cache available
Mem: 64011 2296 60862 22 852 61046
Swap: 11447 0 11447
[root@rhel-storage-44 ~]# bc
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
62905-61046
1859
It is not clear yet what the performance impact is of requiring I/Os larger
than can fit in 2 SG entries to allocate an SG table, instead of I/Os with up
to the SG chunk size. However I intend to put these changes into 8.1 to mitigate
the memory usage by various drivers.
(In reply to Ewan D. Milne from comment #16) > Ming Lei has patches staged for upstream 5.3 that remove the large SG table > from > the preallocated request structures. Applying those to RHEL8, and loading > the > scsi_debug driver with the same options, I see: > > [root@rhel-storage-44 ~]# free -m > total used free shared buff/cache > available > Mem: 64011 584 62130 34 1297 > 62790 > Swap: 11447 0 11447 > [root@rhel-storage-44 ~]# modprobe scsi_debug max_luns=2 num_tgts=200 > add_host=3 vpd_use_hostno=0 > [root@rhel-storage-44 ~]# free -m > total used free shared buff/cache > available > Mem: 64011 1064 61477 38 1469 > 62262 > Swap: 11447 0 11447 > [root@rhel-storage-44 ~]# bc > bc 1.07.1 > Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free > Software Foundation, Inc. > This is free software with ABSOLUTELY NO WARRANTY. > For details type `warranty'. > 62790-62262 > 528 > > This compares with this usage on the same system, without the changes: > > [root@rhel-storage-44 ~]# free -m > total used free shared buff/cache > available > Mem: 64011 539 63011 10 459 > 62905 > Swap: 11447 0 11447 > [root@rhel-storage-44 ~]# modprobe scsi_debug max_luns=2 num_tgts=200 > add_host=3 vpd_use_hostno=0 > [root@rhel-storage-44 ~]# free -m > total used free shared buff/cache > available > Mem: 64011 2296 60862 22 852 > 61046 > Swap: 11447 0 11447 > [root@rhel-storage-44 ~]# bc > bc 1.07.1 > Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free > Software Foundation, Inc. > This is free software with ABSOLUTELY NO WARRANTY. > For details type `warranty'. > 62905-61046 > 1859 > > It is not clear yet what the performance impact is of requiring I/Os larger > than can fit in 2 SG entries to allocate an SG table, instead of I/Os with up > to the SG chunk size. However I intend to put these changes into 8.1 to > mitigate > the memory usage by various drivers. It shouldn't have been one issue by using runtime allocation for scatterlist: 1) this way has been used in non-mq IO path for dozens of years 2) slab(slub) supposed to work well enough for sub-page size allocation(GFP_ATOMIC) I will put the patches in to 8.1 when Martin makes the 5.3/scsi-queue tree visible and we can see in our testing. I think there will be a benefit on large configurtions because of the reduced memory consumption which can be then used elsewhere. Changes posted to 8.1, see bug 1698297. Closing as duplicate as the fix is the same. *** This bug has been marked as a duplicate of bug 1698297 *** Here's how it looks now in the OpenStack instance (image 1MT-RHEL-8.2.0-20191219.0):
# free
total used free shared buff/cache available
Mem: 1870820 143080 1496964 10192 230776 1569828
Swap: 0 0 0
# modprobe scsi_debug max_luns=2 num_tgts=200 add_host=3 vpd_use_hostno=0
after udev settles:
# free
total used free shared buff/cache available
Mem: 1870820 658440 844528 31376 367852 993696
Swap: 0 0 0
so, there is some non-negligible memory used by lots of scsi_debug devices, but it is a big improvement from the previous state and the 2GB VM does not crash anymore.
|