Bug 1675071

Summary:	scsi_debug consumes lots of memory
Product:	Red Hat Enterprise Linux 8	Reporter:	Pavel Cahyna <pcahyna>
Component:	kernel	Assignee:	Ewan D. Milne <emilne>
kernel sub component:	Storage Drivers	QA Contact:	guazhang <guazhang>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	unspecified
Priority:	low	CC:	brueckner, djez, emilne, guazhang, hannsj_uhl, minlei
Version:	8.0	Keywords:	Regression
Target Milestone:	rc
Target Release:	8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-05-30 12:31:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pavel Cahyna 2019-02-11 19:02:48 UTC

Description of problem:

I am trying to use scsi_debug to write a test case which needs lots (about 1000) of scsi devices (bz1672938). The 2 GB RAM VMs offered by the QA Openstack instance by default are not enough for this. The command
modprobe scsi_debug max_luns=2 num_tgts=200 add_host=3 vpd_use_hostno=0
crashes them reliably.
Here is a session on a 4GB machine showing how the used memory is increasing when adding scsi_debug hosts and decreasing when one removes them (i.e. the memory is not permanently leaked).

# systemctl stop systemd-udevd
Warning: Stopping systemd-udevd.service, but it can still be activated by:
  systemd-udevd-kernel.socket
  systemd-udevd-control.socket
# systemctl mask systemd-udevd-kernel.socket systemd-udevd-control.socket
Created symlink /etc/systemd/system/systemd-udevd-kernel.socket → /dev/null.
Created symlink /etc/systemd/system/systemd-udevd-control.socket → /dev/null.


# free
              total        used        free      shared  buff/cache   available
Mem:        3871732      158156     3644648         340       68928     3558964
Swap:       4169724       87384     4082340

# modprobe scsi_debug max_luns=2 num_tgts=200 add_host=3 vpd_use_hostno=0
# free
              total        used        free      shared  buff/cache   available
Mem:        3871732     1836440     1917844         340      117448     1856420
Swap:       4169724       87128     4082596
# cd /sys/bus/pseudo/drivers/scsi_debug
# echo 3 > add_host
# free
              total        used        free      shared  buff/cache   available
Mem:        3871732     3509608      204656         340      157468      163236
Swap:       4169724       87128     4082596
# slabtop -s c -o
 Active / Total Objects (% used)    : 3188739 / 3236912 (98.5%)
 Active / Total Slabs (% used)      : 73395 / 73395 (100.0%)
 Active / Total Caches (% used)     : 97 / 126 (77.0%)
 Active / Total Size (% used)       : 470431.99K / 481630.39K (97.7%)
 Minimum / Average / Maximum Object : 0.01K / 0.15K / 8.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
142631 134125  94%    0.62K   5706       25     91296K inode_cache            
618016 618016 100%    0.12K  19313       32     77252K scsi_sense_cache       
562230 562230 100%    0.13K  18741       30     74964K kernfs_node_cache      
  4849   4849 100%    8.00K   1213        4     38816K biovec-max             
 31652  31621  99%    1.00K   1979       16     31664K kmalloc-1024           
162183 147590  91%    0.19K   7723       21     30892K dentry                 
 12900  12900 100%    2.00K    815       16     26080K kmalloc-2048           
  2432   2432 100%    8.00K    608        4     19456K kmalloc-8192           
 10455   9299  88%    0.72K    714       22     11424K shmem_inode_cache      
293248 277485  94%    0.03K   2291      128      9164K kmalloc-32             
378420 378420 100%    0.02K   2226      170      8904K avtab_node             
 12894  12813  99%    0.57K    921       14      7368K radix_tree_node        
 12776  12776 100%    0.50K    800       16      6400K kmalloc-512            
143310 143310 100%    0.04K   1405      102      5620K Acpi-Namespace         
 29085  29065  99%    0.19K   1385       21      5540K kmalloc-192            
  2415   2415 100%    2.12K    161       15      5152K request_queue          
438272 438272 100%    0.01K    856      512      3424K kmalloc-8              
 35700  35506  99%    0.09K    850       42      3400K kmalloc-96             
 46720  46103  98%    0.06K    730       64      2920K kmalloc-64             
   657    633  96%    4.00K     83        8      2656K kmalloc-4096           
166144 166144 100%    0.02K    649      256      2596K kmalloc-16             
  2500   1759  70%    0.94K    157       17      2512K xfs_inode              
  9299   9059  97%    0.23K    547       17      2188K vm_area_struct         
   301    286  95%    5.56K     61        5      1952K task_struct            
  2991   2935  98%    0.38K    188       21      1504K kmem_cache             
 11232  11011  98%    0.12K    351       32      1404K kmalloc-128            
  1857   1709  92%    0.69K     87       23      1392K proc_inode_cache       
  5120   4611  90%    0.25K    320       16      1280K filp                   
 20288  20288 100%    0.06K    317       64      1268K ebitmap_node           
  4928   4928 100%    0.25K    308       16      1232K pool_workqueue         
  3984   3984 100%    0.25K    249       16       996K kmalloc-256            
 15872  14596  91%    0.06K    248       64       992K pid                    
   405    400  98%    2.06K     27       15       864K sighand_cache          
  8004   7868  98%    0.09K    174       46       696K anon_vma               
   577    577 100%    1.12K     42       14       672K UNIX                   
   851    851 100%    0.69K     37       23       592K sock_inode_cache       
  2457   2397  97%    0.19K    117       21       468K cred_jar               
   435    435 100%    1.06K     29       15       464K signal_cache           
   322    322 100%    1.12K     23       14       368K mm_struct              
   934    774  82%    0.38K     45       21       360K mnt_cache              
  1701    756  44%    0.19K     81       21       324K dmaengine-unmap-16     
   437    437 100%    0.69K     19       23       304K files_cache            
  1072   1008  94%    0.25K     67       16       268K skbuff_head_cache      
   184    184 100%    1.38K      8       23       256K UDPv6                  
  1298   1298 100%    0.18K     59       22       236K xfs_log_ticket         
   918    918 100%    0.23K     54       17       216K xfs_trans              
  2968   2968 100%    0.07K     53       56       212K Acpi-Operand           
  2499   2499 100%    0.08K     49       51       196K Acpi-State             
    48     48 100%    4.00K      6        8       192K names_cache            
    84     84 100%    2.25K      6       14       192K TCP                    
    78     78 100%    2.38K      6       13       192K TCPv6                  
  3008   3008 100%    0.06K     47       64       188K kmem_cache_node        
  1408   1408 100%    0.12K     44       32       176K seq_file               
  3485   3485 100%    0.05K     41       85       164K ftrace_event_field     
  5120   2697  52%    0.03K     40      128       160K avc_xperms_data        
   130    130 100%    1.19K     10       13       160K UDP                    
   360    288  80%    0.43K     20       18       160K xfs_efd_item           
  1794   1794 100%    0.09K     39       46       156K trace_event_file       
  2555   2555 100%    0.05K     35       73       140K Acpi-Parse             
   544    544 100%    0.25K     34       16       136K proc_dir_entry         
   348    329  94%    0.31K     29       12       116K ip6_dst_cache          
   126     76  60%    0.88K      7       18       112K bdev_cache             
   621    621 100%    0.17K     27       23       108K xfs_rud_item           
     6      6 100%    6.19K      2        5        64K net_namespace          
    32     32 100%    2.00K      2       16        64K biovec-128             
  3584   3584 100%    0.02K     14      256        56K selinux_file_security  
   728    728 100%    0.07K     13       56        52K eventpoll_pwq          
   234    234 100%    0.21K     13       18        52K xfs_bui_item           
    63     63 100%    0.75K      3       21        48K task_group             
    20     20 100%    1.12K      3       14        48K RAW                    
   180    150  83%    0.27K     12       15        48K xfs_buf_item           
  1020   1020 100%    0.04K     10      102        40K pde_opener             
    15     15 100%    2.06K      1       15        32K dmaengine-unmap-256    
    32     32 100%    1.00K      2       16        32K biovec-64              
    38     20  52%    0.81K      2       19        32K dax_cache              
     8      8 100%    4.00K      1        8        32K sgpool-128             
    24     24 100%    1.31K      2       12        32K RAWv6                  
    80     80 100%    0.20K      4       20        16K file_lock_cache        
    32     32 100%    0.50K      2       16        16K skbuff_fclone_cache    
    15     15 100%    1.06K      1       15        16K dmaengine-unmap-128    
   156    156 100%    0.10K      4       39        16K blkdev_ioc             
    24     24 100%    0.64K      2       12        16K hugetlbfs_inode_cache  
    24     24 100%    0.62K      2       12        16K dio                    
    17     17 100%    0.94K      1       17        16K mqueue_inode_cache     
    34     34 100%    0.47K      2       17        16K xfs_da_state           
    23     23 100%    0.69K      1       23        16K rpc_inode_cache        
    78     78 100%    0.10K      2       39         8K buffer_head            
    50     50 100%    0.16K      2       25         8K sigqueue               
    24     24 100%    0.32K      2       12         8K taskstats              
    64     64 100%    0.12K      2       32         8K secpath_cache          
    34     34 100%    0.23K      2       17         8K posix_timers_cache     
    26     26 100%    0.30K      2       13         8K request_sock_TCPv6     
    34     34 100%    0.23K      2       17         8K tw_sock_TCPv6          
    36     36 100%    0.22K      2       18         8K xfs_btree_cur          
    15     15 100%    0.26K      1       15         4K numa_policy            
    13     13 100%    0.30K      1       13         4K request_sock_TCP       
    17     17 100%    0.23K      1       17         4K tw_sock_TCP            
     0      0   0%    0.09K      0       42         0K dma-kmalloc-96         
     0      0   0%    0.19K      0       21         0K dma-kmalloc-192        
     0      0   0%    0.01K      0      512         0K dma-kmalloc-8          
     0      0   0%    0.02K      0      256         0K dma-kmalloc-16         
     0      0   0%    0.03K      0      128         0K dma-kmalloc-32         
     0      0   0%    0.06K      0       64         0K dma-kmalloc-64         
     0      0   0%    0.12K      0       32         0K dma-kmalloc-128        
     0      0   0%    0.25K      0       16         0K dma-kmalloc-256        
     0      0   0%    0.50K      0       16         0K dma-kmalloc-512        
     0      0   0%    1.00K      0       16         0K dma-kmalloc-1024       
     0      0   0%    2.00K      0       16         0K dma-kmalloc-2048       
     0      0   0%    4.00K      0        8         0K dma-kmalloc-4096       
     0      0   0%    8.00K      0        4         0K dma-kmalloc-8192       
     0      0   0%    0.43K      0       18         0K uts_namespace          
     0      0   0%    0.12K      0       34         0K iint_cache             
     0      0   0%    0.25K      0       16         0K dquot                  
     0      0   0%    0.81K      0       19         0K xfrm_state             
     0      0   0%    0.44K      0       18         0K xfrm_dst_cache         
     0      0   0%    0.20K      0       19         0K ip4-frags              
     0      0   0%    0.19K      0       21         0K userfaultfd_ctx_cache  
     0      0   0%    0.45K      0       17         0K bfq_queue              
     0      0   0%    1.31K      0       12         0K PINGv6                 
     0      0   0%    0.12K      0       34         0K dm_rq_target_io        
     0      0   0%    0.29K      0       13         0K dm_old_clone_request   
     0      0   0%    2.57K      0       12         0K dm_uevent              
     0      0   0%    3.23K      0        9         0K kcopyd_job             
     0      0   0%    0.68K      0       23         0K xfs_rui_item           
     0      0   0%    0.49K      0       16         0K xfs_dquot              
     0      0   0%    0.52K      0       15         0K xfs_dqtrx              

# echo -3 > add_host
# free
              total        used        free      shared  buff/cache   available
Mem:        3871732     1838808     1914284         340      118640     1853456
Swap:       4169724       87128     4082596
# cat /proc/meminfo 
MemTotal:        3871732 kB
MemFree:         1879920 kB
MemAvailable:    1835248 kB
Buffers:            1044 kB
Cached:            64176 kB
SwapCached:         3676 kB
Active:            36000 kB
Inactive:          64444 kB
Active(anon):      18368 kB
Inactive(anon):    17216 kB
Active(file):      17632 kB
Inactive(file):    47228 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       4169724 kB
SwapFree:        4082596 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         33456 kB
Mapped:            19108 kB
Shmem:               344 kB
Slab:             295076 kB
SReclaimable:      85740 kB
SUnreclaim:       209336 kB
KernelStack:        2176 kB
PageTables:         4744 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6105588 kB
Committed_AS:     400428 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:     12288 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      679928 kB
DirectMap2M:     3514368 kB
# rmmod scsi_debug
# free
              total        used        free      shared  buff/cache   available
Mem:        3871732      160456     3608300         344      102976     3539640
Swap:       4169724       87128     4082596
# cat /proc/meminfo 
MemTotal:        3871732 kB
MemFree:         3608508 kB
MemAvailable:    3539848 kB
Buffers:            1044 kB
Cached:            64212 kB
SwapCached:         3676 kB
Active:            36036 kB
Inactive:          64476 kB
Active(anon):      18384 kB
Inactive(anon):    17216 kB
Active(file):      17652 kB
Inactive(file):    47260 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       4169724 kB
SwapFree:        4082596 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         33488 kB
Mapped:            19056 kB
Shmem:               344 kB
Slab:             113588 kB
SReclaimable:      37720 kB
SUnreclaim:        75868 kB
KernelStack:        2080 kB
PageTables:         4736 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6105588 kB
Committed_AS:     399568 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:     12288 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      679928 kB
DirectMap2M:     3514368 kB


Version-Release number of selected component (if applicable):

4.18.0-64.el8.x86_64

How reproducible:

Well reproducible on RHEL 8. Not reproducible on RHEL 7 nor on Fedora 29 (kernel  4.19.13-300.fc29.x86_64 )

Steps to Reproduce:
1. modprobe scsi_debug max_luns=2 num_tgts=200 add_host=3
scale add_host appropriately if having more than 2 GB of RAM.

Actual results:

Free memory goes down, if adding too many devices OOM killer gets triggered and the machine eventually reboots.

Expected results:

Should be able to add this amount of SCSI devices without problem. Those virtual disks actually share the same RAM storage (8 MB by default) so they should not consume much memory.

Additional info:

I was suspecting systemd-udevd, but the problem occurs even when it is not running. Multipathd was not running, either. Adding swap does not help, it does not get used.

The problem occurs even when using the no_uld=1 module parameter, which avoids creating the sd* devices (creates only sg* devices).

Comment 2 Pavel Cahyna 2019-02-11 19:59:23 UTC

I was able to bisect it a bit:
it did not occur on 4.18.0-0.rc8.1.el8+7.x86_64 (RHEL-8.0-20180807.n.0 snapshot)
it did occur on 4.18.0-32.el8.x86_64 (RHEL-8.0-20181107.0 snapshot)

Comment 3 Pavel Cahyna 2019-02-11 21:28:50 UTC

I bisected the problem between 4.18.0-7.el8.x86_64 and 4.18.0-8.el8.x86_64.

Booting with scsi_mod.use_blk_mq=0 is a workaround.

Comment 4 Pavel Cahyna 2019-02-11 21:35:40 UTC

In Fedora:
grep SCSI_MQ_DEFAULT /boot/config-4.19.13-300.fc29.x86_64 
# CONFIG_SCSI_MQ_DEFAULT is not set

which explains why the problem does not occur there.

Comment 5 Pavel Cahyna 2019-02-12 10:37:02 UTC

The problem also occurs on RHEL 7 when booting with scsi_mod.use_blk_mq=y (technology preview). Kernel version 3.10.0-993.el7

Comment 9 Pavel Cahyna 2019-02-12 17:37:04 UTC

It turns out that reducing the queue length (the max_queue parameter) reduces the memory usage significantly. (The default is 192, I am changing it to 8 in our test now.) So I am reducing priority on this. We will try to investigate whether it affects other scsi drivers as well, like iSCSI.

Comment 10 Ewan D. Milne 2019-02-12 19:17:33 UTC

OK, that's good.  Yes, now that the legacy request path has been removed, it is
no longer possible to disable the use of SCSI-MQ.  We removed it from RHEL8
because the code is no longer going to be supported upstream.

--

I would like to understand the blk-MQ structure your openstack environment generates.

For any one of your scsi_debug SCSI devices, could you please provide the output of:

find /sys/kernel/debug/block/sd<X>   (where <X> is e.g. "sdb", "sdc" or whatever
                                      one of your scsi_debug devices is)

Thanks.

Comment 11 Pavel Cahyna 2019-02-12 19:22:09 UTC

# modprobe scsi_debug
# find /sys/kernel/debug/block/sda
/sys/kernel/debug/block/sda
/sys/kernel/debug/block/sda/hctx0
/sys/kernel/debug/block/sda/hctx0/cpu0
/sys/kernel/debug/block/sda/hctx0/cpu0/completed
/sys/kernel/debug/block/sda/hctx0/cpu0/merged
/sys/kernel/debug/block/sda/hctx0/cpu0/dispatched
/sys/kernel/debug/block/sda/hctx0/cpu0/poll_rq_list
/sys/kernel/debug/block/sda/hctx0/cpu0/read_rq_list
/sys/kernel/debug/block/sda/hctx0/cpu0/default_rq_list
/sys/kernel/debug/block/sda/hctx0/type
/sys/kernel/debug/block/sda/hctx0/dispatch_busy
/sys/kernel/debug/block/sda/hctx0/active
/sys/kernel/debug/block/sda/hctx0/run
/sys/kernel/debug/block/sda/hctx0/queued
/sys/kernel/debug/block/sda/hctx0/dispatched
/sys/kernel/debug/block/sda/hctx0/io_poll
/sys/kernel/debug/block/sda/hctx0/sched_tags_bitmap
/sys/kernel/debug/block/sda/hctx0/sched_tags
/sys/kernel/debug/block/sda/hctx0/tags_bitmap
/sys/kernel/debug/block/sda/hctx0/tags
/sys/kernel/debug/block/sda/hctx0/ctx_map
/sys/kernel/debug/block/sda/hctx0/busy
/sys/kernel/debug/block/sda/hctx0/dispatch
/sys/kernel/debug/block/sda/hctx0/flags
/sys/kernel/debug/block/sda/hctx0/state
/sys/kernel/debug/block/sda/sched
/sys/kernel/debug/block/sda/sched/dispatch
/sys/kernel/debug/block/sda/sched/starved
/sys/kernel/debug/block/sda/sched/batching
/sys/kernel/debug/block/sda/sched/write_next_rq
/sys/kernel/debug/block/sda/sched/write_fifo_list
/sys/kernel/debug/block/sda/sched/read_next_rq
/sys/kernel/debug/block/sda/sched/read_fifo_list
/sys/kernel/debug/block/sda/zone_wlock
/sys/kernel/debug/block/sda/write_hints
/sys/kernel/debug/block/sda/state
/sys/kernel/debug/block/sda/pm_only
/sys/kernel/debug/block/sda/requeue_list
/sys/kernel/debug/block/sda/poll_stat

Comment 12 Ming Lei 2019-02-13 02:44:05 UTC

(In reply to Pavel Cahyna from comment #9)
> It turns out that reducing the queue length (the max_queue parameter)
> reduces the memory usage significantly. (The default is 192, I am changing
> it to 8 in our test now.) So I am reducing priority on this. We will try to
> investigate whether it affects other scsi drivers as well, like iSCSI.


blk-mq allocates request pool statically for saving the runtime allocation cost.

And the worse thing is that two request pools are allocated(one is for none, another is for scheduler pool).

Another workaround is to use 'none', then you can save lots of memory too.

Comment 13 Ewan D. Milne 2019-02-13 17:29:33 UTC

Only 1 CPU and 1 hctx though.

1200 sdevs, 257 scmds per sdev, the slab allocation is on the order of 124M
Something is allocating ~10x that ammount.

Comment 14 Ming Lei 2019-02-14 10:47:18 UTC

(In reply to Ewan D. Milne from comment #13)
> Only 1 CPU and 1 hctx though.
> 
> 1200 sdevs, 257 scmds per sdev, the slab allocation is on the order of 124M
> Something is allocating ~10x that ammount.

1200 * 257 * 4k(one request with its scsi payload) is about 1204M, and I guess the
actual allocation may be 1204*2M.

Comment 15 Ewan D. Milne 2019-04-05 13:11:32 UTC

The blk-mq code is potentially allocating a very large amount of memory for the
request structures, it varies depending upon which driver is in use.  The megaraid
driver appears to also use a lot.

I also had a report that the lpfc driver allocated a large amount for certain cards
(e.g. 64Gb) but I was not able to reproduce this here.  Dependent on the number of
CPUs, perhaps?  This is exacerbated when a vport is created and a whole new set of
structures are allocated for it.

Comment 16 Ewan D. Milne 2019-05-16 22:04:28 UTC

Ming Lei has patches staged for upstream 5.3 that remove the large SG table from
the preallocated request structures.  Applying those to RHEL8, and loading the
scsi_debug driver with the same options, I see:

[root@rhel-storage-44 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:          64011         584       62130          34        1297       62790
Swap:         11447           0       11447
[root@rhel-storage-44 ~]# modprobe scsi_debug max_luns=2 num_tgts=200 add_host=3 vpd_use_hostno=0
[root@rhel-storage-44 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:          64011        1064       61477          38        1469       62262
Swap:         11447           0       11447
[root@rhel-storage-44 ~]# bc
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
62790-62262
528

This compares with this usage on the same system, without the changes:

[root@rhel-storage-44 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:          64011         539       63011          10         459       62905
Swap:         11447           0       11447
[root@rhel-storage-44 ~]# modprobe scsi_debug max_luns=2 num_tgts=200 add_host=3 vpd_use_hostno=0
[root@rhel-storage-44 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:          64011        2296       60862          22         852       61046
Swap:         11447           0       11447
[root@rhel-storage-44 ~]# bc
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
62905-61046
1859

It is not clear yet what the performance impact is of requiring I/Os larger
than can fit in 2 SG entries to allocate an SG table, instead of I/Os with up
to the SG chunk size.  However I intend to put these changes into 8.1 to mitigate
the memory usage by various drivers.

Comment 17 Ming Lei 2019-05-20 00:57:59 UTC

(In reply to Ewan D. Milne from comment #16)
> Ming Lei has patches staged for upstream 5.3 that remove the large SG table
> from
> the preallocated request structures.  Applying those to RHEL8, and loading
> the
> scsi_debug driver with the same options, I see:
> 
> [root@rhel-storage-44 ~]# free -m
>               total        used        free      shared  buff/cache  
> available
> Mem:          64011         584       62130          34        1297      
> 62790
> Swap:         11447           0       11447
> [root@rhel-storage-44 ~]# modprobe scsi_debug max_luns=2 num_tgts=200
> add_host=3 vpd_use_hostno=0
> [root@rhel-storage-44 ~]# free -m
>               total        used        free      shared  buff/cache  
> available
> Mem:          64011        1064       61477          38        1469      
> 62262
> Swap:         11447           0       11447
> [root@rhel-storage-44 ~]# bc
> bc 1.07.1
> Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free
> Software Foundation, Inc.
> This is free software with ABSOLUTELY NO WARRANTY.
> For details type `warranty'. 
> 62790-62262
> 528
> 
> This compares with this usage on the same system, without the changes:
> 
> [root@rhel-storage-44 ~]# free -m
>               total        used        free      shared  buff/cache  
> available
> Mem:          64011         539       63011          10         459      
> 62905
> Swap:         11447           0       11447
> [root@rhel-storage-44 ~]# modprobe scsi_debug max_luns=2 num_tgts=200
> add_host=3 vpd_use_hostno=0
> [root@rhel-storage-44 ~]# free -m
>               total        used        free      shared  buff/cache  
> available
> Mem:          64011        2296       60862          22         852      
> 61046
> Swap:         11447           0       11447
> [root@rhel-storage-44 ~]# bc
> bc 1.07.1
> Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free
> Software Foundation, Inc.
> This is free software with ABSOLUTELY NO WARRANTY.
> For details type `warranty'. 
> 62905-61046
> 1859
> 
> It is not clear yet what the performance impact is of requiring I/Os larger
> than can fit in 2 SG entries to allocate an SG table, instead of I/Os with up
> to the SG chunk size.  However I intend to put these changes into 8.1 to
> mitigate
> the memory usage by various drivers.


It shouldn't have been one issue by using runtime allocation for scatterlist:

1) this way has been used in non-mq IO path for dozens of years

2) slab(slub) supposed to work well enough for sub-page size allocation(GFP_ATOMIC)

Comment 18 Ewan D. Milne 2019-05-20 13:05:06 UTC

I will put the patches in to 8.1 when Martin makes the 5.3/scsi-queue tree visible
and we can see in our testing.  I think there will be a benefit on large configurtions
because of the reduced memory consumption which can be then used elsewhere.

Comment 19 Ewan D. Milne 2019-05-22 13:10:32 UTC

Changes posted to 8.1, see bug 1698297.

Comment 20 Ewan D. Milne 2019-05-30 12:31:58 UTC

Closing as duplicate as the fix is the same.

*** This bug has been marked as a duplicate of bug 1698297 ***

Comment 21 Pavel Cahyna 2020-01-03 14:29:47 UTC

Here's how it looks now in the OpenStack instance (image 1MT-RHEL-8.2.0-20191219.0):
# free
              total        used        free      shared  buff/cache   available
Mem:        1870820      143080     1496964       10192      230776     1569828
Swap:             0           0           0

# modprobe scsi_debug max_luns=2 num_tgts=200 add_host=3 vpd_use_hostno=0

after udev settles:

# free
              total        used        free      shared  buff/cache   available
Mem:        1870820      658440      844528       31376      367852      993696
Swap:             0           0           0

so, there is some non-negligible memory used by lots of scsi_debug devices, but it is a big improvement from the previous state and the 2GB VM does not crash anymore.