+++ This bug is a downstream clone. The original bug is: +++
+++ bug 1749630 +++
Description of problem:
- Admin Portal shows very high memory usage for a host with no VMs running
- Cannot start/migrate VMs to that host
Before running/migrating a VM, in addition to the scheduling memory logic, the engine checks if the host has actual free memory for qemu to be able to start . This free memory comes from vds.getMemFree(), which maps to mem_free in vds_statistics.
And mem_free comes from VDSM Host.getStats, which parses /proc/meminfo on the host and adds: memFree + Caches + Buffer:
Return the actual free mem on host.
meminfo = utils.readMemInfo()
return (meminfo['MemFree'] +
meminfo['Cached'] + meminfo['Buffers']) * Kbytes
Now take into account a host like this, which has a ton of memory available under slab (and reclaimable):
$ cat proc/meminfo
MemTotal: 131791980 kB
MemFree: 14674300 kB
MemAvailable: 99026600 kB
Buffers: 59496 kB
Cached: 1844900 kB
SwapCached: 48 kB
Active: 30116712 kB
Inactive: 1463068 kB
Active(anon): 29200612 kB
Inactive(anon): 971784 kB
Active(file): 916100 kB
Inactive(file): 491284 kB
Unevictable: 44208 kB
Mlocked: 46296 kB
SwapTotal: 4194300 kB
SwapFree: 4193780 kB
Dirty: 84636 kB
Writeback: 0 kB
AnonPages: 29721308 kB
Mapped: 192120 kB
Shmem: 485660 kB
Slab: 84235552 kB
SReclaimable: 83571480 kB <----- this is "free" memory
SUnreclaim: 664072 kB
KernelStack: 12848 kB
PageTables: 119536 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 70090288 kB
Committed_AS: 56864012 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 676288 kB
VmallocChunk: 34290711564 kB
HardwareCorrupted: 0 kB
AnonHugePages: 3100672 kB
CmaTotal: 0 kB
CmaFree: 0 kB
Hugepagesize: 2048 kB
DirectMap4k: 342636 kB
DirectMap2M: 12169216 kB
DirectMap1G: 123731968 kB
VDSM calculates free memory as : 14674300 + 1844900 + 59496 = 15.81G
However, there are 80G+ available from slab:
Slab: 84235552 kB
SReclaimable: 83571480 kB
And the actual available memory is much higher than 15.81G, it is close to 100G.
The VM fails to migrate with messages such as the below:
Cannot migrate VM. There is no host that satisfies current scheduling constraints. See below for details:, The host host.example.com did not satisfy internal filter Memory because its available memory is too low (xxx MB) to run the VM.] (VM: vm.example.com, Source: host.example.com).
So, shouldn't vdsm include things like 'SReclaimable' in memFree?
Most of the slab is dentry:
# name active num objsize num*objsize
dentry 438074280 438074280 192 84110261760
kmalloc-64 8638916 8693248 64 556367872
ext4_inode_cache 40083 40083 1032 41365656
buffer_head 241097 351117 104 36516168
inode_cache 43038 43038 592 25478496
kmalloc-2048 4314 7072 2048 14483456
avc_node 142404 142968 72 10293696
kmalloc-512 8641 16704 512 8552448
radix_tree_node 13595 14448 584 8437632
kernfs_node_cache 59058 59092 120 7091040
kmalloc-1024 5592 6048 1024 6193152
After analysis from our sbr-kernel, this dentry growing is considered normal: https://access.redhat.com/solutions/55818 and can happen when there is intensive operation with files
Version-Release number of selected component (if applicable):
- Somehow get a lot of reclaimable memory on slab, perhaps doing lots of operations with files.
- VM fails to run/migrate
- UI reports high memory usage, which may not be 100% true
- VM runs/migrates
(Originally by Germano Veit Michel)
Looks like this getStats is also used by ovirt-hosted-engine-ha to determine free memory.
(Originally by Germano Veit Michel)
verified on vdsm-4.30.34-1.el7ev.x86_64.
meminfo['SReclaimable']) is included in /usr/lib/python2.7/site-packages/vdsm/host/api.py
host memory usage in UI looks correct. The migration tests in automation passed successfully on production setups.
Liran, could you please confirm if this is good for verification?
(In reply to Polina from comment #5)
> verified on vdsm-4.30.34-1.el7ev.x86_64.
> meminfo['SReclaimable']) is included in
> host memory usage in UI looks correct. The migration tests in automation
> passed successfully on production setups.
> Liran, could you please confirm if this is good for verification?
Please verify the report in UI to the meminfo on the host including SReclaimable.
Try to run a VM consuming this value.
For example, if the host had 2GB free before including SReclaimable, and now it has 3GB (SReclaimable is 1GB) try to run around 3GB VM on this host, or having a VM of 1GB migrating to this host.
This way we can be sure we can use this value as expected and the new report is correct.
verification on vdsm-4.30.37-1.el7ev.x86_64 and ovirt-engine-18.104.22.168-0.1.el7.noarch
meminfo['SReclaimable']) is placed in calculation in file /usr/lib/python2.7/site-packages/vdsm/host/api.py
Physical Memory: 127393 MB total, 8918 MB used, 118475 MB free
Max free Memory for scheduling new VMs: 127007 MB
MemTotal: 130450940 kB
MemFree: 113095328 kB
MemAvailable: 115297108 kB
Buffers: 24688 kB
Cached: 2913108 kB
SReclaimable: 323888 kB
Free + Cached + Buffers = 116033124 kB
Start VM with 118475 MB - ok
Start VM with 118475 MB (free) + 323 MB (SReclaimable) = 118798 MB - ok
Then successfully start the three VMs with memory size: 1024 MB, 4096 MB, 2048 MB
please confirm if it looks ok.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.