+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1749630 +++ ====================================================================== Description of problem: - Admin Portal shows very high memory usage for a host with no VMs running - Cannot start/migrate VMs to that host Before running/migrating a VM, in addition to the scheduling memory logic, the engine checks if the host has actual free memory for qemu to be able to start [1]. This free memory comes from vds.getMemFree()[2], which maps to mem_free in vds_statistics. And mem_free comes from VDSM Host.getStats, which parses /proc/meminfo on the host[3] and adds: memFree + Caches + Buffer: def _memFree(): """ Return the actual free mem on host. """ meminfo = utils.readMemInfo() return (meminfo['MemFree'] + meminfo['Cached'] + meminfo['Buffers']) * Kbytes Now take into account a host like this, which has a ton of memory available under slab (and reclaimable): $ cat proc/meminfo MemTotal: 131791980 kB MemFree: 14674300 kB MemAvailable: 99026600 kB Buffers: 59496 kB Cached: 1844900 kB SwapCached: 48 kB Active: 30116712 kB Inactive: 1463068 kB Active(anon): 29200612 kB Inactive(anon): 971784 kB Active(file): 916100 kB Inactive(file): 491284 kB Unevictable: 44208 kB Mlocked: 46296 kB SwapTotal: 4194300 kB SwapFree: 4193780 kB Dirty: 84636 kB Writeback: 0 kB AnonPages: 29721308 kB Mapped: 192120 kB Shmem: 485660 kB Slab: 84235552 kB SReclaimable: 83571480 kB <----- this is "free" memory SUnreclaim: 664072 kB KernelStack: 12848 kB PageTables: 119536 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 70090288 kB Committed_AS: 56864012 kB VmallocTotal: 34359738367 kB VmallocUsed: 676288 kB VmallocChunk: 34290711564 kB HardwareCorrupted: 0 kB AnonHugePages: 3100672 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 342636 kB DirectMap2M: 12169216 kB DirectMap1G: 123731968 kB VDSM calculates free memory as : 14674300 + 1844900 + 59496 = 15.81G However, there are 80G+ available from slab: Slab: 84235552 kB SReclaimable: 83571480 kB And the actual available memory is much higher than 15.81G, it is close to 100G. The VM fails to migrate with messages such as the below: ~~~ Cannot migrate VM. There is no host that satisfies current scheduling constraints. See below for details:, The host host.example.com did not satisfy internal filter Memory because its available memory is too low (xxx MB) to run the VM.] (VM: vm.example.com, Source: host.example.com). ~~~ So, shouldn't vdsm include things like 'SReclaimable' in memFree? Most of the slab is dentry: # name active num objsize num*objsize dentry 438074280 438074280 192 84110261760 kmalloc-64 8638916 8693248 64 556367872 ext4_inode_cache 40083 40083 1032 41365656 buffer_head 241097 351117 104 36516168 inode_cache 43038 43038 592 25478496 kmalloc-2048 4314 7072 2048 14483456 avc_node 142404 142968 72 10293696 kmalloc-512 8641 16704 512 8552448 radix_tree_node 13595 14448 584 8437632 kernfs_node_cache 59058 59092 120 7091040 kmalloc-1024 5592 6048 1024 6193152 After analysis from our sbr-kernel, this dentry growing is considered normal: https://access.redhat.com/solutions/55818 and can happen when there is intensive operation with files Version-Release number of selected component (if applicable): vdsm-4.30.13-4.el7ev.x86_64 How reproducible: - Somehow get a lot of reclaimable memory on slab, perhaps doing lots of operations with files. Actual results: - VM fails to run/migrate - UI reports high memory usage, which may not be 100% true Expected results: - VM runs/migrates Additional info: [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/scheduling/policyunits/MemoryPolicyUnit.java#L71 [2] https://github.com/oVirt/ovirt-engine/blob/127dc1537c31e644894a2ad3cd20c8711b7713b8/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/scheduling/SlaValidator.java#L46 [3] https://github.com/oVirt/vdsm/blob/5b4f66826fe7692ef334feeb567cd3d9ae42fb59/lib/vdsm/host/api.py#L129 (Originally by Germano Veit Michel)
Looks like this getStats is also used by ovirt-hosted-engine-ha to determine free memory. (Originally by Germano Veit Michel)
verified on vdsm-4.30.34-1.el7ev.x86_64. meminfo['SReclaimable']) is included in /usr/lib/python2.7/site-packages/vdsm/host/api.py host memory usage in UI looks correct. The migration tests in automation passed successfully on production setups. Liran, could you please confirm if this is good for verification?
(In reply to Polina from comment #5) > verified on vdsm-4.30.34-1.el7ev.x86_64. > > meminfo['SReclaimable']) is included in > /usr/lib/python2.7/site-packages/vdsm/host/api.py > > host memory usage in UI looks correct. The migration tests in automation > passed successfully on production setups. > > Liran, could you please confirm if this is good for verification? Please verify the report in UI to the meminfo on the host including SReclaimable. Try to run a VM consuming this value. For example, if the host had 2GB free before including SReclaimable, and now it has 3GB (SReclaimable is 1GB) try to run around 3GB VM on this host, or having a VM of 1GB migrating to this host. This way we can be sure we can use this value as expected and the new report is correct.
verification on vdsm-4.30.37-1.el7ev.x86_64 and ovirt-engine-4.3.7.2-0.1.el7.noarch meminfo['SReclaimable']) is placed in calculation in file /usr/lib/python2.7/site-packages/vdsm/host/api.py Physical Memory: 127393 MB total, 8918 MB used, 118475 MB free Max free Memory for scheduling new VMs: 127007 MB cat /proc/meminfo MemTotal: 130450940 kB MemFree: 113095328 kB MemAvailable: 115297108 kB Buffers: 24688 kB Cached: 2913108 kB SReclaimable: 323888 kB Free + Cached + Buffers = 116033124 kB Start VM with 118475 MB - ok Start VM with 118475 MB (free) + 323 MB (SReclaimable) = 118798 MB - ok Then successfully start the three VMs with memory size: 1024 MB, 4096 MB, 2048 MB please confirm if it looks ok.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:4230