Description of problem: - Admin Portal shows very high memory usage for a host with no VMs running - Cannot start/migrate VMs to that host Before running/migrating a VM, in addition to the scheduling memory logic, the engine checks if the host has actual free memory for qemu to be able to start [1]. This free memory comes from vds.getMemFree()[2], which maps to mem_free in vds_statistics. And mem_free comes from VDSM Host.getStats, which parses /proc/meminfo on the host[3] and adds: memFree + Caches + Buffer: def _memFree(): """ Return the actual free mem on host. """ meminfo = utils.readMemInfo() return (meminfo['MemFree'] + meminfo['Cached'] + meminfo['Buffers']) * Kbytes Now take into account a host like this, which has a ton of memory available under slab (and reclaimable): $ cat proc/meminfo MemTotal: 131791980 kB MemFree: 14674300 kB MemAvailable: 99026600 kB Buffers: 59496 kB Cached: 1844900 kB SwapCached: 48 kB Active: 30116712 kB Inactive: 1463068 kB Active(anon): 29200612 kB Inactive(anon): 971784 kB Active(file): 916100 kB Inactive(file): 491284 kB Unevictable: 44208 kB Mlocked: 46296 kB SwapTotal: 4194300 kB SwapFree: 4193780 kB Dirty: 84636 kB Writeback: 0 kB AnonPages: 29721308 kB Mapped: 192120 kB Shmem: 485660 kB Slab: 84235552 kB SReclaimable: 83571480 kB <----- this is "free" memory SUnreclaim: 664072 kB KernelStack: 12848 kB PageTables: 119536 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 70090288 kB Committed_AS: 56864012 kB VmallocTotal: 34359738367 kB VmallocUsed: 676288 kB VmallocChunk: 34290711564 kB HardwareCorrupted: 0 kB AnonHugePages: 3100672 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 342636 kB DirectMap2M: 12169216 kB DirectMap1G: 123731968 kB VDSM calculates free memory as : 14674300 + 1844900 + 59496 = 15.81G However, there are 80G+ available from slab: Slab: 84235552 kB SReclaimable: 83571480 kB And the actual available memory is much higher than 15.81G, it is close to 100G. The VM fails to migrate with messages such as the below: ~~~ Cannot migrate VM. There is no host that satisfies current scheduling constraints. See below for details:, The host host.example.com did not satisfy internal filter Memory because its available memory is too low (xxx MB) to run the VM.] (VM: vm.example.com, Source: host.example.com). ~~~ So, shouldn't vdsm include things like 'SReclaimable' in memFree? Most of the slab is dentry: # name active num objsize num*objsize dentry 438074280 438074280 192 84110261760 kmalloc-64 8638916 8693248 64 556367872 ext4_inode_cache 40083 40083 1032 41365656 buffer_head 241097 351117 104 36516168 inode_cache 43038 43038 592 25478496 kmalloc-2048 4314 7072 2048 14483456 avc_node 142404 142968 72 10293696 kmalloc-512 8641 16704 512 8552448 radix_tree_node 13595 14448 584 8437632 kernfs_node_cache 59058 59092 120 7091040 kmalloc-1024 5592 6048 1024 6193152 After analysis from our sbr-kernel, this dentry growing is considered normal: https://access.redhat.com/solutions/55818 and can happen when there is intensive operation with files Version-Release number of selected component (if applicable): vdsm-4.30.13-4.el7ev.x86_64 How reproducible: - Somehow get a lot of reclaimable memory on slab, perhaps doing lots of operations with files. Actual results: - VM fails to run/migrate - UI reports high memory usage, which may not be 100% true Expected results: - VM runs/migrates Additional info: [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/scheduling/policyunits/MemoryPolicyUnit.java#L71 [2] https://github.com/oVirt/ovirt-engine/blob/127dc1537c31e644894a2ad3cd20c8711b7713b8/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/scheduling/SlaValidator.java#L46 [3] https://github.com/oVirt/vdsm/blob/5b4f66826fe7692ef334feeb567cd3d9ae42fb59/lib/vdsm/host/api.py#L129
Looks like this getStats is also used by ovirt-hosted-engine-ha to determine free memory.
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
verified on http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-23. meminfo['SReclaimable']) is placed in calculation in file /usr/lib/python3.6/site-packages/vdsm/host/api.py on host Physical Memory: 31948 MB total, 5751 MB used, 26197 MB free on engine Max free Memory for scheduling new VMs: 31562 MB After running VM 12288 MB - Max free Memory for scheduling new VMs: 19106 MB cat /proc/meminfo SReclaimable: 404164 kB Hugetlb: 4194304 kB Start VM with 19106 (free) - 6240 (hugepages) + 404 MB (SReclaimable) = 13270 - ok
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:3246