1749630 – Reporting incorrect high memory usage, preventing VMs from migrating, high slab dentry

Bug 1749630 - Reporting incorrect high memory usage, preventing VMs from migrating, high slab dentry

Summary: Reporting incorrect high memory usage, preventing VMs from migrating, high sl...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.3.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	ovirt-4.4.0
Target Release:	---
Assignee:	Liran Rotenberg
QA Contact:	Polina
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1757697
TreeView+	depends on / blocked

Reported:	2019-09-06 04:26 UTC by Germano Veit Michel
Modified:	2021-08-30 17:23 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, the Administration Portal showed very high memory usage for a host with no virtual machines running because it was not counting slab reclaimable memory. As a result, virtual machines could not be migrated to that host. The current release fixes that issue. The free host memory is evaluated correctly.
Clone Of:
Clones:	1757697 (view as bug list)
Environment:
Last Closed:	2020-08-04 13:27:17 UTC
oVirt Team:	SLA
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-36788	None	None	None	2021-08-30 13:09:54 UTC
Red Hat Knowledge Base (Solution)	55818	None	None	None	2019-09-06 04:27:43 UTC
Red Hat Product Errata	RHEA-2020:3246	None	None	None	2020-08-04 13:27:55 UTC
oVirt gerrit	103416	'None'	MERGED	virt: Add SReclaimable to free memory	2021-02-09 14:36:05 UTC

Description Germano Veit Michel 2019-09-06 04:26:41 UTC

Description of problem:

- Admin Portal shows very high memory usage for a host with no VMs running
- Cannot start/migrate VMs to that host

Before running/migrating a VM, in addition to the scheduling memory logic, the engine checks if the host has actual free memory for qemu to be able to start [1]. This free memory comes from vds.getMemFree()[2], which maps to mem_free in vds_statistics.

And mem_free comes from VDSM Host.getStats, which parses /proc/meminfo on the host[3] and adds: memFree + Caches + Buffer:

def _memFree():
    """
    Return the actual free mem on host.
    """
    meminfo = utils.readMemInfo()
    return (meminfo['MemFree'] +
            meminfo['Cached'] + meminfo['Buffers']) * Kbytes

Now take into account a host like this, which has a ton of memory available under slab (and reclaimable):

$ cat proc/meminfo 
MemTotal:       131791980 kB
MemFree:        14674300 kB
MemAvailable:   99026600 kB
Buffers:           59496 kB
Cached:          1844900 kB
SwapCached:           48 kB
Active:         30116712 kB
Inactive:        1463068 kB
Active(anon):   29200612 kB
Inactive(anon):   971784 kB
Active(file):     916100 kB
Inactive(file):   491284 kB
Unevictable:       44208 kB
Mlocked:           46296 kB
SwapTotal:       4194300 kB
SwapFree:        4193780 kB
Dirty:             84636 kB
Writeback:             0 kB
AnonPages:      29721308 kB
Mapped:           192120 kB
Shmem:            485660 kB
Slab:           84235552 kB
SReclaimable:   83571480 kB   <----- this is "free" memory
SUnreclaim:       664072 kB
KernelStack:       12848 kB
PageTables:       119536 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    70090288 kB
Committed_AS:   56864012 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      676288 kB
VmallocChunk:   34290711564 kB
HardwareCorrupted:     0 kB
AnonHugePages:   3100672 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      342636 kB
DirectMap2M:    12169216 kB
DirectMap1G:    123731968 kB

VDSM calculates free memory as : 14674300 + 1844900 + 59496 = 15.81G

However, there are 80G+ available from slab:

Slab:           84235552 kB
SReclaimable:   83571480 kB

And the actual available memory is much higher than 15.81G, it is close to 100G.

The VM fails to migrate with messages such as the below:

~~~
Cannot migrate VM. There is no host that satisfies current scheduling constraints. See below for details:, The host host.example.com did not satisfy internal filter Memory because its available memory is too low (xxx MB) to run the VM.] (VM: vm.example.com, Source: host.example.com).
~~~

So, shouldn't vdsm include things like 'SReclaimable' in memFree?

Most of the slab is dentry:

# name                         active     num            objsize     num*objsize
dentry                         438074280  438074280      192         84110261760
kmalloc-64                     8638916    8693248        64          556367872
ext4_inode_cache               40083      40083          1032        41365656
buffer_head                    241097     351117         104         36516168
inode_cache                    43038      43038          592         25478496
kmalloc-2048                   4314       7072           2048        14483456
avc_node                       142404     142968         72          10293696
kmalloc-512                    8641       16704          512         8552448
radix_tree_node                13595      14448          584         8437632
kernfs_node_cache              59058      59092          120         7091040
kmalloc-1024                   5592       6048           1024        6193152

After analysis from our sbr-kernel, this dentry growing is considered normal: https://access.redhat.com/solutions/55818 and can happen when there is intensive operation with files

Version-Release number of selected component (if applicable):
vdsm-4.30.13-4.el7ev.x86_64

How reproducible:
- Somehow get a lot of reclaimable memory on slab, perhaps doing lots of operations with files.

Actual results:
- VM fails to run/migrate
- UI reports high memory usage, which may not be 100% true

Expected results:
- VM runs/migrates

Additional info:
[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/scheduling/policyunits/MemoryPolicyUnit.java#L71

[2] https://github.com/oVirt/ovirt-engine/blob/127dc1537c31e644894a2ad3cd20c8711b7713b8/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/scheduling/SlaValidator.java#L46

[3] https://github.com/oVirt/vdsm/blob/5b4f66826fe7692ef334feeb567cd3d9ae42fb59/lib/vdsm/host/api.py#L129

Comment 1 Germano Veit Michel 2019-09-13 05:54:25 UTC

Looks like this getStats is also used by ovirt-hosted-engine-ha to determine free memory.

Comment 5 RHV bug bot 2019-10-22 17:25:43 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 6 RHV bug bot 2019-10-22 17:39:31 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 7 RHV bug bot 2019-10-22 17:46:43 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 8 RHV bug bot 2019-10-22 18:02:32 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 9 RHV bug bot 2019-11-19 11:52:53 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 10 RHV bug bot 2019-11-19 12:02:56 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 14 RHV bug bot 2019-12-13 13:17:26 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 15 RHV bug bot 2019-12-20 17:46:36 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 16 RHV bug bot 2020-01-08 14:50:04 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 17 RHV bug bot 2020-01-08 15:20:15 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 18 RHV bug bot 2020-01-24 19:51:48 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 19 Polina 2020-03-24 21:15:27 UTC

verified on http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-23.

meminfo['SReclaimable']) is placed in calculation in file /usr/lib/python3.6/site-packages/vdsm/host/api.py

on host
Physical Memory:
31948 MB total, 5751 MB used, 26197 MB free
on engine
Max free Memory for scheduling new VMs: 31562 MB

After running VM 12288 MB - 
Max free Memory for scheduling new VMs: 19106 MB

cat /proc/meminfo
SReclaimable:     404164 kB
Hugetlb:         4194304 kB

Start VM with 19106 (free) - 6240 (hugepages) + 404 MB (SReclaimable) = 13270 - ok

Comment 25 errata-xmlrpc 2020-08-04 13:27:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3246

Note You need to log in before you can comment on or make changes to this bug.