1712437 – [downstream clone - 4.3.6] [scale] RHV-M runs out of memory due to to much data reported by the guest agent

Bug 1712437 - [downstream clone - 4.3.6] [scale] RHV-M runs out of memory due to to much data reported by the guest agent

Summary: [downstream clone - 4.3.6] [scale] RHV-M runs out of memory due to to much da...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.2.8-4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	ovirt-4.3.6
Target Release:	4.3.6
Assignee:	Tomasz Barański
QA Contact:	Ilan Zuckerman
Docs Contact:
URL:
Whiteboard:
Depends On:	1700725
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-21 14:05 UTC by RHV bug bot
Modified:	2020-08-03 15:16 UTC (History)
CC List:	13 users (show)
Fixed In Version:	ovirt-engine-4.3.6.1
Doc Type:	No Doc Update
Doc Text:
Clone Of:	1700725
Environment:
Last Closed:	2019-10-10 15:36:58 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)
general heap view (121.85 KB, image/png) 2019-08-12 09:45 UTC, Ilan Zuckerman	no flags	Details
idle_vms_heap_with_forced_gc (77.82 KB, image/png) 2019-08-12 09:45 UTC, Ilan Zuckerman	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2019:3010	None	None	None	2019-10-10 15:37:11 UTC
oVirt gerrit	100224	ovirt-engine-4.3	ABANDONED	core: Intern Strings holding application list	2020-08-31 16:11:14 UTC
oVirt gerrit	101827	ovirt-engine-4.3	ABANDONED	core: cache only relevant data of last reported vms	2020-08-31 16:11:14 UTC
oVirt gerrit	101853	ovirt-engine-4.3	MERGED	core: cache only relevant data of last reported vms	2020-08-31 16:11:14 UTC

Description RHV bug bot 2019-05-21 14:05:30 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1700725 +++
======================================================================

Description of problem:
The engine runs out of memory. The majority of the memory is consumed by the instances of java.lang.String.(Almost 7 GB). All these objects are holding information about the application list running on the VMs. The approximate size of each of the object is 7KB, but the number of allocated objects is 100644. 

Version-Release number of selected component (if applicable):
rhvm 4.2.8

How reproducible:
Not known yet

Actual results:
Engine runs out of memory

Expected results:
Engine doe not run out of memory

(Originally by Roman Hodain)

Comment 14 Michal Skrivanek 2019-05-30 12:20:52 UTC

didn't make it in time, need to wait for 4.3.5

Comment 23 RHV bug bot 2019-08-08 13:17:43 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Open patch attached]

For more info please contact: rhv-devops

Comment 25 Ilan Zuckerman 2019-08-12 09:45:00 UTC

Created attachment 1602803 [details]
general heap view

Comment 26 Ilan Zuckerman 2019-08-12 09:45:55 UTC

Created attachment 1602804 [details]
idle_vms_heap_with_forced_gc

Comment 27 RHV bug bot 2019-08-15 14:05:10 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Tag 'ovirt-engine-4.3.6.2' doesn't contain patch 'https://gerrit.ovirt.org/99577']
gitweb: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/tags/ovirt-engine-4.3.6.2

For more info please contact: rhv-devops

Comment 30 RHV bug bot 2019-08-23 13:26:03 UTC

INFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Tag 'ovirt-engine-4.3.6.3' doesn't contain patch 'https://gerrit.ovirt.org/99577']
gitweb: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/tags/ovirt-engine-4.3.6.3

For more info please contact: rhv-devops

Comment 31 Daniel Gur 2019-08-25 08:02:20 UTC

Is the fix is in the build delivered to QE already?

Comment 32 Ryan Barry 2019-08-25 12:39:09 UTC

It is

Comment 33 Daniel Gur 2019-08-28 13:11:42 UTC

sync2jira

Comment 34 Daniel Gur 2019-08-28 13:15:54 UTC

sync2jira

Comment 35 RHV bug bot 2019-08-29 13:05:27 UTC

INFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Tag 'ovirt-engine-4.3.6.4' doesn't contain patch 'https://gerrit.ovirt.org/99577']
gitweb: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/tags/ovirt-engine-4.3.6.4

For more info please contact: rhv-devops

Comment 37 Ilan Zuckerman 2019-09-03 14:02:55 UTC

Update on this BZ:
we just noticed that the engine service failed with:

2019-09-03 13:29:02,579+0000 ovirt-engine: ERROR run:554 Error: process terminated with status code -9

With repeating ERRORs in engine log:

2019-09-03 13:18:35,366Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-28) [] EVENT_ID: VM_MEMORY_UNDER_GUARANTEED_VALUE(148), VM HostedEngine on host f01-h02-000-r620.rdu2.scalelab.redhat.com was guaranteed 21845 MB but currently has 16384 MB

We think that this crash is caused by the memory growth as a result of 578 up vms with 'hacked' agent (reporting lots of apps).
Adding log collector logs for engine and SPM host for the relevant cluster in private message below.

Comment 39 Arik 2019-09-04 09:36:50 UTC

In general, we expect the memory consumption to increase as more data is received.
The more dynamic data (that is queried often) we get, the more memory the application allocates and more data is cached by postgres.

The logs show that:
1. The engine runs in a VM (hosted-engine)
2. There was memory pressure on the node
3. The VM of the hosted engine was set with (almost) 22G of guaranteed memory and was provided with 16G of memory (due to ballooning)
4. It is the OOM killer that killed the ovirt-engine service at 13:29

This raises few questions:
1. What lead to the memory pressure on the node - what's the cluster memory overcommitment, how many VMs ran on the node and what's their memory requirements, is MOM configured with the default policy? How much memory does the node have?
2. When not using the hacked agent, what's the memory consumption of the hosted engine VM? how much did it change when using the hacked agent?
3. Can you provide statistics on the memory consumption of the ovirt-engine process to see what portion of the application consumed more memory? same for postgres?

My concern is that this failure is not really related to the applications list but may happen when more data is received by the engine (for instance by increasing the number of VMs from 600 to 1000).

I believe that without the hacked agent you can also reproduce it:
1. Take a node with 16G of memory
2. Run 32 VMs that consume 512M of memory with overcommitment=150% (so they consume the entire 16G but from the scheduling perspective we enable scheduling the hosted engine)
3. Run the hosted engine
I think that since the hosted engine consume significantly more memory the OOM killer would try to kill it to free memory for the other processes (VMs)

Comment 40 Arik 2019-09-04 09:39:20 UTC

> I believe that without the hacked agent you can also reproduce it:
> 1. Take a node with 16G of memory
> 2. Run 32 VMs that consume 512M of memory with overcommitment=150% (so they
> consume the entire 16G but from the scheduling perspective we enable
> scheduling the hosted engine)
> 3. Run the hosted engine
> I think that since the hosted engine consume significantly more memory the
> OOM killer would try to kill it to free memory for the other processes (VMs)

Not in that order of course :)
Also this assumes a single node, otherwise I expect the balancer to move some of the VMs

Comment 41 Arik 2019-09-04 09:49:29 UTC

Oh and I wanted to emphasize one more thing - it doesn't seem like the ovirt-engine application experienced any memory pressure. I don't see any errors in the log prior to Sep 3 at 13:29 that indicate the engine actually consumed the 16G that the ballooning left for it. It may well be that ovirt-engine consumed something like 12G but from the OOM killer perspective, it got the highest score for releasing the memory pressure on the node

Comment 56 Michal Skrivanek 2019-10-10 07:10:33 UTC

Ok, VERIFIED then

Comment 58 errata-xmlrpc 2019-10-10 15:36:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3010

Note You need to log in before you can comment on or make changes to this bug.