1700725 – [scale] RHV-M runs out of memory due to to much data reported by the guest agent

Bug 1700725 - [scale] RHV-M runs out of memory due to to much data reported by the guest agent

Summary: [scale] RHV-M runs out of memory due to to much data reported by the guest agent

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.2.8-4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	ovirt-4.4.0
Target Release:	---
Assignee:	Tomasz Barański
QA Contact:	mlehrer
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1712437
TreeView+	depends on / blocked

Reported:	2019-04-17 09:08 UTC by Roman Hodain
Modified:	2020-11-05 07:05 UTC (History)
CC List:	11 users (show)
Fixed In Version:	rhv-4.4.0-29
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1712437 (view as bug list)
Environment:
Last Closed:	2020-08-04 13:17:36 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2020:3247	None	None	None	2020-08-04 13:19:18 UTC
oVirt gerrit	101077	master	ABANDONED	core: Store in memory application list the right way	2021-01-19 12:43:56 UTC
oVirt gerrit	101193	'None'	MERGED	core: cache only relevant data of last reported vms	2021-01-19 12:43:56 UTC

Description Roman Hodain 2019-04-17 09:08:51 UTC

Description of problem:
The engine runs out of memory. The majority of the memory is consumed by the instances of java.lang.String.(Almost 7 GB). All these objects are holding information about the application list running on the VMs. The approximate size of each of the object is 7KB, but the number of allocated objects is 100644. 

Version-Release number of selected component (if applicable):
rhvm 4.2.8

How reproducible:
Not known yet

Actual results:
Engine runs out of memory

Expected results:
Engine doe not run out of memory

Comment 20 Roy Golan 2019-05-26 12:05:58 UTC

Roman, just be sure this is not leaking have hard regerece somewhere, if you trigger a GC you get back the memory?


For a workaround
- Dedup as Ryan suggested - needs some testing, it may be that we would trade memory with cpu consumption for the dedup. 
- More aggressibe GC, paylonger pauses or processing but at least will remove them.
- Decrease the interval of statistics collection - this would have a dramatic effect with little cost - example relax it to 15s -  `engine-config -s VdsRefreshRate=15`

For a reroducer:
Lets just track a single VM, tweak VDSM to return a hugh list, like reported here, and see what happens for a long time.
It must not leak, meaning if we cause a major GC it gets collected.
Later increase the number of VMs,  you can tweak VDSM to do the same for all the VMs, and we apply the workaournds and suggestions.

For a solution, if we see its a must:
We can go in various ways, server side(vdsm) or client side(engine) or both.
server - vdsm sends the list hash, and the full content only if changed, i.e hashes the list and compare every time - similar to devices maybe?
client - when the json is received, we selectively don't de-serialize some fields (like app list)
client + server: engine negotiates which fields its not interested on every call


Shmuel, Ryan: I suggest few other mitigation strategies here, lets discuss those. Also, did anyone test the string interning effect in any way on the a large system?

Comment 21 Ryan Barry 2019-05-28 10:30:20 UTC

(In reply to Roy Golan from comment #20)
> For a workaround
> - Dedup as Ryan suggested - needs some testing, it may be that we would
> trade memory with cpu consumption for the dedup. 

Gajanan, did the customer test this on their environment?

> - More aggressibe GC, paylonger pauses or processing but at least will
> remove them.
> - Decrease the interval of statistics collection - this would have a
> dramatic effect with little cost - example relax it to 15s -  `engine-config
> -s VdsRefreshRate=15`

From the memory dump, the issue is not statistics collection, but that it's a VDI environment which contains a large number of Windows VMs which have long, identical application lists filling up the heap.

> For a solution, if we see its a must:
> We can go in various ways, server side(vdsm) or client side(engine) or both.
> server - vdsm sends the list hash, and the full content only if changed, i.e
> hashes the list and compare every time - similar to devices maybe?
> client - when the json is received, we selectively don't de-serialize some
> fields (like app list)
> client + server: engine negotiates which fields its not interested on every
> call
> 
> 
> Shmuel, Ryan: I suggest few other mitigation strategies here, lets discuss
> those. Also, did anyone test the string interning effect in any way on the a
> large system?

No access to the scale environment, but a test would be great. The other "good" fix is to do as suggested on the patch, and split the list, then pull them back out of a set or hashmap. This carries a high risk of regressions in a Z-stream due to the number of changes needed, though

Comment 22 Ilan Zuckerman 2019-06-04 10:04:41 UTC

Hello Roman,
Please specify the amount of applications on one vm. In order to reproduce / verify this issue.

Comment 25 Michal Skrivanek 2019-06-26 09:43:54 UTC

alternatively to current posted patches, https://gerrit.ovirt.org/#/c/101193/ could be significant as well. Should be tested.

Comment 26 Roy Golan 2019-07-15 11:39:55 UTC

That patch is definitely the fix https://gerrit.ovirt.org/#/c/101193/ and not the intern string one.

I'll update the trackers list accordingly.

Comment 30 Daniel Gur 2019-07-31 07:48:13 UTC

How many VMs on the reported system?

Comment 31 Roy Golan 2019-08-11 13:33:25 UTC

Roman one thing I didn't understand, if you run GC is this collected?

Comment 32 Daniel Gur 2019-08-28 13:13:38 UTC

sync2jira

Comment 33 Daniel Gur 2019-08-28 13:17:51 UTC

sync2jira

Comment 34 Ilan Zuckerman 2019-09-03 08:14:59 UTC

Roy, here is the latest heap dump    sudo -u ovirt jcmd $(pidof ovirt-engine) GC.heap_dump -all /tmp/heap_dump_$(date +"%d-%m-%y_%H-%M").out    taken after starting 578 vms from pools with 'hacked' agent that reports many apps.
https://drive.google.com/open?id=1ogLAtVRE1WuU2iyfNomYt2W0ePhr0190

Comment 35 RHV bug bot 2019-12-13 13:14:31 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 36 RHV bug bot 2019-12-20 17:44:22 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 37 RHV bug bot 2020-01-08 14:48:45 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 38 RHV bug bot 2020-01-08 15:15:31 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 39 RHV bug bot 2020-01-24 19:50:38 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 50 errata-xmlrpc 2020-08-04 13:17:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247

Note You need to log in before you can comment on or make changes to this bug.