Bug 1700725
| Summary: | [scale] RHV-M runs out of memory due to to much data reported by the guest agent | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Roman Hodain <rhodain> | |
| Component: | ovirt-engine | Assignee: | Tomasz BaraĆski <tbaransk> | |
| Status: | CLOSED ERRATA | QA Contact: | mlehrer | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.2.8-4 | CC: | dagur, gchakkar, izuckerm, lsvaty, michal.skrivanek, mlehrer, mperina, mtessun, rgolan, tbaransk, yoliynyk | |
| Target Milestone: | ovirt-4.4.0 | Keywords: | Performance, ZStream | |
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | rhv-4.4.0-29 | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1712437 (view as bug list) | Environment: | ||
| Last Closed: | 2020-08-04 13:17:36 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1712437 | |||
|
Description
Roman Hodain
2019-04-17 09:08:51 UTC
Roman, just be sure this is not leaking have hard regerece somewhere, if you trigger a GC you get back the memory? For a workaround - Dedup as Ryan suggested - needs some testing, it may be that we would trade memory with cpu consumption for the dedup. - More aggressibe GC, paylonger pauses or processing but at least will remove them. - Decrease the interval of statistics collection - this would have a dramatic effect with little cost - example relax it to 15s - `engine-config -s VdsRefreshRate=15` For a reroducer: Lets just track a single VM, tweak VDSM to return a hugh list, like reported here, and see what happens for a long time. It must not leak, meaning if we cause a major GC it gets collected. Later increase the number of VMs, you can tweak VDSM to do the same for all the VMs, and we apply the workaournds and suggestions. For a solution, if we see its a must: We can go in various ways, server side(vdsm) or client side(engine) or both. server - vdsm sends the list hash, and the full content only if changed, i.e hashes the list and compare every time - similar to devices maybe? client - when the json is received, we selectively don't de-serialize some fields (like app list) client + server: engine negotiates which fields its not interested on every call Shmuel, Ryan: I suggest few other mitigation strategies here, lets discuss those. Also, did anyone test the string interning effect in any way on the a large system? (In reply to Roy Golan from comment #20) > For a workaround > - Dedup as Ryan suggested - needs some testing, it may be that we would > trade memory with cpu consumption for the dedup. Gajanan, did the customer test this on their environment? > - More aggressibe GC, paylonger pauses or processing but at least will > remove them. > - Decrease the interval of statistics collection - this would have a > dramatic effect with little cost - example relax it to 15s - `engine-config > -s VdsRefreshRate=15` From the memory dump, the issue is not statistics collection, but that it's a VDI environment which contains a large number of Windows VMs which have long, identical application lists filling up the heap. > For a solution, if we see its a must: > We can go in various ways, server side(vdsm) or client side(engine) or both. > server - vdsm sends the list hash, and the full content only if changed, i.e > hashes the list and compare every time - similar to devices maybe? > client - when the json is received, we selectively don't de-serialize some > fields (like app list) > client + server: engine negotiates which fields its not interested on every > call > > > Shmuel, Ryan: I suggest few other mitigation strategies here, lets discuss > those. Also, did anyone test the string interning effect in any way on the a > large system? No access to the scale environment, but a test would be great. The other "good" fix is to do as suggested on the patch, and split the list, then pull them back out of a set or hashmap. This carries a high risk of regressions in a Z-stream due to the number of changes needed, though Hello Roman, Please specify the amount of applications on one vm. In order to reproduce / verify this issue. alternatively to current posted patches, https://gerrit.ovirt.org/#/c/101193/ could be significant as well. Should be tested. That patch is definitely the fix https://gerrit.ovirt.org/#/c/101193/ and not the intern string one. I'll update the trackers list accordingly. How many VMs on the reported system? Roman one thing I didn't understand, if you run GC is this collected? sync2jira sync2jira Roy, here is the latest heap dump sudo -u ovirt jcmd $(pidof ovirt-engine) GC.heap_dump -all /tmp/heap_dump_$(date +"%d-%m-%y_%H-%M").out taken after starting 578 vms from pools with 'hacked' agent that reports many apps. https://drive.google.com/open?id=1ogLAtVRE1WuU2iyfNomYt2W0ePhr0190 WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:
[Found non-acked flags: '{}', ]
For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:
[Found non-acked flags: '{}', ]
For more info please contact: rhv-devops
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:
[Found non-acked flags: '{}', ]
For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:
[Found non-acked flags: '{}', ]
For more info please contact: rhv-devops
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:
[Found non-acked flags: '{}', ]
For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:
[Found non-acked flags: '{}', ]
For more info please contact: rhv-devops
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:
[Found non-acked flags: '{}', ]
For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:
[Found non-acked flags: '{}', ]
For more info please contact: rhv-devops
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:
[Found non-acked flags: '{}', ]
For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:
[Found non-acked flags: '{}', ]
For more info please contact: rhv-devops
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247 |