Description of problem: The engine runs out of memory. The majority of the memory is consumed by the instances of java.lang.String.(Almost 7 GB). All these objects are holding information about the application list running on the VMs. The approximate size of each of the object is 7KB, but the number of allocated objects is 100644. Version-Release number of selected component (if applicable): rhvm 4.2.8 How reproducible: Not known yet Actual results: Engine runs out of memory Expected results: Engine doe not run out of memory
Roman, just be sure this is not leaking have hard regerece somewhere, if you trigger a GC you get back the memory? For a workaround - Dedup as Ryan suggested - needs some testing, it may be that we would trade memory with cpu consumption for the dedup. - More aggressibe GC, paylonger pauses or processing but at least will remove them. - Decrease the interval of statistics collection - this would have a dramatic effect with little cost - example relax it to 15s - `engine-config -s VdsRefreshRate=15` For a reroducer: Lets just track a single VM, tweak VDSM to return a hugh list, like reported here, and see what happens for a long time. It must not leak, meaning if we cause a major GC it gets collected. Later increase the number of VMs, you can tweak VDSM to do the same for all the VMs, and we apply the workaournds and suggestions. For a solution, if we see its a must: We can go in various ways, server side(vdsm) or client side(engine) or both. server - vdsm sends the list hash, and the full content only if changed, i.e hashes the list and compare every time - similar to devices maybe? client - when the json is received, we selectively don't de-serialize some fields (like app list) client + server: engine negotiates which fields its not interested on every call Shmuel, Ryan: I suggest few other mitigation strategies here, lets discuss those. Also, did anyone test the string interning effect in any way on the a large system?
(In reply to Roy Golan from comment #20) > For a workaround > - Dedup as Ryan suggested - needs some testing, it may be that we would > trade memory with cpu consumption for the dedup. Gajanan, did the customer test this on their environment? > - More aggressibe GC, paylonger pauses or processing but at least will > remove them. > - Decrease the interval of statistics collection - this would have a > dramatic effect with little cost - example relax it to 15s - `engine-config > -s VdsRefreshRate=15` From the memory dump, the issue is not statistics collection, but that it's a VDI environment which contains a large number of Windows VMs which have long, identical application lists filling up the heap. > For a solution, if we see its a must: > We can go in various ways, server side(vdsm) or client side(engine) or both. > server - vdsm sends the list hash, and the full content only if changed, i.e > hashes the list and compare every time - similar to devices maybe? > client - when the json is received, we selectively don't de-serialize some > fields (like app list) > client + server: engine negotiates which fields its not interested on every > call > > > Shmuel, Ryan: I suggest few other mitigation strategies here, lets discuss > those. Also, did anyone test the string interning effect in any way on the a > large system? No access to the scale environment, but a test would be great. The other "good" fix is to do as suggested on the patch, and split the list, then pull them back out of a set or hashmap. This carries a high risk of regressions in a Z-stream due to the number of changes needed, though
Hello Roman, Please specify the amount of applications on one vm. In order to reproduce / verify this issue.
alternatively to current posted patches, https://gerrit.ovirt.org/#/c/101193/ could be significant as well. Should be tested.
That patch is definitely the fix https://gerrit.ovirt.org/#/c/101193/ and not the intern string one. I'll update the trackers list accordingly.
How many VMs on the reported system?
Roman one thing I didn't understand, if you run GC is this collected?
sync2jira
Roy, here is the latest heap dump sudo -u ovirt jcmd $(pidof ovirt-engine) GC.heap_dump -all /tmp/heap_dump_$(date +"%d-%m-%y_%H-%M").out taken after starting 578 vms from pools with 'hacked' agent that reports many apps. https://drive.google.com/open?id=1ogLAtVRE1WuU2iyfNomYt2W0ePhr0190
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247