+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1700725 +++ ====================================================================== Description of problem: The engine runs out of memory. The majority of the memory is consumed by the instances of java.lang.String.(Almost 7 GB). All these objects are holding information about the application list running on the VMs. The approximate size of each of the object is 7KB, but the number of allocated objects is 100644. Version-Release number of selected component (if applicable): rhvm 4.2.8 How reproducible: Not known yet Actual results: Engine runs out of memory Expected results: Engine doe not run out of memory (Originally by Roman Hodain)
didn't make it in time, need to wait for 4.3.5
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Open patch attached] For more info please contact: rhv-devops
Created attachment 1602803 [details] general heap view
Created attachment 1602804 [details] idle_vms_heap_with_forced_gc
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Tag 'ovirt-engine-4.3.6.2' doesn't contain patch 'https://gerrit.ovirt.org/99577'] gitweb: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/tags/ovirt-engine-4.3.6.2 For more info please contact: rhv-devops
INFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Tag 'ovirt-engine-4.3.6.3' doesn't contain patch 'https://gerrit.ovirt.org/99577'] gitweb: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/tags/ovirt-engine-4.3.6.3 For more info please contact: rhv-devops
Is the fix is in the build delivered to QE already?
It is
sync2jira
INFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Tag 'ovirt-engine-4.3.6.4' doesn't contain patch 'https://gerrit.ovirt.org/99577'] gitweb: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/tags/ovirt-engine-4.3.6.4 For more info please contact: rhv-devops
Update on this BZ: we just noticed that the engine service failed with: 2019-09-03 13:29:02,579+0000 ovirt-engine: ERROR run:554 Error: process terminated with status code -9 With repeating ERRORs in engine log: 2019-09-03 13:18:35,366Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-28) [] EVENT_ID: VM_MEMORY_UNDER_GUARANTEED_VALUE(148), VM HostedEngine on host f01-h02-000-r620.rdu2.scalelab.redhat.com was guaranteed 21845 MB but currently has 16384 MB We think that this crash is caused by the memory growth as a result of 578 up vms with 'hacked' agent (reporting lots of apps). Adding log collector logs for engine and SPM host for the relevant cluster in private message below.
In general, we expect the memory consumption to increase as more data is received. The more dynamic data (that is queried often) we get, the more memory the application allocates and more data is cached by postgres. The logs show that: 1. The engine runs in a VM (hosted-engine) 2. There was memory pressure on the node 3. The VM of the hosted engine was set with (almost) 22G of guaranteed memory and was provided with 16G of memory (due to ballooning) 4. It is the OOM killer that killed the ovirt-engine service at 13:29 This raises few questions: 1. What lead to the memory pressure on the node - what's the cluster memory overcommitment, how many VMs ran on the node and what's their memory requirements, is MOM configured with the default policy? How much memory does the node have? 2. When not using the hacked agent, what's the memory consumption of the hosted engine VM? how much did it change when using the hacked agent? 3. Can you provide statistics on the memory consumption of the ovirt-engine process to see what portion of the application consumed more memory? same for postgres? My concern is that this failure is not really related to the applications list but may happen when more data is received by the engine (for instance by increasing the number of VMs from 600 to 1000). I believe that without the hacked agent you can also reproduce it: 1. Take a node with 16G of memory 2. Run 32 VMs that consume 512M of memory with overcommitment=150% (so they consume the entire 16G but from the scheduling perspective we enable scheduling the hosted engine) 3. Run the hosted engine I think that since the hosted engine consume significantly more memory the OOM killer would try to kill it to free memory for the other processes (VMs)
> I believe that without the hacked agent you can also reproduce it: > 1. Take a node with 16G of memory > 2. Run 32 VMs that consume 512M of memory with overcommitment=150% (so they > consume the entire 16G but from the scheduling perspective we enable > scheduling the hosted engine) > 3. Run the hosted engine > I think that since the hosted engine consume significantly more memory the > OOM killer would try to kill it to free memory for the other processes (VMs) Not in that order of course :) Also this assumes a single node, otherwise I expect the balancer to move some of the VMs
Oh and I wanted to emphasize one more thing - it doesn't seem like the ovirt-engine application experienced any memory pressure. I don't see any errors in the log prior to Sep 3 at 13:29 that indicate the engine actually consumed the 16G that the ballooning left for it. It may well be that ovirt-engine consumed something like 12G but from the OOM killer perspective, it got the highest score for releasing the memory pressure on the node
Ok, VERIFIED then
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:3010