Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1500739

Summary: Engine fails with java.lang.OutOfMemoryError making all hosts non responsive
Product: Red Hat Enterprise Virtualization Manager Reporter: nijin ashok <nashok>
Component: vdsm-jsonrpc-javaAssignee: Piotr Kliczewski <pkliczew>
Status: CLOSED ERRATA QA Contact: Petr Matyáš <pmatyas>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.2CC: lsurette, lsvaty, mgoldboi, mkalinin, mperina, nashok, pkliczew, rbalakri, Rhev-m-bugs, rhodain, srevivo, ykaul
Target Milestone: ovirt-4.2.0Keywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1504118 (view as bug list) Environment:
Last Closed: 2018-05-15 17:56:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1504118    

Description nijin ashok 2017-10-11 11:59:01 UTC
Description of problem:

The engine failed suddenly with "java.lang.OutOfMemoryError: Java heap space" error. This made all host in the environment to non-responsive. Below log is registered in the server log.

==
2017-10-11 07:04:45,285+05 ERROR [stderr] (ResponseWorker) Exception in thread "ResponseWorker" java.lang.OutOfMemoryError: Java heap space

2017-10-11 07:04:42,372+05 ERROR [io.undertow.servlet] (default task-84) Exception while dispatching incoming RPC call: com.google.gwt.user.client.rpc.SerializationException: Can't find the serialization policy file. This probably means that the user has an old version of the application loaded in the browser. To solve the issue the user needs to close the browser and open it again, so that the application is reloaded.
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
==

The GC overhead limit exceeded is logged multiple times in the server log. At some point of time, I can see engine is not even trying to check the status of the host. 

The environment is having 1GB of heap size configured and it's not a large environment and is only having less than 100 VMs and 20 host.

There is no specific event before the issue. I can only saw a clone operation before this event. As per the heap dump, the most of the memory is taken by org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker. 

====
One instance of "org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker" loaded by "org.jboss.modules.ModuleClassLoader @ 0xc23ccf30" occupies 643,815,688 (62.78%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "<system class loader>".

Class Name                                                                          | Shallow Heap | Retained Heap
-------------------------------------------------------------------------------------------------------------------
org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker @ 0xc4dfac10                 |           40 |   643,815,688
|- <class> class org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker @ 0xc4d957c8|            8 |             8
|- isTracking java.util.concurrent.atomic.AtomicBoolean @ 0xc4dfac38                |           16 |            16
|- runningCalls java.util.concurrent.ConcurrentHashMap @ 0xc4dfac48                 |           64 |           536
|- map java.util.concurrent.ConcurrentHashMap @ 0xc4dfac88                          |           64 |         2,456
|- hostToId java.util.concurrent.ConcurrentHashMap @ 0xc4dfacc8                     |           64 |   643,812,640
|- queue java.util.concurrent.ConcurrentLinkedQueue @ 0xc4dfad08                    |           24 |            24
|- lock java.util.concurrent.locks.ReentrantLock @ 0xc4dfad20                       |           16 |            16
'- Total: 7 entries                                                                 |              |              
-------------------------------------------------------------------------------------------------------------------
====

There was around 26GB available in the RHV-M server at the time of issue.

Version-Release number of selected component (if applicable):

rhevm-4.1.2.3-0.1.el7.noarch

Additional info:

Comment 4 Martin Perina 2017-10-11 13:36:33 UTC
Targeting for now to 4.2.0, once we will investigate fully, we will reevaluate and retarget if needed

Comment 17 Petr Matyáš 2017-12-11 16:47:09 UTC
Looks good on ovirt-engine-4.2.0-0.6.el7.noarch

Comment 20 errata-xmlrpc 2018-05-15 17:56:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1516

Comment 21 Franta Kust 2019-05-16 13:08:35 UTC
BZ<2>Jira Resync