Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1500739 - Engine fails with java.lang.OutOfMemoryError making all hosts non responsive
Engine fails with java.lang.OutOfMemoryError making all hosts non responsive
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm-jsonrpc-java (Show other bugs)
4.1.2
All Linux
high Severity high
: ovirt-4.2.0
: ---
Assigned To: Piotr Kliczewski
Petr Matyáš
: ZStream
Depends On:
Blocks: 1504118
  Show dependency treegraph
 
Reported: 2017-10-11 07:59 EDT by nijin ashok
Modified: 2018-05-15 13:56 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1504118 (view as bug list)
Environment:
Last Closed: 2018-05-15 13:56:18 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3223081 None None None 2017-10-24 05:21 EDT
oVirt gerrit 82731 master MERGED host to id mapping needs to be cleared on failure 2017-10-19 08:08 EDT
oVirt gerrit 82988 None None None 2017-10-23 03:14 EDT
oVirt gerrit 82990 ovirt-4.1 MERGED host to id mapping needs to be cleared on failure 2017-10-19 10:34 EDT
oVirt gerrit 83896 master MERGED Release host level lock faster 2017-11-20 06:30 EST
oVirt gerrit 84374 master MERGED jsonrpc: version bump 2017-11-20 08:16 EST
oVirt gerrit 84379 ovirt-4.1 MERGED Release host level lock faster 2017-11-20 08:24 EST
oVirt gerrit 84386 ovirt-engine-4.1 MERGED jsonrpc: version bump 2017-11-21 04:49 EST
Red Hat Product Errata RHEA-2018:1516 None None None 2018-05-15 13:56 EDT

  None (edit)
Description nijin ashok 2017-10-11 07:59:01 EDT
Description of problem:

The engine failed suddenly with "java.lang.OutOfMemoryError: Java heap space" error. This made all host in the environment to non-responsive. Below log is registered in the server log.

==
2017-10-11 07:04:45,285+05 ERROR [stderr] (ResponseWorker) Exception in thread "ResponseWorker" java.lang.OutOfMemoryError: Java heap space

2017-10-11 07:04:42,372+05 ERROR [io.undertow.servlet] (default task-84) Exception while dispatching incoming RPC call: com.google.gwt.user.client.rpc.SerializationException: Can't find the serialization policy file. This probably means that the user has an old version of the application loaded in the browser. To solve the issue the user needs to close the browser and open it again, so that the application is reloaded.
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
==

The GC overhead limit exceeded is logged multiple times in the server log. At some point of time, I can see engine is not even trying to check the status of the host. 

The environment is having 1GB of heap size configured and it's not a large environment and is only having less than 100 VMs and 20 host.

There is no specific event before the issue. I can only saw a clone operation before this event. As per the heap dump, the most of the memory is taken by org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker. 

====
One instance of "org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker" loaded by "org.jboss.modules.ModuleClassLoader @ 0xc23ccf30" occupies 643,815,688 (62.78%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "<system class loader>".

Class Name                                                                          | Shallow Heap | Retained Heap
-------------------------------------------------------------------------------------------------------------------
org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker @ 0xc4dfac10                 |           40 |   643,815,688
|- <class> class org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker @ 0xc4d957c8|            8 |             8
|- isTracking java.util.concurrent.atomic.AtomicBoolean @ 0xc4dfac38                |           16 |            16
|- runningCalls java.util.concurrent.ConcurrentHashMap @ 0xc4dfac48                 |           64 |           536
|- map java.util.concurrent.ConcurrentHashMap @ 0xc4dfac88                          |           64 |         2,456
|- hostToId java.util.concurrent.ConcurrentHashMap @ 0xc4dfacc8                     |           64 |   643,812,640
|- queue java.util.concurrent.ConcurrentLinkedQueue @ 0xc4dfad08                    |           24 |            24
|- lock java.util.concurrent.locks.ReentrantLock @ 0xc4dfad20                       |           16 |            16
'- Total: 7 entries                                                                 |              |              
-------------------------------------------------------------------------------------------------------------------
====

There was around 26GB available in the RHV-M server at the time of issue.

Version-Release number of selected component (if applicable):

rhevm-4.1.2.3-0.1.el7.noarch

Additional info:
Comment 4 Martin Perina 2017-10-11 09:36:33 EDT
Targeting for now to 4.2.0, once we will investigate fully, we will reevaluate and retarget if needed
Comment 17 Petr Matyáš 2017-12-11 11:47:09 EST
Looks good on ovirt-engine-4.2.0-0.6.el7.noarch
Comment 20 errata-xmlrpc 2018-05-15 13:56:18 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1516

Note You need to log in before you can comment on or make changes to this bug.