1504118 – [downstream clone - 4.1.8] Engine fails with java.lang.OutOfMemoryError making all hosts non responsive

Bug 1504118 - [downstream clone - 4.1.8] Engine fails with java.lang.OutOfMemoryError making all hosts non responsive

Summary: [downstream clone - 4.1.8] Engine fails with java.lang.OutOfMemoryError makin...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm-jsonrpc-java
Sub Component:
Version:	4.1.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.1.8
Target Release:	---
Assignee:	Piotr Kliczewski
QA Contact:	Petr Matyáš
Docs Contact:
URL:
Whiteboard:
Depends On:	1500739
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-19 14:04 UTC by rhev-integ
Modified:	2021-05-01 16:53 UTC (History)
CC List:	17 users (show)
Fixed In Version:	1.3.15
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1500739
Environment:
Last Closed:	2017-12-12 09:21:18 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3223081	None	None	None	2017-10-24 09:20:41 UTC
Red Hat Product Errata	RHBA-2017:3411	normal	SHIPPED_LIVE	vdsm-jsonrpc-java bug fix and enhancement update for RHV 4.1.8	2017-12-12 14:14:11 UTC
oVirt gerrit	82731	master	MERGED	host to id mapping needs to be cleared on failure	2020-12-31 01:44:43 UTC
oVirt gerrit	82990	ovirt-4.1	MERGED	host to id mapping needs to be cleared on failure	2020-12-31 01:44:41 UTC
oVirt gerrit	82995	ovirt-engine-4.1	MERGED	jsonrpc version bump	2020-12-31 01:45:18 UTC
oVirt gerrit	83896	master	MERGED	Release host level lock faster	2020-12-31 01:44:44 UTC
oVirt gerrit	84379	None	MERGED	Release host level lock faster	2020-12-31 01:44:44 UTC
oVirt gerrit	84386	None	MERGED	jsonrpc: version bump	2020-12-31 01:44:42 UTC

Description rhev-integ 2017-10-19 14:04:52 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1500739 +++
======================================================================

Description of problem:

The engine failed suddenly with "java.lang.OutOfMemoryError: Java heap space" error. This made all host in the environment to non-responsive. Below log is registered in the server log.

==
2017-10-11 07:04:45,285+05 ERROR [stderr] (ResponseWorker) Exception in thread "ResponseWorker" java.lang.OutOfMemoryError: Java heap space

2017-10-11 07:04:42,372+05 ERROR [io.undertow.servlet] (default task-84) Exception while dispatching incoming RPC call: com.google.gwt.user.client.rpc.SerializationException: Can't find the serialization policy file. This probably means that the user has an old version of the application loaded in the browser. To solve the issue the user needs to close the browser and open it again, so that the application is reloaded.
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
==

The GC overhead limit exceeded is logged multiple times in the server log. At some point of time, I can see engine is not even trying to check the status of the host. 

The environment is having 1GB of heap size configured and it's not a large environment and is only having less than 100 VMs and 20 host.

There is no specific event before the issue. I can only saw a clone operation before this event. As per the heap dump, the most of the memory is taken by org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker. 

====
One instance of "org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker" loaded by "org.jboss.modules.ModuleClassLoader @ 0xc23ccf30" occupies 643,815,688 (62.78%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "<system class loader>".

Class Name                                                                          | Shallow Heap | Retained Heap
-------------------------------------------------------------------------------------------------------------------
org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker @ 0xc4dfac10                 |           40 |   643,815,688
|- <class> class org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker @ 0xc4d957c8|            8 |             8
|- isTracking java.util.concurrent.atomic.AtomicBoolean @ 0xc4dfac38                |           16 |            16
|- runningCalls java.util.concurrent.ConcurrentHashMap @ 0xc4dfac48                 |           64 |           536
|- map java.util.concurrent.ConcurrentHashMap @ 0xc4dfac88                          |           64 |         2,456
|- hostToId java.util.concurrent.ConcurrentHashMap @ 0xc4dfacc8                     |           64 |   643,812,640
|- queue java.util.concurrent.ConcurrentLinkedQueue @ 0xc4dfad08                    |           24 |            24
|- lock java.util.concurrent.locks.ReentrantLock @ 0xc4dfad20                       |           16 |            16
'- Total: 7 entries                                                                 |              |              
-------------------------------------------------------------------------------------------------------------------
====

There was around 26GB available in the RHV-M server at the time of issue.

Version-Release number of selected component (if applicable):

rhevm-4.1.2.3-0.1.el7.noarch

Additional info:

(Originally by Nijin Ashok)

Comment 5 rhev-integ 2017-10-19 14:05:19 UTC

Targeting for now to 4.2.0, once we will investigate fully, we will reevaluate and retarget if needed

(Originally by Martin Perina)

Comment 18 Pavol Brilla 2017-11-28 11:36:11 UTC

Tested:

Disaster Recovery Guide [PDF]
Ruby SDK Guide [PDF]

both online and pdf download working

Comment 19 Pavol Brilla 2017-11-28 11:36:43 UTC

Wrong bug

Comment 20 Petr Matyáš 2017-11-29 14:25:50 UTC

Looks good on ovirt-engine-4.1.8.1-0.1.el7.noarch with host on vdsm-4.19.40-1.el7ev

Comment 23 errata-xmlrpc 2017-12-12 09:21:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3411

Note You need to log in before you can comment on or make changes to this bug.