1190471 – [Scale] Engine runs out of memory - java.lang.OutOfMemoryError: Java heap space

Bug 1190471 - [Scale] Engine runs out of memory - java.lang.OutOfMemoryError: Java heap space

Summary: [Scale] Engine runs out of memory - java.lang.OutOfMemoryError: Java heap space

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Assignee:	Barak
QA Contact:	Eldad Marciano
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1191142
TreeView+	depends on / blocked

Reported:	2015-02-08 10:52 UTC by Gil Klein
Modified:	2016-04-20 01:11 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1191142 (view as bug list)
Environment:
Last Closed:	2016-04-20 01:11:49 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Jmap output (~2 days of monitoring) (82.46 KB, text/x-vhdl) 2015-02-08 10:54 UTC, Gil Klein	no flags	Details
engine log from the time of the faliure (1.35 MB, application/x-gzip) 2015-02-08 10:56 UTC, Gil Klein	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1188971	high	CLOSED	ENGINE_HEAP_MAX default value as 1G must be changed	2022-07-09 07:06:38 UTC
oVirt gerrit	37645	master	MERGED	core: Improve performance of GetAllSnapshotsByStorageDomainId	Never
oVirt gerrit	37663	ovirt-engine-3.5	MERGED	core: Improve performance of GetAllSnapshotsByStorageDomainId	Never

Internal Links: 1188971

Description Gil Klein 2015-02-08 10:52:27 UTC

Description of problem:

After running an engine for ~2 weeks, VMs are freezing at "waiting for launch state", tasks are pilling up and  OutOfMemoryError error start to appear in the engine logs. 

System description (RHEV QE production instance):
- 9 Hosts (RHEL 6.6)
- 298 VMs
- 5 networks
- 6 SD

Version-Release number of selected component (if applicable):
VT13.9
rhevm-3.5.0-0.31.el6ev.noarch


How reproducible:
100% after running for ~2 weeks

Steps to Reproduce:
1. Configure an engine with ~10 hosts / ~300 vms (See full system description above)
2. Keep an engine running for ~2 weeks 


Actual results:
Engine runs out of heap memory, tasks are stalled, VMs freezes at waiting for launch state


Expected results:
Engine should not run out of memory


Additional info:

Engine was running with the default settings when the failure was detected:
-Xms1g -Xmx1g 

(Now runs with: -Xms1g -Xmx2g)


Jmap output:

Attaching to process ID 8871, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 24.75-b04

using thread-local object allocation.
Parallel GC with 2 thread(s)

Heap Configuration:
   MinHeapFreeRatio = 0
   MaxHeapFreeRatio = 100
   MaxHeapSize      = 2147483648 (2048.0MB)
   NewSize          = 1310720 (1.25MB)
   MaxNewSize       = 17592186044415 MB
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 268435456 (256.0MB)
   MaxPermSize      = 268435456 (256.0MB)
   G1HeapRegionSize = 0 (0.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 342360064 (326.5MB)
   used     = 99419632 (94.81394958496094MB)
   free     = 242940432 (231.68605041503906MB)
   29.039494513004882% used
From Space:
   capacity = 7864320 (7.5MB)
   used     = 2985280 (2.84698486328125MB)
   free     = 4879040 (4.65301513671875MB)
   37.959798177083336% used
To Space:
   capacity = 7864320 (7.5MB)
   used     = 0 (0.0MB)
   free     = 7864320 (7.5MB)
   0.0% used
PS Old Generation
   capacity = 716177408 (683.0MB)
   used     = 305858544 (291.68943786621094MB)
   free     = 410318864 (391.31056213378906MB)
   42.70709192770292% used
PS Perm Generation
   capacity = 268435456 (256.0MB)
   used     = 122352264 (116.68421173095703MB)
   free     = 146083192 (139.31578826904297MB)
   45.57977020740509% used

52568 interned Strings occupying 5785464 bytes.

Log snippet:

Caused by: org.postgresql.util.PSQLException: Ran out of memory retrieving query results.
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1816)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:510)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:372)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:252)
        at org.jboss.jca.adapters.jdbc.WrappedStatement.executeQuery(WrappedStatement.java:344)
        at org.springframework.jdbc.core.JdbcTemplate$1QueryStatementCallback.doInStatement(JdbcTemplate.java:441) [spring-jdbc.jar:3.1.1.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:396) [spring-jdbc.jar:3.1.1.RELEASE]
        ... 120 more
Caused by: java.lang.OutOfMemoryError: Java heap space

Comment 1 Gil Klein 2015-02-08 10:54:15 UTC

Created attachment 989353 [details]
Jmap output (~2 days of monitoring)

Comment 2 Gil Klein 2015-02-08 10:56:13 UTC

Created attachment 989354 [details]
engine log from the time of the faliure

Comment 4 Yedidyah Bar David 2015-02-09 08:18:39 UTC

See also bug 1188971. If/when we fix this one, we might want to fine-tune that one.

Comment 7 Liran Zelkha 2015-02-09 13:52:23 UTC

Gil - the engine automatically creates a dump file when it gets an OutOfMemoryError. Please send me this dump file (it should be big...) so I can understand what's going on.

Comment 13 Liran Zelkha 2015-02-10 06:46:52 UTC

Few things:
1. Performance wise - we must not run the server with 1GB of Heap. We talked about it - it's not enough for performance environment, and you're running many VMs.
2. The query seems to originate from the UI. It's GetAllDiskSnapshotsByStorageDomainIdQuery. 
2.1 Is that the URL your jmeter is activating? I can see it runs by 4 different threads, so either you have 4 simultaneous users or you're not waiting for a response from your URL call.
2.2 Eldad - correct me if I'm wrong, but isn't this the query we just fixed the performance of? It might help.
2.3 Gil - how many disk snapshots do you have?
2.4 In any case, I find it hard to believe that we suddenly have a bug around that area, or that the JDBC driver is leaking. So I don't think that's a real memory leak issue.
3. Piotr - I think the GWT you see originates from the same thread as the query. We have >800MB of byte[][], which is what the Postgres driver is bringing from the database. If we look into the objects we can see that the objects are indeed the results of GetAllDiskSnapshotsByStorageDomainIdQuery.

So, for short, We need to understand the snapshot result size, and how many concurrent queries are running. And quite likely this is not a memory leak issue, but rather a "not enough memory" issue.

Comment 14 Piotr Kliczewski 2015-02-10 07:13:38 UTC

I add my findings to supplement Liran's information:

I checked briefly the snapshot and I can see couple of local collections which consume most of the heap. 

I can see that in the postgresql driver there is a Vector:

        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1813)
           Local Variable: java.util.Vector#1663

There are 2 instance allocated where one uses 527MB and the other 213MB. From the stack trace I can see one query:

GetAllDiskSnapshotsByStorageDomainIdQuery

The other thing that I see is array of objects which takes 527MB:

        at com.google.gwt.rpc.server.RPC.invokeAndStreamResponse(RPC.java:196)
           Local Variable: java.lang.Object[]#233337
           Local Variable: com.google.gwt.rpc.server.WebModePayloadSink#3
           Local Variable: java.lang.reflect.Method#57084

Comment 15 Liran Zelkha 2015-02-10 08:40:36 UTC

Hi

Function GetAllSnapshotsByStorageDomainId is not implemented correctly, and returns 40K rows when only 200 are necessary. This can of course create a massive stress on the engine memory, as it tries to map all these rows to Objects.

I'll create a patch.

Comment 19 Eldad Marciano 2016-02-11 14:54:52 UTC

the query returns ~6K instead od 40K.

after browisng to snapshots through the vm > sanpshots or storage > disk snapshots tab's.

i find ~fast resposne and postgres related memory for QueryExecutorImpl is pretty low ~8976 bytes.

bug verified on top of 3.6.2

Note You need to log in before you can comment on or make changes to this bug.