Description of problem: After running an engine for ~2 weeks, VMs are freezing at "waiting for launch state", tasks are pilling up and OutOfMemoryError error start to appear in the engine logs. System description (RHEV QE production instance): - 9 Hosts (RHEL 6.6) - 298 VMs - 5 networks - 6 SD Version-Release number of selected component (if applicable): VT13.9 rhevm-3.5.0-0.31.el6ev.noarch How reproducible: 100% after running for ~2 weeks Steps to Reproduce: 1. Configure an engine with ~10 hosts / ~300 vms (See full system description above) 2. Keep an engine running for ~2 weeks Actual results: Engine runs out of heap memory, tasks are stalled, VMs freezes at waiting for launch state Expected results: Engine should not run out of memory Additional info: Engine was running with the default settings when the failure was detected: -Xms1g -Xmx1g (Now runs with: -Xms1g -Xmx2g) Jmap output: Attaching to process ID 8871, please wait... Debugger attached successfully. Server compiler detected. JVM version is 24.75-b04 using thread-local object allocation. Parallel GC with 2 thread(s) Heap Configuration: MinHeapFreeRatio = 0 MaxHeapFreeRatio = 100 MaxHeapSize = 2147483648 (2048.0MB) NewSize = 1310720 (1.25MB) MaxNewSize = 17592186044415 MB OldSize = 5439488 (5.1875MB) NewRatio = 2 SurvivorRatio = 8 PermSize = 268435456 (256.0MB) MaxPermSize = 268435456 (256.0MB) G1HeapRegionSize = 0 (0.0MB) Heap Usage: PS Young Generation Eden Space: capacity = 342360064 (326.5MB) used = 99419632 (94.81394958496094MB) free = 242940432 (231.68605041503906MB) 29.039494513004882% used From Space: capacity = 7864320 (7.5MB) used = 2985280 (2.84698486328125MB) free = 4879040 (4.65301513671875MB) 37.959798177083336% used To Space: capacity = 7864320 (7.5MB) used = 0 (0.0MB) free = 7864320 (7.5MB) 0.0% used PS Old Generation capacity = 716177408 (683.0MB) used = 305858544 (291.68943786621094MB) free = 410318864 (391.31056213378906MB) 42.70709192770292% used PS Perm Generation capacity = 268435456 (256.0MB) used = 122352264 (116.68421173095703MB) free = 146083192 (139.31578826904297MB) 45.57977020740509% used 52568 interned Strings occupying 5785464 bytes. Log snippet: Caused by: org.postgresql.util.PSQLException: Ran out of memory retrieving query results. at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1816) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:510) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:372) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:252) at org.jboss.jca.adapters.jdbc.WrappedStatement.executeQuery(WrappedStatement.java:344) at org.springframework.jdbc.core.JdbcTemplate$1QueryStatementCallback.doInStatement(JdbcTemplate.java:441) [spring-jdbc.jar:3.1.1.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:396) [spring-jdbc.jar:3.1.1.RELEASE] ... 120 more Caused by: java.lang.OutOfMemoryError: Java heap space
Created attachment 989353 [details] Jmap output (~2 days of monitoring)
Created attachment 989354 [details] engine log from the time of the faliure
See also bug 1188971. If/when we fix this one, we might want to fine-tune that one.
Gil - the engine automatically creates a dump file when it gets an OutOfMemoryError. Please send me this dump file (it should be big...) so I can understand what's going on.
Few things: 1. Performance wise - we must not run the server with 1GB of Heap. We talked about it - it's not enough for performance environment, and you're running many VMs. 2. The query seems to originate from the UI. It's GetAllDiskSnapshotsByStorageDomainIdQuery. 2.1 Is that the URL your jmeter is activating? I can see it runs by 4 different threads, so either you have 4 simultaneous users or you're not waiting for a response from your URL call. 2.2 Eldad - correct me if I'm wrong, but isn't this the query we just fixed the performance of? It might help. 2.3 Gil - how many disk snapshots do you have? 2.4 In any case, I find it hard to believe that we suddenly have a bug around that area, or that the JDBC driver is leaking. So I don't think that's a real memory leak issue. 3. Piotr - I think the GWT you see originates from the same thread as the query. We have >800MB of byte[][], which is what the Postgres driver is bringing from the database. If we look into the objects we can see that the objects are indeed the results of GetAllDiskSnapshotsByStorageDomainIdQuery. So, for short, We need to understand the snapshot result size, and how many concurrent queries are running. And quite likely this is not a memory leak issue, but rather a "not enough memory" issue.
I add my findings to supplement Liran's information: I checked briefly the snapshot and I can see couple of local collections which consume most of the heap. I can see that in the postgresql driver there is a Vector: at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1813) Local Variable: java.util.Vector#1663 There are 2 instance allocated where one uses 527MB and the other 213MB. From the stack trace I can see one query: GetAllDiskSnapshotsByStorageDomainIdQuery The other thing that I see is array of objects which takes 527MB: at com.google.gwt.rpc.server.RPC.invokeAndStreamResponse(RPC.java:196) Local Variable: java.lang.Object[]#233337 Local Variable: com.google.gwt.rpc.server.WebModePayloadSink#3 Local Variable: java.lang.reflect.Method#57084
Hi Function GetAllSnapshotsByStorageDomainId is not implemented correctly, and returns 40K rows when only 200 are necessary. This can of course create a massive stress on the engine memory, as it tries to map all these rows to Objects. I'll create a patch.
the query returns ~6K instead od 40K. after browisng to snapshots through the vm > sanpshots or storage > disk snapshots tab's. i find ~fast resposne and postgres related memory for QueryExecutorImpl is pretty low ~8976 bytes. bug verified on top of 3.6.2