Bug 1572604

Summary: Tomcat java.lang.OutOfMemoryError: GC overhead limit exceeded
Product: Red Hat Satellite Reporter: Kenny Tordeurs <ktordeur>
Component: CandlepinAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED ERRATA QA Contact: Roman Plevka <rplevka>
Severity: high Docs Contact:
Priority: high    
Version: 6.2.14CC: bcourt, cdonnell, cduryee, inecas, jhutar, jjansky, khowell, ktordeur, mmccune, molasaga, pcreech, rplevka
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: candlepin-2.1.18-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1584256 1596356 (view as bug list) Environment:
Last Closed: 2018-08-22 20:07:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1584256    
Bug Blocks:    
Attachments:
Description Flags
cpu process
none
database information
none
gc_monitor
none
heap old gen status
none
start manifest import
none
threads filter
none
latest monitoring 1
none
latest monitoring 2
none
latest monitoring 3
none
latest monitoring 4
none
latest monitoring 5
none
latest monitoring 6
none
latest monitoring 7
none
latest monitoring 8
none
latest monitoring 9 none

Description Kenny Tordeurs 2018-04-27 11:57:41 UTC
Description of problem:
Tomcat is running out of memory, tried upping the javaheap size and other parameters without success. 
acing the OOME in Heap and the shared GC logs are indicating that OLD gen is getting full,which is not expected and OOME java heap is occurring. 

~~~
Apr 27, 2018 11:52:54 AM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet [default] in context with path [/candlepin] threw exception
org.jboss.resteasy.spi.UnhandledException: java.lang.OutOfMemoryError: GC overhead limit exceeded
[...]
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.Arrays.copyOf(Arrays.java:2367)
[...]
~~~

Version-Release number of selected component (if applicable):
Satellite 6.2.14 with about 12k clients


Additional info:
Tried upping the javaheap and disabling GC and other parameters as recommend by JVM team:
~~~
 └─15529 java -Xms4096m -Xmx4096m -XX:+DisableExplicitGC -XX:InitialHeapSize=4294967296 -XX:MaxHeapSize=4294967296 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/satellite_backup/ -verbose:gc -Xloggc:/satellite_backup/gc_%p_%t.log -XX:+PrintGCCause -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -classpath /usr/share/tomcat/bin/bootstrap.jar:/usr/share/tomcat/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar -Dcatalina.base=/usr/share/tomcat -Dcatalina.home=/usr/share/tomcat -Djava.endorsed.dirs= -Djava.io.tmpdir=/var/cache/tomcat/temp -Djava.util.logging.config.file=/usr/share/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager org.apache.catalina.startup.Bootstrap start
~~~

Seems application on the customer end is creating the hibernate session which is creating around 22 millions objects, which is making session size large. 
So reducing the number of objects at the application end might help to resolve the issue. 

In the heap dump, shows the majority of heap is in use by nearly 22 million string objects stored in a single hibernate session.  That hibernate session is locally referenced by a quartz scheduler thread currently executing some tasks and queries for candlepin:
~~~
Thread 0x701f1b768
  at java.net.SocketInputStream.socketRead0(Ljava/io/FileDescriptor;[BIII)I (Native Method)^M
  at java.net.SocketInputStream.read([BIII)I (SocketInputStream.java:153)^M
  at java.net.SocketInputStream.read([BII)I (SocketInputStream.java:122)^M
  at java.net.ManagedSocketInputStreamAPSIHighPerformanceNew.read([BII)I (ManagedSocketInputStreamAPSIHighPerformanceNew.java:98)^M
  at org.postgresql.core.VisibleBufferedInputStream.readMore(I)Z (VisibleBufferedInputStream.java:145)^M
  at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(I)Z (VisibleBufferedInputStream.java:114)^M
  at org.postgresql.core.VisibleBufferedInputStream.read()I (VisibleBufferedInputStream.java:73)^M
  at org.postgresql.core.PGStream.ReceiveChar()I (PGStream.java:274)^M
  at org.postgresql.core.v3.QueryExecutorImpl.processResults(Lorg/postgresql/core/ResultHandler;I)V (QueryExecutorImpl.java:1660)^M
  at org.postgresql.core.v3.QueryExecutorImpl.execute(Lorg/postgresql/core/Query;Lorg/postgresql/core/ParameterList;Lorg/postgresql/core/ResultHandler;III)V (QueryExecutorImpl.java:257)^M
  at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(Lorg/postgresql/core/Query;Lorg/postgresql/core/ParameterList;I)V (AbstractJdbc2Statement.java:500)^M
  at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(I)Z (AbstractJdbc2Statement.java:388)^M
  at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery()Ljava/sql/ResultSet; (AbstractJdbc2Statement.java:273)^M
  at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeQuery()Ljava/sql/ResultSet; (NewProxyPreparedStatement.java:76)^M
  at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.extract(Ljava/sql/PreparedStatement;)Ljava/sql/ResultSet; (ResultSetReturnImpl.java:56)^M
  at org.hibernate.loader.Loader.getResultSet(Ljava/sql/PreparedStatement;Lorg/hibernate/engine/spi/RowSelection;Lorg/hibernate/dialect/pagination/LimitHandler;ZLorg/hibernate/engine/spi/SessionImplementor;)Ljava/sql/ResultSet; (Loader.java:2040)^M
  at org.hibernate.loader.Loader.executeQueryStatement(Ljava/lang/String;Lorg/hibernate/engine/spi/QueryParameters;ZLjava/util/List;Lorg/hibernate/engine/spi/SessionImplementor;)Lorg/hibernate/loader/Loader$SqlStatementWrapper; (Loader.java:1837)^M
  at org.hibernate.loader.Loader.executeQueryStatement(Lorg/hibernate/engine/spi/QueryParameters;ZLjava/util/List;Lorg/hibernate/engine/spi/SessionImplementor;)Lorg/hibernate/loader/Loader$SqlStatementWrapper; (Loader.java:1816)^M
  at org.hibernate.loader.Loader.doQuery(Lorg/hibernate/engine/spi/SessionImplementor;Lorg/hibernate/engine/spi/QueryParameters;ZLorg/hibernate/transform/ResultTransformer;)Ljava/util/List; (Loader.java:900)^M
  at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Lorg/hibernate/engine/spi/SessionImplementor;Lorg/hibernate/engine/spi/QueryParameters;ZLorg/hibernate/transform/ResultTransformer;)Ljava/util/List; (Loader.java:342)^M
  at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Lorg/hibernate/engine/spi/SessionImplementor;Lorg/hibernate/engine/spi/QueryParameters;Z)Ljava/util/List; (Loader.java:312)^M
  at org.hibernate.loader.Loader.loadCollection(Lorg/hibernate/engine/spi/SessionImplementor;Ljava/io/Serializable;Lorg/hibernate/type/Type;)V (Loader.java:2238)^M
  at org.hibernate.loader.collection.CollectionLoader.initialize(Ljava/io/Serializable;Lorg/hibernate/engine/spi/SessionImplementor;)V (CollectionLoader.java:65)^M
  at org.hibernate.persister.collection.AbstractCollectionPersister.initialize(Ljava/io/Serializable;Lorg/hibernate/engine/spi/SessionImplementor;)V (AbstractCollectionPersister.java:674)^M
  at org.hibernate.event.internal.DefaultInitializeCollectionEventListener.onInitializeCollection(Lorg/hibernate/event/spi/InitializeCollectionEvent;)V (DefaultInitializeCollectionEventListener.java:85)^M
  at org.hibernate.internal.SessionImpl.initializeCollection(Lorg/hibernate/collection/spi/PersistentCollection;Z)V (SessionImpl.java:1849)^M
  at org.hibernate.collection.internal.AbstractPersistentCollection$4.doWork()Ljava/lang/Object; (AbstractPersistentCollection.java:549)^M
  at org.hibernate.collection.internal.AbstractPersistentCollection.withTemporarySessionIfNeeded(Lorg/hibernate/collection/internal/AbstractPersistentCollection$LazyInitializationWork;)Ljava/lang/Object; (AbstractPersistentCollection.java:234)^M
  at org.hibernate.collection.internal.AbstractPersistentCollection.initialize(Z)V (AbstractPersistentCollection.java:545)^M
  at org.hibernate.collection.internal.AbstractPersistentCollection.read()V (AbstractPersistentCollection.java:124)^M
  at org.hibernate.collection.internal.AbstractPersistentCollection$1ExtraLazyElementByIndexReader.doWork()Ljava/lang/Object; (AbstractPersistentCollection.java:339)^M
  at org.hibernate.collection.internal.AbstractPersistentCollection.withTemporarySessionIfNeeded(Lorg/hibernate/collection/internal/AbstractPersistentCollection$LazyInitializationWork;)Ljava/lang/Object; (AbstractPersistentCollection.java:234)^M
  at org.hibernate.collection.internal.AbstractPersistentCollection.readElementByIndex(Ljava/lang/Object;)Ljava/lang/Object; (AbstractPersistentCollection.java:347)^M
  at org.hibernate.collection.internal.PersistentMap.get(Ljava/lang/Object;)Ljava/lang/Object; (PersistentMap.java:165)^M
  at org.candlepin.model.Consumer.getFact(Ljava/lang/String;)Ljava/lang/String; (Consumer.java:359)^M
  at org.candlepin.model.StatisticCurator.consumersPerSocketCount(Lorg/candlepin/model/Owner;)V (StatisticCurator.java:116)^M
  at org.candlepin.model.StatisticCurator.executeStatisticRun()V (StatisticCurator.java:83)^M
  at org.candlepin.pinsetter.tasks.StatisticHistoryTask.toExecute(Lorg/quartz/JobExecutionContext;)V (StatisticHistoryTask.java:54)^M
  at org.candlepin.pinsetter.tasks.KingpinJob.execute(Lorg/quartz/JobExecutionContext;)V (KingpinJob.java:81)^M
  at org.candlepin.pinsetter.core.GuiceJobFactory$JobScopeDecorator.execute(Lorg/quartz/JobExecutionContext;)V (GuiceJobFactory.java:93)^M
  at org.quartz.core.JobRunShell.run()V (JobRunShell.java:202)^M
~~~

Comment 8 Jan Jansky 2018-04-30 12:31:47 UTC
Created attachment 1428810 [details]
cpu process

Comment 9 Jan Jansky 2018-04-30 12:32:20 UTC
Created attachment 1428811 [details]
database information

Comment 10 Jan Jansky 2018-04-30 12:32:40 UTC
Created attachment 1428812 [details]
gc_monitor

Comment 11 Jan Jansky 2018-04-30 12:33:10 UTC
Created attachment 1428813 [details]
heap old gen status

Comment 12 Jan Jansky 2018-04-30 12:33:46 UTC
Created attachment 1428814 [details]
start manifest import

Comment 13 Jan Jansky 2018-04-30 12:34:06 UTC
Created attachment 1428815 [details]
threads filter

Comment 14 Jan Jansky 2018-04-30 12:50:23 UTC
From consultant on site

Just checked the table QRTZ_TRIGGERS to see the configured jobs.
All of them have the default values (I ckecked those values in the candlepin source code, looking for DEFAULT_SCHEDULE).

One thing though:
There is one job (CancelJobJob) in ACQUIRED state, the rest of them are WAITING. 
I can't see the ACQUIRED status in the status list written in candlepin.rb.

  while Time.now <= timeout_limit
          job = res.ok_content
          case job[:state]
          when "CREATED", "PENDING", "WAITING", "RUNNING"
            sleep 1
            res = get_job(:job_id => opts[:job_id])
          when "FINISHED", "CANCELED", "FAILED"
            return res
          else
            raise RuntimeError.new("Unknown job state returned.")
          end
        end

Comment 19 Jan Jansky 2018-05-11 12:54:04 UTC
Hello Team,
we just got new information from customer, i am uploading pictures from they monitoring tool.

Most important information is that there is SQL query which is taking 10 000 000 ms.


SELECT POOL0_.ID AS ID1_36_0_, POOL0_.CREATED AS CREATED2_36_0_, POOL0_.UPDATED AS UPDATED3_36_0_, POOL0_.ACCOUNTNUMBER AS ACCOUNTN4_36_0_, POOL0_.ACTIVESUBSCRIPTION AS ACTIVESU5_36_0_, POOL0_.QUANTITY_CONSUMED AS QUANTITY6_36_0_, POOL0_.CONTRACTNUMBER AS CONTRACT7_36_0_, POOL0_.DERIVEDPRODUCTID AS DERIVEDP8_36_0_, POOL0_.DERIVEDPRODUCTNAME AS DERIVEDP9_36_0_, POOL0_.ENDDATE AS ENDDATE10_36_0_, POOL0_.QUANTITY_EXPORTED AS QUANTIT11_36_0_, POOL0_.ORDERNUMBER AS ORDERNU12_36_0_, POOL0_.OWNER_ID AS OWNER_I20_36_0_, POOL0_.PRODUCTID AS PRODUCT13_36_0_, POOL0_.PRODUCTNAME AS PRODUCT14_36_0_, POOL0_.QUANTITY AS QUANTIT15_36_0_, POOL0_.RESTRICTEDTOUSERNAME AS RESTRIC16_36_0_, POOL0_.SOURCEENTITLEMENT_ID AS SOURCEE21_36_0_, POOL0_.STARTDATE AS STARTDA17_36_0_, POOL0_.TYPE AS TYPE18_36_0_, POOL0_.VERSION AS VERSION19_36_0_ FROM CP_POOL POOL0_ WHERE POOL0_.ID = ? FOR UPDATE

visible at image_08.png

Can you please help with identification of query? From first view it seems like query to candlepin checking subscription, but i am not sure why it is executed.

We need to identify the query and disable if it is not required or improve behavior (execution time).

Comment 20 Jan Jansky 2018-05-11 12:55:41 UTC
Created attachment 1434877 [details]
latest monitoring 1

Comment 21 Jan Jansky 2018-05-11 12:56:02 UTC
Created attachment 1434878 [details]
latest monitoring 2

Comment 22 Jan Jansky 2018-05-11 12:56:29 UTC
Created attachment 1434879 [details]
latest monitoring 3

Comment 23 Jan Jansky 2018-05-11 12:56:46 UTC
Created attachment 1434880 [details]
latest monitoring 4

Comment 24 Jan Jansky 2018-05-11 12:57:07 UTC
Created attachment 1434881 [details]
latest monitoring 5

Comment 25 Jan Jansky 2018-05-11 12:57:33 UTC
Created attachment 1434882 [details]
latest monitoring 6

Comment 26 Jan Jansky 2018-05-11 12:57:53 UTC
Created attachment 1434883 [details]
latest monitoring 7

Comment 27 Jan Jansky 2018-05-11 12:58:09 UTC
Created attachment 1434884 [details]
latest monitoring 8

Comment 28 Jan Jansky 2018-05-11 12:59:45 UTC
Created attachment 1434885 [details]
latest monitoring 9

Comment 30 Barnaby Court 2018-05-14 19:31:51 UTC
Locking a pool for update can be blocked by any number of things, hypervisor checkin, manifest update, other systems in process of binding to that pool. If something else was holding the lock & never giving it up I would have expected a lock-wait-timeout exception if any one thing was holding the lock for too long. Were the parameters of the satellite system updated to change the default lock-wait-timeout which should be set to ~ 1 minute.

Comment 31 Kenny Tordeurs 2018-05-15 08:57:19 UTC
@bcourt

Are you talking about the one for postgres that should be shown with something like:

# su - postgres -c "psql -c \"show all;\"" | grep lock

If not can you specify where this timeout is set so we can follow up on it?

Thanks

Comment 33 Barnaby Court 2018-05-15 17:53:39 UTC
Statement timeout of 0 means infinite per https://www.postgresql.org/docs/9.2/static/runtime-config-client.html. 

I would expect you have a lot of items getting queued up and it is just taking time to work through. Pool locks on Satellite 6.3 are dramatically shorter than in 6.2 so my expectation is that unless you are trying to spin up hundreds of concurrent VMs concurrently using this activation key things would be fixed by upgrading to Satellite 6.3.x

Comment 34 Kenny Tordeurs 2018-05-16 07:27:46 UTC
@Barnaby

Not sure where we would see the improvement in the pool locks on Satellite 6.3 or is it rather code base then postgresql setting?

[root@provisioning ~]# rpm -qa | grep satellite-6
~~~
satellite-6.3.1-3.el7sat.noarch
~~~

[root@provisioning ~]#  su - postgres -c "psql -c \"show all;\"" | grep statement_timeout
~~~
 statement_timeout               | 0                                   | Sets the maximum allowed duration of any statement.
~~~

[root@provisioning ~]#  su - postgres -c "psql -c \"show all;\"" | grep lock
~~~
 block_size                      | 8192                                | Shows the size of a disk block.
 deadlock_timeout                | 1s                                  | Sets the time to wait on a lock before checking for deadlock.
 log_lock_waits                  | off                                 | Logs long lock waits.
 max_locks_per_transaction       | 64                                  | Sets the maximum number of locks per transaction.
 max_pred_locks_per_transaction  | 64                                  | Sets the maximum number of predicate locks per transaction.
 wal_block_size                  | 8192                                | Shows the block size in the write ahead log.
~~~

#  su - postgres -c "psql -c \"show all;\"" | grep "timeout"
~~~
 archive_timeout                 | 0                                   | Forces a switch to the next xlog file if a new file has not been started within N seconds.
 authentication_timeout          | 1min                                | Sets the maximum allowed time to complete client authentication.
 checkpoint_timeout              | 5min                                | Sets the maximum time between automatic WAL checkpoints.
 deadlock_timeout                | 1s                                  | Sets the time to wait on a lock before checking for deadlock.
 replication_timeout             | 1min                                | Sets the maximum time to wait for WAL replication.
 statement_timeout               | 0                                   | Sets the maximum allowed duration of any statement.
~~~

Comment 35 Barnaby Court 2018-05-16 14:10:14 UTC
@Kenny The changes in 6.3 to hold locks for smaller periods of time are entirely code based. We put a lot of work into optimization moving a lot of logic that was previously performed inside of a lock to take place outside of the lock specifically around binding to a pool & the case where lots of consumers are trying to attach to a single pool.

Comment 57 Kevin Howell 2018-06-07 21:02:57 UTC
See cloned BZs for the fixed-in-version for various hotfix branches

Comment 71 Mike McCune 2018-08-10 16:48:09 UTC
(scratch prior comment, attachment was too large for bugzilla)

*** Satellite 6.3.2 Hotfix Available ***

1) download hotfix rpms:

wget http://people.redhat.com/~mmccune/hotfix/candlepin-2.1.18-1.el7.noarch.rpm
wget http://people.redhat.com/~mmccune/hotfix/candlepin-selinux-2.1.18-1.el7.noarch.rpm

2) Verify sha256:

sha256sum candlepin-*.rpm

3c4c308f3ec1dcd969222dad546d72f273fbb10cf596d72941fedc29812f6afc  candlepin-2.1.18-1.el7.noarch.rpm
76ccab8cdead86de727c1a81d40cddeec7902495517eec9f3295c5ce6bba0fb9  candlepin-selinux-2.1.18-1.el7.noarch.rpm


3) copy to satellite server, rpm -Uvh candlepin*.rpm

4) katello-service restart

5) resume operations

Comment 73 errata-xmlrpc 2018-08-22 20:07:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2550