Description of problem: Tomcat is running out of memory, tried upping the javaheap size and other parameters without success. acing the OOME in Heap and the shared GC logs are indicating that OLD gen is getting full,which is not expected and OOME java heap is occurring. ~~~ Apr 27, 2018 11:52:54 AM org.apache.catalina.core.StandardWrapperValve invoke SEVERE: Servlet.service() for servlet [default] in context with path [/candlepin] threw exception org.jboss.resteasy.spi.UnhandledException: java.lang.OutOfMemoryError: GC overhead limit exceeded [...] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:2367) [...] ~~~ Version-Release number of selected component (if applicable): Satellite 6.2.14 with about 12k clients Additional info: Tried upping the javaheap and disabling GC and other parameters as recommend by JVM team: ~~~ └─15529 java -Xms4096m -Xmx4096m -XX:+DisableExplicitGC -XX:InitialHeapSize=4294967296 -XX:MaxHeapSize=4294967296 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/satellite_backup/ -verbose:gc -Xloggc:/satellite_backup/gc_%p_%t.log -XX:+PrintGCCause -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -classpath /usr/share/tomcat/bin/bootstrap.jar:/usr/share/tomcat/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar -Dcatalina.base=/usr/share/tomcat -Dcatalina.home=/usr/share/tomcat -Djava.endorsed.dirs= -Djava.io.tmpdir=/var/cache/tomcat/temp -Djava.util.logging.config.file=/usr/share/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager org.apache.catalina.startup.Bootstrap start ~~~ Seems application on the customer end is creating the hibernate session which is creating around 22 millions objects, which is making session size large. So reducing the number of objects at the application end might help to resolve the issue. In the heap dump, shows the majority of heap is in use by nearly 22 million string objects stored in a single hibernate session. That hibernate session is locally referenced by a quartz scheduler thread currently executing some tasks and queries for candlepin: ~~~ Thread 0x701f1b768 at java.net.SocketInputStream.socketRead0(Ljava/io/FileDescriptor;[BIII)I (Native Method)^M at java.net.SocketInputStream.read([BIII)I (SocketInputStream.java:153)^M at java.net.SocketInputStream.read([BII)I (SocketInputStream.java:122)^M at java.net.ManagedSocketInputStreamAPSIHighPerformanceNew.read([BII)I (ManagedSocketInputStreamAPSIHighPerformanceNew.java:98)^M at org.postgresql.core.VisibleBufferedInputStream.readMore(I)Z (VisibleBufferedInputStream.java:145)^M at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(I)Z (VisibleBufferedInputStream.java:114)^M at org.postgresql.core.VisibleBufferedInputStream.read()I (VisibleBufferedInputStream.java:73)^M at org.postgresql.core.PGStream.ReceiveChar()I (PGStream.java:274)^M at org.postgresql.core.v3.QueryExecutorImpl.processResults(Lorg/postgresql/core/ResultHandler;I)V (QueryExecutorImpl.java:1660)^M at org.postgresql.core.v3.QueryExecutorImpl.execute(Lorg/postgresql/core/Query;Lorg/postgresql/core/ParameterList;Lorg/postgresql/core/ResultHandler;III)V (QueryExecutorImpl.java:257)^M at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(Lorg/postgresql/core/Query;Lorg/postgresql/core/ParameterList;I)V (AbstractJdbc2Statement.java:500)^M at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(I)Z (AbstractJdbc2Statement.java:388)^M at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery()Ljava/sql/ResultSet; (AbstractJdbc2Statement.java:273)^M at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeQuery()Ljava/sql/ResultSet; (NewProxyPreparedStatement.java:76)^M at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.extract(Ljava/sql/PreparedStatement;)Ljava/sql/ResultSet; (ResultSetReturnImpl.java:56)^M at org.hibernate.loader.Loader.getResultSet(Ljava/sql/PreparedStatement;Lorg/hibernate/engine/spi/RowSelection;Lorg/hibernate/dialect/pagination/LimitHandler;ZLorg/hibernate/engine/spi/SessionImplementor;)Ljava/sql/ResultSet; (Loader.java:2040)^M at org.hibernate.loader.Loader.executeQueryStatement(Ljava/lang/String;Lorg/hibernate/engine/spi/QueryParameters;ZLjava/util/List;Lorg/hibernate/engine/spi/SessionImplementor;)Lorg/hibernate/loader/Loader$SqlStatementWrapper; (Loader.java:1837)^M at org.hibernate.loader.Loader.executeQueryStatement(Lorg/hibernate/engine/spi/QueryParameters;ZLjava/util/List;Lorg/hibernate/engine/spi/SessionImplementor;)Lorg/hibernate/loader/Loader$SqlStatementWrapper; (Loader.java:1816)^M at org.hibernate.loader.Loader.doQuery(Lorg/hibernate/engine/spi/SessionImplementor;Lorg/hibernate/engine/spi/QueryParameters;ZLorg/hibernate/transform/ResultTransformer;)Ljava/util/List; (Loader.java:900)^M at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Lorg/hibernate/engine/spi/SessionImplementor;Lorg/hibernate/engine/spi/QueryParameters;ZLorg/hibernate/transform/ResultTransformer;)Ljava/util/List; (Loader.java:342)^M at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Lorg/hibernate/engine/spi/SessionImplementor;Lorg/hibernate/engine/spi/QueryParameters;Z)Ljava/util/List; (Loader.java:312)^M at org.hibernate.loader.Loader.loadCollection(Lorg/hibernate/engine/spi/SessionImplementor;Ljava/io/Serializable;Lorg/hibernate/type/Type;)V (Loader.java:2238)^M at org.hibernate.loader.collection.CollectionLoader.initialize(Ljava/io/Serializable;Lorg/hibernate/engine/spi/SessionImplementor;)V (CollectionLoader.java:65)^M at org.hibernate.persister.collection.AbstractCollectionPersister.initialize(Ljava/io/Serializable;Lorg/hibernate/engine/spi/SessionImplementor;)V (AbstractCollectionPersister.java:674)^M at org.hibernate.event.internal.DefaultInitializeCollectionEventListener.onInitializeCollection(Lorg/hibernate/event/spi/InitializeCollectionEvent;)V (DefaultInitializeCollectionEventListener.java:85)^M at org.hibernate.internal.SessionImpl.initializeCollection(Lorg/hibernate/collection/spi/PersistentCollection;Z)V (SessionImpl.java:1849)^M at org.hibernate.collection.internal.AbstractPersistentCollection$4.doWork()Ljava/lang/Object; (AbstractPersistentCollection.java:549)^M at org.hibernate.collection.internal.AbstractPersistentCollection.withTemporarySessionIfNeeded(Lorg/hibernate/collection/internal/AbstractPersistentCollection$LazyInitializationWork;)Ljava/lang/Object; (AbstractPersistentCollection.java:234)^M at org.hibernate.collection.internal.AbstractPersistentCollection.initialize(Z)V (AbstractPersistentCollection.java:545)^M at org.hibernate.collection.internal.AbstractPersistentCollection.read()V (AbstractPersistentCollection.java:124)^M at org.hibernate.collection.internal.AbstractPersistentCollection$1ExtraLazyElementByIndexReader.doWork()Ljava/lang/Object; (AbstractPersistentCollection.java:339)^M at org.hibernate.collection.internal.AbstractPersistentCollection.withTemporarySessionIfNeeded(Lorg/hibernate/collection/internal/AbstractPersistentCollection$LazyInitializationWork;)Ljava/lang/Object; (AbstractPersistentCollection.java:234)^M at org.hibernate.collection.internal.AbstractPersistentCollection.readElementByIndex(Ljava/lang/Object;)Ljava/lang/Object; (AbstractPersistentCollection.java:347)^M at org.hibernate.collection.internal.PersistentMap.get(Ljava/lang/Object;)Ljava/lang/Object; (PersistentMap.java:165)^M at org.candlepin.model.Consumer.getFact(Ljava/lang/String;)Ljava/lang/String; (Consumer.java:359)^M at org.candlepin.model.StatisticCurator.consumersPerSocketCount(Lorg/candlepin/model/Owner;)V (StatisticCurator.java:116)^M at org.candlepin.model.StatisticCurator.executeStatisticRun()V (StatisticCurator.java:83)^M at org.candlepin.pinsetter.tasks.StatisticHistoryTask.toExecute(Lorg/quartz/JobExecutionContext;)V (StatisticHistoryTask.java:54)^M at org.candlepin.pinsetter.tasks.KingpinJob.execute(Lorg/quartz/JobExecutionContext;)V (KingpinJob.java:81)^M at org.candlepin.pinsetter.core.GuiceJobFactory$JobScopeDecorator.execute(Lorg/quartz/JobExecutionContext;)V (GuiceJobFactory.java:93)^M at org.quartz.core.JobRunShell.run()V (JobRunShell.java:202)^M ~~~
Created attachment 1428810 [details] cpu process
Created attachment 1428811 [details] database information
Created attachment 1428812 [details] gc_monitor
Created attachment 1428813 [details] heap old gen status
Created attachment 1428814 [details] start manifest import
Created attachment 1428815 [details] threads filter
From consultant on site Just checked the table QRTZ_TRIGGERS to see the configured jobs. All of them have the default values (I ckecked those values in the candlepin source code, looking for DEFAULT_SCHEDULE). One thing though: There is one job (CancelJobJob) in ACQUIRED state, the rest of them are WAITING. I can't see the ACQUIRED status in the status list written in candlepin.rb. while Time.now <= timeout_limit job = res.ok_content case job[:state] when "CREATED", "PENDING", "WAITING", "RUNNING" sleep 1 res = get_job(:job_id => opts[:job_id]) when "FINISHED", "CANCELED", "FAILED" return res else raise RuntimeError.new("Unknown job state returned.") end end
Hello Team, we just got new information from customer, i am uploading pictures from they monitoring tool. Most important information is that there is SQL query which is taking 10 000 000 ms. SELECT POOL0_.ID AS ID1_36_0_, POOL0_.CREATED AS CREATED2_36_0_, POOL0_.UPDATED AS UPDATED3_36_0_, POOL0_.ACCOUNTNUMBER AS ACCOUNTN4_36_0_, POOL0_.ACTIVESUBSCRIPTION AS ACTIVESU5_36_0_, POOL0_.QUANTITY_CONSUMED AS QUANTITY6_36_0_, POOL0_.CONTRACTNUMBER AS CONTRACT7_36_0_, POOL0_.DERIVEDPRODUCTID AS DERIVEDP8_36_0_, POOL0_.DERIVEDPRODUCTNAME AS DERIVEDP9_36_0_, POOL0_.ENDDATE AS ENDDATE10_36_0_, POOL0_.QUANTITY_EXPORTED AS QUANTIT11_36_0_, POOL0_.ORDERNUMBER AS ORDERNU12_36_0_, POOL0_.OWNER_ID AS OWNER_I20_36_0_, POOL0_.PRODUCTID AS PRODUCT13_36_0_, POOL0_.PRODUCTNAME AS PRODUCT14_36_0_, POOL0_.QUANTITY AS QUANTIT15_36_0_, POOL0_.RESTRICTEDTOUSERNAME AS RESTRIC16_36_0_, POOL0_.SOURCEENTITLEMENT_ID AS SOURCEE21_36_0_, POOL0_.STARTDATE AS STARTDA17_36_0_, POOL0_.TYPE AS TYPE18_36_0_, POOL0_.VERSION AS VERSION19_36_0_ FROM CP_POOL POOL0_ WHERE POOL0_.ID = ? FOR UPDATE visible at image_08.png Can you please help with identification of query? From first view it seems like query to candlepin checking subscription, but i am not sure why it is executed. We need to identify the query and disable if it is not required or improve behavior (execution time).
Created attachment 1434877 [details] latest monitoring 1
Created attachment 1434878 [details] latest monitoring 2
Created attachment 1434879 [details] latest monitoring 3
Created attachment 1434880 [details] latest monitoring 4
Created attachment 1434881 [details] latest monitoring 5
Created attachment 1434882 [details] latest monitoring 6
Created attachment 1434883 [details] latest monitoring 7
Created attachment 1434884 [details] latest monitoring 8
Created attachment 1434885 [details] latest monitoring 9
Locking a pool for update can be blocked by any number of things, hypervisor checkin, manifest update, other systems in process of binding to that pool. If something else was holding the lock & never giving it up I would have expected a lock-wait-timeout exception if any one thing was holding the lock for too long. Were the parameters of the satellite system updated to change the default lock-wait-timeout which should be set to ~ 1 minute.
@bcourt Are you talking about the one for postgres that should be shown with something like: # su - postgres -c "psql -c \"show all;\"" | grep lock If not can you specify where this timeout is set so we can follow up on it? Thanks
Statement timeout of 0 means infinite per https://www.postgresql.org/docs/9.2/static/runtime-config-client.html. I would expect you have a lot of items getting queued up and it is just taking time to work through. Pool locks on Satellite 6.3 are dramatically shorter than in 6.2 so my expectation is that unless you are trying to spin up hundreds of concurrent VMs concurrently using this activation key things would be fixed by upgrading to Satellite 6.3.x
@Barnaby Not sure where we would see the improvement in the pool locks on Satellite 6.3 or is it rather code base then postgresql setting? [root@provisioning ~]# rpm -qa | grep satellite-6 ~~~ satellite-6.3.1-3.el7sat.noarch ~~~ [root@provisioning ~]# su - postgres -c "psql -c \"show all;\"" | grep statement_timeout ~~~ statement_timeout | 0 | Sets the maximum allowed duration of any statement. ~~~ [root@provisioning ~]# su - postgres -c "psql -c \"show all;\"" | grep lock ~~~ block_size | 8192 | Shows the size of a disk block. deadlock_timeout | 1s | Sets the time to wait on a lock before checking for deadlock. log_lock_waits | off | Logs long lock waits. max_locks_per_transaction | 64 | Sets the maximum number of locks per transaction. max_pred_locks_per_transaction | 64 | Sets the maximum number of predicate locks per transaction. wal_block_size | 8192 | Shows the block size in the write ahead log. ~~~ # su - postgres -c "psql -c \"show all;\"" | grep "timeout" ~~~ archive_timeout | 0 | Forces a switch to the next xlog file if a new file has not been started within N seconds. authentication_timeout | 1min | Sets the maximum allowed time to complete client authentication. checkpoint_timeout | 5min | Sets the maximum time between automatic WAL checkpoints. deadlock_timeout | 1s | Sets the time to wait on a lock before checking for deadlock. replication_timeout | 1min | Sets the maximum time to wait for WAL replication. statement_timeout | 0 | Sets the maximum allowed duration of any statement. ~~~
@Kenny The changes in 6.3 to hold locks for smaller periods of time are entirely code based. We put a lot of work into optimization moving a lot of logic that was previously performed inside of a lock to take place outside of the lock specifically around binding to a pool & the case where lots of consumers are trying to attach to a single pool.
See cloned BZs for the fixed-in-version for various hotfix branches
(scratch prior comment, attachment was too large for bugzilla) *** Satellite 6.3.2 Hotfix Available *** 1) download hotfix rpms: wget http://people.redhat.com/~mmccune/hotfix/candlepin-2.1.18-1.el7.noarch.rpm wget http://people.redhat.com/~mmccune/hotfix/candlepin-selinux-2.1.18-1.el7.noarch.rpm 2) Verify sha256: sha256sum candlepin-*.rpm 3c4c308f3ec1dcd969222dad546d72f273fbb10cf596d72941fedc29812f6afc candlepin-2.1.18-1.el7.noarch.rpm 76ccab8cdead86de727c1a81d40cddeec7902495517eec9f3295c5ce6bba0fb9 candlepin-selinux-2.1.18-1.el7.noarch.rpm 3) copy to satellite server, rpm -Uvh candlepin*.rpm 4) katello-service restart 5) resume operations
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2550