Bug 1572604
Summary: | Tomcat java.lang.OutOfMemoryError: GC overhead limit exceeded | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Kenny Tordeurs <ktordeur> | ||||||||||||||||||||||||||||||||
Component: | Candlepin | Assignee: | satellite6-bugs <satellite6-bugs> | ||||||||||||||||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Roman Plevka <rplevka> | ||||||||||||||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||||||||||||||
Priority: | high | ||||||||||||||||||||||||||||||||||
Version: | 6.2.14 | CC: | bcourt, cdonnell, cduryee, inecas, jhutar, jjansky, khowell, ktordeur, mmccune, molasaga, pcreech, rplevka | ||||||||||||||||||||||||||||||||
Target Milestone: | Unspecified | Keywords: | Triaged | ||||||||||||||||||||||||||||||||
Target Release: | Unused | ||||||||||||||||||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||||||||||
Fixed In Version: | candlepin-2.1.18-1 | Doc Type: | If docs needed, set a value | ||||||||||||||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||||||||||||||
Clone Of: | |||||||||||||||||||||||||||||||||||
: | 1584256 1596356 (view as bug list) | Environment: | |||||||||||||||||||||||||||||||||
Last Closed: | 2018-08-22 20:07:08 UTC | Type: | Bug | ||||||||||||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||||||||||||
Bug Depends On: | 1584256 | ||||||||||||||||||||||||||||||||||
Bug Blocks: | |||||||||||||||||||||||||||||||||||
Attachments: |
|
Description
Kenny Tordeurs
2018-04-27 11:57:41 UTC
Created attachment 1428810 [details]
cpu process
Created attachment 1428811 [details]
database information
Created attachment 1428812 [details]
gc_monitor
Created attachment 1428813 [details]
heap old gen status
Created attachment 1428814 [details]
start manifest import
Created attachment 1428815 [details]
threads filter
From consultant on site Just checked the table QRTZ_TRIGGERS to see the configured jobs. All of them have the default values (I ckecked those values in the candlepin source code, looking for DEFAULT_SCHEDULE). One thing though: There is one job (CancelJobJob) in ACQUIRED state, the rest of them are WAITING. I can't see the ACQUIRED status in the status list written in candlepin.rb. while Time.now <= timeout_limit job = res.ok_content case job[:state] when "CREATED", "PENDING", "WAITING", "RUNNING" sleep 1 res = get_job(:job_id => opts[:job_id]) when "FINISHED", "CANCELED", "FAILED" return res else raise RuntimeError.new("Unknown job state returned.") end end Hello Team, we just got new information from customer, i am uploading pictures from they monitoring tool. Most important information is that there is SQL query which is taking 10 000 000 ms. SELECT POOL0_.ID AS ID1_36_0_, POOL0_.CREATED AS CREATED2_36_0_, POOL0_.UPDATED AS UPDATED3_36_0_, POOL0_.ACCOUNTNUMBER AS ACCOUNTN4_36_0_, POOL0_.ACTIVESUBSCRIPTION AS ACTIVESU5_36_0_, POOL0_.QUANTITY_CONSUMED AS QUANTITY6_36_0_, POOL0_.CONTRACTNUMBER AS CONTRACT7_36_0_, POOL0_.DERIVEDPRODUCTID AS DERIVEDP8_36_0_, POOL0_.DERIVEDPRODUCTNAME AS DERIVEDP9_36_0_, POOL0_.ENDDATE AS ENDDATE10_36_0_, POOL0_.QUANTITY_EXPORTED AS QUANTIT11_36_0_, POOL0_.ORDERNUMBER AS ORDERNU12_36_0_, POOL0_.OWNER_ID AS OWNER_I20_36_0_, POOL0_.PRODUCTID AS PRODUCT13_36_0_, POOL0_.PRODUCTNAME AS PRODUCT14_36_0_, POOL0_.QUANTITY AS QUANTIT15_36_0_, POOL0_.RESTRICTEDTOUSERNAME AS RESTRIC16_36_0_, POOL0_.SOURCEENTITLEMENT_ID AS SOURCEE21_36_0_, POOL0_.STARTDATE AS STARTDA17_36_0_, POOL0_.TYPE AS TYPE18_36_0_, POOL0_.VERSION AS VERSION19_36_0_ FROM CP_POOL POOL0_ WHERE POOL0_.ID = ? FOR UPDATE visible at image_08.png Can you please help with identification of query? From first view it seems like query to candlepin checking subscription, but i am not sure why it is executed. We need to identify the query and disable if it is not required or improve behavior (execution time). Created attachment 1434877 [details]
latest monitoring 1
Created attachment 1434878 [details]
latest monitoring 2
Created attachment 1434879 [details]
latest monitoring 3
Created attachment 1434880 [details]
latest monitoring 4
Created attachment 1434881 [details]
latest monitoring 5
Created attachment 1434882 [details]
latest monitoring 6
Created attachment 1434883 [details]
latest monitoring 7
Created attachment 1434884 [details]
latest monitoring 8
Created attachment 1434885 [details]
latest monitoring 9
Locking a pool for update can be blocked by any number of things, hypervisor checkin, manifest update, other systems in process of binding to that pool. If something else was holding the lock & never giving it up I would have expected a lock-wait-timeout exception if any one thing was holding the lock for too long. Were the parameters of the satellite system updated to change the default lock-wait-timeout which should be set to ~ 1 minute. @bcourt Are you talking about the one for postgres that should be shown with something like: # su - postgres -c "psql -c \"show all;\"" | grep lock If not can you specify where this timeout is set so we can follow up on it? Thanks Statement timeout of 0 means infinite per https://www.postgresql.org/docs/9.2/static/runtime-config-client.html. I would expect you have a lot of items getting queued up and it is just taking time to work through. Pool locks on Satellite 6.3 are dramatically shorter than in 6.2 so my expectation is that unless you are trying to spin up hundreds of concurrent VMs concurrently using this activation key things would be fixed by upgrading to Satellite 6.3.x @Barnaby Not sure where we would see the improvement in the pool locks on Satellite 6.3 or is it rather code base then postgresql setting? [root@provisioning ~]# rpm -qa | grep satellite-6 ~~~ satellite-6.3.1-3.el7sat.noarch ~~~ [root@provisioning ~]# su - postgres -c "psql -c \"show all;\"" | grep statement_timeout ~~~ statement_timeout | 0 | Sets the maximum allowed duration of any statement. ~~~ [root@provisioning ~]# su - postgres -c "psql -c \"show all;\"" | grep lock ~~~ block_size | 8192 | Shows the size of a disk block. deadlock_timeout | 1s | Sets the time to wait on a lock before checking for deadlock. log_lock_waits | off | Logs long lock waits. max_locks_per_transaction | 64 | Sets the maximum number of locks per transaction. max_pred_locks_per_transaction | 64 | Sets the maximum number of predicate locks per transaction. wal_block_size | 8192 | Shows the block size in the write ahead log. ~~~ # su - postgres -c "psql -c \"show all;\"" | grep "timeout" ~~~ archive_timeout | 0 | Forces a switch to the next xlog file if a new file has not been started within N seconds. authentication_timeout | 1min | Sets the maximum allowed time to complete client authentication. checkpoint_timeout | 5min | Sets the maximum time between automatic WAL checkpoints. deadlock_timeout | 1s | Sets the time to wait on a lock before checking for deadlock. replication_timeout | 1min | Sets the maximum time to wait for WAL replication. statement_timeout | 0 | Sets the maximum allowed duration of any statement. ~~~ @Kenny The changes in 6.3 to hold locks for smaller periods of time are entirely code based. We put a lot of work into optimization moving a lot of logic that was previously performed inside of a lock to take place outside of the lock specifically around binding to a pool & the case where lots of consumers are trying to attach to a single pool. See cloned BZs for the fixed-in-version for various hotfix branches (scratch prior comment, attachment was too large for bugzilla) *** Satellite 6.3.2 Hotfix Available *** 1) download hotfix rpms: wget http://people.redhat.com/~mmccune/hotfix/candlepin-2.1.18-1.el7.noarch.rpm wget http://people.redhat.com/~mmccune/hotfix/candlepin-selinux-2.1.18-1.el7.noarch.rpm 2) Verify sha256: sha256sum candlepin-*.rpm 3c4c308f3ec1dcd969222dad546d72f273fbb10cf596d72941fedc29812f6afc candlepin-2.1.18-1.el7.noarch.rpm 76ccab8cdead86de727c1a81d40cddeec7902495517eec9f3295c5ce6bba0fb9 candlepin-selinux-2.1.18-1.el7.noarch.rpm 3) copy to satellite server, rpm -Uvh candlepin*.rpm 4) katello-service restart 5) resume operations Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2550 |