Bug 1193803
Summary: | manifest "imported successfully", but corresponding task ended with warning: RestClient::RequestTimeout: Katello::Resources::Candlepin::Owner: Request Timeout (POST /candlepin/owners/Default_Organization/imports) | ||
---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Jan Hutař <jhutar> |
Component: | Candlepin | Assignee: | satellite6-bugs <satellite6-bugs> |
Status: | CLOSED ERRATA | QA Contact: | jcallaha |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 6.1.0 | CC: | alikins, bbuckingham, daviddavis, dgoodwin, jcallaha, jhutar, kbidarka, mmccune, pablo.iranzo, pmoravec, sreber |
Target Milestone: | Unspecified | Keywords: | Triaged |
Target Release: | Unused | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-07-27 09:13:33 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1313551 | ||
Bug Blocks: |
Description
Jan Hutař
2015-02-18 09:26:00 UTC
Since this issue was entered in Red Hat Bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release. There are cases where manifest refresh causes traceback (and reports success (!)). Anyway, after this I do not see all the products. Is there a katello-debug output available for the sat6 system (in particular, the candlepin logs)? Looking on the beaker box atm, is there a timeframe this happened? Looking in logs, I see an import start at 2015-03-17 21:01:11,032 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=] INFO org.candlepin.common.filter.LoggingFilter - Request: verb=POST, uri=/candlepin/owners/Default_Organization/imports and then (roughly 100 secs later): 2015-03-17 21:02:49,190 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=Default_Organization] INFO org.candlepin.common.filter.LoggingFilter - Response: status=204, content-type="null", time=98167 Could the client RestClient::RequestTimeout: Katello::Resources::Candlepin::Owner: Request Timeout be timing out in 100 secs? and then eventually (just shy of 5 minutes later): 2015-03-17 21:03:49,707 [=, org=] ERROR org.candlepin.audit.AMQPBusPublisher - Unable to send event: Event [id=null, t arget=SUBSCRIPTION, type=MODIFIED, time=Tue Mar 17 21:02:45 EDT 2015, entity=8a90c2e94c2a5a4e014c2a63df19005a] javax.jms.JMSException: Exception when sending message:timed out waiting for completion at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:260) ~[qpid-cli ent-0.30.redhat-1.jar:0.30.redhat-1] at org.apache.qpid.client.BasicMessageProducer.sendImpl(BasicMessageProducer.java:528) ~[qpid-client-0.30.redh at-1.jar:0.30.redhat-1] at org.apache.qpid.client.BasicMessageProducer.send(BasicMessageProducer.java:326) ~[qpid-client-0.30.redhat-1 .jar:0.30.redhat-1] at org.apache.qpid.client.TopicPublisherAdapter.send(TopicPublisherAdapter.java:120) ~[qpid-client-0.30.redhat -1.jar:0.30.redhat-1] at org.candlepin.audit.AMQPBusPublisher.onEvent(AMQPBusPublisher.java:60) ~[AMQPBusPublisher.class:na] at org.candlepin.audit.ListenerWrapper.onMessage(ListenerWrapper.java:52) [ListenerWrapper.class:na] at org.hornetq.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1114) [hornetq-core-c lient-2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2] at org.hornetq.core.client.impl.ClientConsumerImpl.access$500(ClientConsumerImpl.java:57) [hornetq-core-client -2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2] at org.hornetq.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1249) [hornetq-core-clie nt-2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2] at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:106) [hornetq-co re-client-2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2] The import related jobs seem to continue for a while though. 2015-03-17 21:03:49,707 [=, org=] ERROR org.candlepin.audit.AMQPBusPublisher - Unable to send event: Event [id=null, t arget=SUBSCRIPTION, type=MODIFIED, time=Tue Mar 17 21:02:45 EDT 2015, entity=8a90c2e94c2a5a4e014c2a63df19005a] The time referenced in the amqp errors corresponds to roughly the time subscriptions start getting merged and modified in the import 2015-03-17 21:01:11,032 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=] INFO org.candlepin.common.filter.LoggingFilter - Request: verb=POST, uri=/candlepin/owners/Default_Organization/imports 2015-03-17 21:01:11,087 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=Default_Organization] INFO org.candlepin.resource.OwnerResource - Importing archive /var/cache/tomcat6/temp/pfx416775695858992529sfx for owner Default Organization 2015-03-17 21:01:11,142 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=Default_Organization] INFO org.candlepin.sync.RulesImporter - Ignoring older rules in manifest, current version: 5.14 new version: 5.12 2015-03-17 21:02:45,754 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=Default_Organization] INFO org.candlepin.sync.EntitlementImporter - Merging subscription for incoming entitlement id [8a85f98744edf2cd0144f90252445a6c] into subscription with existing entitlement id [8a85f98744edf2cd0144f90252445a6c]. Entitlement id match. 2015-03-17 21:02:45,945 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=Default_Organization] INFO org.candlepin.sync.EntitlementImporter - Creating new subscription for incoming entitlement with id [8a85f98244800ff20144f90254613ac1] 2015-03-17 21:02:45,949 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=Default_Organization] INFO org.candlepin.sync.EntitlementImporter - Merging subscription for incoming entitlement id [8a85f98644ee0e4e0144f90256bd2168] into subscription with existing entitlement id [8a85f98644ee0e4e0144f90256bd2168]. Entitlement id match. Failed events shouldn't be the cause of this BZ. They should be queued up and resent once the message bus is back online. The Db transaction should not be affected. Yeah AMQP events shouldn't be related to this. (must be a separate issue, possibly caused by the tomcat/qpid restart issues mstead was working on recently. This is really easy to reproduce even with a standalone candlepin on a fresh db, it took over a minute here as well. The import API needs to be async, however, for this fresh manifest to take that long, there might be a very obvious performance fix available to correct this immediately. I can't immediately see what the issue is from logs they're quiet when the CPU is going nuts. I would get this hooked up to a profiler ASAP and see where the time is going, I bet it's an easy fix. We have been unable to isolate any quick fixes for the performance, importing/comparing all those products and content (due to the use of multiple employee subscriptions) is taking too long for the request timeout. We do not believe that this is an issue that should affect a customer as regular subscriptions are considerably smaller than this. We also have some efforts underway which may help the situation, or at the very least require us to fix it twice in two different ways. Given these points we're requesting the severity of this to be lowered and deferred to a future release. We will move the story back to our backlog and pick up once the very large "per org products" refactor is in place. Would this be acceptable? If this never is seen in front of customers we can consider deferring it. What we really don't want to do is block all internal testing of Satellite 6 because people can't work around this. That said, it doesn't seem to happen that often during our 6.1 beta testing. If it really starts to block forward progress I'd advocate for bumping the priority but for now we can put this into 6.2. https://github.com/candlepin/candlepin/tree/alikins/1193803_manifest_import_speed has an experimental patch that seems to help some. After doing some profiling of the server doing manifest imports, the main time sinks for this case are: 1. Importing Contents and Products. Contents go pretty quick, but products take about a second each, and in the test manifest, there are ~200 products. So about 2mins is spent there. Most of the time seems to be in hibernate book keeping, doing 'dirty' checks on the objects in the transaction. This seems to be caused by the db read mode, and the large transaction (~8000 statements for the attached manifest). The patch above adds some calls to *Curator.refresh() after creating new persisted objects, to force the session to know it has the most recent version of freshly created object and avoid the high number of dirty checks. This speeds up the Content and Product load times from about 1 sec each, to about 1 sec total. So that cuts total manifest import time by about 2 mins. 2. Pool refreshes and sub generation. Once the Products, Contents, and Subscriptions are imported, they still need to be refreshed, and that takes up most of the remaining time. Pablo, Are you able to duplicate this issue with Satellite 6.1.7? We added a number of changes specifically to speed up manifest import. I have been able to duplicate this issue. It will require converting an API endpoint from synchronous to async in the upstream Candlepin project and updating katello to use this new API endpoint. I have created an RFE for the new feature in Candlepin. *** Bug 1193835 has been marked as a duplicate of this bug. *** *** Bug 1212120 has been marked as a duplicate of this bug. *** We moved the default timeout to be much higher in 6.1.6, this is also in 6.2 I dont think this is fixed - see: https://bugzilla.redhat.com/show_bug.cgi?id=1329091 Every time I install Sat6.2 beta and import manifest with Employee SKU, I hit that bug. Verified in Satellite 6.2 beta Snap 15.2 After 15 manifest imports and 20 refreshes, this error has not presented itself. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1501 |