Bug 1193803

Summary:	manifest "imported successfully", but corresponding task ended with warning: RestClient::RequestTimeout: Katello::Resources::Candlepin::Owner: Request Timeout (POST /candlepin/owners/Default_Organization/imports)
Product:	Red Hat Satellite	Reporter:	Jan Hutař <jhutar>
Component:	Candlepin	Assignee:	satellite6-bugs <satellite6-bugs>
Status:	CLOSED ERRATA	QA Contact:	jcallaha
Severity:	high	Docs Contact:
Priority:	high
Version:	6.1.0	CC:	alikins, bbuckingham, daviddavis, dgoodwin, jcallaha, jhutar, kbidarka, mmccune, pablo.iranzo, pmoravec, sreber
Target Milestone:	Unspecified	Keywords:	Triaged
Target Release:	Unused
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-07-27 09:13:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1313551
Bug Blocks:

Description Jan Hutař 2015-02-18 09:26:00 UTC

Description of problem:
Although manifest was "imported successfully", but corresponding task ended with warning: "RestClient::RequestTimeout: Katello::Resources::Candlepin::Owner: Request Timeout  (POST /candlepin/owners/Default_Organization/imports)"


Version-Release number of selected component (if applicable):
Satellite-6.1.0-RHEL-6-20150217.0-Satellite-x86_64


How reproducible:
always


Steps to Reproduce:
1. Install Satellite and import manifest
   Content -> Red Hat Subscriptions -> [Manage Manifest] -> [Browse]+[Upload]
   and wait for manifest upload status being set to "imported successfully"
2. Check corresponding task result
   Monitor -> Tasks -> Import Manifest organization 'Default Organization'


Actual results:
Task have:
  State: stopped
  Result: warning
and when you click on "Errors" tab, you get:
  Action:
    Actions::Katello::Provider::ManifestImport
  Exception:
    RestClient::RequestTimeout: Katello::Resources::Candlepin::Owner: Request Timeout  (POST /candlepin/owners/Default_Organization/imports)


Expected results:
Task should pass, or manifest upload status should not be pass


Additional info:
System had this HW: 8 GB RAM, 8 CPU (Quad-Core AMD Opteron(tm) Processor 8389)

Comment 5 RHEL Program Management 2015-02-18 20:06:33 UTC

Since this issue was entered in Red Hat Bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.

Comment 7 Jan Hutař 2015-02-25 13:18:33 UTC

There are cases where manifest refresh causes traceback (and reports success (!)). Anyway, after this I do not see all the products.

Comment 8 Adrian Likins 2015-03-18 17:02:40 UTC

Is there a katello-debug output available for the sat6 system (in particular, the candlepin logs)?

Comment 9 Adrian Likins 2015-03-18 17:08:29 UTC

Looking on the beaker box atm, is there a timeframe this happened?

Comment 10 Adrian Likins 2015-03-18 17:29:59 UTC

Looking in logs, I see an import start at 

2015-03-17 21:01:11,032 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=] INFO  org.candlepin.common.filter.LoggingFilter - Request: verb=POST, uri=/candlepin/owners/Default_Organization/imports


and then (roughly 100 secs later):
2015-03-17 21:02:49,190 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=Default_Organization] INFO  org.candlepin.common.filter.LoggingFilter - Response: status=204, content-type="null", time=98167

Could the client RestClient::RequestTimeout: Katello::Resources::Candlepin::Owner: Request Timeout be timing out in 100 secs?


and then eventually (just shy of 5 minutes later):


2015-03-17 21:03:49,707 [=, org=] ERROR org.candlepin.audit.AMQPBusPublisher - Unable to send event: Event [id=null, t
arget=SUBSCRIPTION, type=MODIFIED, time=Tue Mar 17 21:02:45 EDT 2015, entity=8a90c2e94c2a5a4e014c2a63df19005a]
javax.jms.JMSException: Exception when sending message:timed out waiting for completion
        at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:260) ~[qpid-cli
ent-0.30.redhat-1.jar:0.30.redhat-1]
        at org.apache.qpid.client.BasicMessageProducer.sendImpl(BasicMessageProducer.java:528) ~[qpid-client-0.30.redh
at-1.jar:0.30.redhat-1]
        at org.apache.qpid.client.BasicMessageProducer.send(BasicMessageProducer.java:326) ~[qpid-client-0.30.redhat-1
.jar:0.30.redhat-1]
        at org.apache.qpid.client.TopicPublisherAdapter.send(TopicPublisherAdapter.java:120) ~[qpid-client-0.30.redhat
-1.jar:0.30.redhat-1]
        at org.candlepin.audit.AMQPBusPublisher.onEvent(AMQPBusPublisher.java:60) ~[AMQPBusPublisher.class:na]
        at org.candlepin.audit.ListenerWrapper.onMessage(ListenerWrapper.java:52) [ListenerWrapper.class:na]
        at org.hornetq.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1114) [hornetq-core-c
lient-2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2]
        at org.hornetq.core.client.impl.ClientConsumerImpl.access$500(ClientConsumerImpl.java:57) [hornetq-core-client
-2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2]
        at org.hornetq.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1249) [hornetq-core-clie
nt-2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2]
        at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:106) [hornetq-co
re-client-2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2]

The import related jobs seem to continue for a while though.

Comment 11 Adrian Likins 2015-03-18 17:57:48 UTC

2015-03-17 21:03:49,707 [=, org=] ERROR org.candlepin.audit.AMQPBusPublisher - Unable to send event: Event [id=null, t
arget=SUBSCRIPTION, type=MODIFIED, time=Tue Mar 17 21:02:45 EDT 2015, entity=8a90c2e94c2a5a4e014c2a63df19005a]

The time referenced in the amqp errors corresponds to roughly the time subscriptions start getting merged and modified in the import

2015-03-17 21:01:11,032 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=] INFO  org.candlepin.common.filter.LoggingFilter - Request: verb=POST, uri=/candlepin/owners/Default_Organization/imports
2015-03-17 21:01:11,087 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=Default_Organization] INFO  org.candlepin.resource.OwnerResource - Importing archive /var/cache/tomcat6/temp/pfx416775695858992529sfx for owner Default Organization
2015-03-17 21:01:11,142 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=Default_Organization] INFO  org.candlepin.sync.RulesImporter - Ignoring older rules in manifest, current version: 5.14 new version: 5.12
2015-03-17 21:02:45,754 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=Default_Organization] INFO  org.candlepin.sync.EntitlementImporter - Merging subscription for incoming entitlement id [8a85f98744edf2cd0144f90252445a6c] into subscription with existing entitlement id [8a85f98744edf2cd0144f90252445a6c]. Entitlement id match.
2015-03-17 21:02:45,945 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=Default_Organization] INFO  org.candlepin.sync.EntitlementImporter - Creating new subscription for incoming entitlement with id [8a85f98244800ff20144f90254613ac1]
2015-03-17 21:02:45,949 [req=6b754f9e-dbfd-4f5d-b64f-e144a1ae5d75, org=Default_Organization] INFO  org.candlepin.sync.EntitlementImporter - Merging subscription for incoming entitlement id [8a85f98644ee0e4e0144f90256bd2168] into subscription with existing entitlement id [8a85f98644ee0e4e0144f90256bd2168]. Entitlement id match.

Comment 12 Michael Stead 2015-03-18 18:42:04 UTC

Failed events shouldn't be the cause of this BZ. They should be queued up and resent once the message bus is back online. The Db transaction should not be affected.

Comment 13 Devan Goodwin 2015-03-20 18:05:00 UTC

Yeah AMQP events shouldn't be related to this. (must be a separate issue, possibly caused by the tomcat/qpid restart issues mstead was working on recently.

This is really easy to reproduce even with a standalone candlepin on a fresh db, it took over a minute here as well.

The import API needs to be async, however, for this fresh manifest to take that long, there might be a very obvious performance fix available to correct this immediately. I can't immediately see what the issue is from logs they're quiet when the CPU is going nuts. I would get this hooked up to a profiler ASAP and see where the time is going, I bet it's an easy fix.

Comment 15 Devan Goodwin 2015-03-25 13:34:11 UTC

We have been unable to isolate any quick fixes for the performance, importing/comparing all those products and content (due to the use of multiple employee subscriptions) is taking too long for the request timeout. 

We do not believe that this is an issue that should affect a customer as regular subscriptions are considerably smaller than this.

We also have some efforts underway which may help the situation, or at the very least require us to fix it twice in two different ways.

Given these points we're requesting the severity of this to be lowered and deferred to a future release. We will move the story back to our backlog and pick up once the very large "per org products" refactor is in place.

Would this be acceptable?

Comment 16 Mike McCune 2015-03-26 18:27:20 UTC

If this never is seen in front of customers we can consider deferring it. What we really don't want to do is block all internal testing of Satellite 6 because people can't work around this.

That said, it doesn't seem to happen that often during our 6.1 beta testing.

If it really starts to block forward progress I'd advocate for bumping the priority but for now we can put this into 6.2.

Comment 17 Adrian Likins 2015-04-02 14:10:54 UTC

https://github.com/candlepin/candlepin/tree/alikins/1193803_manifest_import_speed
has an experimental patch that seems to help some.

After doing some profiling of the server doing manifest imports, the main time sinks for this case are:

1. Importing Contents and Products. Contents go pretty quick, but products take about a second each, and in the test manifest, there are ~200 products. So about 2mins is spent there. Most of the time seems to be in hibernate book keeping, doing 'dirty' checks on the objects in the transaction. This seems to be caused by the db read mode, and the large transaction (~8000 statements for the attached manifest). 


     The patch above adds some calls to *Curator.refresh() after
     creating new persisted objects, to force the session to know 
     it has the most recent version of freshly created object and
     avoid the high number of dirty checks.

     This speeds up the Content and Product load times from about
     1 sec each, to about 1 sec total. So that cuts total manifest
     import time by about 2 mins.

2. Pool refreshes and sub generation. Once the Products, Contents, and Subscriptions are imported, they still need to be refreshed, and that takes
up most of the remaining time.

Comment 22 Barnaby Court 2016-03-01 19:36:27 UTC

Pablo, 

Are you able to duplicate this issue with Satellite 6.1.7? We added a number of changes specifically to speed up manifest import.

Comment 23 Barnaby Court 2016-03-01 21:37:39 UTC

I have been able to duplicate this issue. It will require converting an API endpoint from synchronous to async in the upstream Candlepin project and updating katello to use this new API endpoint. I have created an RFE for the new feature in Candlepin.

Comment 24 Barnaby Court 2016-03-04 18:21:13 UTC

*** Bug 1193835 has been marked as a duplicate of this bug. ***

Comment 25 David Davis 2016-03-11 13:10:40 UTC

*** Bug 1212120 has been marked as a duplicate of this bug. ***

Comment 27 Mike McCune 2016-03-31 23:31:57 UTC

We moved the default timeout to be much higher in 6.1.6, this is also in 6.2

Comment 28 Pavel Moravec 2016-04-29 07:54:18 UTC

I dont think this is fixed - see:

https://bugzilla.redhat.com/show_bug.cgi?id=1329091 

Every time I install Sat6.2 beta and import manifest with Employee SKU, I hit that bug.

Comment 29 jcallaha 2016-06-14 19:33:16 UTC

Verified in Satellite 6.2 beta Snap 15.2

After 15 manifest imports and 20 refreshes, this error has not presented itself.

Comment 31 errata-xmlrpc 2016-07-27 09:13:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1501