1269509 – Package installation via Satellite 6.1 is much slower than yum

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1269509 - Package installation via Satellite 6.1 is much slower than yum

Summary: Package installation via Satellite 6.1 is much slower than yum

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	WebUI
Sub Component:
Version:	6.1.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	Unspecified
Assignee:	Partha Aji
QA Contact:	Tazim Kolhar
Docs Contact:
URL:	http://projects.theforeman.org/issues...
Whiteboard:
Depends On:	1286066
Blocks:	1277292
TreeView+	depends on / blocked

Reported:	2015-10-07 13:01 UTC by Stuart Auchterlonie
Modified:	2022-07-09 07:29 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1276443 1277292 (view as bug list)
Environment:
Last Closed:	2015-12-15 09:20:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
client hotfix packages (880.00 KB, application/x-tar) 2015-11-23 19:02 UTC, Mike McCune	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2327941	0	None	None	None	2016-05-18 07:16:15 UTC
Red Hat Product Errata	RHSA-2015:2622	0	normal	SHIPPED_LIVE	Moderate: Satellite 6.1.5 bug fix update	2015-12-15 14:17:45 UTC

Description Stuart Auchterlonie 2015-10-07 13:01:53 UTC

Description of problem:

When you attempt to install errata(s) onto a host via
the errata managment tab of a host. If that operation
will take longer to complete than approx ~120 seconds
Then the

Version-Release number of selected component (if applicable):

Sat 6.1.2

How reproducible:

Fairly easily


Steps to Reproduce:
1. Select a large errata like kernel, or multiple errata(s) to create a transaction that will take at least 120 seconds to apply
2. Apply selected errata
3.

Actual results:

Task state goes to Stopped / Warning with the error listed
as "Request Timeout"

Looking at the dynflow console for the task, I always get
something similar to

Actions::Pulp::Consumer::ContentInstall (skipped) [ 121.50s / 120.16s ]


Expected results:

Task continues to run until completion

Additional info:

Details from the dynflow task above
-----------
Started at: 2015-10-07 12:56:28 UTC

Ended at: 2015-10-07 12:58:29 UTC

Real time: 121.50s

Execution time (excluding suspended state): 120.16s

Input:

---
consumer_uuid: 636e2e99-4f04-4cfb-8690-1090042223cf
type: erratum
args:
- RHEA-2015:1593
remote_user: admin-5b9b1e3e
remote_cp_user: admin
locale: en-GB
Output:

--- {}
Error:

RestClient::RequestTimeout

Request Timeout

---
- /opt/rh/ruby193/root/usr/share/gems/gems/rbovirt-0.0.29/lib/restclient_ext/request.rb:56:in
  `rescue in transmit'
- /opt/rh/ruby193/root/usr/share/gems/gems/rbovirt-0.0.29/lib/restclient_ext/request.rb:11:in
  `transmit'
- /opt/rh/ruby193/root/usr/share/gems/gems/rest-client-1.6.7/lib/restclient/request.rb:64:in
  `execute'
- /opt/rh/ruby193/root/usr/share/gems/gems/rest-client-1.6.7/lib/restclient/request.rb:33:in
  `execute'
- /opt/rh/ruby193/root/usr/share/gems/gems/rest-client-1.6.7/lib/restclient/resource.rb:67:in
  `post'
- /opt/rh/ruby193/root/usr/share/gems/gems/runcible-1.3.5/lib/runcible/base.rb:91:in
  `get_response'
- /opt/rh/ruby193/root/usr/share/gems/gems/runcible-1.3.5/lib/runcible/base.rb:79:in
  `call'
- /opt/rh/ruby193/root/usr/share/gems/gems/runcible-1.3.5/lib/runcible/resources/consumer.rb:139:in
  `install_units'
- /opt/rh/ruby193/root/usr/share/gems/gems/runcible-1.3.5/lib/runcible/extensions/consumer.rb:85:in
  `install_content'
- /opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.67/app/lib/actions/pulp/consumer/content_install.rb:27:in
  `invoke_external_task'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action/polling.rb:83:in
  `initiate_external_action'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action/polling.rb:18:in
  `run'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action/cancellable.rb:9:in
  `run'
- /opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.67/app/lib/actions/pulp/abstract_async_task.rb:57:in
  `run'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action.rb:487:in
  `block (3 levels) in execute_run'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/middleware/stack.rb:26:in
  `call'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/middleware/stack.rb:26:in
  `pass'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/middleware.rb:16:in
  `pass'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/middleware.rb:25:in
  `run'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/middleware/stack.rb:22:in
  `call'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/middleware/stack.rb:26:in
  `pass'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/middleware.rb:16:in
  `pass'
- /opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.67/app/lib/actions/middleware/remote_action.rb:27:in
  `block in run'
- /opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.67/app/lib/actions/middleware/remote_action.rb:57:in
  `block (2 levels) in as_remote_user'
- /opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.67/app/lib/katello/util/thread_session.rb:84:in
  `pulp_config'
- /opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.67/app/lib/actions/middleware/remote_action.rb:43:in
  `as_pulp_user'
- /opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.67/app/lib/actions/middleware/remote_action.rb:56:in
  `block in as_remote_user'
- /opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.67/app/lib/katello/util/thread_session.rb:91:in
  `cp_config'
- /opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.67/app/lib/actions/middleware/remote_action.rb:38:in
  `as_cp_user'
- /opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.67/app/lib/actions/middleware/remote_action.rb:55:in
  `as_remote_user'
- /opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.67/app/lib/actions/middleware/remote_action.rb:27:in
  `run'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/middleware/stack.rb:22:in
  `call'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/middleware/stack.rb:26:in
  `pass'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/middleware.rb:16:in
  `pass'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action/progress.rb:30:in
  `with_progress_calculation'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action/progress.rb:16:in
  `run'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/middleware/stack.rb:22:in
  `call'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/middleware/stack.rb:26:in
  `pass'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/middleware.rb:16:in
  `pass'
- /opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.67/app/lib/actions/middleware/keep_locale.rb:23:in
  `block in run'
- /opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.67/app/lib/actions/middleware/keep_locale.rb:34:in
  `with_locale'
- /opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.67/app/lib/actions/middleware/keep_locale.rb:23:in
  `run'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/middleware/stack.rb:22:in
  `call'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/middleware/world.rb:30:in
  `execute'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action.rb:486:in
  `block (2 levels) in execute_run'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action.rb:485:in
  `catch'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action.rb:485:in
  `block in execute_run'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action.rb:402:in
  `call'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action.rb:402:in
  `block in with_error_handling'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action.rb:402:in
  `catch'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action.rb:402:in
  `with_error_handling'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action.rb:480:in
  `execute_run'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action.rb:262:in
  `execute'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:9:in
  `block (2 levels) in execute'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/execution_plan/steps/abstract.rb:155:in
  `call'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/execution_plan/steps/abstract.rb:155:in
  `with_meta_calculation'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:8:in
  `block in execute'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:22:in
  `open_action'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:7:in
  `execute'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/executors/parallel/worker.rb:20:in
  `block in on_message'
- /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:859:in
  `block in assigns'
- /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:858:in
  `tap'
- /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:858:in
  `assigns'
- /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:138:in
  `match_value'
- /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:116:in
  `block in match'
- /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:115:in
  `each'
- /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:115:in
  `match'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/executors/parallel/worker.rb:17:in
  `on_message'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/micro_actor.rb:82:in
  `on_envelope'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/micro_actor.rb:72:in
  `receive'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/micro_actor.rb:99:in
  `block (2 levels) in run'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/micro_actor.rb:99:in
  `loop'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/micro_actor.rb:99:in
  `block in run'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/micro_actor.rb:99:in
  `catch'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/micro_actor.rb:99:in
  `run'
- /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/micro_actor.rb:13:in
  `block in initialize'
- /opt/rh/ruby193/root/usr/share/gems/gems/logging-1.8.1/lib/logging/diagnostic_context.rb:323:in
  `call'
- /opt/rh/ruby193/root/usr/share/gems/gems/logging-1.8.1/lib/logging/diagnostic_context.rb:323:in
  `block in create_with_logging_context'

Comment 3 Mike McCune 2015-10-13 22:04:10 UTC

I've reproduced this in house fairly easily:

A) register a single RHEL 6.5 host to RHEL 6Server repo, this should end up with ~100+ errata needing application

B) Attempt to update the host and apply all applicable errata. note the timeout and excessive time consumed to apply the errata


This is a 2 part issue:

1) We shouldn't be timing out at the API level

2) the update is taking too long.  Users are seeing these updates take 30-60 minutes to update the host from RHEL 6.5 -> 6.7 via the Satellite whereas you can do a 'yum update' from the client itself and it finishes in ~5 minutes.

Comment 5 Justin Sherrill 2015-10-29 17:44:23 UTC

Splitting this up into two different bugzillas as there really is two different issues.  

This bug is going to handle the fact that installation via satellite 6 takes longer than via yum.

Will clone to handle the '120 sec' timeout issue.

Comment 6 Sean Mullen 2015-11-02 17:38:17 UTC

I've been looking at this in detail with our 6.1.3 installation. We will require the ability to have 200 or more servers in a host collection and have them all patch within a reasonable amount of time(that being 2 hours or less) because of stringent change management controls and windows.

From what I've seen, Katello seems to throw tasks "over the wall" to pulp and then timeout / give up after X seconds. The truly horrible points here are that A: there are only a small number of celery threads processing these requests, B: the pulp queue is allowed to grow unchecked while the Client patching gets to compete with repository synchronization for the celery threads, C: Worst, THE PULP TASKS ARE NOT CANCELLED WHEN THE KATELLO TASK TIMES OUT ... from what I've seen, if there's a decent queue of activities built up in pulp (repo synch's for example) and you try to patch servers, katello tells you it's timed out and failed. Meanwhile, hours later, pulp can get around to your patching tasks that are queued in the mongodb task queue and patch your server ... long after you've given up, closed your change ticket as a failure and gone to bed.

The other directly impacting issue I've seen is that you're given the option to synch repos hourly (We need to do this with a few custom repos to pull in new development packages throughout the day). The bad thing here is that the repo synch's / metadata refreshes / capsule synch's are all ran regardless of whether anything has changed in the repo or not. In our situation, this caused a huge backlog of pulp tasks which I had to manually cancel to get things back on track ... after disabling hourly synch's. I had to manually cancel over 11,000 tasks that were waiting in pulp but Katello said were failed.

Suggestion: If a repo has no changes ... don't regenerate metadata and don't push update tasks to the capsules ... nothing has changed, don't waste the resources and back up the pulp queue further.

Comment 7 Brad Buckingham 2015-11-03 00:07:07 UTC

Created redmine issue http://projects.theforeman.org/issues/12375 from this bug

Comment 8 Bryan Kearney 2015-11-03 03:01:41 UTC

Moving to POST since upstream bug http://projects.theforeman.org/issues/12375 has been closed
-------------
Anonymous
Applied in changeset commit:katello-agent|5210a603a8281fb14dd1e9a23012baa123122762.

Comment 13 Sean Mullen 2015-11-20 13:04:39 UTC

Seriously?? You DO see that the solution in the linked issue is "turn off progress reporting in the agent" ... that's not even close to a solution for the issue "Package installation via Satellite 6.1 is much slower than yum"

in fact it makes it far worse since it masks the real issue that gofer sucks at anything close to an enterprise load.  I'll be escalating this on the red hat side for the related cases I have open since this is nowhere near an acceptable solution.

Comment 14 Justin Sherrill 2015-11-20 13:59:25 UTC

Hi Sean, 

I wanted to address some of your original comments to clarify some things:

(In reply to Sean Mullen from comment #6)
> I've been looking at this in detail with our 6.1.3 installation.  We will
> require the ability to have 200 or more servers in a host collection and
> have them all patch within a reasonable amount of time(that being 2 hours or
> less) because of stringent change management controls and windows.
> 
> From what I've seen, Katello seems to throw tasks "over the wall" to pulp
> and then timeout / give up after X seconds.  The truly horrible points here
> are that A: there are only a small number of celery threads processing these
> requests, 

This is not true.  There are two types of tasks at play here, agent tasks and 'worker' tasks.  Agent tasks are not processed by the celery workers at all.  The workers can be completely busy and agent tasks will still be submitted to the agents. 

Also there are two timeouts here.  A 'pickup' timeout and a 'completion' timeout.  The pickup timeout triggers if the client has not picked up the task in 20 seconds.  Given that this is not prevented by busy celery workers and an agent will pick up tasks even if they are currently processing tasks I'm not sure why this is a problem?  The 'completion' timeout is to handle the case where a system or gofer running on the system 'dies' after the agent has picked up the task.  This defaults to 3600 seconds, but you can easily increase this under settings. 

If you are seeing the 20 second timeout you may be having an issue on gofer on the client.  


>B: the pulp queue is allowed to grow unchecked while the Client
> patching gets to compete with repository synchronization for the celery
> threads, 

As i said before, this isn't true.

> C: Worst, THE PULP TASKS ARE NOT CANCELLED WHEN THE KATELLO TASK
> TIMES OUT ... 

This is a perfectly valid RFE, please file an RFE for this.  We *should* be able to get this in as a zstream.

>from what I've seen, if there's a decent queue of activities
> built up in pulp (repo synch's for example) and you try to patch servers,
> katello tells you it's timed out and failed.  Meanwhile, hours later, pulp
> can get around to your patching tasks that are queued in the mongodb task
> queue and patch your server ... long after you've given up, closed your
> change ticket as a failure and gone to bed.
> 
> The other directly impacting issue I've seen is that you're given the option
> to synch repos hourly (We need to do this with a few custom repos to pull in
> new development packages throughout the day).  The bad thing here is that
> the repo synch's / metadata refreshes / capsule synch's are all ran
> regardless of whether anything has changed in the repo or not.  

There is an open BZ for that here:  https://bugzilla.redhat.com/show_bug.cgi?id=1264560

It has already been fixed upstream and should appear in the next Z-stream i believe.

>In our
> situation, this caused a huge backlog of pulp tasks which I had to manually
> cancel to get things back on track ... after disabling hourly synch's.  I
> had to manually cancel over 11,000 tasks that were waiting in pulp but
> Katello said were failed.
> 
> Suggestion: If a repo has no changes ... don't regenerate metadata and don't
> push update tasks to the capsules ... nothing has changed, don't waste the
> resources and back up the pulp queue further.

Hope that helps!

Comment 15 Stuart Auchterlonie 2015-11-23 11:05:33 UTC

(In reply to Justin Sherrill from comment #14)
ess) because of stringent change management controls and windows.
> > 
> > From what I've seen, Katello seems to throw tasks "over the wall" to pulp
> > and then timeout / give up after X seconds.  The truly horrible points here
> > are that A: there are only a small number of celery threads processing these
> > requests, 
> 
> This is not true.  There are two types of tasks at play here, agent tasks
> and 'worker' tasks.  Agent tasks are not processed by the celery workers at
> all.  The workers can be completely busy and agent tasks will still be
> submitted to the agents. 
> 
> Also there are two timeouts here.  A 'pickup' timeout and a 'completion'
> timeout.  The pickup timeout triggers if the client has not picked up the
> task in 20 seconds.  Given that this is not prevented by busy celery workers
> and an agent will pick up tasks even if they are currently processing tasks
> I'm not sure why this is a problem?  The 'completion' timeout is to handle
> the case where a system or gofer running on the system 'dies' after the
> agent has picked up the task.  This defaults to 3600 seconds, but you can
> easily increase this under settings. 
> 
> If you are seeing the 20 second timeout you may be having an issue on gofer
> on the client.  
> 
> 

I'll give you a perfectly good example of where this is not working correctly.

If you have a capsule set up as part of your satellite system, and it is offline
for some reason.

The tasks will be submitted, they will hit the 20 second time out, and the
task itself will error out.

*HOWEVER* The task request remains on the message queue, so when you bring
your capsule back online a few hours later, it'll start up, then go and
check for any items in it's queue, finds those tasks requests on the message
queue, and start pulling them and actioning them.

I still believe we need better state management within the task system.
Task once they finish planning, should enter a "queued" state, since they
are "queued" waiting for a pulp worker to pick them up and action them.

Tasks should not immediately enter "Running / Pending" since this isn't 
really correct for them, they should only transition to "Running" once a
worker has picked them up and actually started processing them.


Regards
Stuart

Comment 16 Justin Sherrill 2015-11-23 13:29:32 UTC

> I'll give you a perfectly good example of where this is not working
> correctly.
> 
> If you have a capsule set up as part of your satellite system, and it is
> offline
> for some reason.
> 
> The tasks will be submitted, they will hit the 20 second time out, and the
> task itself will error out.
> 
> *HOWEVER* The task request remains on the message queue, so when you bring
> your capsule back online a few hours later, it'll start up, then go and
> check for any items in it's queue, finds those tasks requests on the message
> queue, and start pulling them and actioning them.

Right, and as I said to Sean I believe this is a valid bug/rfe.  We can likely get this resolved in a z-stream if customers start requesting it via normal channels.  As of today no one has, as a comment in an unrelated bz is not a formal request :)

> 
> I still believe we need better state management within the task system.
> Task once they finish planning, should enter a "queued" state, since they
> are "queued" waiting for a pulp worker to pick them up and action them.
> 
> Tasks should not immediately enter "Running / Pending" since this isn't 
> really correct for them, they should only transition to "Running" once a
> worker has picked them up and actually started processing them.

When viewing the application as a whole I mostly disagree with this.  The tasking system you are looking at that is saying Running/Pending is sitting ontop of the foreman/katello application, the pulp application (and its tasking system) and the candlepin application.  It is not merely dispatching tasks to pulp.  In many cases, performing an operation will perform some action in the foreman/katello database, kick off a pulp task, and then perform some other operation in the foreman/katello database (and that is a simple case).  To say that it isn't 'running' because pulp hasn't picked up a single part of the full action isn't really accurate.  But I can see the confusion when looking at the two different tasking systems (katello's and pulp's) and seeing conflicting data.

Comment 17 Justin Sherrill 2015-11-23 13:35:13 UTC

I've opened a bz here: https://bugzilla.redhat.com/show_bug.cgi?id=1284494 to cover task cancelling on timeout.

Comment 18 Mike McCune 2015-11-23 19:00:38 UTC

We are releasing an immediate hotfix for this issue which will reduce the amount of time to apply large errata updates to end systems as well as resolve some reliability issues with applying large # of errata updates to client systems.

== Server HOTFIX instructions ==

1) Edit /etc/foreman/plugins/katello.yaml, update rest_client_timeout to 3600
2) Save file
3) katello-service restart

If katello-installer --upgrade is re-run you will need to re-apply this configuration change. 

== Client HOTFIX Instructions ==

1) Download attached bz-hotfix-1269509.tar.gz file or wget http://people.redhat.com/~mmccune/hotfix/1269509/bz-hotfix-1269509.tar.gz
2) Expand archive which contains packages for Red Hat Enterprise Linux 5/6/7 packages for katello-agent and gofer
3) Copy packages from extracted subdirectory for your version of Red Hat Enterprise Linux to a client system you wish to upgrade to the hotfix 
4) yum localinstall *.rpm 
5) service goferd restart

This hotfix includes fixes for:

 - https://bugzilla.redhat.com/show_bug.cgi?id=1277269
 - https://bugzilla.redhat.com/show_bug.cgi?id=1269509

Comment 19 Mike McCune 2015-11-23 19:02:43 UTC

Created attachment 1097826 [details]
client hotfix packages

Comment 20 Sean Mullen 2015-11-23 20:25:28 UTC

Hi everyone,
Thanks for the attention, clarification and hotfix, I'll pull it down and work with it asap. Justin, Specific to your clarification thanks, that helps me understand, I had to use some "assumption" since I can only dig through so much code.

All that said, the reason I made a few assumptions ... things mentioned in my redhat case that are not mentioned here ... I had my Red Hat repos (many of them) set to auto synch weekly and I have a few internal custom repos set to synch hourly (so dev's can see / install new versions quickly after pushing them to the internal repo). After allowing that to run for a while, I imported some servers and tried applying errata through Satellite. I tried applying to 2 hosts ... all security errata. The tasks timed out after 120 seconds or so ... I gave up that night and went home. When I came in in the morning, I saw that the hosts finally did patch several hours later. This is what I'm saying is totally unacceptable in an enterprise ... if I pick 200 servers and say "patch them", the patching MUST be done within a couple of hours and if something says it gave up/timed out, it better not apply hours later. In our organization, that's a career limiting issue since it'd end up patching production servers during business hours.

After seeing the patched servers the next day, I poked around a lot more and found that there were over 11,000 pulp tasks in pending status and assumed that this was a good portion of the issue. I was then told "you probably shouldn't set repos to synch hourly, the system isn't designed for that" ... even though the default synch timeframe is ... hourly.

You guys mention properly opening RFE's ... I know what an RFE is but apparently not how to open one. I'm a customer and I have several requests that would be valid RFE's I believe. I only got into this thread because it was linked in my cases (01531872 and 01530408)

Comment 21 Mike McCune 2015-11-24 21:11:33 UTC

Just to set some expectation on this bug around the performance improvements.

On an older RHEL 6.5 virtual machine that I updated with ~290 errata to get it fully updated to RHEL 6.7 it took approximately 1H before the fixes in this bug

After the update to katello-agent and gofer packages it was reduced to 35 minutes

We hope to improve this further in future releases and offering options to simplify the process by allowing for remote execution of a simple 'yum update' command instead of specific errata installation instructions.

Comment 22 Tazim Kolhar 2015-11-30 15:06:27 UTC

VERIFIED:

# rpm -qa | grep foreman
ruby193-rubygem-foreman_hooks-0.3.7-2.el7sat.noarch
rubygem-hammer_cli_foreman_docker-0.0.3.10-1.el7sat.noarch
foreman-gce-1.7.2.47-1.el7sat.noarch
rubygem-hammer_cli_foreman_discovery-0.0.1.10-1.el7sat.noarch
hp-bl490cg6-01.rhts.eng.bos.redhat.com-foreman-proxy-1.0-2.noarch
foreman-debug-1.7.2.47-1.el7sat.noarch
foreman-compute-1.7.2.47-1.el7sat.noarch
foreman-vmware-1.7.2.47-1.el7sat.noarch
ruby193-rubygem-foreman-redhat_access-0.2.4-1.el7sat.noarch
rubygem-hammer_cli_foreman_tasks-0.0.3.5-1.el7sat.noarch
ruby193-rubygem-foreman_bootdisk-4.0.2.13-1.el7sat.noarch
hp-bl490cg6-01.rhts.eng.bos.redhat.com-foreman-client-1.0-1.noarch
hp-bl490cg6-01.rhts.eng.bos.redhat.com-foreman-proxy-client-1.0-1.noarch
foreman-ovirt-1.7.2.47-1.el7sat.noarch
ruby193-rubygem-foreman_discovery-2.0.0.22-1.el7sat.noarch
rubygem-hammer_cli_foreman-0.1.4.14-1.el7sat.noarch
foreman-selinux-1.7.2.17-1.el7sat.noarch
ruby193-rubygem-foreman_gutterball-0.0.1.9-1.el7sat.noarch
foreman-postgresql-1.7.2.47-1.el7sat.noarch
ruby193-rubygem-foreman_docker-1.2.0.24-1.el7sat.noarch
rubygem-hammer_cli_foreman_bootdisk-0.1.2.7-1.el7sat.noarch
foreman-libvirt-1.7.2.47-1.el7sat.noarch
foreman-1.7.2.47-1.el7sat.noarch
ruby193-rubygem-foreman-tasks-0.6.15.7-1.el7sat.noarch
foreman-proxy-1.7.2.6-1.el7sat.noarch

steps:
1. Select multiple errata(s) to create a transaction that will take at least 120 seconds to apply
2. Apply selected errata

 Id: d30951ae-aec0-41ca-b56b-e512bee7dff6
Label: Actions::Katello::System::Erratum::Install
Name: Install erratum
Owner: admin
Started at: 2015-11-30 14:53:10 UTC
Ended at: 2015-11-30 14:56:09 UTC
State: stopped
Result: success
Params: RHSA-2015:2504, RHBA-2015:2014, RHBA-2015:2018, RHBA-2015:2006, RHBA-2015:2011; system 'ibm-hs22-02.lab.bos.redhat.com'; organization 'Default Organization' 


perl-IO-Compress-Base-2.021-141.el6_7.1.x86_64
libreport-python-2.0.9-25.el6_7.x86_64
1:perl-Module-Pluggable-3.90-141.el6_7.1.x86_64
3:perl-version-0.77-141.el6_7.1.x86_64
4:perl-devel-5.10.1-141.el6_7.1.x86_64
4:perl-5.10.1-141.el6_7.1.x86_64
1:perl-Compress-Raw-Zlib-2.021-141.el6_7.1.x86_64
libreport-plugin-kerneloops-2.0.9-25.el6_7.x86_64
perl-CGI-3.51-141.el6_7.1.x86_64
libreport-plugin-mailx-2.0.9-25.el6_7.x86_64
zip-3.0-1.el6_7.1.x86_64
1:perl-Pod-Simple-3.13-141.el6_7.1.x86_64
1:perl-Pod-Escapes-1.04-141.el6_7.1.x86_64
libreport-compat-2.0.9-25.el6_7.x86_64
libreport-plugin-rhtsupport-2.0.9-25.el6_7.x86_64
selinux-policy-3.7.19-279.el6_7.7.noarch
4:perl-libs-5.10.1-141.el6_7.1.x86_64
libreport-2.0.9-25.el6_7.x86_64
perl-Test-Simple-0.92-141.el6_7.1.x86_64
libreport-cli-2.0.9-25.el6_7.x86_64
perl-ExtUtils-MakeMaker-6.55-141.el6_7.1.x86_64
1:perl-ExtUtils-ParseXS-2.2003.0-141.el6_7.1.x86_64
libreport-plugin-logger-2.0.9-25.el6_7.x86_64
perl-Test-Harness-3.17-141.el6_7.1.x86_64
grep-2.20-3.el6_7.1.x86_64
libreport-plugin-reportuploader-2.0.9-25.el6_7.x86_64
selinux-policy-targeted-3.7.19-279.el6_7.7.noarch
perl-IO-Compress-Zlib-2.021-141.el6_7.1.x86_64
perl-Compress-Zlib-2.021-141.el6_7.1.x86_64
json-c-0.11-12.el6.x86_64
libreport-filesystem-2.0.9-25.el6_7.x86_64
satyr-0.16-2.el6.x86_64
augeas-libs-1.0.0-10.el6.x86_64

Comment 24 errata-xmlrpc 2015-12-15 09:20:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2622

Comment 25 gabicr 2015-12-16 10:05:20 UTC

Hello!

Updated yesterday to 6.1.5 on RHEL 7.Sync last night, tried today appl 612 errata  an RH 6 guest registerd on Sat6. Got same timeout. Checked /etc/foreman/plugins/katello.yaml on satellite  rest_client_timeout is 120 !!!!!!!

Comment 26 Karthick Murugadhas 2015-12-16 22:03:02 UTC

I have a customer who upgraded to 6.1.5 and still facing the timeout issue.

Note You need to log in before you can comment on or make changes to this bug.