+++ This bug was initially created as a clone of Bug #1269509 +++
Description of problem:
When you attempt to install errata(s) onto a host via
the errata managment tab of a host. If that operation
will take longer to complete than approx ~120 seconds the operation will time out.
We narrowed this down to an issue with qpid:
The difference in performance between satellite and using YUM directly to install errata (as reported in
#01519801) can be attributed to a known issue  with Qpid and AMQP 1.0. Further it explains why this
performance is a regression in Satellite 6.1. In 6.0, AMQP-0-10 was used. In 6.1, AMQP-1.0 was introduced
to support Qpid Dispatch Router on Capsules. Using AMQP-0-10, the client can explicitly trigger the flush
by sending a sync control (or setting the sync flag on the transfer). There is no such explicit mechanism in
AMQP 1.0, so at present the flush happens after a short timeout on the broker. During the development and
testing of python-gofer-proton, I noted an approximate 1 second delay sending messages. After discussion
with Ted Ross, Gordon Sim and Ken Giusti, this finding was confirmed.
So, what this means to Satellite is that it takes 1 second to send every message when using AMQP 1.0 to
durable queues when Qpid persistence is enabled (which it is). The messaging flow during an errata install
looks something like this:
| request -------->|
|<------- accepted |
|<------- started |
|<------- result |
* A TON OF THESE
For example: 549 (progress) messages where sent on a test system when installing 191 packages associated
with an Errata install. At 1 second per progress message this adds 549 seconds (9 minutes) to the install.
These numbers matched test results. With progress reporting enabled: ~18 min. Without progress reporting:
1. Fix the AMQP-1.0 issue in Qpid. This has been described by the Qpid teams as difficult so unlikely this
will get fixed soon. Or, that the fixed version be ported to RHEL6.
2. Disconnect progress reporting in the katelo agent plugin. This solution is simple but assumes Katello
can live without progress reported for agent tasks. This progress is reported to Pulp and is included in
the Pulp task.
3. A variation on #2. If katello needs the reported progress information we could rate-limit the reported
progress. This mitigates the issue but not as good as #1 or #2.
The disconnect delays introduced by AMQP-1.0 heartbeats that I observed and reported earlier seems to have
no significant impact on the performance of installing Errata.
For 6.1.z we opted for (2) from above as merged into katello-2.2's agent:
but we need a more permanent solution for 6.2 (or we just need to carry this change forward in 6.2 if we decide to just keep it as-is).
If this bug is being looked at near the end of the 6.2 cycle and we are out of time, just get the above mentioned PR merged into 6.2 and avoid any regressions.
I flagged this as a regression and a blocker as we can't ship 6.2 without at least pulling that PR above into 6.2's code.
Suggestion on alternative fix #4
4) Report (interim) progress messages via a non durable queue.
The only progress message that we must received is the "completed"
messages. For any message that just provides an indication of
progress, it doesn't matter if the queue is not durable, because in
the event of a crash, the whole task will end up aborted, so any
"interim" progress reports are irrelevant.
This has been fixed using option #2 and can be moved to MODIFIED, ON_QA or CLOSED, right? See: https://github.com/Katello/katello-agent/pull/33.
Actually, that commit already shipped with an errata as part of BZ1269509. Can we just close this as a dupe? I would think it should just get into 6.2's katello-agent when we branch it for 6.2, no?
In my opinion this bz cannot be closed until 6.2 is shipped.
BZ1269509 tracked the 6.1.z version of this issue.
Why? We generally only have one BZ per bug, not per Satellite version. Generally because it made a 6.1.z, the code will be in 6.2 when we branch the katello-agent repo.
Comment #1 from Mike indicates he wants to keep this open though, so it doesn't much matter to me.
Retested this scenario in Satellite 6.2-beta-snap-6.
1. Install 60+ errata in a rhel 7.2 content host from content host -> errata tab.
2. The task did not time out in 120 seconds as the bug status.
3. Instead, it timed out in 3600 seconds since the installation did not complete in 3600 seconds. Note that
Administer -> Settings -> Katello ->
content_action_accept_timeout = 20
content_action_finish_timeout = 3600
Also note that the content host got all the errata updates but took a little longer.
It looks to me that the bug is resolved as the timeout did not occur in 120 seconds.
I tested with an other host applying 20+ erratas, the content host did not get the erratas. I also tried with 3+ erratas - same problem. There is definitely a problem in satellite connecting to katello-agent of content hosts.
moved https://bugzilla.redhat.com/show_bug.cgi?id=1323726 back ON_QA , moving this one as well.
https://bugzilla.redhat.com/show_bug.cgi?id=1323726 is failed again. Moving this back to ASSIGNED
ping pong ... back ON_QA :)
still blocked on https://bugzilla.redhat.com/show_bug.cgi?id=1323726
Moving to Assigned as per Comment 20
POST as indicated in https://bugzilla.redhat.com/show_bug.cgi?id=1323726
Verified in Satellite 6.2 Beta Snap 9. With the new gofer package, and the resolution of the bug above, package installation is now within a reasonable margin of yum.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.