Bug 1997225 - Noticed "event_queue_error: type=delete_host_agent_queue, object_id=XX" error logging during concurrent host build/rebuild/re-registration/deletion in Satellite 6.10
Summary: Noticed "event_queue_error: type=delete_host_agent_queue, object_id=XX" error...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Hosts - Content
Version: 6.10.0
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: 6.10.0
Assignee: Lucy Fu
QA Contact: Stephen Wadeley
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-24 16:19 UTC by Sayan Das
Modified: 2021-11-16 14:13 UTC (History)
4 users (show)

Fixed In Version: tfm-rubygem-katello-4.1.1.22-1,tfm-rubygem-katello-4.1.1.28-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-16 14:13:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Foreman Issue Tracker 33348 0 None None None 2021-08-25 15:47:55 UTC
Red Hat Product Errata RHSA-2021:4702 0 None None None 2021-11-16 14:13:35 UTC

Description Sayan Das 2021-08-24 16:19:48 UTC
Description of problem:

While doing certain host build/rebuild/re-registration/deletion-related tasks, I have noticed the following errors in production.log at certain times.

2021-08-24T21:15:40 [E|kat|25e729cf] event_queue_error: type=delete_host_agent_queue, object_id=21
2021-08-24T21:15:40 [E|kat|25e729cf] Connection refused - connect(2) for "localhost" port 5671
2021-08-24T21:15:40 [E|kat|25e729cf] /opt/theforeman/tfm/root/usr/share/gems/gems/qpid_proton-0.33.0/lib/core/container.rb:130:in `initialize'
 25e729cf | /opt/theforeman/tfm/root/usr/share/gems/gems/qpid_proton-0.33.0/lib/core/container.rb:130:in `new'
 25e729cf | /opt/theforeman/tfm/root/usr/share/gems/gems/qpid_proton-0.33.0/lib/core/container.rb:130:in `connect'
 25e729cf | /opt/theforeman/tfm/root/usr/share/gems/gems/katello-4.1.1.11/app/lib/katello/qpid/connection.rb:18:in `on_container_start'


Version-Release number of selected component (if applicable):

Satellite 6.10 [ snap 13 ]


How reproducible:

Not always


Steps to Reproduce:
1. Install a satellite 6 and set it up for system deployment.
2. Deploy some RHEL 7 hosts concurrently from Satellite server and wait for their build to get completed.
3. Perform some re-registrations, host deletion etc
4. Execute --> grep 'Connection refused - connect(2) for "localhost"' /var/log/foreman/production.log -B3 -A3 


Actual results:

# grep 'Connection refused - connect(2) for "localhost"' /var/log/foreman/production.log -B3 -A3
2021-08-24T18:45:23 [I|app|312ab092] ForemanWebhooks::EventSubscriber: host_updated.event.foreman event received
2021-08-24T18:45:23 [I|app|312ab092] Completed 201 Created in 1144ms (ActiveRecord: 298.3ms | Allocations: 225772)
2021-08-24T18:50:19 [E|kat|25e729cf] event_queue_error: type=delete_host_agent_queue, object_id=20
2021-08-24T18:50:19 [E|kat|25e729cf] Connection refused - connect(2) for "localhost" port 5671
2021-08-24T18:50:19 [E|kat|25e729cf] /opt/theforeman/tfm/root/usr/share/gems/gems/qpid_proton-0.33.0/lib/core/container.rb:130:in `initialize'
 25e729cf | /opt/theforeman/tfm/root/usr/share/gems/gems/qpid_proton-0.33.0/lib/core/container.rb:130:in `new'
 25e729cf | /opt/theforeman/tfm/root/usr/share/gems/gems/qpid_proton-0.33.0/lib/core/container.rb:130:in `connect'
--
2021-08-24T20:02:54 [I|app|f39c138b] Processing by NotificationRecipientsController#index as JSON
2021-08-24T20:02:54 [I|app|f39c138b] Completed 200 OK in 46ms (Views: 0.2ms | ActiveRecord: 2.9ms | Allocations: 17943)
2021-08-24T20:05:13 [E|kat|25e729cf] event_queue_error: type=delete_host_agent_queue, object_id=20
2021-08-24T20:05:13 [E|kat|25e729cf] Connection refused - connect(2) for "localhost" port 5671
2021-08-24T20:05:13 [E|kat|25e729cf] /opt/theforeman/tfm/root/usr/share/gems/gems/qpid_proton-0.33.0/lib/core/container.rb:130:in `initialize'
 25e729cf | /opt/theforeman/tfm/root/usr/share/gems/gems/qpid_proton-0.33.0/lib/core/container.rb:130:in `new'
 25e729cf | /opt/theforeman/tfm/root/usr/share/gems/gems/qpid_proton-0.33.0/lib/core/container.rb:130:in `connect'



Expected results:

These should not occur at all.


Additional info:

* I don't have any qpidd service running and based on the Engineering team, this scenario is occurring when satellite attempts to delete qpid queue when katello agent is disabled

Comment 4 Pavel Moravec 2021-08-25 07:21:33 UTC
I was able to reproduce it *once* by (re-)registering a host via the curl Register Host feature. Just one Host was in charge, dont recall particular parameters used (few tests executed, hit just once).

Comment 5 Pavel Moravec 2021-08-25 08:01:21 UTC
IMHO deterministic reproducer: register a host (with AK, could be via WebUI Register Host), and unregister it (or forcefully re-register it).

Then wait 10 minutes (this is tricky, see https://github.com/Katello/katello/blob/master/app/services/katello/registration_manager.rb#L268 for reasoning).

Simply katello just has some leftovers from qpid katello event queues that need to be removed:

1) https://github.com/Katello/katello/blob/master/app/services/katello/registration_manager.rb#L130
- remove this command / line as there is no katello agent queue

2) remove whole https://github.com/Katello/katello/blob/master/app/services/katello/registration_manager.rb#L264-L270 as being dead code

3) remove whole https://github.com/Katello/katello/blob/master/app/models/katello/events/delete_host_agent_queue.rb as being dead code

4) remove https://github.com/Katello/katello/blob/master/lib/katello/engine.rb#L236 as being orphaned / not-existing action

(not sure if whole https://github.com/Katello/katello/blob/master/app/services/katello/agent/dispatcher.rb and some tests referring to delete_client_queue should not be removed as well)


I *think* the bug is not severe as the unregister call flow just triggers an event and continues successfully further. And processing the particular event fails (with the backtrace), but the  Katello::EventQueue itself isnt affected - it still can process further events.

Comment 6 Lucy Fu 2021-08-25 15:47:54 UTC
Created redmine issue https://projects.theforeman.org/issues/33348 from this bug

Comment 7 Jonathon Turel 2021-08-25 20:08:32 UTC
Hey Pavel! None of the code you referred to is dead yet. In new Satellite 6.10 installations the katello agent infrastructure is only _disabled_ by default. For users upgrading, it will still be enabled and we still support it. If users want, they can run the installer with a flag to enable the agent bits. In a few Satellite releases, once we have a proper replacement for katello agent, then we'll remove all of that old stuff you pointed out!

This bug is still valid => when the agent infrastructure is disabled we don't need to worry about deleting the agent queue.

Comment 8 Pavel Moravec 2021-08-25 20:50:26 UTC
(In reply to Jonathon Turel from comment #7)
> Hey Pavel! None of the code you referred to is dead yet. In new Satellite
> 6.10 installations the katello agent infrastructure is only _disabled_ by
> default. For users upgrading, it will still be enabled and we still support
> it. If users want, they can run the installer with a flag to enable the
> agent bits. In a few Satellite releases, once we have a proper replacement
> for katello agent, then we'll remove all of that old stuff you pointed out!
> 
> This bug is still valid => when the agent infrastructure is disabled we
> don't need to worry about deleting the agent queue.

Ah I see, thanks for the explanation. So the registration_manager.rb#L130 should be called conditionally when the feature is enabled, only, I guess.

Comment 9 Sayan Das 2021-08-26 11:18:13 UTC
So now I can see that 10mins delay and map my action with it.

* Yesterday I had created a system from my sat 6.10

* Today, at "2021-08-26T16:30:11" i deleted the same system and it finished at 2021-08-26T16:30:18

* The traceback reported in BZ, popped up approx after 10 mins i.e. at "2021-08-26T16:40:13"

Comment 15 errata-xmlrpc 2021-11-16 14:13:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.10 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4702


Note You need to log in before you can comment on or make changes to this bug.