Bug 2124419 - Jobs pushed in MQTT queue is not delivered if yggdrasild was not running and communicating with the right broker before the jobs were pushed
Summary: Jobs pushed in MQTT queue is not delivered if yggdrasild was not running and ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Remote Execution
Version: 6.12.0
Hardware: All
OS: Linux
high
high
Target Milestone: 6.13.0
Assignee: Adam Ruzicka
QA Contact: Peter Ondrejka
URL:
Whiteboard:
Depends On:
Blocks: 2124287
TreeView+ depends on / blocked
 
Reported: 2022-09-06 06:01 UTC by Sayan Das
Modified: 2023-05-03 13:21 UTC (History)
8 users (show)

Fixed In Version: rubygem-foreman_remote_execution-8.1.0, rubygem-smart_proxy_remote_execution_ssh-0.9.0
Doc Type: Known Issue
Doc Text:
Clone Of: 2124287
Environment:
Last Closed: 2023-05-03 13:21:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker SAT-14272 0 None None None 2022-12-07 12:22:42 UTC
Red Hat Product Errata RHSA-2023:2097 0 None None None 2023-05-03 13:21:59 UTC

Description Sayan Das 2022-09-06 06:01:01 UTC
+++ This bug was initially created as a clone of Bug #2124287 +++

The problem / reproducer:

1) stop yggdrasild on a client:
systemctl stop yggdrasild

2) invoke a pull-mqtt REX job to the client

3) start the service back:
systemctl start yggdrasild

4) check status of the job - it will be hung forever

(the reason is yggd start drops all incoming notifications)


We *need* to document this *somehow*. Not sure about the most appropriate way, if:
- in install/config guide with pull-mqtt config description, as a warning
- in release notes (where I assume the pull-mqtt feature will be mentioned either way), again as a warning
- as a KCS (that I am happy to write, but I feel it is rather insufficient post-mortem documentation)

--- Additional comment from Pavel Moravec on 2022-09-05 14:16:57 UTC ---

See:

https://issues.redhat.com/browse/SAT-11349  As a user, I expect jobs which were scheduled when a host was down to be processed once the machine comes up
https://issues.redhat.com/browse/SAT-7337 Tasks delivered during yggd start are dropped 

for the "missing feature behaviour"

Recall that katello-agent works well here, its jobs survive goferd and also qpidd restart, and customers would see this as a regression. Let at least notify them about it.

Brad, who shall decide about the best place to document it?

--- Additional comment from Sayan Das on 2022-09-05 17:23:49 UTC ---

+1 for :

~~

3) start the service back:
systemctl start yggdrasild

4) check status of the job - it will be hung forever

(the reason is yggd start drops all incoming notifications)

~~


I think this is a major flaw in mosquitto + uggdrasild communication i.e. if there is a disturbance and the yggdrasild was down when the Job was pushed to the smart-proxy, That job will remain hung forever even if yggdrasild was brought up online. 

I faced this issue number of times when I was playing around with "Change COntent SOurce" + pull-mqtt before and after .

The hung job is not even cancellable but It needs to be "Aborted". I feel apart from documentation, It should be considered a product bug as well as big customers will bite us back later for this behavior. 


CC'ing aruzicka

--- Additional comment from Brad Buckingham on 2022-09-05 22:19:53 UTC ---

I'd recommend this be documented as a Known Issue in the Release Notes, unless it is resolved prior to GA.  It could also be in a KCS for those that may miss the note.

Is there a related bugzilla to address the behavior with the workflow?

I agree that it would be viewed as a regression in behavior when moving to the REX Pull-Provider.  We had to go through similar growing pains with katello-agent to make it more resilient.

Comment 1 Sayan Das 2022-09-06 06:03:58 UTC
This bug should be treated as a product bug to fix the same behavior of https://bugzilla.redhat.com/show_bug.cgi?id=2124287 ( https://github.com/RedHatInsights/yggdrasil/issues/82 )

Comment 2 Adam Ruzicka 2022-09-07 14:23:32 UTC
I agree that this should get fixed eventually, but I don't see us fixing it in 6.12. If this does not get fixed in yggdrasil itself, we have a workaround in the works that should land in 6.13.

Comment 7 Adam Ruzicka 2022-12-07 12:20:44 UTC
This should be fixed in foreman_remote_execution-8.1.0 and smart_proxy_remote_execution-0.9.0 in the spirit of [1] and [2].

The MQTT notification is re-sent every 15 minutes (configurable in /etc/foreman-proxy/settings.d/remote_execution_ssh.yml under the mqtt_resend_interval key). There is also a satellite-wide setting (overridable per-job) to set a time to pickup. If the host does not pick up the job within the given time interval, the job fails.

[1] - https://issues.redhat.com/browse/SAT-1668
[2] - https://issues.redhat.com/browse/SAT-11349

Comment 8 Peter Ondrejka 2023-01-06 14:33:22 UTC
Verified on Sat 6.13 sn 5, pending jobs are successfully picked up after yggdrasil starts on the host, resend interval can be set as expected

Comment 11 errata-xmlrpc 2023-05-03 13:21:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.13 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2097


Note You need to log in before you can comment on or make changes to this bug.