Bug 1440235
| Summary: | candlepin event listener does not acknowledge every 100th message | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Chris Duryee <cduryee> | ||||
| Component: | Subscription Management | Assignee: | Justin Sherrill <jsherril> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Peter Ondrejka <pondrejk> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 6.2.8 | CC: | andrewrpalmer, andrew.schofield, bbuckingham, bkearney, cduryee, egolov, ehelms, inecas, jcallaha, jentrena, jorge_martinez, kabbott, lzap, mmccune, pmoravec, sthirugn, wpinheir, xdmoon | ||||
| Target Milestone: | Unspecified | Keywords: | FieldEngineering, PrioBumpField, PrioBumpGSS, Triaged | ||||
| Target Release: | Unused | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | rubygem-katello-3.0.0.161-1 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1530702 (view as bug list) | Environment: | |||||
| Last Closed: | 2018-02-05 13:54:34 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Chris Duryee
2017-04-07 16:04:30 UTC
Could you describe a scenario, that this behavior happens in terms of state of the system and example of the messages that lead to the messages that are not being processed It is difficult to reproduce outside of a production environment, but I think if you get candlepin to generate a large number of events and then while the listener is working through them, restart foreman-tasks a few times. It may help to put a longer sleep statement in the listener loop, so there's a longer delay between picking the message up and ACKing it. The delay between picking up and acking should no be the same as missing release due to error while processing (as it would eventually get acked). What we need is a bactkrace/error message from the case when the message is not processed by Katello. Otherwise we are not able to help with this case. Putting needinfo back to reporter, until we get this info from this customer or somebody else running into the same issue. From what I've seen in the code, I've not found an obvious place where this could happen One another note: can't be the cause (of increasing backlog of messages) the limited throughput 1message per second? See https://bugzilla.redhat.com/show_bug.cgi?id=1399877 . That BZ was fixed in 6.2.7, if the customer behind this BZ is on older Sat release and sending lots of candlepin events, they can be affected by bz1399877 . Created redmine issue http://projects.theforeman.org/issues/20532 from this bug FYI a consequence of this bug is qpidd memory leak (so far in any version): https://bugzilla.redhat.com/show_bug.cgi?id=1479579 (In reply to Chris Duryee from comment #5) > It is difficult to reproduce outside of a production environment, but I > think if you get candlepin to generate a large number of events and then > while the listener is working through them, restart foreman-tasks a few > times. It may help to put a longer sleep statement in the listener loop, so > there's a longer delay between picking the message up and ACKing it. I confirm this reproducer. Bit more straightforward way (just artificial repro but worth for devels to emulate the bug): 1) stop foreman-tasks service (to populate katello_event_queue a bit) 2) generate several hundreds candlepin events (i.e. (un)register a Content Host with an activation key, or remove all and attach back a subscription pool to another Host) - do that in a loop until katello_event_queue has few hundreds of messages 3) start foreman-tasks service - leave step 2) _running_ (at least I did so, it might but not need to be important) 4) Once LOCE task consumes the backlog, check if katello_event_queue has zero queue depth (see #c7) 5) if there some messages constantly acquired but not acknowledged, you got it. Otherwise, goto step 1). Upstream bug assigned to jsherril Upstream bug assigned to jsherril Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/20532 has been resolved. Created attachment 1323878 [details]
patch
Verified on 6.2.14 using steps from c#12, no stuck messages found in the queue after processing hundreds of requests Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:0273 |