1440235 – candlepin event listener does not acknowledge every 100th message

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1440235 - candlepin event listener does not acknowledge every 100th message

Summary: candlepin event listener does not acknowledge every 100th message

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Subscription Management
Sub Component:
Version:	6.2.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium with 1 vote
Target Milestone:	Unspecified
Assignee:	Justin Sherrill
QA Contact:	Peter Ondrejka
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-07 16:04 UTC by Chris Duryee
Modified:	2022-07-09 09:21 UTC (History)
CC List:	18 users (show)
Fixed In Version:	rubygem-katello-3.0.0.161-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1530702 (view as bug list)
Environment:
Last Closed:	2018-02-05 13:54:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
patch (3.40 KB, patch) 2017-09-08 19:40 UTC, Justin Sherrill	no flags	Details \| Diff
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Foreman Issue Tracker	20532	Normal	Closed	candlepin event listener does not release messages after error	2020-09-10 12:41:40 UTC
Red Hat Bugzilla	1479579	high	CLOSED	qpidd memory accumulation after ListenOnCandlepinEvents forgot to accept a message	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	3145381	None	None	None	2017-08-09 08:51:08 UTC
Red Hat Product Errata	RHSA-2018:0273	normal	SHIPPED_LIVE	Important: Red Hat Satellite 6 security, bug fix, and enhancement update	2018-02-08 00:35:29 UTC

Internal Links: 1479579

Description Chris Duryee 2017-04-07 16:04:30 UTC

Description of problem:

The candlepin event listener will call 'acknowledge' on messages that it processes, but does not call 'release' or 'reject' on messages that it is unable to process.

This can cause some messages to become stuck in the katello_event queue, since they are being held but will never be released.

The best behavior may be to just log the message and then reject it, so potentially bad messages do not get reprocessed over and over.


Version-Release number of selected component (if applicable): 6.2.8


How reproducible: not sure how to repro yet


note: I put this under the 'hosts' component but I'm not sure if that is the best place for this to live. It is related to the ListenOnCandlepinEvents task.

Comment 4 Ivan Necas 2017-04-10 15:18:59 UTC

Could you describe a scenario, that this behavior happens in terms of state of the system and example of the messages that lead to the messages that are not being processed

Comment 5 Chris Duryee 2017-04-10 15:22:44 UTC

It is difficult to reproduce outside of a production environment, but I think if you get candlepin to generate a large number of events and then while the listener is working through them, restart foreman-tasks a few times. It may help to put a longer sleep statement in the listener loop, so there's a longer delay between picking the message up and ACKing it.

Comment 8 Ivan Necas 2017-04-11 08:57:17 UTC

The delay between picking up and acking should no be the same as missing release due to error while processing (as it would eventually get acked). What we need is a bactkrace/error message from the case when the message is not processed by Katello. Otherwise we are not able to help with this case. Putting needinfo back to reporter, until we get this info from this customer or somebody else running into the same issue. From what I've seen in the code, I've not found an obvious place where this could happen

Comment 9 Pavel Moravec 2017-04-11 09:43:46 UTC

One another note:
can't be the cause (of increasing backlog of messages) the limited throughput 1message per second? See https://bugzilla.redhat.com/show_bug.cgi?id=1399877 .

That BZ was fixed in 6.2.7, if the customer behind this BZ is on older Sat release and sending lots of candlepin events, they can be affected by bz1399877 .

Comment 10 Justin Sherrill 2017-08-09 01:37:27 UTC

Created redmine issue http://projects.theforeman.org/issues/20532 from this bug

Comment 11 Pavel Moravec 2017-08-09 07:20:13 UTC

FYI a consequence of this bug is qpidd memory leak (so far in any version):

https://bugzilla.redhat.com/show_bug.cgi?id=1479579

Comment 12 Pavel Moravec 2017-08-09 07:54:17 UTC

(In reply to Chris Duryee from comment #5)
> It is difficult to reproduce outside of a production environment, but I
> think if you get candlepin to generate a large number of events and then
> while the listener is working through them, restart foreman-tasks a few
> times. It may help to put a longer sleep statement in the listener loop, so
> there's a longer delay between picking the message up and ACKing it.

I confirm this reproducer. Bit more straightforward way (just artificial repro but worth for devels to emulate the bug):

1) stop foreman-tasks service (to populate katello_event_queue a bit)
2) generate several hundreds candlepin events (i.e. (un)register a Content Host with an activation key, or remove all and attach back a subscription pool to another Host) - do that in a loop until katello_event_queue has few hundreds of messages
3) start foreman-tasks service - leave step 2) _running_ (at least I did so, it might but not need to be important)
4) Once LOCE task consumes the backlog, check if katello_event_queue has zero queue depth (see #c7)
5) if there some messages constantly acquired but not acknowledged, you got it. Otherwise, goto step 1).

Comment 14 Satellite Program 2017-08-09 20:11:46 UTC

Upstream bug assigned to jsherril

Comment 15 Satellite Program 2017-08-09 20:11:53 UTC

Upstream bug assigned to jsherril

Comment 17 Satellite Program 2017-09-07 22:11:44 UTC

Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/20532 has been resolved.

Comment 18 Justin Sherrill 2017-09-08 19:40:51 UTC

Created attachment 1323878 [details]
patch

Comment 24 Peter Ondrejka 2018-01-11 14:27:09 UTC

Verified on 6.2.14 using steps from c#12, no stuck messages found in the queue after processing hundreds of requests

Comment 27 errata-xmlrpc 2018-02-05 13:54:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:0273

Note You need to log in before you can comment on or make changes to this bug.