1570534 – in case "event" exchange is missing, connection leak, not suspended mode and no error log

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1570534 - in case "event" exchange is missing, connection leak, not suspended mode and no error log

Summary: in case "event" exchange is missing, connection leak, not suspended mode and ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Candlepin
Sub Component:
Version:	6.3.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	Unspecified
Assignee:	satellite6-bugs
QA Contact:	Perry Gagne
Docs Contact:
URL:
Whiteboard:
Depends On:	1631735
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-23 07:03 UTC by Pavel Moravec
Modified:	2021-12-10 16:01 UTC (History)
CC List:	14 users (show)
Fixed In Version:	candlepin-2.4.13-1, candlepin-2.5.10-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1570796 1631735 (view as bug list)
Environment:
Last Closed:	2019-07-12 14:07:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3632401	0	None	None	None	2018-09-30 15:43:57 UTC
Red Hat Knowledge Base (Solution)	3830041	0	None	None	None	2019-01-25 09:09:08 UTC

Description Pavel Moravec 2018-04-23 07:03:55 UTC

Description of problem:
(this is based on a real customer story though on older Satellite version)

If "event" exchange is deleted (e.g. after purging away whole /var/lib/qpidd or due to whatever user error) and candlepin is about to send an event message to qpidd, the candlepin behaves wrongly in 3 ways:

1) it creates extra TCP connection(s) to qpidd (I bet old connections not closed properly when producer detects 'event' address is not resolvable to a queue or exchange) - that is a connection and FD leak that affects both candlepin and also qpidd (that runs out of 500 connections after a short time)

2) candlepin.log contains proper error that IMHO should be copied also to error.log (how a support associate can assume errors are logged in candlepin log but not in error log?)

3) candlepin should go to suspend mode/state. Since without 'event' exchange, it can't send messages to qpidd like in QPID_FLOW_STOPPED case.


Version-Release number of selected component (if applicable):
candlepin-2.1.15-1.el7.noarch
Satellite 6.3.1


How reproducible:
100%


Steps to Reproduce:
1. Have running Satellite
2. Delete 'event' exchange to mimic some qpid/config issue:
qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 del exchange event --durable
3. Register a system, add a subscription etc.
4. Check /var/log/candlepin/error.log
5. Check the system status in WebUI
6. Check number of TCP connections between candlepin and qpidd:
netstat -anp | grep 5671 | grep ESTABLISHED | grep -c java


Actual results:
3. succeeds without any warning/error
4. no error log
5. subscription info about the system is not updated in WebUI (b'cos katello doesnt get updates via qpidd from candlepin)
6. more than 1 TCP connection created (each event message failed to be sent creates 2-3 leaked TCP connections, from experiments)


Expected results:
3. subscription requests should fail with "Candlepin is in Suspend mode"
4. error log to contain the same error / backtrace like candlepin.log has
5. subscription info isnt updated in WebUI as well, but because the action hasnt happened on the client (i.e. no mismatch of system's subscription and what WebUI shows)
6. no TCP connection leak, there should be 1 TCP connection between qpidd<->candlepin only


Additional info:
candlepin.log has this entry when trying to send a message to deleted 'event' exchange:

2018-04-23 08:57:57,940 [thread=Thread-16 (HornetQ-client-global-threads-315890771)] [=, org=, csid=] ERROR org.hornetq.core.client - HQ214000: Failed to call onMessage
java.lang.RuntimeException: Error sending event to message bus
	at org.candlepin.audit.AMQPBusPublisher.onEvent(AMQPBusPublisher.java:46) ~[AMQPBusPublisher.class:na]
	at org.candlepin.audit.ListenerWrapper.onMessage(ListenerWrapper.java:50) ~[ListenerWrapper.class:na]
	at org.hornetq.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1116) ~[hornetq-core-client-2.4.7.Final.jar:na]
	at org.hornetq.core.client.impl.ClientConsumerImpl.access$500(ClientConsumerImpl.java:56) ~[hornetq-core-client-2.4.7.Final.jar:na]
	at org.hornetq.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1251) ~[hornetq-core-client-2.4.7.Final.jar:na]
	at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:104) [hornetq-core-client-2.4.7.Final.jar:na]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152) [na:1.7.0_171]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) [na:1.7.0_171]
	at java.lang.Thread.run(Thread.java:748) [na:1.7.0_171]
Caused by: java.lang.RuntimeException: Error sending event to message bus
	at org.candlepin.audit.QpidConnection.sendTextMessage(QpidConnection.java:168) ~[QpidConnection.class:na]
	at org.candlepin.audit.AMQPBusPublisher.onEvent(AMQPBusPublisher.java:43) ~[AMQPBusPublisher.class:na]
	... 8 common frames omitted
Caused by: javax.jms.JMSException: Exception when sending message:ch=0 id=0 ExecutionException(errorCode=NOT_FOUND, commandId=0, classCode=0, commandCode=0, fieldIndex=0, description=not-found: Exchange not found: event (/builddir/build/BUILD/qpid-cpp-1.36.0/src/qpid/broker/ExchangeRegistry.cpp:144), errorInfo={})
	at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:260) ~[qpid-client-0.32.jar:0.32]
	at org.apache.qpid.client.BasicMessageProducer.sendImpl(BasicMessageProducer.java:528) ~[qpid-client-0.32.jar:0.32]
	at org.apache.qpid.client.BasicMessageProducer.send(BasicMessageProducer.java:326) ~[qpid-client-0.32.jar:0.32]
	at org.apache.qpid.client.TopicPublisherAdapter.send(TopicPublisherAdapter.java:120) ~[qpid-client-0.32.jar:0.32]
	at org.candlepin.audit.QpidConnection.sendTextMessage(QpidConnection.java:157) ~[QpidConnection.class:na]
	... 9 common frames omitted
Caused by: org.apache.qpid.transport.SessionException: ch=0 id=0 ExecutionException(errorCode=NOT_FOUND, commandId=0, classCode=0, commandCode=0, fieldIndex=0, description=not-found: Exchange not found: event (/builddir/build/BUILD/qpid-cpp-1.36.0/src/qpid/broker/ExchangeRegistry.cpp:144), errorInfo={})
	at org.apache.qpid.transport.Session.sync(Session.java:876) ~[qpid-common-0.32.jar:0.32]
	at org.apache.qpid.transport.Session.sync(Session.java:837) ~[qpid-common-0.32.jar:0.32]
	at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:253) ~[qpid-client-0.32.jar:0.32]
	... 13 common frames omitted

Comment 1 Pavel Moravec 2018-04-23 08:00:13 UTC

Btw the connection is is present also in situation, when:

- candlepin went to suspended state (due to katello_event_queue almost full / messages sent to 'event' exchange are timeouting)
- recovering to resume normal operation
- sending new event - a new TCP connection is created

Again, I suspect the connection where AMQP exception/error happened (timeout here) was not closed properly.


Reproducer:
- to speedup reproducer, add

default-queue-limit=100000

to /etc/qpid/qpidd.conf

(such that katello_event_queue gets full much faster)

- netstat -anp | grep 5671 | grep java
- service foreman-tasks stop
- re-register a system 10 times (that generates 10 events where the queue will enqueue just 6 before candlepin goes to suspended state
- service foreman-tasks start
- wait till katello_event_queue is empty ( qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt  -b amqps://localhost:5671 -q katello_event_queue )
- wait till candlepin resumes operations / not in suspended state
- re-register a system (will work now)
- netstat shows extra TCP connection (candlepin forgot to drop the old one)

Comment 4 Pavel Moravec 2018-06-23 07:41:23 UTC

I was asked to clarify what is requested to change about logging.

This bug is not just about logging error, but:

- primarily about TCP/IP connection leak in 2 similar situations (when 'event' exchange is gone, and when qpidd doesnt accept messages due to queue full)

- minor issue is the error "HQ214000: Failed to call onMessage
java.lang.RuntimeException: Error sending event to message bus" should be printed not only to candlepin.log, but I expect to error.log as well (at least it is my understanding the error.log should contain *all* error logs, but I might be wrong with this expected specification)

Comment 6 Kevin Howell 2018-09-17 19:30:32 UTC

Some context for this, there are a number of changes to the HornetQ/Artemis setup for Candlepin:

 * Candlepin now better tracks Qpid connection state and clean up stale connections (via https://github.com/candlepin/candlepin/pull/1927 )
 * Candlepin now handles Qpid connection failure without message loss (via https://github.com/candlepin/candlepin/pull/1951 )
 * Candlepin now ensures order of messages when failure happens (via https://github.com/candlepin/candlepin/pull/1982 )

I believe that comment 0 has good reproducer steps to show changes in behavior for the first two changes (and the third is unrelated).

Comment 15 Pavel Moravec 2018-11-02 12:06:15 UTC

Another scenario: LOCE task down, candlepin sending event till katello_event_queue completely full:

2018-11-02 04:29:42 [Broker] error Execution exception: resource-limit-exceeded: Maximum depth exceeded on katello_event_queue: current=[count: 1041, size: 104735244], max=[size: 104857600] (/builddir/build/BUILD/qpid-cpp-1.36.0/src/qpid/broker/Queue.cpp:1662)

and those errors leave same connection leaks behind..

Comment 18 Pavel Moravec 2019-01-22 09:48:05 UTC

Expected but not confirmed reproducer:

- have many durable queues
- restart qpidd
  - during this restart, recovering the journal files for the durable queues would take more minutes
- meantime, candlepin will try to (re)connect to qpidd - such connection will succeed on TCP layer but not on AMQP level, with *expected* outcome in same connection leak like behind this BZ

Comment 25 Mike McCune 2019-06-12 14:04:34 UTC

looks like this is closed in 6.5 and going to CLOSED:CURRENTRELEASE with the inclusion of candlepin-2.5.15-1.el7sat.noarch

If this re-occurs in 6.5 or later, feel free to re-open.

Comment 26 Taft Sanders 2019-07-11 22:58:03 UTC

Reopening for backport request to Satellite 6.3.5

Note You need to log in before you can comment on or make changes to this bug.