Bug 1570796

Summary: in case "event" exchange is missing, connection leak, not suspended mode and no error log
Product: [Community] Candlepin Reporter: Barnaby Court <bcourt>
Component: candlepinAssignee: candlepin-bugs
Status: CLOSED CURRENTRELEASE QA Contact: Katello QA List <katello-qa-list>
Severity: high Docs Contact:
Priority: high    
Version: 0.9.54CC: crog, katello-qa-list, khowell, pmoravec, redakkan, skallesh
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: candlepin-2.4.1-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1570534 Environment:
Last Closed: 2018-06-26 15:42:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Barnaby Court 2018-04-23 12:19:50 UTC
1 & 3 should be fixed in hornet-improvements branch only. #2 (logging) is not

+++ This bug was initially created as a clone of Bug #1570534 +++

Description of problem:
(this is based on a real customer story though on older Satellite version)

If "event" exchange is deleted (e.g. after purging away whole /var/lib/qpidd or due to whatever user error) and candlepin is about to send an event message to qpidd, the candlepin behaves wrongly in 3 ways:

1) it creates extra TCP connection(s) to qpidd (I bet old connections not closed properly when producer detects 'event' address is not resolvable to a queue or exchange) - that is a connection and FD leak that affects both candlepin and also qpidd (that runs out of 500 connections after a short time)

2) candlepin.log contains proper error that IMHO should be copied also to error.log (how a support associate can assume errors are logged in candlepin log but not in error log?)

3) candlepin should go to suspend mode/state. Since without 'event' exchange, it can't send messages to qpidd like in QPID_FLOW_STOPPED case.


Version-Release number of selected component (if applicable):
candlepin-2.1.15-1.el7.noarch
Satellite 6.3.1


How reproducible:
100%


Steps to Reproduce:
1. Have running Satellite
2. Delete 'event' exchange to mimic some qpid/config issue:
qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 del exchange event --durable
3. Register a system, add a subscription etc.
4. Check /var/log/candlepin/error.log
5. Check the system status in WebUI
6. Check number of TCP connections between candlepin and qpidd:
netstat -anp | grep 5671 | grep ESTABLISHED | grep -c java


Actual results:
3. succeeds without any warning/error
4. no error log
5. subscription info about the system is not updated in WebUI (b'cos katello doesnt get updates via qpidd from candlepin)
6. more than 1 TCP connection created (each event message failed to be sent creates 2-3 leaked TCP connections, from experiments)


Expected results:
3. subscription requests should fail with "Candlepin is in Suspend mode"
4. error log to contain the same error / backtrace like candlepin.log has
5. subscription info isnt updated in WebUI as well, but because the action hasnt happened on the client (i.e. no mismatch of system's subscription and what WebUI shows)
6. no TCP connection leak, there should be 1 TCP connection between qpidd<->candlepin only


Additional info:
candlepin.log has this entry when trying to send a message to deleted 'event' exchange:

2018-04-23 08:57:57,940 [thread=Thread-16 (HornetQ-client-global-threads-315890771)] [=, org=, csid=] ERROR org.hornetq.core.client - HQ214000: Failed to call onMessage
java.lang.RuntimeException: Error sending event to message bus
	at org.candlepin.audit.AMQPBusPublisher.onEvent(AMQPBusPublisher.java:46) ~[AMQPBusPublisher.class:na]
	at org.candlepin.audit.ListenerWrapper.onMessage(ListenerWrapper.java:50) ~[ListenerWrapper.class:na]
	at org.hornetq.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1116) ~[hornetq-core-client-2.4.7.Final.jar:na]
	at org.hornetq.core.client.impl.ClientConsumerImpl.access$500(ClientConsumerImpl.java:56) ~[hornetq-core-client-2.4.7.Final.jar:na]
	at org.hornetq.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1251) ~[hornetq-core-client-2.4.7.Final.jar:na]
	at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:104) [hornetq-core-client-2.4.7.Final.jar:na]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152) [na:1.7.0_171]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) [na:1.7.0_171]
	at java.lang.Thread.run(Thread.java:748) [na:1.7.0_171]
Caused by: java.lang.RuntimeException: Error sending event to message bus
	at org.candlepin.audit.QpidConnection.sendTextMessage(QpidConnection.java:168) ~[QpidConnection.class:na]
	at org.candlepin.audit.AMQPBusPublisher.onEvent(AMQPBusPublisher.java:43) ~[AMQPBusPublisher.class:na]
	... 8 common frames omitted
Caused by: javax.jms.JMSException: Exception when sending message:ch=0 id=0 ExecutionException(errorCode=NOT_FOUND, commandId=0, classCode=0, commandCode=0, fieldIndex=0, description=not-found: Exchange not found: event (/builddir/build/BUILD/qpid-cpp-1.36.0/src/qpid/broker/ExchangeRegistry.cpp:144), errorInfo={})
	at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:260) ~[qpid-client-0.32.jar:0.32]
	at org.apache.qpid.client.BasicMessageProducer.sendImpl(BasicMessageProducer.java:528) ~[qpid-client-0.32.jar:0.32]
	at org.apache.qpid.client.BasicMessageProducer.send(BasicMessageProducer.java:326) ~[qpid-client-0.32.jar:0.32]
	at org.apache.qpid.client.TopicPublisherAdapter.send(TopicPublisherAdapter.java:120) ~[qpid-client-0.32.jar:0.32]
	at org.candlepin.audit.QpidConnection.sendTextMessage(QpidConnection.java:157) ~[QpidConnection.class:na]
	... 9 common frames omitted
Caused by: org.apache.qpid.transport.SessionException: ch=0 id=0 ExecutionException(errorCode=NOT_FOUND, commandId=0, classCode=0, commandCode=0, fieldIndex=0, description=not-found: Exchange not found: event (/builddir/build/BUILD/qpid-cpp-1.36.0/src/qpid/broker/ExchangeRegistry.cpp:144), errorInfo={})
	at org.apache.qpid.transport.Session.sync(Session.java:876) ~[qpid-common-0.32.jar:0.32]
	at org.apache.qpid.transport.Session.sync(Session.java:837) ~[qpid-common-0.32.jar:0.32]
	at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:253) ~[qpid-client-0.32.jar:0.32]
	... 13 common frames omitted

--- Additional comment from Pavel Moravec on 2018-04-23 04:00:13 EDT ---

Btw the connection is is present also in situation, when:

- candlepin went to suspended state (due to katello_event_queue almost full / messages sent to 'event' exchange are timeouting)
- recovering to resume normal operation
- sending new event - a new TCP connection is created

Again, I suspect the connection where AMQP exception/error happened (timeout here) was not closed properly.


Reproducer:
- to speedup reproducer, add

default-queue-limit=100000

to /etc/qpid/qpidd.conf

(such that katello_event_queue gets full much faster)

- netstat -anp | grep 5671 | grep java
- service foreman-tasks stop
- re-register a system 10 times (that generates 10 events where the queue will enqueue just 6 before candlepin goes to suspended state
- service foreman-tasks start
- wait till katello_event_queue is empty ( qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt  -b amqps://localhost:5671 -q katello_event_queue )
- wait till candlepin resumes operations / not in suspended state
- re-register a system (will work now)
- netstat shows extra TCP connection (candlepin forgot to drop the old one)

Comment 1 Chris "Ceiu" Rog 2018-06-26 15:32:00 UTC
As of 2.5.5, Candlepin will output the following to error.log when a messaging or connection error occurs:

2018-06-20 15:08:31,834 [thread=IoReceiver - localhost/127.0.0.1:5671] [=, org=, csid=] ERROR org.apache.qpid.client.AMQConnection - Throwable Received but no listener set: org.apache.qpid.AMQException: ch=0 id=0 ExecutionExczeption(errorCode=NOT_FOUND, commandId=972, classCode=0, commandCode=0, fieldIndex=0, description=not-found: Exchange not found: event (/builddir/build/BUILD/qpid-cpp-1.36.0/src/qpid/broker/ExchangeRegistry.cpp:144), errorInfo={}) [error code 404: not found]
2018-06-20 15:08:31,841 [thread=Thread-14 (ActiveMQ-client-global-threads)] [=, org=, csid=] ERROR org.candlepin.audit.QpidEventMessageReceiver - Unable to process message 7781[7780]: Exception when sending message:ch=0 id=0 ExecutionException(errorCode=NOT_FOUND, commandId=972, classCode=0, commandCode=0, fieldIndex=0, description=not-found: Exchange not found: event (/builddir/build/BUILD/qpid-cpp-1.36.0/src/qpid/broker/ExchangeRegistry.cpp:144), errorInfo={})
2018-06-20 15:08:31,851 [thread=Thread-14 (ActiveMQ-client-global-threads)] [=, org=, csid=] ERROR org.candlepin.audit.QpidEventMessageReceiver - Unable to process message 7783[7782]: Session has been closed

While these are lacking the stack trace (by design), that can be pulled from candlepin.log in the event it is required.

Is this sufficient for addressing point #2?