Bug 1570534 - in case "event" exchange is missing, connection leak, not suspended mode and no error log
Summary: in case "event" exchange is missing, connection leak, not suspended mode and ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Candlepin
Version: 6.3.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: Unspecified
Assignee: satellite6-bugs
QA Contact: Perry Gagne
URL:
Whiteboard:
Depends On: 1631735
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-23 07:03 UTC by Pavel Moravec
Modified: 2019-07-12 14:07 UTC (History)
14 users (show)

Fixed In Version: candlepin-2.4.13-1, candlepin-2.5.10-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1570796 1631735 (view as bug list)
Environment:
Last Closed: 2019-07-12 14:07:28 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3632401 0 None None None 2018-09-30 15:43:57 UTC
Red Hat Knowledge Base (Solution) 3830041 0 None None None 2019-01-25 09:09:08 UTC

Description Pavel Moravec 2018-04-23 07:03:55 UTC
Description of problem:
(this is based on a real customer story though on older Satellite version)

If "event" exchange is deleted (e.g. after purging away whole /var/lib/qpidd or due to whatever user error) and candlepin is about to send an event message to qpidd, the candlepin behaves wrongly in 3 ways:

1) it creates extra TCP connection(s) to qpidd (I bet old connections not closed properly when producer detects 'event' address is not resolvable to a queue or exchange) - that is a connection and FD leak that affects both candlepin and also qpidd (that runs out of 500 connections after a short time)

2) candlepin.log contains proper error that IMHO should be copied also to error.log (how a support associate can assume errors are logged in candlepin log but not in error log?)

3) candlepin should go to suspend mode/state. Since without 'event' exchange, it can't send messages to qpidd like in QPID_FLOW_STOPPED case.


Version-Release number of selected component (if applicable):
candlepin-2.1.15-1.el7.noarch
Satellite 6.3.1


How reproducible:
100%


Steps to Reproduce:
1. Have running Satellite
2. Delete 'event' exchange to mimic some qpid/config issue:
qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 del exchange event --durable
3. Register a system, add a subscription etc.
4. Check /var/log/candlepin/error.log
5. Check the system status in WebUI
6. Check number of TCP connections between candlepin and qpidd:
netstat -anp | grep 5671 | grep ESTABLISHED | grep -c java


Actual results:
3. succeeds without any warning/error
4. no error log
5. subscription info about the system is not updated in WebUI (b'cos katello doesnt get updates via qpidd from candlepin)
6. more than 1 TCP connection created (each event message failed to be sent creates 2-3 leaked TCP connections, from experiments)


Expected results:
3. subscription requests should fail with "Candlepin is in Suspend mode"
4. error log to contain the same error / backtrace like candlepin.log has
5. subscription info isnt updated in WebUI as well, but because the action hasnt happened on the client (i.e. no mismatch of system's subscription and what WebUI shows)
6. no TCP connection leak, there should be 1 TCP connection between qpidd<->candlepin only


Additional info:
candlepin.log has this entry when trying to send a message to deleted 'event' exchange:

2018-04-23 08:57:57,940 [thread=Thread-16 (HornetQ-client-global-threads-315890771)] [=, org=, csid=] ERROR org.hornetq.core.client - HQ214000: Failed to call onMessage
java.lang.RuntimeException: Error sending event to message bus
	at org.candlepin.audit.AMQPBusPublisher.onEvent(AMQPBusPublisher.java:46) ~[AMQPBusPublisher.class:na]
	at org.candlepin.audit.ListenerWrapper.onMessage(ListenerWrapper.java:50) ~[ListenerWrapper.class:na]
	at org.hornetq.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1116) ~[hornetq-core-client-2.4.7.Final.jar:na]
	at org.hornetq.core.client.impl.ClientConsumerImpl.access$500(ClientConsumerImpl.java:56) ~[hornetq-core-client-2.4.7.Final.jar:na]
	at org.hornetq.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1251) ~[hornetq-core-client-2.4.7.Final.jar:na]
	at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:104) [hornetq-core-client-2.4.7.Final.jar:na]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152) [na:1.7.0_171]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) [na:1.7.0_171]
	at java.lang.Thread.run(Thread.java:748) [na:1.7.0_171]
Caused by: java.lang.RuntimeException: Error sending event to message bus
	at org.candlepin.audit.QpidConnection.sendTextMessage(QpidConnection.java:168) ~[QpidConnection.class:na]
	at org.candlepin.audit.AMQPBusPublisher.onEvent(AMQPBusPublisher.java:43) ~[AMQPBusPublisher.class:na]
	... 8 common frames omitted
Caused by: javax.jms.JMSException: Exception when sending message:ch=0 id=0 ExecutionException(errorCode=NOT_FOUND, commandId=0, classCode=0, commandCode=0, fieldIndex=0, description=not-found: Exchange not found: event (/builddir/build/BUILD/qpid-cpp-1.36.0/src/qpid/broker/ExchangeRegistry.cpp:144), errorInfo={})
	at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:260) ~[qpid-client-0.32.jar:0.32]
	at org.apache.qpid.client.BasicMessageProducer.sendImpl(BasicMessageProducer.java:528) ~[qpid-client-0.32.jar:0.32]
	at org.apache.qpid.client.BasicMessageProducer.send(BasicMessageProducer.java:326) ~[qpid-client-0.32.jar:0.32]
	at org.apache.qpid.client.TopicPublisherAdapter.send(TopicPublisherAdapter.java:120) ~[qpid-client-0.32.jar:0.32]
	at org.candlepin.audit.QpidConnection.sendTextMessage(QpidConnection.java:157) ~[QpidConnection.class:na]
	... 9 common frames omitted
Caused by: org.apache.qpid.transport.SessionException: ch=0 id=0 ExecutionException(errorCode=NOT_FOUND, commandId=0, classCode=0, commandCode=0, fieldIndex=0, description=not-found: Exchange not found: event (/builddir/build/BUILD/qpid-cpp-1.36.0/src/qpid/broker/ExchangeRegistry.cpp:144), errorInfo={})
	at org.apache.qpid.transport.Session.sync(Session.java:876) ~[qpid-common-0.32.jar:0.32]
	at org.apache.qpid.transport.Session.sync(Session.java:837) ~[qpid-common-0.32.jar:0.32]
	at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:253) ~[qpid-client-0.32.jar:0.32]
	... 13 common frames omitted

Comment 1 Pavel Moravec 2018-04-23 08:00:13 UTC
Btw the connection is is present also in situation, when:

- candlepin went to suspended state (due to katello_event_queue almost full / messages sent to 'event' exchange are timeouting)
- recovering to resume normal operation
- sending new event - a new TCP connection is created

Again, I suspect the connection where AMQP exception/error happened (timeout here) was not closed properly.


Reproducer:
- to speedup reproducer, add

default-queue-limit=100000

to /etc/qpid/qpidd.conf

(such that katello_event_queue gets full much faster)

- netstat -anp | grep 5671 | grep java
- service foreman-tasks stop
- re-register a system 10 times (that generates 10 events where the queue will enqueue just 6 before candlepin goes to suspended state
- service foreman-tasks start
- wait till katello_event_queue is empty ( qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt  -b amqps://localhost:5671 -q katello_event_queue )
- wait till candlepin resumes operations / not in suspended state
- re-register a system (will work now)
- netstat shows extra TCP connection (candlepin forgot to drop the old one)

Comment 4 Pavel Moravec 2018-06-23 07:41:23 UTC
I was asked to clarify what is requested to change about logging.

This bug is not just about logging error, but:

- primarily about TCP/IP connection leak in 2 similar situations (when 'event' exchange is gone, and when qpidd doesnt accept messages due to queue full)

- minor issue is the error "HQ214000: Failed to call onMessage
java.lang.RuntimeException: Error sending event to message bus" should be printed not only to candlepin.log, but I expect to error.log as well (at least it is my understanding the error.log should contain *all* error logs, but I might be wrong with this expected specification)

Comment 6 Kevin Howell 2018-09-17 19:30:32 UTC
Some context for this, there are a number of changes to the HornetQ/Artemis setup for Candlepin:

 * Candlepin now better tracks Qpid connection state and clean up stale connections (via https://github.com/candlepin/candlepin/pull/1927 )
 * Candlepin now handles Qpid connection failure without message loss (via https://github.com/candlepin/candlepin/pull/1951 )
 * Candlepin now ensures order of messages when failure happens (via https://github.com/candlepin/candlepin/pull/1982 )

I believe that comment 0 has good reproducer steps to show changes in behavior for the first two changes (and the third is unrelated).

Comment 15 Pavel Moravec 2018-11-02 12:06:15 UTC
Another scenario: LOCE task down, candlepin sending event till katello_event_queue completely full:

2018-11-02 04:29:42 [Broker] error Execution exception: resource-limit-exceeded: Maximum depth exceeded on katello_event_queue: current=[count: 1041, size: 104735244], max=[size: 104857600] (/builddir/build/BUILD/qpid-cpp-1.36.0/src/qpid/broker/Queue.cpp:1662)

and those errors leave same connection leaks behind..

Comment 18 Pavel Moravec 2019-01-22 09:48:05 UTC
Expected but not confirmed reproducer:

- have many durable queues
- restart qpidd
  - during this restart, recovering the journal files for the durable queues would take more minutes
- meantime, candlepin will try to (re)connect to qpidd - such connection will succeed on TCP layer but not on AMQP level, with *expected* outcome in same connection leak like behind this BZ

Comment 25 Mike McCune 2019-06-12 14:04:34 UTC
looks like this is closed in 6.5 and going to CLOSED:CURRENTRELEASE with the inclusion of candlepin-2.5.15-1.el7sat.noarch

If this re-occurs in 6.5 or later, feel free to re-open.

Comment 26 Taft Sanders 2019-07-11 22:58:03 UTC
Reopening for backport request to Satellite 6.3.5


Note You need to log in before you can comment on or make changes to this bug.