Bug 1564087

Summary: no way to recover from dead connection to hornetQ
Product: Red Hat Satellite Reporter: Pavel Moravec <pmoravec>
Component: CandlepinAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED ERRATA QA Contact: Perry Gagne <pgagne>
Severity: high Docs Contact:
Priority: high    
Version: 6.2.14CC: bcourt, cdonnell, khowell, ktordeur, mmccune, pcreech, pmoravec
Target Milestone: 6.4.0Keywords: PrioBumpGSS, Triaged
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: candlepin-2.4.7-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1570064 1571423 1571425 1571426 (view as bug list) Environment:
Last Closed: 2018-10-16 19:26:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1570064, 1571423, 1571425, 1571426    
Bug Blocks:    

Description Pavel Moravec 2018-04-05 10:38:00 UTC
Description of problem:
When candlepin thinks it lost connection to hornetQ (whatever reason it is), it makes no attempt to recover from it. That causes candlepin can't send events to qpidd, hence katello/foreman does not receive updates about subscription status.


Version-Release number of selected component (if applicable):
candlepin-0.9.54.26-1.el7.noarch


How reproducible:
???


Steps to Reproduce:
1. ??? (no idea how to trigger this) get to the state when candlepin thinks it lost connection to hornetQ, logging:

2018-04-04 13:33:38,528 [thread=Thread-5137 (HornetQ-client-global-threads-246930626)] [=, org=] WARN  org.hornetq.core.client - HQ212037: Connection failure has been detected: HQ119015: The connection was disconnected because of server shutdown [code=DISCONNECTED]
2018-04-04 13:35:06,837 [thread=IoReceiver - bsul0081.fs01.vwf.vwfs-ad/10.43.225.233:5671] [=, org=] WARN  org.apache.qpid.transport.network.security.ssl.SSLUtil - Exception received while trying to verify hostname
2018-04-04 13:35:08,999 [thread=localhost-startStop-1] [=, org=] WARN  org.hibernate.id.UUIDHexGenerator - HHH000409: Using org.hibernate.id.UUIDHexGenerator which does not generate IETF RFC 4122 compliant UUID values; consider using org.hibernate.id.UUIDGenerator instead
2018-04-04 13:36:14,106 [thread=hornetq-failure-check-thread] [=, org=] WARN  org.hornetq.core.client - HQ212037: Connection failure has been detected: HQ119014: Did not receive data from invm:0. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. You also might have configured connection-ttl and client-failure-check-period incorrectly. Please check user manual for more information. The connection will now be closed. [code=CONNECTION_TIMEDOUT]

2. have an unentitled system you attach subscriptions
3. check WebUI / hammer for the status of the system
4. check /var/log/candlepin/error.log

Actual results:
3. shows unentitled
4. error.log have logs:
2018-04-01 03:06:03,681 [thread=http-bio-8443-exec-10] [req=14931ec4-de5e-438f-abb3-6b39193a87b5, org=Default_Organization] ERROR org.candlepin.audit.HornetqEventDispatcher - Error while trying to send event: Event [id=null, target=COMPLIANCE, type=CREATED, time=Sun Apr 01 03:06:03 CEST 2018, entity=8aa538916211fe4001621ab99e1201d9]
java.lang.NullPointerException: null
        at org.hornetq.core.client.impl.ClientSessionFactoryImpl.createSessionInternal(ClientSessionFactoryImpl.java:940) ~[hornetq-core-client-2.3.5.Final.jar:na]
        at org.hornetq.core.client.impl.ClientSessionFactoryImpl.createSession(ClientSessionFactoryImpl.java:363) ~[hornetq-core-client-2.3.5.Final.jar:na]
        at org.candlepin.audit.HornetqEventDispatcher.getClientSession(HornetqEventDispatcher.java:82) ~[HornetqEventDispatcher.class:na]
        at org.candlepin.audit.HornetqEventDispatcher.sendEvent(HornetqEventDispatcher.java:111) ~[HornetqEventDispatcher.class:na]
        at org.candlepin.audit.EventSinkImpl.sendEvents(EventSinkImpl.java:79) [EventSinkImpl.class:na]
..


Expected results:
3. should show entitled (assuming proper/sufficient subscriptions were provided)
4. no such error logs


Additional info:
(it is worth trying to understand why the connection fails..)

An attempt to disable TTL and connection checks:

diff -rup candlepin-0.9.54.26/src/main/java/org/candlepin/audit/HornetqEventDispatcher.java candlepin-0.9.54.26.2/src/main/java/org/candlepin/audit/HornetqEventDispatcher.java
--- candlepin-0.9.54.26/src/main/java/org/candlepin/audit/HornetqEventDispatcher.java	2017-12-07 17:36:23.000000000 +0100
+++ candlepin-0.9.54.26.2/src/main/java/org/candlepin/audit/HornetqEventDispatcher.java	2018-04-03 21:03:54.000000000 +0200
@@ -72,6 +72,8 @@ public class HornetqEventDispatcher  {
         ServerLocator locator = HornetQClient.createServerLocatorWithoutHA(
             new TransportConfiguration(InVMConnectorFactory.class.getName()));
         locator.setMinLargeMessageSize(largeMsgSize);
+        locator.setConnectionTTL(-1);
+        locator.setClientFailureCheckPeriod(-1);
         return locator.createSessionFactory();
     }

does not help. What I - as a totally noob on hornetQ, but knowing jms and reading hornetQ docs/API - rather think can help is to configure reconnections, calling on the same place:

locator.setReconnectAttempts(-1);  # default is 0 i.e. no reconnect
locator.setInitialConnectAttempts(-1);  # default # of connection attempts is 1, so no reconnect

(at least that is my deduction from:

https://activemq.apache.org/artemis/docs/javadocs/javadoc-1.4.0/org/apache/activemq/artemis/api/core/client/ServerLocator.html#setReconnectAttempts-int-
https://activemq.apache.org/artemis/docs/javadocs/javadoc-1.4.0/constant-values.html#org.apache.activemq.artemis.api.core.client.ActiveMQClient.DEFAULT_RECONNECT_ATTEMPTS

and around)

Comment 2 Pavel Moravec 2018-04-19 14:36:05 UTC
The improved connection parameters setting helped the customer behind this BZ.

Could you please add this setting to future candlepin versions?

Comment 4 Kevin Howell 2018-06-07 21:00:57 UTC
See linked BZs for fixed-in-versions for various hotfix branches.

Comment 12 Patrick Creech 2018-09-24 14:50:34 UTC
snap 23, not 63

Comment 14 Bryan Kearney 2018-10-16 19:26:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2927