Description of problem: Assume scenario: - a goferd client connects to qdrouterd - qdrouterd does reverse DNS lookup (PTR query) for client's IP address against a DNS server - assume the DNS server is broken, such that it does not reply to the query or replies after say one minute - qdrouterd while waiting to the response stops send any AMQP data to any other connection Consequences: - that causes inter-qdrouterd connection timeouts due to unresponded heartbeats - any communication between Satellite and katello-agent is postponed/delayed, causing optionally task timeouts Please backport https://issues.apache.org/jira/browse/DISPATCH-443 once the fix is available. Version-Release number of selected component (if applicable): qpid-dispatch-router-0.4-11.el7.x86_64 How reproducible: 100% Steps to Reproduce: 0. Setup Satellite with Capsule, either with external DNS server 1. Break your DNS server such that it does not respond (on time) to some DNS PTR queries (i.e. remove some IP range from its managed rages) 2. Kick off goferd on a client such that DNS PTR query against its IP address is responded after a long time or never. 3. Observe no communication can flow through the qdrouterd where the goferd is connecting to - including inter-qdrouterd communication or new task (package install) request. Actual results: - Package installs to other clients will timeout (assuming the DNS query is still being "processed"). - inter-qdrouterd connection flapping (see https://access.redhat.com/solutions/2429011 for particular logs) Expected results: - other clients can communicate with the qdrouterd, they can accept and acknowledge tasks (to istall a package) etc. - inter-qdrouterd connection is stable Additional info:
A fix for this issue has been committed to the master branch upstream. https://git-wip-us.apache.org/repos/asf?p=qpid-dispatch.git;a=patch;h=cf3c874 This is a low-risk update and is ready for back-port to the product builds if approved.
Hi Matej, having Interconnect / qdrouterd knowledge, would you be able to reproduce (or even verify) this?
I am moving this to VERIFIED. We have not been able to reproduce the issue, and we have already deployed this code at certain customers with no negative imapct. Therefore, we are markign this as verified to deliver with 6.2.4. If you are still seeing this issue after 6.2.4 please feel free to re-open and provide additional information on how to reproduce.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:2699
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days