1169416 – gofer does not try to reconnect after network issue

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1169416 - gofer does not try to reconnect after network issue

Summary: gofer does not try to reconnect after network issue

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	katello-agent
Sub Component:
Version:	6.0.6
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	Unspecified
Assignee:	Justin Sherrill
QA Contact:	jcallaha
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1169397 1188253 1199967 (view as bug list)
Depends On:	1181005
Blocks:	1171330
TreeView+	depends on / blocked

Reported:	2014-12-01 15:40 UTC by Pavel Moravec
Modified:	2020-07-16 08:31 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-08-12 05:20:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Apache JIRA	QPID-6297	None	None	None	Never
Red Hat Bugzilla	1159281	high	CLOSED	Mention request to increase ulimit of nofiles for larger deployments	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	1295583	None	None	None	Never
Red Hat Knowledge Base (Solution)	1344383	None	None	None	Never
Red Hat Product Errata	RHSA-2015:1592	normal	SHIPPED_LIVE	Important: Red Hat Satellite 6.1.1 on RHEL 6	2015-08-12 09:04:35 UTC

Internal Links: 1159281

Description Pavel Moravec 2014-12-01 15:40:40 UTC

Description of problem:
Having some temporary network outage causing gofer loses TCP connection to AMQP broker, it does not try to reconnect.


Version-Release number of selected component (if applicable):
gofer-1.3.0-1.el6sat.noarch


How reproducible:
100%


Steps to Reproduce:
1. Just to speedup reproducer, lower kernel tunable net.ipv4.tcp_retries2 to e.g. 4:
echo 4 > /proc/sys/net/ipv4/tcp_retries2

2. Have goferd running, check its TCP connections to AMQP broker:
netstat -anp | grep 5671
(there should be 2 TCP connections)

3. Emulate network outage via iptables:
iptables -A OUTPUT -p tcp --dport 5671 -j REJECT

4. Monitor /var/log/messages; once it logs WARNING "recoverable error", flush iptables (iptables -F).

5. Wait few seconds when gofer should reconnect.

6. Check gofer TCP connections:
netstat -anp | grep 5671


Actual results:
6. shows just 1 TCP connection
/var/log/messages repeatedly logs:
Dec  1 16:39:02 pmoravec-rhel6-3 goferd: [ERROR][pulp.agent.a726580c-5f1e-4a79-9f11-de0adc52c1e9] gofer.transport.qpid.consumer:117 - 046d2084-b0f1-4de4-a039-89499d9e680d
Dec  1 16:39:02 pmoravec-rhel6-3 goferd: [ERROR][pulp.agent.a726580c-5f1e-4a79-9f11-de0adc52c1e9] gofer.transport.qpid.consumer:117 - Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/gofer/transport/qpid/consumer.py", line 113, in get return self.__receiver.fetch(timeout=timeout) File "<string>", line 6, in fetch File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 1030, in fetch self._ecwait(lambda: self.linked) File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait result = self._ewait(lambda: self.closed or predicate(), timeout) File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 993, in _ewait result = self.session._ewait(lambda: self.error or predicate(), timeout) File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 580, in _ewait result = self.connection._ewait(lambda: self.error or predicate(), timeout) File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 219, in _ewait self.check_error() File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 212, in check_error raise e InternalError: Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/qpid/messaging/driver.py", line 660, in write op.dispatch(self) File "/usr/lib/python2.6/site-packages/qpid/ops.py", line 84, in dispatch getattr(target, handler)(self, *args) File "/usr/lib/python2.6/site-packages/qpid/messaging/driver.py", line 877, in do_session_detached sst = self._sessions.pop(dtc.channel) KeyError: 'pop(): dictionary is empty'


Expected results:
2nd TCP connection re-established, no errors in /var/log/messages


Additional info:

Comment 4 Jeff Ortel 2015-01-07 18:32:48 UTC

Reported to QPID jira: https://issues.apache.org/jira/browse/QPID-6297

Comment 5 Thom Carlin 2015-01-09 14:42:35 UTC

Updated severity/priority to high since it reoccurred at customer site again.  I believe this may affect all Sat 6 customers with capsules.

Comment 6 Thom Carlin 2015-01-09 14:45:16 UTC

Workaround (if this occurs during a Sat 6 -> Capsule sync):
1) On Sat 6 server: Cancel sync
2) On capsule: service goferd restart
3) On Sat 6 server: Restart sync

Comment 7 Pavel Moravec 2015-01-12 12:25:30 UTC

Per my analysis of tcpdump at [1], I think some code change is necessary also in goferd. Esp. it seems to me like when it detects disconnection, it tries to re-connect and re-establish everything again - even on the TCP connection not affected by the error. Should not it call connection.close() prior reconnect, or so? Or is this issue also python-qpid problem?


[1] https://issues.apache.org/jira/browse/QPID-6297?focusedCommentId=14272920&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14272920

Comment 8 Pavel Moravec 2015-01-12 14:02:55 UTC

(In reply to Pavel Moravec from comment #7)
> Per my analysis of tcpdump at [1], I think some code change is necessary
> also in goferd. Esp. it seems to me like when it detects disconnection, it
> tries to re-connect and re-establish everything again - even on the TCP
> connection not affected by the error. Should not it call connection.close()
> prior reconnect, or so? Or is this issue also python-qpid problem?
> 
> 
> [1]
> https://issues.apache.org/jira/browse/QPID-
> 6297?focusedCommentId=14272920&page=com.atlassian.jira.plugin.system.
> issuetabpanels:comment-tabpanel#comment-14272920

Disregard - I am afraid this is done by the client itself.

Workaround: use heartbeats (ideally of length similar to 1/2 of TCP retries interval)

Comment 9 Pavel Moravec 2015-01-12 14:20:58 UTC

> Workaround: use heartbeats (ideally of length similar to 1/2 of TCP retries
> interval)

Workaround in python-gofer-qpid:

--- /usr/lib/python2.7/site-packages/gofer/transport/qpid/broker.py	2014-06-16 23:19:24.000000000 +0200
+++ /usr/lib/python2.7/site-packages/gofer/transport/qpid/broker.py.new	2015-01-12 15:11:38.031826407 +0100
@@ -77,7 +77,8 @@ class Qpid(Broker):
                 password=self.password,
                 ssl_trustfile=self.cacert,
                 ssl_certfile=self.clientcert,
-                ssl_skip_hostname_check=(not self.host_validation))
+                ssl_skip_hostname_check=(not self.host_validation),
+                heartbeat=5)
             con.attach()
             self.connection.cached = con
             log.info('{%s} connected to AMQP', self.id())

This is imho recommended solution - see https://access.redhat.com/solutions/56487 for reasoning.

All the problems in python-qpid comes from the fact the broker rejects session attach (after the network glitch), assuming the session is attached. This rejection "confuses" the client. Surely it should handle that in a better way, but at the end it would need to inform the program (gofer) it failed to reattach dropped session.

Jeff, following the KCS, do you see some other solution, e.g. inside python-qpid? Note the broker thinks the AMQP session is established while the client does not think so - the key problem here.

Comment 10 Jeff Ortel 2015-01-14 15:46:11 UTC

I went ahead and set heartbeat=10 for good measure in gofer 2.4 which is included in pulp 2.6 / Sat 6.1.

Any chance you can monkey patch on the customer site just to see if this helps?  Do we need this back ported to gofer 1.4 (included in sat 6.0)

Comment 11 Pavel Moravec 2015-01-15 07:30:45 UTC

(In reply to Jeff Ortel from comment #10)
> I went ahead and set heartbeat=10 for good measure in gofer 2.4 which is
> included in pulp 2.6 / Sat 6.1.
heartbeat=10 seems fine. It could be much higher to still satisfy the request "less than half of TCP retry scheme" and to have less auxiliary activity on qpidd+network+goferd, but this particular value is still fine wrt. additional load or scaling (compared to the fact the client invokes fetch every 10 seconds any time, so heartbeats should occur mainly during network issues).

> 
> Any chance you can monkey patch on the customer site just to see if this
> helps?  Do we need this back ported to gofer 1.4 (included in sat 6.0)

Monkey-patching at customer is straightforward (apply the change in c#9, restart goferd), but ends up in changed-thus-unsupported goferd. If there are planned also other changes to gofer 1.4, I am in favour in backporting this - low risk, potential bigger gain.

Comment 12 Jeff Ortel 2015-01-22 22:33:00 UTC

(In reply to Pavel Moravec from comment #11)
> (In reply to Jeff Ortel from comment #10)
> > I went ahead and set heartbeat=10 for good measure in gofer 2.4 which is
> > included in pulp 2.6 / Sat 6.1.
> heartbeat=10 seems fine. It could be much higher to still satisfy the
> request "less than half of TCP retry scheme" and to have less auxiliary
> activity on qpidd+network+goferd, but this particular value is still fine
> wrt. additional load or scaling (compared to the fact the client invokes
> fetch every 10 seconds any time, so heartbeats should occur mainly during
> network issues).
> 
> > 
> > Any chance you can monkey patch on the customer site just to see if this
> > helps?  Do we need this back ported to gofer 1.4 (included in sat 6.0)
> 
> Monkey-patching at customer is straightforward (apply the change in c#9,
> restart goferd), but ends up in changed-thus-unsupported goferd. If there
> are planned also other changes to gofer 1.4, I am in favour in backporting
> this - low risk, potential bigger gain.

I just meant monkey-patch just long enough to see if the heartbeat helps.  Nothing else planned for gofer 1.4 ATM.

Comment 13 Jeff Ortel 2015-02-06 16:41:42 UTC

Tested using the step provided in the description using gofer 2.5 (qpid.messaging heartbeat enabled in 2.4) on F20.  Without the heartbeat, the KeyError was raised on reconnect.  With the heartbeat enabled, the reconnect succeeded and did NOT raise the KeyError.  I am not able to reproduce with the heartbeat enabled.

Comment 14 Justin Sherrill 2015-02-06 16:53:43 UTC

This should be resolved when we upgrade to pulp 2.6 which involves a gofer update for katello-agent.

Comment 17 Brad Buckingham 2015-02-27 15:24:28 UTC

*** Bug 1188253 has been marked as a duplicate of this bug. ***

Comment 18 Mike McCune 2015-03-10 16:40:30 UTC

*** Bug 1199967 has been marked as a duplicate of this bug. ***

Comment 19 Pavel Moravec 2015-03-13 14:51:28 UTC

*** Bug 1169397 has been marked as a duplicate of this bug. ***

Comment 23 jcallaha 2015-05-13 20:41:34 UTC

Connection successfully re-established. Verified in Satellite 6.1 GA3.

Comment 27 Bryan Kearney 2015-08-11 13:21:49 UTC

This bug is slated to be released with Satellite 6.1.

Comment 28 errata-xmlrpc 2015-08-12 05:20:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:1592

Note You need to log in before you can comment on or make changes to this bug.