Bug 1169416
| Summary: | gofer does not try to reconnect after network issue | ||
|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> |
| Component: | katello-agent | Assignee: | Justin Sherrill <jsherril> |
| Status: | CLOSED ERRATA | QA Contact: | jcallaha |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.0.6 | CC: | ahuchcha, bbuckingham, bchardim, chrobert, cwelton, egolov, fdacunha, jcallaha, jmontleo, jortel, jsherril, ktordeur, tbily, tcarlin, tkubota, tony |
| Target Milestone: | Unspecified | Keywords: | Triaged |
| Target Release: | Unused | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-08-12 05:20:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1181005 | ||
| Bug Blocks: | 1171330 | ||
|
Description
Pavel Moravec
2014-12-01 15:40:40 UTC
Reported to QPID jira: https://issues.apache.org/jira/browse/QPID-6297 Updated severity/priority to high since it reoccurred at customer site again. I believe this may affect all Sat 6 customers with capsules. Workaround (if this occurs during a Sat 6 -> Capsule sync): 1) On Sat 6 server: Cancel sync 2) On capsule: service goferd restart 3) On Sat 6 server: Restart sync Per my analysis of tcpdump at [1], I think some code change is necessary also in goferd. Esp. it seems to me like when it detects disconnection, it tries to re-connect and re-establish everything again - even on the TCP connection not affected by the error. Should not it call connection.close() prior reconnect, or so? Or is this issue also python-qpid problem? [1] https://issues.apache.org/jira/browse/QPID-6297?focusedCommentId=14272920&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14272920 (In reply to Pavel Moravec from comment #7) > Per my analysis of tcpdump at [1], I think some code change is necessary > also in goferd. Esp. it seems to me like when it detects disconnection, it > tries to re-connect and re-establish everything again - even on the TCP > connection not affected by the error. Should not it call connection.close() > prior reconnect, or so? Or is this issue also python-qpid problem? > > > [1] > https://issues.apache.org/jira/browse/QPID- > 6297?focusedCommentId=14272920&page=com.atlassian.jira.plugin.system. > issuetabpanels:comment-tabpanel#comment-14272920 Disregard - I am afraid this is done by the client itself. Workaround: use heartbeats (ideally of length similar to 1/2 of TCP retries interval) > Workaround: use heartbeats (ideally of length similar to 1/2 of TCP retries > interval) Workaround in python-gofer-qpid: --- /usr/lib/python2.7/site-packages/gofer/transport/qpid/broker.py 2014-06-16 23:19:24.000000000 +0200 +++ /usr/lib/python2.7/site-packages/gofer/transport/qpid/broker.py.new 2015-01-12 15:11:38.031826407 +0100 @@ -77,7 +77,8 @@ class Qpid(Broker): password=self.password, ssl_trustfile=self.cacert, ssl_certfile=self.clientcert, - ssl_skip_hostname_check=(not self.host_validation)) + ssl_skip_hostname_check=(not self.host_validation), + heartbeat=5) con.attach() self.connection.cached = con log.info('{%s} connected to AMQP', self.id()) This is imho recommended solution - see https://access.redhat.com/solutions/56487 for reasoning. All the problems in python-qpid comes from the fact the broker rejects session attach (after the network glitch), assuming the session is attached. This rejection "confuses" the client. Surely it should handle that in a better way, but at the end it would need to inform the program (gofer) it failed to reattach dropped session. Jeff, following the KCS, do you see some other solution, e.g. inside python-qpid? Note the broker thinks the AMQP session is established while the client does not think so - the key problem here. I went ahead and set heartbeat=10 for good measure in gofer 2.4 which is included in pulp 2.6 / Sat 6.1. Any chance you can monkey patch on the customer site just to see if this helps? Do we need this back ported to gofer 1.4 (included in sat 6.0) (In reply to Jeff Ortel from comment #10) > I went ahead and set heartbeat=10 for good measure in gofer 2.4 which is > included in pulp 2.6 / Sat 6.1. heartbeat=10 seems fine. It could be much higher to still satisfy the request "less than half of TCP retry scheme" and to have less auxiliary activity on qpidd+network+goferd, but this particular value is still fine wrt. additional load or scaling (compared to the fact the client invokes fetch every 10 seconds any time, so heartbeats should occur mainly during network issues). > > Any chance you can monkey patch on the customer site just to see if this > helps? Do we need this back ported to gofer 1.4 (included in sat 6.0) Monkey-patching at customer is straightforward (apply the change in c#9, restart goferd), but ends up in changed-thus-unsupported goferd. If there are planned also other changes to gofer 1.4, I am in favour in backporting this - low risk, potential bigger gain. (In reply to Pavel Moravec from comment #11) > (In reply to Jeff Ortel from comment #10) > > I went ahead and set heartbeat=10 for good measure in gofer 2.4 which is > > included in pulp 2.6 / Sat 6.1. > heartbeat=10 seems fine. It could be much higher to still satisfy the > request "less than half of TCP retry scheme" and to have less auxiliary > activity on qpidd+network+goferd, but this particular value is still fine > wrt. additional load or scaling (compared to the fact the client invokes > fetch every 10 seconds any time, so heartbeats should occur mainly during > network issues). > > > > > Any chance you can monkey patch on the customer site just to see if this > > helps? Do we need this back ported to gofer 1.4 (included in sat 6.0) > > Monkey-patching at customer is straightforward (apply the change in c#9, > restart goferd), but ends up in changed-thus-unsupported goferd. If there > are planned also other changes to gofer 1.4, I am in favour in backporting > this - low risk, potential bigger gain. I just meant monkey-patch just long enough to see if the heartbeat helps. Nothing else planned for gofer 1.4 ATM. Tested using the step provided in the description using gofer 2.5 (qpid.messaging heartbeat enabled in 2.4) on F20. Without the heartbeat, the KeyError was raised on reconnect. With the heartbeat enabled, the reconnect succeeded and did NOT raise the KeyError. I am not able to reproduce with the heartbeat enabled. This should be resolved when we upgrade to pulp 2.6 which involves a gofer update for katello-agent. *** Bug 1188253 has been marked as a duplicate of this bug. *** *** Bug 1199967 has been marked as a duplicate of this bug. *** *** Bug 1169397 has been marked as a duplicate of this bug. *** Connection successfully re-established. Verified in Satellite 6.1 GA3. This bug is slated to be released with Satellite 6.1. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:1592 |