Bug 1359894
| Summary: | openvswitch agents are being reported as down for 10 minutes after all reset controllers come back online | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Marian Krcmarik <mkrcmari> | |
| Component: | openstack-neutron | Assignee: | John Schwarz <jschwarz> | |
| Status: | CLOSED ERRATA | QA Contact: | Toni Freger <tfreger> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 9.0 (Mitaka) | CC: | amuller, chrisw, jjoyce, jschwarz, mburns, michele, mkrcmari, mlopes, nyechiel, oblaut, sclewis, srevivo, tvignaud | |
| Target Milestone: | ga | Keywords: | ZStream | |
| Target Release: | 9.0 (Mitaka) | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | openstack-neutron-8.1.2-2.el7ost | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, all controllers restarted at the same time, causing all AMQP (RabbitMQ) servers to also restart; this caused the connection between neutron agents and the AMQP servers to seem to hang until timed-out, while the time-out amount was linear (60 seconds, then 120, then 240, and so on (limited at 600)).
Consequently, an agent was considered `down` (not receive any events) until the timeout expired.
With this update, the time-out mechanism in the specific event that tries to connect between the agents and the neutron-server has been changed to always be 60 seconds. As a result, if the connection hangs because of a restart of all the controllers, the agents will recover quicker (~60 seconds after the controllers fully start again).
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1365378 (view as bug list) | Environment: | ||
| Last Closed: | 2016-08-24 12:56:40 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1365378 | |||
|
Description
Marian Krcmarik
2016-07-25 16:16:43 UTC
Please attach an SOS report from a controller and include the timestamps for the reboots. Or give us access to the system. Assigned to John to chase the info down and to root cause the bug. Logs from the initial report show that rabbitmq had restarted a bunch of times (got connection refused and successful reconnections a few times in a row), and then finally the connect succeeded and a report_state call was initiated. Then, I guess that the rabbitmq might have restarted again but oslo.messaging didn't hear about this. Interesting to see if this will reproduce with the same Neutron versions but an earlier (rhos-8's) oslo.messaging version. Either way, I talked to Marian and the setup was reprovisioned. They are trying to reproduce this on another setup. Marian - I'll be happy if, when/if this is reproduced again, the oslo.messaging version could be posted in addition to the sosreport. Also, could you try to downgrade the oslo.messaging package to rhos-8's on the problematic node and see if this fixes things? Marian gave me another setup and we debugged it together. The fault is ultimately https://review.openstack.org/#/c/280595/, which introduces an exponential backoff (actually it's linear but nevermind). This patch was ported usptream to mitaka and thus we are encountering this in osp-9. In short, the patch was intended for potentially-long-lasting RPC calls (like "get_routers" when there are a lot of routers) and not for short RPC calls that don't carry a lot of calculations. The solution I have in mind is introducing a way for code to decide for itself if it wants to use the exponential backoff or not. I'll begin work on this upstream in the coming days. Added external trackers for the proposed upstream solution and bug report. This only happens if you all shut down all three controllers at the same time. I don't think that situation could classify this bug to be a blocker. My recommendation would be not to possibly delay the release due to this bug. The patch has been merged upstream and I've begun a backport for rhos-9. Once that is done I'll package a new RPM. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1759.html |