Bug 1340097

Summary:	Review fallback on RPC messaging
Product:	Red Hat OpenStack	Reporter:	Robin Cernin <rcernin>
Component:	openstack-neutron	Assignee:	Hynek Mlnarik <hmlnarik>
Status:	CLOSED ERRATA	QA Contact:	Toni Freger <tfreger>
Severity:	high	Docs Contact:
Priority:	high
Version:	8.0 (Liberty)	CC:	abehl, adahms, amuller, chrisw, felipe.alfaro, hrivero, jlibosva, ndahiya, nyechiel, ochalups, pgadiya, sguha, srevivo
Target Milestone:	async	Keywords:	ZStream
Target Release:	8.0 (Liberty)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-neutron-7.0.4-6.el7ost	Doc Type:	If docs needed, set a value
Doc Text:	Previously, when the metadata agent experienced an issue with RPC communication, it would start using the neutron client interface as a fallback. The only way to configure the metadata agent to use RPC again would then be to restart the agent. Due to this, metadata agent services would become unavailable after any disruption of RPC communication if neutron API credentials are not configured for the agent. With this update, the code that activates a fallback using neutron client has been removed from the metadata agent so that RPC is always used.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-06-29 13:59:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1339488

Description Robin Cernin 2016-05-26 12:44:19 UTC

Description of problem:

We created upstream bug to review the fallback logic on RPC messaging in Liberty: https://bugs.launchpad.net/neutron/+bug/1586025

~~~
Would it be possible to review the fallback logic in Liberty. Do we need really need to fallback on any RPC error?

We don't think it's optimal right now as it fallback on all RPC errors. This makes agent fragile in case neutron client is misconfigured.
~~~

This bug is submitted to track the upstream bug in case we need to backport d/s.

Thank you,
Kind Regards,
Robin Černín

Comment 2 Robin Cernin 2016-05-27 06:38:22 UTC

To reflect the description change from Ihar Hrachyshka:

Till Kilo we used neutronclient library to get data from neutron-server to metadata agent. Since Kilo we correctly introduced rpc communication between server and metadata agent - with fallback mechanism for those who upgrade agents first. That could lead to situation where server is still on Juno which don't have rpc api needed by Kilo agent. In such situation, we start again using neutron-client.

https://github.com/openstack/neutron/blob/stable/liberty/neutron/agent/metadata/agent.py#L131

The fallback mechanism stayed there and got to Liberty, where it's not needed anymore. Also there is a problem here, because we fallback on any exception that comes from rpc communication. So for new Liberty deployments, that are not supposed to configure metadata agent with credentials for neutron api (as since kilo it's not used), on any error that happens on rpc, it switches to unconfigured neutron client till metadata agent is restarted.

We should just remove the fallback mechanism as we did in Mitaka: https://review.openstack.org/#/c/231065/

Comment 7 Assaf Muller 2016-06-03 19:27:37 UTC

I think that backporting the patch to OSP 8 is the right thing to do as the rewards outweigh the risks.

Here's an explanation of the bug, the proposed solution, the concern with the solution, and why I think it makes sense to move forward.

The Neutron metadata agent is used in most deployments during the "spawn a VM" flow. No metadata means that OpenStack won't be able to inject SSH keys in to VMs, rendering the VMs inaccessible in the common use case. The Neutron metadata agent needs to access information in the Neutron DB, which it used to obtain via the Neutron API. At some point, it was switched over to grab the same information via the messaging bus instead, keeping a fallback on API if the RPC implementation was found to be immature. We very recently found out that the fallback logic was: If any error (Including intermittent or transient errors) occur, fall back to using the API client. The issue is twofold: First, that doesn't make any sense, second: TripleO doesn't configure the Neutron metadata agent to use the API client. That means that after any intermittent messaging bus error, the metadata agent would become as useful as a bag of bricks, until the admin figured out that VMs cannot access metadata and the metadata agent was restarted. In short, this bug is a sitting time bomb. The solution that we are proposing is to backport a patch from Mitaka that removes the fallback logic entirely, leaving only the RPC client as a possibility, which will solve both problems outlined above.

The outstanding concern are deployments that use:
1) A core plugin other than ML2
2) The metadata agent

The RPC endpoint that is responsible for answering the metadata agent in the neutron-server process is implemented via the ML2 plugin. Any deployment with ML2, reference implementation or third party, will work fine. Core plugins that do not implement this RPC endpoint to answer metadata requests, and whose architecture includes the metadata agent, will no longer work if we backport the patch in question. The reason why I think it is reasonable to backport the patch is that I am not aware of Neutron solutions in certification that answer to both conditions outlined above. Furthermore, these solutions will have to adapt to the new world come OSP 9 as the patch was merged to Mitaka upstream, and so it is likely that a patch to fix the problem on the vendor side already exists, and if it becomes a problem with OSP 8, will simply be backported.

We will be moving forward with the backport in one week unless anyone objects loudly. I am keeping the needinfo's up for now.

Comment 13 Toni Freger 2016-06-23 10:42:26 UTC

Code tested in latest OSP8  - openstack-neutron-7.0.4-7.el7ost.noarch

Comment 16 errata-xmlrpc 2016-06-29 13:59:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1353

Comment 17 Red Hat Bugzilla 2023-09-14 03:23:21 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days