Description of problem: We created upstream bug to review the fallback logic on RPC messaging in Liberty: https://bugs.launchpad.net/neutron/+bug/1586025 ~~~ Would it be possible to review the fallback logic in Liberty. Do we need really need to fallback on any RPC error? We don't think it's optimal right now as it fallback on all RPC errors. This makes agent fragile in case neutron client is misconfigured. ~~~ This bug is submitted to track the upstream bug in case we need to backport d/s. Thank you, Kind Regards, Robin Černín
To reflect the description change from Ihar Hrachyshka: Till Kilo we used neutronclient library to get data from neutron-server to metadata agent. Since Kilo we correctly introduced rpc communication between server and metadata agent - with fallback mechanism for those who upgrade agents first. That could lead to situation where server is still on Juno which don't have rpc api needed by Kilo agent. In such situation, we start again using neutron-client. https://github.com/openstack/neutron/blob/stable/liberty/neutron/agent/metadata/agent.py#L131 The fallback mechanism stayed there and got to Liberty, where it's not needed anymore. Also there is a problem here, because we fallback on any exception that comes from rpc communication. So for new Liberty deployments, that are not supposed to configure metadata agent with credentials for neutron api (as since kilo it's not used), on any error that happens on rpc, it switches to unconfigured neutron client till metadata agent is restarted. We should just remove the fallback mechanism as we did in Mitaka: https://review.openstack.org/#/c/231065/
I think that backporting the patch to OSP 8 is the right thing to do as the rewards outweigh the risks. Here's an explanation of the bug, the proposed solution, the concern with the solution, and why I think it makes sense to move forward. The Neutron metadata agent is used in most deployments during the "spawn a VM" flow. No metadata means that OpenStack won't be able to inject SSH keys in to VMs, rendering the VMs inaccessible in the common use case. The Neutron metadata agent needs to access information in the Neutron DB, which it used to obtain via the Neutron API. At some point, it was switched over to grab the same information via the messaging bus instead, keeping a fallback on API if the RPC implementation was found to be immature. We very recently found out that the fallback logic was: If any error (Including intermittent or transient errors) occur, fall back to using the API client. The issue is twofold: First, that doesn't make any sense, second: TripleO doesn't configure the Neutron metadata agent to use the API client. That means that after any intermittent messaging bus error, the metadata agent would become as useful as a bag of bricks, until the admin figured out that VMs cannot access metadata and the metadata agent was restarted. In short, this bug is a sitting time bomb. The solution that we are proposing is to backport a patch from Mitaka that removes the fallback logic entirely, leaving only the RPC client as a possibility, which will solve both problems outlined above. The outstanding concern are deployments that use: 1) A core plugin other than ML2 2) The metadata agent The RPC endpoint that is responsible for answering the metadata agent in the neutron-server process is implemented via the ML2 plugin. Any deployment with ML2, reference implementation or third party, will work fine. Core plugins that do not implement this RPC endpoint to answer metadata requests, and whose architecture includes the metadata agent, will no longer work if we backport the patch in question. The reason why I think it is reasonable to backport the patch is that I am not aware of Neutron solutions in certification that answer to both conditions outlined above. Furthermore, these solutions will have to adapt to the new world come OSP 9 as the patch was merged to Mitaka upstream, and so it is likely that a patch to fix the problem on the vendor side already exists, and if it becomes a problem with OSP 8, will simply be backported. We will be moving forward with the backport in one week unless anyone objects loudly. I am keeping the needinfo's up for now.
Code tested in latest OSP8 - openstack-neutron-7.0.4-7.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1353
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days