Bug 1051047
| Summary: | neutron server doesn't reschedule routers when a neutron-l3-agent goes down | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Miguel Angel Ajo <majopela> |
| Component: | openstack-neutron | Assignee: | Miguel Angel Ajo <mangelajo> |
| Status: | CLOSED ERRATA | QA Contact: | yfried |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.0 | CC: | amoralej, chrisw, dnavale, fdinitto, javier.pena, lpeer, mangelajo, twilson, yeylon |
| Target Milestone: | z2 | Keywords: | OtherQA, ZStream |
| Target Release: | 4.0 | Flags: | majopela:
needinfo-
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-neutron-2013.2.2-1.el6ost | Doc Type: | Known Issue |
| Doc Text: |
When you shut down a neutron-l3-agent (or it dies) and you start another neutron-l3-agent in a different node, OpenStack Networking will not reschedule virtual routers from an L3 agent to the second one. The routing or metadata remain tied to the initial L3 agent ID. As a result, you cannot have an HA environment when you have several nodes with L3 agents, with different IDs either in Active/Active or Active/Passive states.
Workaround:
You can use the 'host=' field in the agent configuration file for both L3 agents to keep the same logical ID towards neutron-server.
Two hosts should never run the neutron-l3-agent at the same time with the same 'host=' parameter. And, when one L3 agent is brought down (service stop) the 'neutron-netns-cleanup --forced' script should be used to clean any namespaces and running settings left by the neutron-l3-agent.
Using this workaround, you can have virtual routers rescheduled to a different neutron-l3-agent, as long as they have the same 'host=' logical ID. When you use neutron agent-list, the host field of the neutron-l3-agent will match the 'host=' field from configuration regardless of the actual agent hostname.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-03-04 20:13:52 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1061578, 1072381 | ||
| Bug Blocks: | 1080561 | ||
Javier, this made it work, although we have found situations like this during testing (it's just a bug/race condition), while you switch the ACTIVE node. https://bugzilla.redhat.com/show_bug.cgi?id=1051615 I have to check upstream, that this setting is intended for what we're doing, and that we should not find any side effect. But, for what I have tested, it does effectively work. Thank you very much, Miguel Ángel We have a workaround per comment#3 but the general scheduling problem is not solved in u/s and would be addressed in Icehouse - https://bugzilla.redhat.com/show_bug.cgi?id=1042396 Upstream confirmation on the the intended usage of the host= parameter. http://lists.openstack.org/pipermail/openstack-dev/2014-January/026020.html I've been testing this configuration with ML2 + OVS + vxlan and I've found that adding host parameter in l3_agent configuration causes problems. When using ML2, when a router is assigned to an L3 agent with a host value different that the hostname, the internal port of the router (the one connected to br-int OVS) is assigned vlsn 4095 and created a flow to drop all packages from this port. I've seen this is done by a port_dead method in the openvswitch agent. Another possible workaround is running something like this from somewhere in a cron, making sure that it does evacuate virtual routers out of down l3 agents to live ones. https://github.com/stackforge/cookbook-openstack-network/blob/master/files/default/neutron-ha-tool.py (In reply to Alfredo Moralejo from comment #6) > I've been testing this configuration with ML2 + OVS + vxlan and I've found > that adding host parameter in l3_agent configuration causes problems. > > When using ML2, when a router is assigned to an L3 agent with a host value > different that the hostname, the internal port of the router (the one > connected to br-int OVS) is assigned vlsn 4095 and created a flow to drop > all packages from this port. I've seen this is done by a port_dead method in > the openvswitch agent. The ml2 plugin uses the value of the binding:host_id port attribute in port binding. The binding:host_id of the l3-agent's ports is set with the host value from the l3-agent config. If this does not match the name the openvswitch-agent uses for the host, a binding cannot be creating. See BZ 1061578. A solution for this particular use case may be to override host with the same value in the openvswitch-agent and l3-agent config files. Not enough baremetal resources ATM. Miguel has volunteered to verify I can confirm that it works, 1) setup two network nodes, and a controller 2) setup host=l3-agent-name (or desired logical name) in l3_agent.ini for both network nodes 3) start l3_agent in network node A 4) ping from a VM to the external network: OK -failover- 5) poweroff A (or /etc/init.d/neutron-l3-agent stop + neutron-netns-forced-cleanup from bz#1051036) 6) start l3_agent in network node B 7) ping from the same vm to the external network: OK -failback- 8) poweron A 9) poweroff B (or stop l3 agent + use cleanup script) 10) start l3 agent on network node A 11) ping from the same VM to the external network: OK Checked with 2013.2.2-1 on RHEL6.5 with 2014-02-17.1 build. node A: [root@rhos4-neutron-n1 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.5 (Santiago) [root@rhos4-neutron-n1 ~]# rpm -qa | grep neutron python-neutron-2013.2.2-1.el6ost.noarch openstack-neutron-2013.2.2-1.el6ost.noarch python-neutronclient-2.3.1-3.el6ost.noarch openstack-neutron-openvswitch-2013.2.2-1.el6ost.noarch node B: [root@rhos4-neutron-n2 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.5 (Santiago) [root@rhos4-neutron-n2 ~]# rpm -qa | grep neutron python-neutron-2013.2.2-1.el6ost.noarch openstack-neutron-2013.2.2-1.el6ost.noarch python-neutronclient-2.3.1-3.el6ost.noarch openstack-neutron-openvswitch-2013.2.2-1.el6ost.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0213.html |
Description of problem: When you shut down a neutron-l3-agent (or it dies) and you start another neutron-l3-agent in a different node, you end up with no routing or metadata for the networks that were routed by the first agent. How reproducible: Always Steps to Reproduce: 1. Start the first network node, with neutron-l3-agent, It will rpc call sync_routers into neutron-server, getting auto assigned any unassigned virtual routers. 2. Stop the first network node 3. Start the second (2-3 can be exchanged). Actual results: # neutron agent-list shows node1 as down, node2 as up Even with the first node marked as not-alive, the virtual routers are not rescheduled to a new neutron-l3-agent, because they are assigned to node1. Expected results: For use in HA environments, some kind of automatic relocation of the virtual routers would help. Via neutron.conf settings for example. Additional info: cleaning up non-alive clients between steps 2 and 3, makes it work. DOWN_AGENTS=$(neutron agent-list | grep "| xxx |" | cut -f2 -d\ ); for AGENT in $DOWN_AGENTS; do neutron agent-delete $AGENT ; done We are assuming that router_auto_schedule = True in neutron.conf (the default) Upstream this could clash with the blueprint: https://blueprints.launchpad.net/neutron/+spec/l3-high-availability that intends to provide HA from inside neutron.