Description of problem: When you shut down a neutron-l3-agent (or it dies) and you start another neutron-l3-agent in a different node, you end up with no routing or metadata for the networks that were routed by the first agent. How reproducible: Always Steps to Reproduce: 1. Start the first network node, with neutron-l3-agent, It will rpc call sync_routers into neutron-server, getting auto assigned any unassigned virtual routers. 2. Stop the first network node 3. Start the second (2-3 can be exchanged). Actual results: # neutron agent-list shows node1 as down, node2 as up Even with the first node marked as not-alive, the virtual routers are not rescheduled to a new neutron-l3-agent, because they are assigned to node1. Expected results: For use in HA environments, some kind of automatic relocation of the virtual routers would help. Via neutron.conf settings for example. Additional info: cleaning up non-alive clients between steps 2 and 3, makes it work. DOWN_AGENTS=$(neutron agent-list | grep "| xxx |" | cut -f2 -d\ ); for AGENT in $DOWN_AGENTS; do neutron agent-delete $AGENT ; done We are assuming that router_auto_schedule = True in neutron.conf (the default) Upstream this could clash with the blueprint: https://blueprints.launchpad.net/neutron/+spec/l3-high-availability that intends to provide HA from inside neutron.
Javier, this made it work, although we have found situations like this during testing (it's just a bug/race condition), while you switch the ACTIVE node. https://bugzilla.redhat.com/show_bug.cgi?id=1051615 I have to check upstream, that this setting is intended for what we're doing, and that we should not find any side effect. But, for what I have tested, it does effectively work. Thank you very much, Miguel Ángel
We have a workaround per comment#3 but the general scheduling problem is not solved in u/s and would be addressed in Icehouse - https://bugzilla.redhat.com/show_bug.cgi?id=1042396
Upstream confirmation on the the intended usage of the host= parameter. http://lists.openstack.org/pipermail/openstack-dev/2014-January/026020.html
I've been testing this configuration with ML2 + OVS + vxlan and I've found that adding host parameter in l3_agent configuration causes problems. When using ML2, when a router is assigned to an L3 agent with a host value different that the hostname, the internal port of the router (the one connected to br-int OVS) is assigned vlsn 4095 and created a flow to drop all packages from this port. I've seen this is done by a port_dead method in the openvswitch agent.
Another possible workaround is running something like this from somewhere in a cron, making sure that it does evacuate virtual routers out of down l3 agents to live ones. https://github.com/stackforge/cookbook-openstack-network/blob/master/files/default/neutron-ha-tool.py
(In reply to Alfredo Moralejo from comment #6) > I've been testing this configuration with ML2 + OVS + vxlan and I've found > that adding host parameter in l3_agent configuration causes problems. > > When using ML2, when a router is assigned to an L3 agent with a host value > different that the hostname, the internal port of the router (the one > connected to br-int OVS) is assigned vlsn 4095 and created a flow to drop > all packages from this port. I've seen this is done by a port_dead method in > the openvswitch agent. The ml2 plugin uses the value of the binding:host_id port attribute in port binding. The binding:host_id of the l3-agent's ports is set with the host value from the l3-agent config. If this does not match the name the openvswitch-agent uses for the host, a binding cannot be creating. See BZ 1061578. A solution for this particular use case may be to override host with the same value in the openvswitch-agent and l3-agent config files.
Not enough baremetal resources ATM. Miguel has volunteered to verify
I can confirm that it works, 1) setup two network nodes, and a controller 2) setup host=l3-agent-name (or desired logical name) in l3_agent.ini for both network nodes 3) start l3_agent in network node A 4) ping from a VM to the external network: OK -failover- 5) poweroff A (or /etc/init.d/neutron-l3-agent stop + neutron-netns-forced-cleanup from bz#1051036) 6) start l3_agent in network node B 7) ping from the same vm to the external network: OK -failback- 8) poweron A 9) poweroff B (or stop l3 agent + use cleanup script) 10) start l3 agent on network node A 11) ping from the same VM to the external network: OK
Checked with 2013.2.2-1 on RHEL6.5 with 2014-02-17.1 build. node A: [root@rhos4-neutron-n1 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.5 (Santiago) [root@rhos4-neutron-n1 ~]# rpm -qa | grep neutron python-neutron-2013.2.2-1.el6ost.noarch openstack-neutron-2013.2.2-1.el6ost.noarch python-neutronclient-2.3.1-3.el6ost.noarch openstack-neutron-openvswitch-2013.2.2-1.el6ost.noarch node B: [root@rhos4-neutron-n2 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.5 (Santiago) [root@rhos4-neutron-n2 ~]# rpm -qa | grep neutron python-neutron-2013.2.2-1.el6ost.noarch openstack-neutron-2013.2.2-1.el6ost.noarch python-neutronclient-2.3.1-3.el6ost.noarch openstack-neutron-openvswitch-2013.2.2-1.el6ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0213.html