Bug 1051047

Summary:	neutron server doesn't reschedule routers when a neutron-l3-agent goes down
Product:	Red Hat OpenStack	Reporter:	Miguel Angel Ajo <majopela>
Component:	openstack-neutron	Assignee:	Miguel Angel Ajo <mangelajo>
Status:	CLOSED ERRATA	QA Contact:	yfried
Severity:	high	Docs Contact:
Priority:	high
Version:	4.0	CC:	amoralej, chrisw, dnavale, fdinitto, javier.pena, lpeer, mangelajo, twilson, yeylon
Target Milestone:	z2	Keywords:	OtherQA, ZStream
Target Release:	4.0	Flags:	majopela: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-neutron-2013.2.2-1.el6ost	Doc Type:	Known Issue
Doc Text:	When you shut down a neutron-l3-agent (or it dies) and you start another neutron-l3-agent in a different node, OpenStack Networking will not reschedule virtual routers from an L3 agent to the second one. The routing or metadata remain tied to the initial L3 agent ID. As a result, you cannot have an HA environment when you have several nodes with L3 agents, with different IDs either in Active/Active or Active/Passive states. Workaround: You can use the 'host=' field in the agent configuration file for both L3 agents to keep the same logical ID towards neutron-server. Two hosts should never run the neutron-l3-agent at the same time with the same 'host=' parameter. And, when one L3 agent is brought down (service stop) the 'neutron-netns-cleanup --forced' script should be used to clean any namespaces and running settings left by the neutron-l3-agent. Using this workaround, you can have virtual routers rescheduled to a different neutron-l3-agent, as long as they have the same 'host=' logical ID. When you use neutron agent-list, the host field of the neutron-l3-agent will match the 'host=' field from configuration regardless of the actual agent hostname.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-03-04 20:13:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1061578, 1072381
Bug Blocks:	1080561

Description Miguel Angel Ajo 2014-01-09 15:55:28 UTC

Description of problem:

  When you shut down a neutron-l3-agent (or it dies) and you start another
neutron-l3-agent in a different node, you end up with no routing or metadata for the networks that were routed by the first agent.

How reproducible:

  Always

Steps to Reproduce:
1. Start the first network node, with neutron-l3-agent, It will rpc call sync_routers into neutron-server, getting auto assigned any unassigned virtual routers.

2. Stop the first network node
 
3. Start the second

(2-3 can be exchanged).

Actual results:

  # neutron agent-list shows node1 as down, node2 as up

  Even with the first node marked as not-alive, the virtual routers are not rescheduled to a new neutron-l3-agent, because they are assigned to node1.
  
Expected results:

  For use in HA environments, some kind of automatic relocation of the virtual routers would help. Via neutron.conf settings for example.

Additional info:

  cleaning up non-alive clients between steps 2 and 3, makes it work.

  DOWN_AGENTS=$(neutron agent-list | grep "| xxx   |" | cut -f2 -d\ ); 
  for AGENT in $DOWN_AGENTS; do 
      neutron agent-delete $AGENT ;
  done 

  We are assuming that router_auto_schedule = True in neutron.conf (the default)

  Upstream this could clash with the blueprint: https://blueprints.launchpad.net/neutron/+spec/l3-high-availability  that intends to provide HA from inside neutron.

Comment 3 Miguel Angel Ajo 2014-01-10 17:20:43 UTC

Javier, this made it work, although we have found situations like this during testing (it's just a bug/race condition), while you switch the ACTIVE node.

https://bugzilla.redhat.com/show_bug.cgi?id=1051615

I have to check upstream, that this setting is intended for what we're doing, and that we should not find any side effect.

But, for what I have tested, it does effectively work.

Thank you very much,
Miguel Ángel

Comment 4 lpeer 2014-01-21 14:13:09 UTC

We have a workaround per comment#3 but the general scheduling problem is not solved in u/s and would be addressed in Icehouse - 
https://bugzilla.redhat.com/show_bug.cgi?id=1042396

Comment 5 Miguel Angel Ajo 2014-01-31 06:59:40 UTC

Upstream confirmation on the the intended usage of the host= parameter.

http://lists.openstack.org/pipermail/openstack-dev/2014-January/026020.html

Comment 6 Alfredo Moralejo 2014-02-04 11:02:21 UTC

I've been testing this configuration with ML2 + OVS + vxlan and I've found that adding host parameter in l3_agent configuration causes problems.

When using ML2, when a router is assigned to an L3 agent with a host value different that the hostname, the internal port of the router (the one connected to br-int OVS) is assigned vlsn 4095 and created a flow to drop all packages from this port. I've seen this is done by a port_dead method in the openvswitch agent.

Comment 7 Miguel Angel Ajo 2014-02-04 15:07:04 UTC

Another possible workaround is running something like this from somewhere in a cron, making sure that it does evacuate virtual routers out of down l3 agents to live ones.

https://github.com/stackforge/cookbook-openstack-network/blob/master/files/default/neutron-ha-tool.py

Comment 8 Bob Kukura 2014-02-06 14:19:51 UTC

(In reply to Alfredo Moralejo from comment #6)
> I've been testing this configuration with ML2 + OVS + vxlan and I've found
> that adding host parameter in l3_agent configuration causes problems.
> 
> When using ML2, when a router is assigned to an L3 agent with a host value
> different that the hostname, the internal port of the router (the one
> connected to br-int OVS) is assigned vlsn 4095 and created a flow to drop
> all packages from this port. I've seen this is done by a port_dead method in
> the openvswitch agent.

The ml2 plugin uses the value of the binding:host_id port attribute in port binding. The binding:host_id of the l3-agent's ports is set with the host value from the l3-agent config. If this does not match the name the openvswitch-agent uses for the host, a binding cannot be creating. See BZ 1061578. A solution for this particular use case may be to override host with the same value in the openvswitch-agent and l3-agent config files.

Comment 11 yfried 2014-02-20 11:45:39 UTC

Not enough baremetal resources ATM.
Miguel has volunteered to verify

Comment 12 Miguel Angel Ajo 2014-02-20 15:33:44 UTC

I can confirm that it works,

1) setup two network nodes, and a controller
2) setup host=l3-agent-name (or desired logical name) in l3_agent.ini for both network nodes
3) start l3_agent in network node A
4) ping from a VM to the external network: OK
-failover-
5) poweroff A (or /etc/init.d/neutron-l3-agent stop + neutron-netns-forced-cleanup from bz#1051036)
6) start l3_agent in network node B
7) ping from the same vm to the external network: OK

-failback-

8) poweron A
9) poweroff B (or stop l3 agent + use cleanup script)
10) start l3 agent on network node A
11) ping from the same VM to the external network: OK

Comment 13 Miguel Angel Ajo 2014-02-21 07:41:48 UTC

Checked with 2013.2.2-1 on RHEL6.5 with 2014-02-17.1 build.

node A:

[root@rhos4-neutron-n1 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.5 (Santiago)

[root@rhos4-neutron-n1 ~]# rpm -qa | grep neutron
python-neutron-2013.2.2-1.el6ost.noarch
openstack-neutron-2013.2.2-1.el6ost.noarch
python-neutronclient-2.3.1-3.el6ost.noarch
openstack-neutron-openvswitch-2013.2.2-1.el6ost.noarch


node B:

[root@rhos4-neutron-n2 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.5 (Santiago)

[root@rhos4-neutron-n2 ~]# rpm -qa | grep neutron
python-neutron-2013.2.2-1.el6ost.noarch
openstack-neutron-2013.2.2-1.el6ost.noarch
python-neutronclient-2.3.1-3.el6ost.noarch
openstack-neutron-openvswitch-2013.2.2-1.el6ost.noarch

Comment 16 errata-xmlrpc 2014-03-04 20:13:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0213.html