Bug 1169408 - Neutron router interface port creation fails with radvd >= 2.0 due to blocked router update processing
Summary: Neutron router interface port creation fails with radvd >= 2.0 due to blocked...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: openstack-neutron
Version: rawhide
Hardware: Unspecified
OS: Linux
low
low
Target Milestone: ---
Assignee: Ihar Hrachyshka
QA Contact: Nir Magnezi
URL:
Whiteboard:
Depends On:
Blocks: 1046786 1083891
TreeView+ depends on / blocked
 
Reported: 2014-12-01 15:17 UTC by Nir Magnezi
Modified: 2016-04-26 16:09 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-01-08 15:45:12 UTC
Type: Bug


Attachments (Terms of Use)
tests (17.26 KB, text/plain)
2014-12-01 15:17 UTC, Nir Magnezi
no flags Details
port_down (2.51 KB, text/plain)
2014-12-01 15:19 UTC, Nir Magnezi
no flags Details
port_active (5.60 KB, text/plain)
2014-12-01 15:19 UTC, Nir Magnezi
no flags Details
IPv6 plan log (3.73 KB, text/plain)
2014-12-01 15:39 UTC, Nir Magnezi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1398779 0 None None None Never
OpenStack gerrit 138688 0 None None None Never

Description Nir Magnezi 2014-12-01 15:17:42 UTC
Created attachment 963325 [details]
tests

Description of problem:
=======================
At a certain point, while testing neutron for IPv6, neutron ports created for router interfaces remind DOWN upon attachment.
This happened twice while running the same test plan and at similar stage.
Since there is no short reproduction scenario, I'll attach files with the tests I executed.

In addition, I noticed that while this issue persists with the router I created, when I create an additional router (without removing the "problematic" one), router interface attachments (hence, port creations) works fine.

I will add information to this bug as we discover more.

Version-Release number of selected component (if applicable):
=============================================================
RHEL-OSP6-Beta: openstack-neutron-2014.2-11.el7ost.noarch

How reproducible:
=================
Reproduced 2 times so far.

Steps to Reproduce:
===================
* See attached file named: tests
* The steps below sums the reproduction when the issue takes place in the Openstack setup
1. create a neutron router
2. create network, subnet (I used both IPv4 & IPv6)
3. attach the subnet to the router

Actual results:
===============
As described in: 'Description of problem'

for server.log when port creation fails: see attached: port_down file 
You'll notice that it does not reach to the 'Attempting to bind port' part.

Expected results:
=================
Should work OK.
for server.log when port creation succeeds: see attached: port_active file 

Additional info:
================
Will probably be found in the upcoming comments.

Comment 1 Nir Magnezi 2014-12-01 15:19:07 UTC
Created attachment 963326 [details]
port_down

Comment 2 Nir Magnezi 2014-12-01 15:19:42 UTC
Created attachment 963327 [details]
port_active

Comment 5 Nir Magnezi 2014-12-01 15:39:55 UTC
Created attachment 963340 [details]
IPv6 plan log

Comment 6 Ihar Hrachyshka 2014-12-02 10:31:44 UTC
As was said before, the problem is with one of the routers. Other routers work fine.

Some log reading below:

When a subnet is attached to the router, the following can be found in server.log on l3 agent side:

2014-11-30 11:29:52.352 2686 DEBUG neutron.agent.l3_agent [req-06a126a5-fac3-4f5c-915e-07cb51da821b None] Got routers updated notification :[u'c480186a-0a2f-4f1c-b0d1-89251760f9cd'] routers_updated /usr/lib/python2.7/site-packages/neutron/agent/l3_agent.py:1763

Though there are no "Starting router update for %s" messages in the log. They were initially shown in logs some time ago four times, but then disappeared.

2014-11-27 15:15:54.005 2686 DEBUG neutron.agent.l3_agent [-] Starting router update for c480186a-0a2f-4f1c-b0d1-89251760f9cd _process_router_update /usr/lib/python2.7/site-packages/neutron/agent/l3_agent.py:1830
2014-11-27 15:15:57.433 2686 DEBUG neutron.agent.l3_agent [-] Starting router update for c480186a-0a2f-4f1c-b0d1-89251760f9cd _process_router_update /usr/lib/python2.7/site-packages/neutron/agent/l3_agent.py:1830
2014-11-27 15:16:03.493 2686 DEBUG neutron.agent.l3_agent [-] Starting router update for c480186a-0a2f-4f1c-b0d1-89251760f9cd _process_router_update /usr/lib/python2.7/site-packages/neutron/agent/l3_agent.py:1830
2014-11-27 15:16:33.067 2686 DEBUG neutron.agent.l3_agent [-] Starting router update for c480186a-0a2f-4f1c-b0d1-89251760f9cd _process_router_update /usr/lib/python2.7/site-packages/neutron/agent/l3_agent.py:1830
2014-11-27 15:16:47.848 2686 DEBUG neutron.agent.l3_agent [-] Starting router update for c480186a-0a2f-4f1c-b0d1-89251760f9cd _process_router_update /usr/lib/python2.7/site-packages/neutron/agent/l3_agent.py:1830

So the update notification is received from controller, and probably put into RouterUpdate queue, but then not served properly.

Code wise, for updates, there is a RouterProcessingQueue that provides parallel access to updates via each_update_to_next_router() generator that is intended to serialize parallel update requests by giving exclusive read access to update queue to one of the green threads that serve _process_router_update() calls. The code of the queue and exclusive accessor seems quite hacky and may indeed contain some race condition that would end up blocking any new updates to some unlucky router.

It could also be the case that _process_router_update() greenthread pool is full and locked, so no new updates are actually served, but in that case new requests for other routers wouldn't be served properly. Also, no l3 agent file locks seem to be set for a long time in /var/lib/neutron/...

Comment 7 Nir Magnezi 2014-12-02 14:10:30 UTC
Reproduction scenario narrowed down to this:
1. Attach an IPv4 subnet --> port is active
2. Attach an IPv6 RADVD (stateless, stateful, slacc) subnet --> port is active 
3. from that point and on, this router is broken and all attachments stop working. port status remain down.

Comment 8 Ihar Hrachyshka 2014-12-02 20:57:36 UTC
So it's radvd thing. Once I downgrade to radvd < 2.0, everything works fine again. Looking at the diff between 1.14 and 2.0, daemonization code was changed significantly for the 2.0 release.

Comment 9 Nir Yechiel 2014-12-03 09:46:21 UTC
(In reply to Ihar Hrachyshka from comment #8)
> So it's radvd thing. Once I downgrade to radvd < 2.0, everything works fine
> again. Looking at the diff between 1.14 and 2.0, daemonization code was
> changed significantly for the 2.0 release.

Hi Ihar,

Nice catch :)

What do you think would be the right way to proceed here? Should we file a bug for radvd or downgrade the version?

Thanks,
Nir

Comment 10 Ihar Hrachyshka 2014-12-03 11:31:35 UTC
@Nir, we should fix neutron to work properly with all versions of radvd. It's as easy as passing additional '-m syslog' argument to radvd to make it close stderr in all versions of the daemon. I'm working on fixing it in u/s.

Comment 12 Ihar Hrachyshka 2014-12-12 15:20:10 UTC
Now that we dropped radvd 2.+ from RDO and RHOSP, we may downgrade the bug to Rawhide, the only Red Hat project that includes the new radvd version.


Note You need to log in before you can comment on or make changes to this bug.