Bug 1300584

Summary:	Backport: Tracker for IPV6 router is not working with VRRP
Product:	Red Hat OpenStack	Reporter:	Nir Magnezi <nmagnezi>
Component:	openstack-neutron	Assignee:	Nir Magnezi <nmagnezi>
Status:	CLOSED WONTFIX	QA Contact:	Toni Freger <tfreger>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.0 (Juno)	CC:	amuller, chrisw, dcadzow, ihrachys, jschluet, mburns, nyechiel, oblaut, srevivo, tfreger, vcojot
Target Milestone:	async	Keywords:	FeatureBackport, ZStream
Target Release:	6.0 (Juno)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-neutron-2014.2.3-34.el7ost	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	1222775	Environment:
Last Closed:	2016-05-02 19:03:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1222775
Bug Blocks:	1300580

Comment 6 Toni Freger 2016-04-13 16:27:37 UTC

Tested on two nodes- Controller and Networker + Compute.
Latest puddle from 2016-04-05.2 - openstack-neutron-2014.2.3-35.el7ost.src.rpm .
Instance image - RHEL


During my test with VRRP routers - I saw issues below:
1)It takes to much time to get an IPv6 address (~ 1 minute)
2)After reboot to all nodes (controller and networker) - VM has been started  without IPV6 address,
RA sent from the router but it takes too much time to receive it on the VM side.
At the end even when the RA received VM stays without IPv6.

 

Additional Info:
- Instances that attached to the network with legacy router gets IPv6 after boot and after reboot of the network nodes.
- No issues with IPv4 for HA router.

Comment 7 Nir Magnezi 2016-04-14 14:54:22 UTC

I have used the very same setup to debug this issue, the root cause:

After hosts reboot ports seem to be wired correctly (neutron wise).
When radvd send IPv6 router advertisements (which does not always happen, hold on..), they reach the instances right-away and the instances obtain an IPv6 address.

Upon reproduction (reboot servers), nova instances indeed won't obtain IPv6 addresses.
Looking closely I've found the following:
1. Instance won't get router advertisements, hence no IPv6 addresses.
2. radvd takes a very long time (sometimes, a lot more than a minute) to send the first router advertisement.
3. In some cases there won't be any router advertisements coming from radvd (at that time radvd process is running).

Looking at /var/log/messages (on both nodes) I saw the following error:

radvd[xxxx]: no linklocal address configured for qr-27493a15-58
radvd[xxxx]: sendmsg: Cannot assign requested address
radvd[xxxx]: resuming normal operation

and then, numerous repeats of:
radvd[xxxx]: resetting ipv6-allrouters membership on qr-27493a15-58


Backup Node:
============
This error is indeed expected since neutron won't configure any IP address in the qrouter namespace, Hence no Link Local Address:

# sudo ip netns exec qrouter-01efcb45-0589-4313-8b3a-49057524656c ifconfig qr-27493a15-58
qr-27493a15-58: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether fa:16:3e:cf:e6:10  txqueuelen 0  (Ethernet)
        RX packets 543  bytes 59138 (57.7 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1  bytes 110 (110.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

-> There is another problem here with the fact that radvd is running in backup node to begin with, I will explaing this in a bit.


Master Node:
============
This error should not be expected, since we do have a Link Local Address configured:

# ip netns exec qrouter-01efcb45-0589-4313-8b3a-49057524656c ifconfig qr-27493a15-58
qr-27493a15-58: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::f816:3eff:fecf:e610  prefixlen 64  scopeid 0x20<link>
        inet6 2001:db3::1  prefixlen 64  scopeid 0x0<global>
        ether fa:16:3e:cf:e6:10  txqueuelen 0  (Ethernet)
        RX packets 105  bytes 9574 (9.3 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 602  bytes 65932 (64.3 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

-> What happens here is that radvd is spawned too early, causing it to complain with 'no linklocal address configured'.
   That address is configured afterwards but radvd take a long time, if any, to recover.
   As long as radvd is in that state, it won't send any router advertisements.

To conclude we have two issues here:
====================================
1. radvd is spawned on the backup node, which it shouldn't since it will enter an error state and chances are that it will fail to recover upon master/backup transition.
2. radvd is spawned too early on the master node, which leads to an error state and lack of router advertisements.


There is a fix[1] for this by Sridhar Gaddam, starting from Kilo.
If you read the commit message[2] you will see it addresses both issues mentioned above.

There was a major overall to the L3 agent code, so we cannot easily cherry-pick Sridhar's fix.
We would have to think if and how this issue can be fixed in OSP6.

[1] https://review.openstack.org/#/c/179392
[2] https://review.openstack.org/#/c/179392/1//COMMIT_MSG

Comment 8 Nir Magnezi 2016-04-14 14:58:12 UTC

Hey Toni,
The issue you describe in comment #6 is a different bug by itself.
Did you manage to verify the keepalived configuration?

Comment 9 Toni Freger 2016-04-18 07:10:33 UTC

Keeplived configuration is correct.

Comment 11 Nir Magnezi 2016-05-02 19:03:24 UTC

After further investigation, it seems like the fix cannot be implemented in OSP6.
The reason is that the fix[1] for the two issues described in comment 7 requires the neutron l3-agent to be aware of its state, meaning it must be able to determine if it is currently in MASTER or BACKUP state.

In the OSP6 this information is external to the neutron codebase. that information was held exclusively in keepalived.

Starting from OSP7 (Kilo), thanks to vast modification and additions[2] to the l3-agent code base, the agent became aware of its current state, hence we can make decisions by that information and by events such as failover.

Therefore, in order to fix this bug in OSP6 we will need to implement a whole new feature.

Due to the above, closing as WONTFIX.

[1] https://review.openstack.org/#/c/179392
[2] http://specs.openstack.org/openstack/neutron-specs/specs/kilo/report-ha-router-master.html