1367947 – Booting up VM with dual stack interface is affected by the order of Neutron subnet creation

Bug 1367947 - Booting up VM with dual stack interface is affected by the order of Neutron subnet creation

Summary: Booting up VM with dual stack interface is affected by the order of Neutron s...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	8.0 (Liberty)
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	async
Target Release:	8.0 (Liberty)
Assignee:	Ihar Hrachyshka
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:	hot
Duplicates (1):	1351795 (view as bug list)
Depends On:
Blocks:	1194008 1394880
TreeView+	depends on / blocked

Reported:	2016-08-18 00:45 UTC by kahou
Modified:	2020-01-17 15:53 UTC (History)
CC List:	11 users (show)
Fixed In Version:	openstack-neutron-7.2.0-4.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1394880 (view as bug list)
Environment:
Last Closed:	2016-12-21 16:43:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1499406	None	None	None	2016-10-11 17:20:06 UTC
Launchpad	1549793	None	None	None	2016-10-11 18:58:02 UTC
Launchpad	1556991	None	None	None	2016-10-05 16:06:24 UTC
OpenStack gerrit	293237	None	MERGED	Update metadata proxy when subnet add/delete	2020-07-15 12:01:34 UTC
Red Hat Product Errata	RHBA-2016:2988	normal	SHIPPED_LIVE	openstack-neutron bug fix advisory	2016-12-21 21:35:02 UTC

Description kahou 2016-08-18 00:45:09 UTC

Description of problem:

I am testing a dual stack VM (with a trusty cloud ini image). Seems like the order of the Neutron subnet creation will impact the VM network device setup…

For example, I have the following Neutron network with the below creation sequences:

1. neutron net-create net-64-2 --provider:network_type vlan --provider:physical_network vlan_net1 --provider:segmentation_id 2005

2. neutron subnet-create net-64-2 2005::/64 --name subnet_6 --enable_dhcp true --ipv6-address-mode slaac --ip_version 6

3. neutron subnet-create net-64-2 10.0.5.0/24 --name subnet_4 --enable_dhcp true


Note that I created IPv6 subnet before IPv4 subnet.

If I boot an Ubuntu VM, it will get stuck at setting up the network device:

cloud-init-nonet[23.36]: waiting 120 seconds for network device
 * Starting configure network device[74G[ OK ]
 * Starting Bridge socket events into upstart[74G[ OK ]
 * Stopping cold plug devices[74G[ OK ]
 * Stopping log initial device creation[74G[ OK ]
 * Starting enable remaining boot-time encrypted block devices[74G[ OK ]
cloud-init-nonet[143.37]: gave up waiting for a network device.
Cloud-init v. 0.7.5 running 'init' at Wed, 17 Aug 2016 18:22:07 +0000. Up 143.63 seconds.
ci-info: +++++++++++++++++++++++Net device info+++++++++++++++++++++++
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: | Device |  Up  |  Address  |    Mask   |     Hw-Address    |
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: |   lo   | True | 127.0.0.1 | 255.0.0.0 |         .         |
ci-info: |  eth0  | True |     .     |     .     | fa:16:3e:ad:f0:8c |
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Route info failed!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2016-08-17 18:22:07,769 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [0/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'>: [Errno 101] Network is unreachable)]

When I recreate the network and reorder the subnet creation sequence again (I.e. Create IPv4 subnet before creating IPv6 subnet), then the ubuntu VM will boot up with the correct dual stack interfaces.

Also, by looking at the VM console log itself. Looks like the failure is due to the dnsmasq is not responding.

Currently I am using dnsmasq 2.66.
 
Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Create a neutron network
2. Create a IPv6 subnet
3. Create a IPv4 subnet
4. Boot a VM with that network

Actual results:


Expected results:


Additional info:

Comment 1 kahou 2016-08-18 14:50:44 UTC

Looks like the root cause is due to dnsmasq

Below are the steps to identify the issue:

1. Create the network, v6 subnet and then v4 subnet (it will introduce the problem)
2. Boot the VM and it will not get any leased IPv4 address
3. Kill the corresponding dnsmasq
4. Restart dhcp agent (which will respawn dnsmasq)
5. Boot the VM again and it will come up fine.

Comment 2 kahou 2016-08-18 14:50:55 UTC

Looks like the root cause is due to dnsmasq

Below are the steps to identify the issue:

1. Create the network, v6 subnet and then v4 subnet (it will introduce the problem)
2. Boot the VM and it will not get any leased IPv4 address
3. Kill the corresponding dnsmasq
4. Restart dhcp agent (which will respawn dnsmasq)
5. Boot the VM again and it will come up fine.

Comment 4 Charles Crouch 2016-08-18 15:01:34 UTC

Assaf, any initial thoughts?

Comment 5 Assaf Muller 2016-09-06 14:40:25 UTC

Ihar, can you take a look? It sounds like something you've handled in the past.

Comment 10 Ihar Hrachyshka 2016-09-07 16:04:02 UTC

From the initial look at it, it sounds like a missed AMQP notification on particular subnet creation; OR neutron-dhcp-agent misconfiguring dnsmasq; OR dnsmasq 2.66 not handling the correct configuration files.

We would need the following logs to isolate the issue:
- neutron-server (debug = True) logs when creating subnets;
- neutron-dhcp-agent (debug = True) logs when creating subnets;
- config files for dnsmasq serving the network.

Screenshots are really not enough to proceed with the issue.

Comment 11 Ihar Hrachyshka 2016-09-07 16:05:44 UTC

BTW please collect dnsmasq config files before killing it and restarting the agent, and after it's back and correctly serves requests. I suspect once we have both and compare them, it will become clear why dnsmasq is not serving the request as expected. I suspect dnsmasq was not respawned when the second subnet was created.

Comment 12 Ihar Hrachyshka 2016-09-07 16:05:45 UTC

BTW please collect dnsmasq config files before killing it and restarting the agent, and after it's back and correctly serves requests. I suspect once we have both and compare them, it will become clear why dnsmasq is not serving the request as expected. I suspect dnsmasq was not respawned when the second subnet was created.

Comment 15 Ihar Hrachyshka 2016-09-12 14:04:04 UTC

Tried to reproduce locally with Delorean/Liberty + CentOS installation. It seems to work fine. There is a slight difference in my network options though: I don't use provider networking, but for what DHCP agent is concerned, it should not be of relevance.

I am using the following versions of relevant packages:
openstack-neutron-7.1.3-0.20160826005505.120a643.el7.centos.noarch
dnsmasq-2.66-14.el7_1.x86_64

What's the exact version of openstack-neutron RPM in the failing environment?

Comment 16 Ihar Hrachyshka 2016-09-12 14:05:29 UTC

Also of potential interest is whether they can reproduce the same issue with a tenant network (not provider):

neutron net-create net-64-2
neutron subnet-create net-64-2 2005::/64 --name subnet_6 --enable_dhcp true --ipv6-address-mode slaac --ip_version 6
neutron subnet-create net-64-2 10.0.5.0/24 --name subnet_4 --enable_dhcp true

Then start a Nova instance with the net-64-2 network, and check if you see the same issue.

Comment 19 Trinh Lee 2016-09-14 18:36:22 UTC

Hi Ihar,

I tried your suggestion by using the non-provider network, but got the same errors:

2016-09-14 18:30:45,436 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [65/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'>: [Errno 101] Network is unreachable)]
2016-09-14 18:30:50,467 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [70/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'>: [Errno 101] Network is unreachable)]

Comment 21 kahou 2016-09-20 21:54:53 UTC

Hi Ihar,

Below are the neutron packages version

[admin@mcp1 ~]$ rpm -qa | grep neutron
openstack-neutron-linuxbridge-7.0.1-15_v7.0.6_fusion.noarch
openstack-neutron-lbaas-7.0.0-2.el7ost.noarch
python-neutron-7.0.1-15_v7.0.6_fusion.noarch
python-neutron-fwaas-7.0.0-1.el7ost.noarch
openstack-neutron-ml2-7.0.1-15_v7.0.6_fusion.noarch
openstack-neutron-7.0.1-15_v7.0.6_fusion.noarch
python-neutron-lbaas-7.0.0-2.el7ost.noarch
python-neutronclient-3.1.0-1.el7ost.noarch
openstack-neutron-common-7.0.1-15_v7.0.6_fusion.noarch
openstack-neutron-fwaas-7.0.0-1.el7ost.noarch

Comment 23 Aaron Thomas 2016-10-05 16:02:22 UTC

Now able to reproduce. What's observable from the reproduction:

When 'enable_isolated_metadata' and 'dhcp_broadcast_reply' are enabled in /etc/neutron/dhcp_agent.ini using an external provider network IPv6 subnets that are created before IPV4 subnets results in routes not being set testing with cirros and trusty images. I've gone through each of the options for /etc/neutron/dhcp_agent.ini that Cisco is using to uncover any behavior that was missed after Kahou (filed the BZ) observed that the ordering of the network interfaces of the dhcp server appeared to indicate if the behavior was occurring.   

Example of ordering of the network interfaces on the dhcp server for the namespace:

The first time the environments stoodup and everything is creating the ordering appears correct and everything functions as it should:
-----------------------------------------
# ip netns exec qdhcp-b0eccf39-f1b0-4c47-9bfb-f05cf12e83fc ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
119: tapd1754fca-22: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:0b:7a:28 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.130/24 brd 192.168.0.255 scope global tapd1754fca-22
       valid_lft forever preferred_lft forever
    inet 169.254.169.254/16 brd 169.254.255.255 scope global tapd1754fca-22
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe0b:7a28/64 scope link 
       valid_lft forever preferred_lft forever

Any attempt to create another network with IPv6 and IPv4 results in the following ordering which Kahou discovered initially:
-----------------------------------------
# ip netns exec qdhcp-b0eccf39-f1b0-4c47-9bfb-f05cf12e83fc ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
119: tapd1754fca-22: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:0b:7a:28 brd ff:ff:ff:ff:ff:ff
    inet 169.254.169.254/16 brd 169.254.255.255 scope global tapd1754fca-22
       valid_lft forever preferred_lft forever
    inet 192.168.0.130/24 brd 192.168.0.255 scope global tapd1754fca-22
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe0b:7a28/64 scope link 
       valid_lft forever preferred_lft forever


Everything is 100% automated in my personal lab and I have the networks, subnets, glance images, security rules and instances all configuring during provisioning. The first observation I've noticed is in my environment the first time on a newly provisioned environment when I create the IPv6 network before the IPv4 using the isolated metadata configuration everything is ordered properly on the dhcp server concerning the network interfaces. I am not able to reproduce this behavior other than to re-deploy a new environment creating these resources for the first time. Every subsequent attempt results in the ordering we have observed where the metadata inet is placed at the top of the network interface list and the only way to see this behavior corrected is completely reprovision the entire environment.  

The second observation I've noticed is the interface ordering doesn't appear to matter on the dhcp server in a working configuration. When dhcp_broadcast_reply is set to true instances are not able to contact the metadata server over dhcp routes. I additionally attempted to reproduce without using host destination routes and next hop configuration with my IPv4 network however with dhcp_broadcast_reply set to false instances are always able to contact the metadata server over dhcp routes and these routes are set even when the ordering of the network interfaces appears incorrect. This maybe a known limitation when configuring metadata routes over dhcp concerning this dhcp_agent.ini option however I wanted to provide as much detail as possible.

The below /etc/neutron/dhcp_agent.ini configuration works 100% of the time with 'dhcp_broadcast_reply = False'. It does not work 100% of the time when 'dhcp_broadcast_reply = True' reproducing the exact behavior we are sing from Cisco. It seems that in my reproduction broadcasting dhcp replies does not work when enabling isolated metadata. I've relayed this to Cisco for confirmation in their environment and wanted to present this information here on the BZ as well. Since vlans are isolated broadcast domains it's possible this option mirrors their issue as I'm using a flat provider network and the problem is related to how the dhcp broadcast reply traffic is or is not translated over the dhcp in their environment. I can spin up the reproduction environment if needed which is external and provide ssh access (takes about 20 minutes to provision, please let me know).      


My reference /etc/neutron/dhcp_agent.ini:
[DEFAULT]
debug = False
resync_interval = 30
interface_driver =neutron.agent.linux.interface.OVSInterfaceDriver
dhcp_driver = neutron.agent.linux.dhcp.Dnsmasq
force_metadata = True
enable_isolated_metadata = True
enable_metadata_network = False
dhcp_domain = openstacklocal
dnsmasq_config_file =/etc/neutron/dnsmasq-neutron.conf
dhcp_broadcast_reply = False
dhcp_delete_namespaces = True
root_helper=sudo neutron-rootwrap /etc/neutron/rootwrap.conf
state_path=/var/lib/neutron
[AGENT]

Comment 25 Aaron Thomas 2016-10-11 17:19:35 UTC

It appears that the following upstream issue is affecting Cisco in this BZ request concerning the functionality of the dhcp agent option "force_metadata" when using isolated subnets.  

The upstream tracker relays this was resolved in the openstack/neutron 8.0.0.0b1 development milestone and looking through git log and our changelogs I didn't see this present in OSP-8(Liberty) as the backports from the upstream tracker are null-merges:

...
Merge tag '7.0.0' This is a null-merge of the 7.0.0 release tag back into the master branch so that the 7.0.0 tag will appear in the git commit history of the master branch. It contains no actual changes to the master branch, regardless of how our code review system's UI represents it. Please ask in #openstack-infra if you have any questions, and otherwise try to merge this as quickly as possible to avoid later conflicts on the master branch.
...

Upstream:

The option force_metadata = True is broken 
-----------------------------------------
https://bugs.launchpad.net/neutron/+bug/1499406

Initially found here: https://bugzilla.redhat.com/show_bug.cgi?id=1256816#c9
Patch https://review.openstack.org/#/c/211963 intreduces a regression with force_metadata = True.

Using the option force_metadata = True can will cause neutron to fail


Upstream Merged commit:

https://review.openstack.org/#/c/230941/
-----------------------------------------

 Change 230941 - Merged
The option force_metadata=True breaks the dhcp agent

Patch I5f6ee9788717c3d4f1f2e2a4b9734fdd8dd92b40 has an issue with
force_metadata = True.

Using the option force_metadata=True while
enable_isolated_metadata=False (which is the default), will break the
dhcp agent because the variable subnet_to_interface_ip is being
referenced before assignment.

Co-Authored-By: Jakub Libosvar <jlibosva>
Change-Id: I4e1d918e3a24dd483ee134021f587ae4520bf431
Closes-Bug: #1499406
(cherry picked from commit 473c338ff8c5526157d297b7e90d5e4f5e94cbb9)

- Aaron

Comment 26 Aaron Thomas 2016-10-11 18:57:31 UTC

Additionally reproduced the following upstream issue on Liberty using only the "force_metadata" option for neutron dhcp agents (Attaching upstream issue to this BZ):  

force_metadata = True : qdhcp namespace has no interface with ip 169.254.169.254
-----------------------------------------
https://bugs.launchpad.net/neutron/+bug/1549793

This issue was fixed in the openstack/neutron 9.0.0.0b2 development milestone.

Reproduction info:
-----------------------------------------
[root@dualnets ~(keystone_admin)]# rpm -qa|grep openstack-nova-common
openstack-nova-common-12.0.4-8.el7ost.noarch

[root@dualnets ~(keystone_admin)]#  cat /etc/neutron/dhcp_agent.ini | grep metadata | grep -v "#"
force_metadata = True
enable_isolated_metadata = False
enable_metadata_network = False

[root@dualnets ~(keystone_admin)]# ip netns exec qdhcp-ec3cf920-6ba0-457b-b447-074e0d943610 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
28: tapc09857cd-9d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:35:82:30 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.130/24 brd 192.168.0.255 scope global tapc09857cd-9d
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe35:8230/64 scope link 
       valid_lft forever preferred_lft forever

We should have interface on qdhcp namespace with 169.254.169.254 ip for metadata when "force_metadata = True" in /etc/neutron/dhcp-agent.ini.

Thanks,

Aaron

Comment 28 Eran Kuris 2016-11-21 08:51:25 UTC

Verified on OSP8 virt ENV  deployed by OSPD8.  base on rhel 7.3
$ rpm -qa |grep neutron
openstack-neutron-common-7.2.0-5.el7ost.noarch
openstack-neutron-7.2.0-5.el7ost.noarch
python-neutron-7.2.0-5.el7ost.noarch
python-neutronclient-3.1.0-2.el7ost.noarch
openstack-neutron-ml2-7.2.0-5.el7ost.noarch
openstack-neutron-openvswitch-7.2.0-5.el7ost.noarch


[root@vm1 ~]# ifconfig 
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1400
        inet 10.0.5.3  netmask 255.255.255.0  broadcast 10.0.5.255
        inet6 2005::f816:3eff:fe9c:bbb4  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::f816:3eff:fe9c:bbb4  prefixlen 64  scopeid 0x20<link>
        ether fa:16:3e:9c:bb:b4  txqueuelen 1000  (Ethernet)
        RX packets 1038  bytes 108965 (106.4 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 513  bytes 48311 (47.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Comment 30 errata-xmlrpc 2016-12-21 16:43:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2988.html

Comment 31 Jakub Libosvar 2017-01-18 14:10:49 UTC

*** Bug 1351795 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.