Description of problem: I am testing a dual stack VM (with a trusty cloud ini image). Seems like the order of the Neutron subnet creation will impact the VM network device setup… For example, I have the following Neutron network with the below creation sequences: 1. neutron net-create net-64-2 --provider:network_type vlan --provider:physical_network vlan_net1 --provider:segmentation_id 2005 2. neutron subnet-create net-64-2 2005::/64 --name subnet_6 --enable_dhcp true --ipv6-address-mode slaac --ip_version 6 3. neutron subnet-create net-64-2 10.0.5.0/24 --name subnet_4 --enable_dhcp true Note that I created IPv6 subnet before IPv4 subnet. If I boot an Ubuntu VM, it will get stuck at setting up the network device: cloud-init-nonet[23.36]: waiting 120 seconds for network device * Starting configure network device[74G[ OK ] * Starting Bridge socket events into upstart[74G[ OK ] * Stopping cold plug devices[74G[ OK ] * Stopping log initial device creation[74G[ OK ] * Starting enable remaining boot-time encrypted block devices[74G[ OK ] cloud-init-nonet[143.37]: gave up waiting for a network device. Cloud-init v. 0.7.5 running 'init' at Wed, 17 Aug 2016 18:22:07 +0000. Up 143.63 seconds. ci-info: +++++++++++++++++++++++Net device info+++++++++++++++++++++++ ci-info: +--------+------+-----------+-----------+-------------------+ ci-info: | Device | Up | Address | Mask | Hw-Address | ci-info: +--------+------+-----------+-----------+-------------------+ ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | . | ci-info: | eth0 | True | . | . | fa:16:3e:ad:f0:8c | ci-info: +--------+------+-----------+-----------+-------------------+ ci-info: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Route info failed!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2016-08-17 18:22:07,769 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [0/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'>: [Errno 101] Network is unreachable)] When I recreate the network and reorder the subnet creation sequence again (I.e. Create IPv4 subnet before creating IPv6 subnet), then the ubuntu VM will boot up with the correct dual stack interfaces. Also, by looking at the VM console log itself. Looks like the failure is due to the dnsmasq is not responding. Currently I am using dnsmasq 2.66. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Create a neutron network 2. Create a IPv6 subnet 3. Create a IPv4 subnet 4. Boot a VM with that network Actual results: Expected results: Additional info:
Looks like the root cause is due to dnsmasq Below are the steps to identify the issue: 1. Create the network, v6 subnet and then v4 subnet (it will introduce the problem) 2. Boot the VM and it will not get any leased IPv4 address 3. Kill the corresponding dnsmasq 4. Restart dhcp agent (which will respawn dnsmasq) 5. Boot the VM again and it will come up fine.
Assaf, any initial thoughts?
Ihar, can you take a look? It sounds like something you've handled in the past.
From the initial look at it, it sounds like a missed AMQP notification on particular subnet creation; OR neutron-dhcp-agent misconfiguring dnsmasq; OR dnsmasq 2.66 not handling the correct configuration files. We would need the following logs to isolate the issue: - neutron-server (debug = True) logs when creating subnets; - neutron-dhcp-agent (debug = True) logs when creating subnets; - config files for dnsmasq serving the network. Screenshots are really not enough to proceed with the issue.
BTW please collect dnsmasq config files before killing it and restarting the agent, and after it's back and correctly serves requests. I suspect once we have both and compare them, it will become clear why dnsmasq is not serving the request as expected. I suspect dnsmasq was not respawned when the second subnet was created.
Tried to reproduce locally with Delorean/Liberty + CentOS installation. It seems to work fine. There is a slight difference in my network options though: I don't use provider networking, but for what DHCP agent is concerned, it should not be of relevance. I am using the following versions of relevant packages: openstack-neutron-7.1.3-0.20160826005505.120a643.el7.centos.noarch dnsmasq-2.66-14.el7_1.x86_64 What's the exact version of openstack-neutron RPM in the failing environment?
Also of potential interest is whether they can reproduce the same issue with a tenant network (not provider): neutron net-create net-64-2 neutron subnet-create net-64-2 2005::/64 --name subnet_6 --enable_dhcp true --ipv6-address-mode slaac --ip_version 6 neutron subnet-create net-64-2 10.0.5.0/24 --name subnet_4 --enable_dhcp true Then start a Nova instance with the net-64-2 network, and check if you see the same issue.
Hi Ihar, I tried your suggestion by using the non-provider network, but got the same errors: 2016-09-14 18:30:45,436 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [65/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'>: [Errno 101] Network is unreachable)] 2016-09-14 18:30:50,467 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [70/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'>: [Errno 101] Network is unreachable)]
Hi Ihar, Below are the neutron packages version [admin@mcp1 ~]$ rpm -qa | grep neutron openstack-neutron-linuxbridge-7.0.1-15_v7.0.6_fusion.noarch openstack-neutron-lbaas-7.0.0-2.el7ost.noarch python-neutron-7.0.1-15_v7.0.6_fusion.noarch python-neutron-fwaas-7.0.0-1.el7ost.noarch openstack-neutron-ml2-7.0.1-15_v7.0.6_fusion.noarch openstack-neutron-7.0.1-15_v7.0.6_fusion.noarch python-neutron-lbaas-7.0.0-2.el7ost.noarch python-neutronclient-3.1.0-1.el7ost.noarch openstack-neutron-common-7.0.1-15_v7.0.6_fusion.noarch openstack-neutron-fwaas-7.0.0-1.el7ost.noarch
Now able to reproduce. What's observable from the reproduction: When 'enable_isolated_metadata' and 'dhcp_broadcast_reply' are enabled in /etc/neutron/dhcp_agent.ini using an external provider network IPv6 subnets that are created before IPV4 subnets results in routes not being set testing with cirros and trusty images. I've gone through each of the options for /etc/neutron/dhcp_agent.ini that Cisco is using to uncover any behavior that was missed after Kahou (filed the BZ) observed that the ordering of the network interfaces of the dhcp server appeared to indicate if the behavior was occurring. Example of ordering of the network interfaces on the dhcp server for the namespace: The first time the environments stoodup and everything is creating the ordering appears correct and everything functions as it should: ----------------------------------------- # ip netns exec qdhcp-b0eccf39-f1b0-4c47-9bfb-f05cf12e83fc ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 119: tapd1754fca-22: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether fa:16:3e:0b:7a:28 brd ff:ff:ff:ff:ff:ff inet 192.168.0.130/24 brd 192.168.0.255 scope global tapd1754fca-22 valid_lft forever preferred_lft forever inet 169.254.169.254/16 brd 169.254.255.255 scope global tapd1754fca-22 valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fe0b:7a28/64 scope link valid_lft forever preferred_lft forever Any attempt to create another network with IPv6 and IPv4 results in the following ordering which Kahou discovered initially: ----------------------------------------- # ip netns exec qdhcp-b0eccf39-f1b0-4c47-9bfb-f05cf12e83fc ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 119: tapd1754fca-22: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether fa:16:3e:0b:7a:28 brd ff:ff:ff:ff:ff:ff inet 169.254.169.254/16 brd 169.254.255.255 scope global tapd1754fca-22 valid_lft forever preferred_lft forever inet 192.168.0.130/24 brd 192.168.0.255 scope global tapd1754fca-22 valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fe0b:7a28/64 scope link valid_lft forever preferred_lft forever Everything is 100% automated in my personal lab and I have the networks, subnets, glance images, security rules and instances all configuring during provisioning. The first observation I've noticed is in my environment the first time on a newly provisioned environment when I create the IPv6 network before the IPv4 using the isolated metadata configuration everything is ordered properly on the dhcp server concerning the network interfaces. I am not able to reproduce this behavior other than to re-deploy a new environment creating these resources for the first time. Every subsequent attempt results in the ordering we have observed where the metadata inet is placed at the top of the network interface list and the only way to see this behavior corrected is completely reprovision the entire environment. The second observation I've noticed is the interface ordering doesn't appear to matter on the dhcp server in a working configuration. When dhcp_broadcast_reply is set to true instances are not able to contact the metadata server over dhcp routes. I additionally attempted to reproduce without using host destination routes and next hop configuration with my IPv4 network however with dhcp_broadcast_reply set to false instances are always able to contact the metadata server over dhcp routes and these routes are set even when the ordering of the network interfaces appears incorrect. This maybe a known limitation when configuring metadata routes over dhcp concerning this dhcp_agent.ini option however I wanted to provide as much detail as possible. The below /etc/neutron/dhcp_agent.ini configuration works 100% of the time with 'dhcp_broadcast_reply = False'. It does not work 100% of the time when 'dhcp_broadcast_reply = True' reproducing the exact behavior we are sing from Cisco. It seems that in my reproduction broadcasting dhcp replies does not work when enabling isolated metadata. I've relayed this to Cisco for confirmation in their environment and wanted to present this information here on the BZ as well. Since vlans are isolated broadcast domains it's possible this option mirrors their issue as I'm using a flat provider network and the problem is related to how the dhcp broadcast reply traffic is or is not translated over the dhcp in their environment. I can spin up the reproduction environment if needed which is external and provide ssh access (takes about 20 minutes to provision, please let me know). My reference /etc/neutron/dhcp_agent.ini: [DEFAULT] debug = False resync_interval = 30 interface_driver =neutron.agent.linux.interface.OVSInterfaceDriver dhcp_driver = neutron.agent.linux.dhcp.Dnsmasq force_metadata = True enable_isolated_metadata = True enable_metadata_network = False dhcp_domain = openstacklocal dnsmasq_config_file =/etc/neutron/dnsmasq-neutron.conf dhcp_broadcast_reply = False dhcp_delete_namespaces = True root_helper=sudo neutron-rootwrap /etc/neutron/rootwrap.conf state_path=/var/lib/neutron [AGENT]
It appears that the following upstream issue is affecting Cisco in this BZ request concerning the functionality of the dhcp agent option "force_metadata" when using isolated subnets. The upstream tracker relays this was resolved in the openstack/neutron 8.0.0.0b1 development milestone and looking through git log and our changelogs I didn't see this present in OSP-8(Liberty) as the backports from the upstream tracker are null-merges: ... Merge tag '7.0.0' This is a null-merge of the 7.0.0 release tag back into the master branch so that the 7.0.0 tag will appear in the git commit history of the master branch. It contains no actual changes to the master branch, regardless of how our code review system's UI represents it. Please ask in #openstack-infra if you have any questions, and otherwise try to merge this as quickly as possible to avoid later conflicts on the master branch. ... Upstream: The option force_metadata = True is broken ----------------------------------------- https://bugs.launchpad.net/neutron/+bug/1499406 Initially found here: https://bugzilla.redhat.com/show_bug.cgi?id=1256816#c9 Patch https://review.openstack.org/#/c/211963 intreduces a regression with force_metadata = True. Using the option force_metadata = True can will cause neutron to fail Upstream Merged commit: https://review.openstack.org/#/c/230941/ ----------------------------------------- Change 230941 - Merged The option force_metadata=True breaks the dhcp agent Patch I5f6ee9788717c3d4f1f2e2a4b9734fdd8dd92b40 has an issue with force_metadata = True. Using the option force_metadata=True while enable_isolated_metadata=False (which is the default), will break the dhcp agent because the variable subnet_to_interface_ip is being referenced before assignment. Co-Authored-By: Jakub Libosvar <jlibosva> Change-Id: I4e1d918e3a24dd483ee134021f587ae4520bf431 Closes-Bug: #1499406 (cherry picked from commit 473c338ff8c5526157d297b7e90d5e4f5e94cbb9) - Aaron
Additionally reproduced the following upstream issue on Liberty using only the "force_metadata" option for neutron dhcp agents (Attaching upstream issue to this BZ): force_metadata = True : qdhcp namespace has no interface with ip 169.254.169.254 ----------------------------------------- https://bugs.launchpad.net/neutron/+bug/1549793 This issue was fixed in the openstack/neutron 9.0.0.0b2 development milestone. Reproduction info: ----------------------------------------- [root@dualnets ~(keystone_admin)]# rpm -qa|grep openstack-nova-common openstack-nova-common-12.0.4-8.el7ost.noarch [root@dualnets ~(keystone_admin)]# cat /etc/neutron/dhcp_agent.ini | grep metadata | grep -v "#" force_metadata = True enable_isolated_metadata = False enable_metadata_network = False [root@dualnets ~(keystone_admin)]# ip netns exec qdhcp-ec3cf920-6ba0-457b-b447-074e0d943610 ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 28: tapc09857cd-9d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether fa:16:3e:35:82:30 brd ff:ff:ff:ff:ff:ff inet 192.168.0.130/24 brd 192.168.0.255 scope global tapc09857cd-9d valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fe35:8230/64 scope link valid_lft forever preferred_lft forever We should have interface on qdhcp namespace with 169.254.169.254 ip for metadata when "force_metadata = True" in /etc/neutron/dhcp-agent.ini. Thanks, Aaron
Verified on OSP8 virt ENV deployed by OSPD8. base on rhel 7.3 $ rpm -qa |grep neutron openstack-neutron-common-7.2.0-5.el7ost.noarch openstack-neutron-7.2.0-5.el7ost.noarch python-neutron-7.2.0-5.el7ost.noarch python-neutronclient-3.1.0-2.el7ost.noarch openstack-neutron-ml2-7.2.0-5.el7ost.noarch openstack-neutron-openvswitch-7.2.0-5.el7ost.noarch [root@vm1 ~]# ifconfig eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1400 inet 10.0.5.3 netmask 255.255.255.0 broadcast 10.0.5.255 inet6 2005::f816:3eff:fe9c:bbb4 prefixlen 64 scopeid 0x0<global> inet6 fe80::f816:3eff:fe9c:bbb4 prefixlen 64 scopeid 0x20<link> ether fa:16:3e:9c:bb:b4 txqueuelen 1000 (Ethernet) RX packets 1038 bytes 108965 (106.4 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 513 bytes 48311 (47.1 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2988.html
*** Bug 1351795 has been marked as a duplicate of this bug. ***