Bug 1367947
| Summary: | Booting up VM with dual stack interface is affected by the order of Neutron subnet creation | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | kahou <kalei> | |
| Component: | openstack-neutron | Assignee: | Ihar Hrachyshka <ihrachys> | |
| Status: | CLOSED ERRATA | QA Contact: | Eran Kuris <ekuris> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 8.0 (Liberty) | CC: | aathomas, amuller, charcrou, chrisw, ihrachys, jdonohue, kalei, nyechiel, skulkarn, srevivo, trinhlee | |
| Target Milestone: | async | Keywords: | Triaged, ZStream | |
| Target Release: | 8.0 (Liberty) | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | hot | |||
| Fixed In Version: | openstack-neutron-7.2.0-4.el7ost | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1394880 (view as bug list) | Environment: | ||
| Last Closed: | 2016-12-21 16:43:52 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1194008, 1394880 | |||
|
Description
kahou
2016-08-18 00:45:09 UTC
Looks like the root cause is due to dnsmasq Below are the steps to identify the issue: 1. Create the network, v6 subnet and then v4 subnet (it will introduce the problem) 2. Boot the VM and it will not get any leased IPv4 address 3. Kill the corresponding dnsmasq 4. Restart dhcp agent (which will respawn dnsmasq) 5. Boot the VM again and it will come up fine. Looks like the root cause is due to dnsmasq Below are the steps to identify the issue: 1. Create the network, v6 subnet and then v4 subnet (it will introduce the problem) 2. Boot the VM and it will not get any leased IPv4 address 3. Kill the corresponding dnsmasq 4. Restart dhcp agent (which will respawn dnsmasq) 5. Boot the VM again and it will come up fine. Assaf, any initial thoughts? Ihar, can you take a look? It sounds like something you've handled in the past. From the initial look at it, it sounds like a missed AMQP notification on particular subnet creation; OR neutron-dhcp-agent misconfiguring dnsmasq; OR dnsmasq 2.66 not handling the correct configuration files. We would need the following logs to isolate the issue: - neutron-server (debug = True) logs when creating subnets; - neutron-dhcp-agent (debug = True) logs when creating subnets; - config files for dnsmasq serving the network. Screenshots are really not enough to proceed with the issue. BTW please collect dnsmasq config files before killing it and restarting the agent, and after it's back and correctly serves requests. I suspect once we have both and compare them, it will become clear why dnsmasq is not serving the request as expected. I suspect dnsmasq was not respawned when the second subnet was created. BTW please collect dnsmasq config files before killing it and restarting the agent, and after it's back and correctly serves requests. I suspect once we have both and compare them, it will become clear why dnsmasq is not serving the request as expected. I suspect dnsmasq was not respawned when the second subnet was created. Tried to reproduce locally with Delorean/Liberty + CentOS installation. It seems to work fine. There is a slight difference in my network options though: I don't use provider networking, but for what DHCP agent is concerned, it should not be of relevance. I am using the following versions of relevant packages: openstack-neutron-7.1.3-0.20160826005505.120a643.el7.centos.noarch dnsmasq-2.66-14.el7_1.x86_64 What's the exact version of openstack-neutron RPM in the failing environment? Also of potential interest is whether they can reproduce the same issue with a tenant network (not provider): neutron net-create net-64-2 neutron subnet-create net-64-2 2005::/64 --name subnet_6 --enable_dhcp true --ipv6-address-mode slaac --ip_version 6 neutron subnet-create net-64-2 10.0.5.0/24 --name subnet_4 --enable_dhcp true Then start a Nova instance with the net-64-2 network, and check if you see the same issue. Hi Ihar, I tried your suggestion by using the non-provider network, but got the same errors: 2016-09-14 18:30:45,436 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [65/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'>: [Errno 101] Network is unreachable)] 2016-09-14 18:30:50,467 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [70/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'>: [Errno 101] Network is unreachable)] Hi Ihar, Below are the neutron packages version [admin@mcp1 ~]$ rpm -qa | grep neutron openstack-neutron-linuxbridge-7.0.1-15_v7.0.6_fusion.noarch openstack-neutron-lbaas-7.0.0-2.el7ost.noarch python-neutron-7.0.1-15_v7.0.6_fusion.noarch python-neutron-fwaas-7.0.0-1.el7ost.noarch openstack-neutron-ml2-7.0.1-15_v7.0.6_fusion.noarch openstack-neutron-7.0.1-15_v7.0.6_fusion.noarch python-neutron-lbaas-7.0.0-2.el7ost.noarch python-neutronclient-3.1.0-1.el7ost.noarch openstack-neutron-common-7.0.1-15_v7.0.6_fusion.noarch openstack-neutron-fwaas-7.0.0-1.el7ost.noarch Now able to reproduce. What's observable from the reproduction:
When 'enable_isolated_metadata' and 'dhcp_broadcast_reply' are enabled in /etc/neutron/dhcp_agent.ini using an external provider network IPv6 subnets that are created before IPV4 subnets results in routes not being set testing with cirros and trusty images. I've gone through each of the options for /etc/neutron/dhcp_agent.ini that Cisco is using to uncover any behavior that was missed after Kahou (filed the BZ) observed that the ordering of the network interfaces of the dhcp server appeared to indicate if the behavior was occurring.
Example of ordering of the network interfaces on the dhcp server for the namespace:
The first time the environments stoodup and everything is creating the ordering appears correct and everything functions as it should:
-----------------------------------------
# ip netns exec qdhcp-b0eccf39-f1b0-4c47-9bfb-f05cf12e83fc ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
119: tapd1754fca-22: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
link/ether fa:16:3e:0b:7a:28 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.130/24 brd 192.168.0.255 scope global tapd1754fca-22
valid_lft forever preferred_lft forever
inet 169.254.169.254/16 brd 169.254.255.255 scope global tapd1754fca-22
valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fe0b:7a28/64 scope link
valid_lft forever preferred_lft forever
Any attempt to create another network with IPv6 and IPv4 results in the following ordering which Kahou discovered initially:
-----------------------------------------
# ip netns exec qdhcp-b0eccf39-f1b0-4c47-9bfb-f05cf12e83fc ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
119: tapd1754fca-22: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
link/ether fa:16:3e:0b:7a:28 brd ff:ff:ff:ff:ff:ff
inet 169.254.169.254/16 brd 169.254.255.255 scope global tapd1754fca-22
valid_lft forever preferred_lft forever
inet 192.168.0.130/24 brd 192.168.0.255 scope global tapd1754fca-22
valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fe0b:7a28/64 scope link
valid_lft forever preferred_lft forever
Everything is 100% automated in my personal lab and I have the networks, subnets, glance images, security rules and instances all configuring during provisioning. The first observation I've noticed is in my environment the first time on a newly provisioned environment when I create the IPv6 network before the IPv4 using the isolated metadata configuration everything is ordered properly on the dhcp server concerning the network interfaces. I am not able to reproduce this behavior other than to re-deploy a new environment creating these resources for the first time. Every subsequent attempt results in the ordering we have observed where the metadata inet is placed at the top of the network interface list and the only way to see this behavior corrected is completely reprovision the entire environment.
The second observation I've noticed is the interface ordering doesn't appear to matter on the dhcp server in a working configuration. When dhcp_broadcast_reply is set to true instances are not able to contact the metadata server over dhcp routes. I additionally attempted to reproduce without using host destination routes and next hop configuration with my IPv4 network however with dhcp_broadcast_reply set to false instances are always able to contact the metadata server over dhcp routes and these routes are set even when the ordering of the network interfaces appears incorrect. This maybe a known limitation when configuring metadata routes over dhcp concerning this dhcp_agent.ini option however I wanted to provide as much detail as possible.
The below /etc/neutron/dhcp_agent.ini configuration works 100% of the time with 'dhcp_broadcast_reply = False'. It does not work 100% of the time when 'dhcp_broadcast_reply = True' reproducing the exact behavior we are sing from Cisco. It seems that in my reproduction broadcasting dhcp replies does not work when enabling isolated metadata. I've relayed this to Cisco for confirmation in their environment and wanted to present this information here on the BZ as well. Since vlans are isolated broadcast domains it's possible this option mirrors their issue as I'm using a flat provider network and the problem is related to how the dhcp broadcast reply traffic is or is not translated over the dhcp in their environment. I can spin up the reproduction environment if needed which is external and provide ssh access (takes about 20 minutes to provision, please let me know).
My reference /etc/neutron/dhcp_agent.ini:
[DEFAULT]
debug = False
resync_interval = 30
interface_driver =neutron.agent.linux.interface.OVSInterfaceDriver
dhcp_driver = neutron.agent.linux.dhcp.Dnsmasq
force_metadata = True
enable_isolated_metadata = True
enable_metadata_network = False
dhcp_domain = openstacklocal
dnsmasq_config_file =/etc/neutron/dnsmasq-neutron.conf
dhcp_broadcast_reply = False
dhcp_delete_namespaces = True
root_helper=sudo neutron-rootwrap /etc/neutron/rootwrap.conf
state_path=/var/lib/neutron
[AGENT]
It appears that the following upstream issue is affecting Cisco in this BZ request concerning the functionality of the dhcp agent option "force_metadata" when using isolated subnets. The upstream tracker relays this was resolved in the openstack/neutron 8.0.0.0b1 development milestone and looking through git log and our changelogs I didn't see this present in OSP-8(Liberty) as the backports from the upstream tracker are null-merges: ... Merge tag '7.0.0' This is a null-merge of the 7.0.0 release tag back into the master branch so that the 7.0.0 tag will appear in the git commit history of the master branch. It contains no actual changes to the master branch, regardless of how our code review system's UI represents it. Please ask in #openstack-infra if you have any questions, and otherwise try to merge this as quickly as possible to avoid later conflicts on the master branch. ... Upstream: The option force_metadata = True is broken ----------------------------------------- https://bugs.launchpad.net/neutron/+bug/1499406 Initially found here: https://bugzilla.redhat.com/show_bug.cgi?id=1256816#c9 Patch https://review.openstack.org/#/c/211963 intreduces a regression with force_metadata = True. Using the option force_metadata = True can will cause neutron to fail Upstream Merged commit: https://review.openstack.org/#/c/230941/ ----------------------------------------- Change 230941 - Merged The option force_metadata=True breaks the dhcp agent Patch I5f6ee9788717c3d4f1f2e2a4b9734fdd8dd92b40 has an issue with force_metadata = True. Using the option force_metadata=True while enable_isolated_metadata=False (which is the default), will break the dhcp agent because the variable subnet_to_interface_ip is being referenced before assignment. Co-Authored-By: Jakub Libosvar <jlibosva> Change-Id: I4e1d918e3a24dd483ee134021f587ae4520bf431 Closes-Bug: #1499406 (cherry picked from commit 473c338ff8c5526157d297b7e90d5e4f5e94cbb9) - Aaron Additionally reproduced the following upstream issue on Liberty using only the "force_metadata" option for neutron dhcp agents (Attaching upstream issue to this BZ): force_metadata = True : qdhcp namespace has no interface with ip 169.254.169.254 ----------------------------------------- https://bugs.launchpad.net/neutron/+bug/1549793 This issue was fixed in the openstack/neutron 9.0.0.0b2 development milestone. Reproduction info: ----------------------------------------- [root@dualnets ~(keystone_admin)]# rpm -qa|grep openstack-nova-common openstack-nova-common-12.0.4-8.el7ost.noarch [root@dualnets ~(keystone_admin)]# cat /etc/neutron/dhcp_agent.ini | grep metadata | grep -v "#" force_metadata = True enable_isolated_metadata = False enable_metadata_network = False [root@dualnets ~(keystone_admin)]# ip netns exec qdhcp-ec3cf920-6ba0-457b-b447-074e0d943610 ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 28: tapc09857cd-9d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether fa:16:3e:35:82:30 brd ff:ff:ff:ff:ff:ff inet 192.168.0.130/24 brd 192.168.0.255 scope global tapc09857cd-9d valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fe35:8230/64 scope link valid_lft forever preferred_lft forever We should have interface on qdhcp namespace with 169.254.169.254 ip for metadata when "force_metadata = True" in /etc/neutron/dhcp-agent.ini. Thanks, Aaron Verified on OSP8 virt ENV deployed by OSPD8. base on rhel 7.3
$ rpm -qa |grep neutron
openstack-neutron-common-7.2.0-5.el7ost.noarch
openstack-neutron-7.2.0-5.el7ost.noarch
python-neutron-7.2.0-5.el7ost.noarch
python-neutronclient-3.1.0-2.el7ost.noarch
openstack-neutron-ml2-7.2.0-5.el7ost.noarch
openstack-neutron-openvswitch-7.2.0-5.el7ost.noarch
[root@vm1 ~]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1400
inet 10.0.5.3 netmask 255.255.255.0 broadcast 10.0.5.255
inet6 2005::f816:3eff:fe9c:bbb4 prefixlen 64 scopeid 0x0<global>
inet6 fe80::f816:3eff:fe9c:bbb4 prefixlen 64 scopeid 0x20<link>
ether fa:16:3e:9c:bb:b4 txqueuelen 1000 (Ethernet)
RX packets 1038 bytes 108965 (106.4 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 513 bytes 48311 (47.1 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2988.html *** Bug 1351795 has been marked as a duplicate of this bug. *** |