Hide Forgot
Description of problem: Launching nova instances on a neutron subnet with DHCP enabled results in the instances coming up without an IP address assigned to its single interface. A subsequent ifup <interface> fixes the problem. Version-Release number of selected component (if applicable): RHEL OpenStack havana, namely: rhel openstack-neutron-2013.2-5 rhel openstack-nova-compute-2013.2-4 How reproducible: almost always Steps to Reproduce: 1. Packstack installation of one physical controller and two physical nova hosts, with neutron networking, 2. Upload cirros 0.3.0.x86_64 image into glance 3. Launch an instance with m1.tiny, cirros image 4. Login to new instance via console, ifconfig eth0 shows no IP assigned 5. ifup eth0 to get an IP, then ping gateway works Actual results: ifconfig eth0 shows no IP address assigned to the interface Expected results: Upon booting, the nova instance should acquire an IP address for its internal interface from the DHCP agent Additional info:
This issue happens when the new VM is the first VM on the virtual network. Neutron's dhcp-agent does not start dnsmasq daemon for the network if the network does not have DHCP clients. However, when the first VM appears, it looks like sometimes it may be initiated faster then the dnsmasq daemon. As a result, nobody answers its DHCP requests. It looks like a race condition for me.
Richard, if I remember correctly, there are issues with the cirros 0.3.0 image re: DHCP on RHEL. Please try the 0.3.1 image ( http://download.cirros-cloud.net/0.3.1/cirros-0.3.1-x86_64-disk.img ) and report back. If that doesn't fix the issue, please describe your setup (vlan, gre, etc.) and post your packstack answer file. Thanks.
Created attachment 831838 [details] Packstack answer file This is the original answer file used to create this Openstack cloud.
I downloaded the cirros.0.3.1 image from the link provided and retried the process above - same essential results; create a new private network, launch 20 m1.tiny images, login on instance console, about 50% of the nodes had no IP assigned to their eth0 interface. I deleted the instances and tried again to launch only 10 instances and still the same results with some large portion of nodes getting no assigned IPs. neutron-dhcp-agent reports dhcpoffers to all the instances. Attaching our packstack answer-file which describes a gre setup, one controller, three nova hosts.
I'm wondering if the problem you are reporting with the 0.3.1 image are related to another bug (https://bugzilla.redhat.com/show_bug.cgi?id=1034822) also related to dhcp agent reliability. Can you replicate the reported problem when booting a single VM? What is the load on the controller node - and more importantly, what is the cpu usage of the neutron service - when you are attempting to boot multiple vm's?
The correct link to the other bug is https://bugzilla.redhat.com/show_bug.cgi?id=1023818
Using cirros 0.3.1, and packstack-based install of 2013.2.1-4 using gre and two compute hosts, every time I boot with lots of instances everything ends up getting a dhcp address. I just haven't been able to reproduce this. Is it possible that there is an interaction with VLANs going on as well in this set up similar to what was just reported in https://bugzilla.redhat.com/show_bug.cgi?id=1064109? I don't see any information that would point me to that in the answer file provided, but maybe some manual configuration was done later? Or can you try to reproduce with 2013.2.1-4 and cirros 0.3.1? Thanks.
The original problem occured at a client site which I do not have access to. The tenant network was attached manually to a vlan-tagged bonded interface, although we were using GRE across it. At one point during that engagement, the problem disappeared and I was unable to reproduce it. The circumstances were not unlike 1064109, but I cannot confirm it, but neither has the client reported the problem since.
Going ahead and closing this as WORKSFORME. Sounds like there is at least a decent chance that this is related to https://bugzilla.redhat.com/show_bug.cgi?id=1064109, so we can just continue to track that there since it has a lot of information on that bug. Thanks!