Created attachment 1016479 [details] setup-neutron Description of problem: Using latest Juno RDO packaging, nova rewrites the IP addresses to empty in the info cache. [sdake@bigiron kolla]$ nova list +--------------------------------------+-------------------------------------------------------+--------+------------+-------------+---------------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------------------------------------+--------+------------+-------------+---------------------------------------+ | 94e84142-b7f4-4777-a20b-b675c7f2a168 | st-25npk73mrk6-0-agkvww4nm3ao-steak_node-wmavxtw6bgae | ACTIVE | - | Running | demo-net=10.10.10.14, 192.168.100.170 | | d341d247-4505-4f14-bfc5-69cd08755ffd | st-25npk73mrk6-1-ktv5fvdjyg5z-steak_node-zq73kaywdjhi | ACTIVE | - | Running | | | 07dc65cc-9523-40ec-816a-f0f0f9a39c2e | st-25npk73mrk6-2-6kjozrh2swfl-steak_node-6hy7vthojq4o | ACTIVE | - | Running | demo-net=10.10.10.10, 192.168.100.167 | | a0783b2d-23b0-4872-a6ea-1a6c3a77aca4 | st-25npk73mrk6-3-7vpyvhq6cgjx-steak_node-wuxk4wus55ce | ACTIVE | - | Running | demo-net=10.10.10.3, 192.168.100.151 | | 51887fa4-8701-42f6-a6b6-e0c8a96085fe | st-25npk73mrk6-4-2xw2lyyiikdx-steak_node-tianxosr5bcp | ACTIVE | - | Running | | | 06616508-2903-4a81-b377-16c2c6c2e274 | st-25npk73mrk6-5-hnes7dhtr2aj-steak_node-7oubcgeq7sdg | ACTIVE | - | Running | | | f6a870fe-e6ab-499d-83dd-1c2bc94e84b9 | st-25npk73mrk6-6-ukgua5s6hbca-steak_node-5xiich6yuf7y | ACTIVE | - | Running | | | 7f3653f9-90b3-4b91-9625-a852fe19b003 | st-25npk73mrk6-7-pk7am5rgoifw-steak_node-itckair4xt7d | ACTIVE | - | Running | | | 5337a083-6cd6-4362-976e-26a75088503d | st-25npk73mrk6-8-rd65frc7o4cv-steak_node-ejvipynq4cko | ACTIVE | - | Running | demo-net=10.10.10.7, 192.168.100.157 | | eeed70f4-6ebc-4d7b-af22-48a5ae59b26d | st-25npk73mrk6-9-afewmrk7blr6-steak_node-43vejemufin6 | ACTIVE | - | Running | demo-net=10.10.10.20, 192.168.100.156 | | 861b0b4b-4b5a-41ab-b824-3bbf2d181772 | st-5npk73mrk6-10-ttrt4rznqux5-steak_node-jmu7mkvdyo72 | ACTIVE | - | Running | | | dc7ea542-4d47-4344-a11a-10c5377858b7 | st-5npk73mrk6-11-hrbzs7uwrijx-steak_node-wt6kd6elxyda | ACTIVE | - | Running | demo-net=10.10.10.22, 192.168.100.162 | | b816e9f1-28ff-47de-830e-356138452b6b | st-5npk73mrk6-12-5ptrj7dpd3cg-steak_node-gc66oowa2kxd | ACTIVE | - | Running | demo-net=10.10.10.4, 192.168.100.153 | | 91e2ec9b-df43-462c-bc00-64c28253e2f1 | st-5npk73mrk6-13-myxwolw4txkt-steak_node-lr6a3lloj7oe | ACTIVE | - | Running | demo-net=10.10.10.18, 192.168.100.158 | | bd4b8b61-899e-4b93-bd42-c3cc7db8f29b | st-5npk73mrk6-14-w77lpejbkgc6-steak_node-kp2c2dnzrugt | ACTIVE | - | Running | | | 93c2f9c7-4d08-49fa-92ed-e3a70e4ebd81 | st-5npk73mrk6-15-r6ijb5pyfq32-steak_node-xm5vkk5f5xsd | ACTIVE | - | Running | demo-net=10.10.10.21, 192.168.100.152 | | 8b5c5531-3ed5-4409-a04a-a39d3b7f7ffe | st-5npk73mrk6-16-5jkskqbsghwh-steak_node-snbjnm33jlhc | ACTIVE | - | Running | demo-net=10.10.10.5, 192.168.100.154 | | 9345ec7c-4179-4031-ac80-d39010bf5901 | st-5npk73mrk6-17-bkrhtamjxdpo-steak_node-ln4v7r36rmby | ACTIVE | - | Running | | | c6bea271-a174-4bd9-b103-cb276eef37fb | st-5npk73mrk6-18-dzv34hs2qafj-steak_node-u5w63s73hnms | ACTIVE | - | Running | | | 99a6d520-1c48-4543-8275-876b808c2424 | st-5npk73mrk6-19-zclmxdkoaymu-steak_node-ajyivkxjzxmg | ACTIVE | - | Running | | +--------------------------------------+-------------------------------------------------------+--------+------------+-------------+---------------------------------------+ Inspection with mysql of the nova database of the info cache shows the UUID of the VM is stored with an empty NetworkModel. I added some print debug logging, but after 5 days, am only a bit closer to understanding the root cause of the problem (but boy, do I understand how IPs make their way through the nova system, and its not pretty) The VMs that don't have IPs associated with them were able to obtain DHCP/metadata, so clearly they know which IPs belong to which mac addresses. Version-Release number of selected component (if applicable): [sdake@bigiron kolla]$ sudo docker exec ab rpm -qa | grep nova python-nova-2014.2.2-1.el7.noarch python-novaclient-2.20.0-1.el7.centos.noarch openstack-nova-common-2014.2.2-1.el7.noarch openstack-nova-compute-2014.2.2-1.el7.noarch How reproducible: 100% Steps to Reproduce: Using this repository: http://github.com:/stackforge/kolla 1. Read and follow the kolla developer-env.md documentation (make sure to follow directions!) 2. Run kolla/pull -c 3. run kolla/start 4. Run attached setup-neutron 2. Run attached launch script (CCBY enjoy:) 3. run nova list after 3-4 minutes Actual results: Only some VMs should have IP addresses and floats listed in nova list. Expected results: All vms should have IP addresses and floats listed in nova list Additional info: I have attached my debug logs and debug addition changes to root cause the problem.
Created attachment 1016480 [details] launch
Created attachment 1016492 [details] heat-rg.yml
Created attachment 1016493 [details] steak.yaml
Created attachment 1016494 [details] debug logging
Created attachment 1016496 [details] logs of running system
NB: I suspect this problem will reproduce independently of Kolla with Juno. It should be simple to test - although you will need enough ram to launch 40 VMs. Regards -steve
RE Comment #6, I meant 20 VMs. This problem was happening with 40VMs, but I wanted to amp it down a notch to 20 VMs, and the problem still appears. Change "count: 20" in heat-rg.yml to set the VM/port*2 count. Regards -steve
Created attachment 1016499 [details] logs of running system with compute
When nodes don't get IP addresses I get the following output: 2015-04-20 18:29:33.063 1 DEBUG nova.compute.utils [-] STEAKB2 get_nw_info_for_instance instance.info_cache InstanceInfoCache(created_at=2015-04-20T16:09:41Z,deleted=False,deleted_at=None,instance_uuid=07dc65cc-9523-40ec-816a-f0f0f9a39c2e,network_info=NetworkModel(7d6efb40-f1f1-48e3-9228-be4be1e4f26a),updated_at=2015-04-20T16:27:15Z) get_nw_info_for_instance /usr/lib/python2.7/site-packages/nova/compute/utils.py:373 When nodes get IP addresses I get the following output: 2015-04-20 18:31:35.133 1 DEBUG nova.compute.utils [-] STEAKB2 get_nw_info_for_instance instance.info_cache InstanceInfoCache(created_at=2015-04-20T16:09:42Z,deleted=False,deleted_at=None,instance_uuid=99a6d520-1c48-4543-8275-876b808c2424,network_info=NetworkModel(),updated_at=2015-04-20T16:10:11Z) get_nw_info_for_instance /usr/lib/python2.7/site-packages/nova/compute/utils.py:373 NB: The NetworkModel is defined in the "obtains IP addresses" case - defaults in the other case. Nova list validates this assertion: [sdake@bigiron kolla]$ nova list | grep 07dc65cc | 07dc65cc-9523-40ec-816a-f0f0f9a39c2e | st-25npk73mrk6-2-6kjozrh2swfl-steak_node-6hy7vthojq4o | ACTIVE | - | Running | demo-net=10.10.10.10, 192.168.100.167 | [sdake@bigiron kolla]$ nova list | grep 99a6 | 99a6d520-1c48-4543-8275-876b808c2424 | st-5npk73mrk6-19-zclmxdkoaymu-steak_node-ajyivkxjzxmg | ACTIVE | - | Running | |
Regarding comment #9 swap first two comments output.
Changing nova-compute, nova-api, nova-scheduler and nova-conductor to icehouse RDO version does not exhibit this problem. This looks like a juno only problem.
This problem exists in Kilo RC2 Delorean packages as of April 27th. So I suspect this problem also exists in master. I have not duplicated on devstack. When I do that, I'll report back.
(In reply to Steven Dake from comment #11) > Changing nova-compute, nova-api, nova-scheduler and nova-conductor to > icehouse RDO version does not exhibit this problem. This looks like a juno > only problem. What about Neutron, did it stay Juno? This looks like a networking issue to me, so I'm moving it to openstack-neutron for further investigation.
(In reply to Steven Dake from comment #9) > When nodes don't get IP addresses I get the following output: > > 2015-04-20 18:29:33.063 1 DEBUG nova.compute.utils [-] STEAKB2 > get_nw_info_for_instance instance.info_cache > InstanceInfoCache(created_at=2015-04-20T16:09:41Z,deleted=False, > deleted_at=None,instance_uuid=07dc65cc-9523-40ec-816a-f0f0f9a39c2e, > network_info=NetworkModel(7d6efb40-f1f1-48e3-9228-be4be1e4f26a), > updated_at=2015-04-20T16:27:15Z) get_nw_info_for_instance > /usr/lib/python2.7/site-packages/nova/compute/utils.py:373 > > > When nodes get IP addresses I get the following output: > > 2015-04-20 18:31:35.133 1 DEBUG nova.compute.utils [-] STEAKB2 > get_nw_info_for_instance instance.info_cache > InstanceInfoCache(created_at=2015-04-20T16:09:42Z,deleted=False, > deleted_at=None,instance_uuid=99a6d520-1c48-4543-8275-876b808c2424, > network_info=NetworkModel(),updated_at=2015-04-20T16:10:11Z) > get_nw_info_for_instance > /usr/lib/python2.7/site-packages/nova/compute/utils.py:373 > > > NB: The NetworkModel is defined in the "obtains IP addresses" case - > defaults in the other case. > > Nova list validates this assertion: > > [sdake@bigiron kolla]$ nova list | grep 07dc65cc > | 07dc65cc-9523-40ec-816a-f0f0f9a39c2e | > st-25npk73mrk6-2-6kjozrh2swfl-steak_node-6hy7vthojq4o | ACTIVE | - > | Running | demo-net=10.10.10.10, 192.168.100.167 | > [sdake@bigiron kolla]$ nova list | grep 99a6 > | 99a6d520-1c48-4543-8275-876b808c2424 | > st-5npk73mrk6-19-zclmxdkoaymu-steak_node-ajyivkxjzxmg | ACTIVE | - > | Running | | I looked into attached logs and tracked the instance 99a6d520. It successfully creates port in neutron and also creates a correct NWInfoCache in nova. After successful boot I bet you see the correct 10.10.10.17 ip in nova list. But after refresh cache is called, nova queries neutron with specific tenant id: http://192.168.1.100:9696/v2.0/ports.json?tenant_id=1a697505e6e74578813e2209caaf4d6c&device_id=99a6d520-1c48-4543-8275-876b808c2424 -X GET -H "X-Auth-Token: 5091daa5f5274d9d8a34ddd02999d058" -H "User-Agent: python-neutronclient" and neutron replies with RESP:401 {'date': 'Mon, 20 Apr 2015 17:10:01 GMT', 'content-length': '23', 'content-type': 'text/plain', 'www-authenticate': "Keystone uri='http://192.168.1.100:5000/v2.0'", 'x-openstack-request-id': 'req-a7d741bd-631d-4170-81d3-07423d138e66'} Authentication required Provided token 5091daa5f5274d9d8a34ddd02999d058 is valid till 17:09 "token": {"issued_at": "2015-04-20T16:09:30.109666", "expires": "2015-04-20T17:09:30Z", "id": "5091daa5f5274d9d8a34ddd02999d058" But I can also see that token was successfully redacted: 2015-04-20 17:05:57.061 1 DEBUG keystoneclient.session [-] REQ: curl -i -X GET http://192.168.1.100:35357/v3/auth/tokens -H "X-Subject-Token: TOKEN_REDACTED" -H "User-Agent: python-keystoneclient" -H "Accept: application/json" -H "X-Auth-Token: TOKEN_REDACTED" _http_log_request /usr/lib/python2.7/site-packages/keystoneclient/session.py:155^M 2015-04-20 17:05:57.120 1 DEBUG keystoneclient.session [-] RESP: [200] {'x-subject-token': '5091daa5f5274d9d8a34ddd02999d058', 'content-type': 'application/json', 'content-length': '4668', 'vary': 'X-Auth-Token', 'date': 'Mon, 20 Apr 2015 17:05:57 GMT'} I am not sure how the tokens are cached in nova and I'm also very confused by all the logs being in single file. Steven, may I ask you to push the very same logs in tar file with log per service? It will make grepping easier and thus better tracking of the issue.
This lead me to look into neutronclient and I see if authentication fails due to expired token, client authenticates and sends the same request again, which is what happens in our case. So I guess what I wrote above is dead end. I'll keep digging.
Alan, This is definitely a nova problem. I ran Junoo/Kilo of Neuron with Icehouse Nova, and it works perfectly. I run with Juno/Kilo version of Nova with Juno/Kilo of Neutron, and that does not work. This implies that Nova had a regression sometime after Icehouse. Looking through the code and my 5+ days debugging, I can say for sure it is nova that is broken, since it calls IP address assignment in approximately 40+ places.
(In reply to Alan Pevec from comment #13) > (In reply to Steven Dake from comment #11) > > Changing nova-compute, nova-api, nova-scheduler and nova-conductor to > > icehouse RDO version does not exhibit this problem. This looks like a juno > > only problem. > > What about Neutron, did it stay Juno? > This looks like a networking issue to me, so I'm moving it to > openstack-neutron for further investigation. Alan, It does look like a networking issue, but it is a Nova issue. The "NetworkModel" object in one case is set with data, and in another case is not set with data. See comment #9. NetworkModel comes exclusively from Nova.
Alright, so back to Nova. Eoghan, evidence so far is that neutron/neutronclient do the right thing but nwinfo somehow gets lost in the corridors of Nova?
Just curious, can this be reproduced consistently with Nova only, *without* Kolla or Heat being involved? E.g. something like: for i in {1..20}; do nova boot --flavor 1 --key_name oskey1 \ --image cirros-0.3.3-x86_64-disk vm-$i; done
Kashyap, I'm pretty sure the fact it works with icehouse nova eliminates Kolla as the source of the problem. But your right it does need to be validated on devstack. I cannot alter my environment to setup devstack prior to summit because I have two demos using kolla. As for your do nova boot operation, part of the issue is using neutron floating ips so the floats need to be assigned as well. At one point I had a script which took a parameter which would create N nova images using neutron floats. IIRC this reproduced there occasionally. Regards -steve
Steve, when you get a moment, please file an upstream bug (with a reproducer) and link it here.
Kashyap, This fell off my radar after VC ODS so apologies for not getting a bug filed back in Kilo. Now filing a bug with upstream at this point is likely to result in no positive action because the problem is fixed in Liberty and master. The Nova team seemed pretty meh when I mentioned this in the nova irc channel during debugging, indicating the code was a disaster and needed to be rewritten and there was no real way to fix it without a major refactor. What the nova team eventually did with it, i'm not sure. The code at that point in time literally set the IP address in hundreds of places across 10+ RPC boundaries. Dan Smith mentioned this before I got into debugging -> trust but verify and all that, he was definitely correct, it was a mess. Not being a nova core reviewer, I don't keep up with the nova code changes, so can't offer consulting on what patches need to be backported into RDO or RHEL OSP. I can confirm Liberty and Master (mitaka) work perfectly. Regards -steve
This bug is against a Version which has reached End of Life. If it's still present in supported release (http://releases.openstack.org), please update Version and reopen.