RDO tickets are now tracked in Jira https://issues.redhat.com/projects/RDO/issues/
Bug 1213547 - launching 20 VMs at once via a heat resource group causes nova to not record some IPs correctly
Summary: launching 20 VMs at once via a heat resource group causes nova to not record ...
Keywords:
Status: CLOSED EOL
Alias: None
Product: RDO
Classification: Community
Component: openstack-nova
Version: Juno
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: ---
: Kilo
Assignee: Eoghan Glynn
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-04-20 18:00 UTC by Steven Dake
Modified: 2016-05-19 15:39 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-19 15:39:45 UTC
Embargoed:


Attachments (Terms of Use)
setup-neutron (1.59 KB, application/x-shellscript)
2015-04-20 18:00 UTC, Steven Dake
no flags Details
launch (459 bytes, text/plain)
2015-04-20 18:01 UTC, Steven Dake
no flags Details
heat-rg.yml (686 bytes, text/plain)
2015-04-20 18:03 UTC, Steven Dake
no flags Details
steak.yaml (970 bytes, text/plain)
2015-04-20 18:03 UTC, Steven Dake
no flags Details
debug logging (5.61 KB, patch)
2015-04-20 18:04 UTC, Steven Dake
no flags Details | Diff
logs of running system (2.66 MB, application/x-gzip)
2015-04-20 18:12 UTC, Steven Dake
no flags Details
logs of running system with compute (2.66 MB, application/x-gzip)
2015-04-20 18:35 UTC, Steven Dake
no flags Details

Description Steven Dake 2015-04-20 18:00:55 UTC
Created attachment 1016479 [details]
setup-neutron

Description of problem:
Using latest Juno RDO packaging, nova rewrites the IP addresses to empty in the info cache.

[sdake@bigiron kolla]$ nova list
+--------------------------------------+-------------------------------------------------------+--------+------------+-------------+---------------------------------------+
| ID                                   | Name                                                  | Status | Task State | Power State | Networks                              |
+--------------------------------------+-------------------------------------------------------+--------+------------+-------------+---------------------------------------+
| 94e84142-b7f4-4777-a20b-b675c7f2a168 | st-25npk73mrk6-0-agkvww4nm3ao-steak_node-wmavxtw6bgae | ACTIVE | -          | Running     | demo-net=10.10.10.14, 192.168.100.170 |
| d341d247-4505-4f14-bfc5-69cd08755ffd | st-25npk73mrk6-1-ktv5fvdjyg5z-steak_node-zq73kaywdjhi | ACTIVE | -          | Running     |                                       |
| 07dc65cc-9523-40ec-816a-f0f0f9a39c2e | st-25npk73mrk6-2-6kjozrh2swfl-steak_node-6hy7vthojq4o | ACTIVE | -          | Running     | demo-net=10.10.10.10, 192.168.100.167 |
| a0783b2d-23b0-4872-a6ea-1a6c3a77aca4 | st-25npk73mrk6-3-7vpyvhq6cgjx-steak_node-wuxk4wus55ce | ACTIVE | -          | Running     | demo-net=10.10.10.3, 192.168.100.151  |
| 51887fa4-8701-42f6-a6b6-e0c8a96085fe | st-25npk73mrk6-4-2xw2lyyiikdx-steak_node-tianxosr5bcp | ACTIVE | -          | Running     |                                       |
| 06616508-2903-4a81-b377-16c2c6c2e274 | st-25npk73mrk6-5-hnes7dhtr2aj-steak_node-7oubcgeq7sdg | ACTIVE | -          | Running     |                                       |
| f6a870fe-e6ab-499d-83dd-1c2bc94e84b9 | st-25npk73mrk6-6-ukgua5s6hbca-steak_node-5xiich6yuf7y | ACTIVE | -          | Running     |                                       |
| 7f3653f9-90b3-4b91-9625-a852fe19b003 | st-25npk73mrk6-7-pk7am5rgoifw-steak_node-itckair4xt7d | ACTIVE | -          | Running     |                                       |
| 5337a083-6cd6-4362-976e-26a75088503d | st-25npk73mrk6-8-rd65frc7o4cv-steak_node-ejvipynq4cko | ACTIVE | -          | Running     | demo-net=10.10.10.7, 192.168.100.157  |
| eeed70f4-6ebc-4d7b-af22-48a5ae59b26d | st-25npk73mrk6-9-afewmrk7blr6-steak_node-43vejemufin6 | ACTIVE | -          | Running     | demo-net=10.10.10.20, 192.168.100.156 |
| 861b0b4b-4b5a-41ab-b824-3bbf2d181772 | st-5npk73mrk6-10-ttrt4rznqux5-steak_node-jmu7mkvdyo72 | ACTIVE | -          | Running     |                                       |
| dc7ea542-4d47-4344-a11a-10c5377858b7 | st-5npk73mrk6-11-hrbzs7uwrijx-steak_node-wt6kd6elxyda | ACTIVE | -          | Running     | demo-net=10.10.10.22, 192.168.100.162 |
| b816e9f1-28ff-47de-830e-356138452b6b | st-5npk73mrk6-12-5ptrj7dpd3cg-steak_node-gc66oowa2kxd | ACTIVE | -          | Running     | demo-net=10.10.10.4, 192.168.100.153  |
| 91e2ec9b-df43-462c-bc00-64c28253e2f1 | st-5npk73mrk6-13-myxwolw4txkt-steak_node-lr6a3lloj7oe | ACTIVE | -          | Running     | demo-net=10.10.10.18, 192.168.100.158 |
| bd4b8b61-899e-4b93-bd42-c3cc7db8f29b | st-5npk73mrk6-14-w77lpejbkgc6-steak_node-kp2c2dnzrugt | ACTIVE | -          | Running     |                                       |
| 93c2f9c7-4d08-49fa-92ed-e3a70e4ebd81 | st-5npk73mrk6-15-r6ijb5pyfq32-steak_node-xm5vkk5f5xsd | ACTIVE | -          | Running     | demo-net=10.10.10.21, 192.168.100.152 |
| 8b5c5531-3ed5-4409-a04a-a39d3b7f7ffe | st-5npk73mrk6-16-5jkskqbsghwh-steak_node-snbjnm33jlhc | ACTIVE | -          | Running     | demo-net=10.10.10.5, 192.168.100.154  |
| 9345ec7c-4179-4031-ac80-d39010bf5901 | st-5npk73mrk6-17-bkrhtamjxdpo-steak_node-ln4v7r36rmby | ACTIVE | -          | Running     |                                       |
| c6bea271-a174-4bd9-b103-cb276eef37fb | st-5npk73mrk6-18-dzv34hs2qafj-steak_node-u5w63s73hnms | ACTIVE | -          | Running     |                                       |
| 99a6d520-1c48-4543-8275-876b808c2424 | st-5npk73mrk6-19-zclmxdkoaymu-steak_node-ajyivkxjzxmg | ACTIVE | -          | Running     |                                       |
+--------------------------------------+-------------------------------------------------------+--------+------------+-------------+---------------------------------------+


Inspection with mysql of the nova database of the info cache shows the UUID of the VM is stored with an empty NetworkModel.

I added some print debug logging, but after 5 days, am only a bit closer to understanding the root cause of the problem (but boy, do I understand how IPs make their way through the nova system, and its not pretty)

The VMs that don't have IPs associated with them were able to obtain DHCP/metadata, so clearly they know which IPs belong to which mac addresses.

Version-Release number of selected component (if applicable):
[sdake@bigiron kolla]$ sudo docker exec ab rpm -qa | grep nova
python-nova-2014.2.2-1.el7.noarch
python-novaclient-2.20.0-1.el7.centos.noarch
openstack-nova-common-2014.2.2-1.el7.noarch
openstack-nova-compute-2014.2.2-1.el7.noarch

How reproducible:
100%

Steps to Reproduce:

Using this repository:
http://github.com:/stackforge/kolla

1. Read and follow the kolla developer-env.md documentation (make sure to follow directions!)
2. Run kolla/pull -c
3. run kolla/start
4. Run attached setup-neutron
2. Run attached launch script (CCBY enjoy:)
3. run nova list after 3-4 minutes

Actual results:
Only some VMs should have IP addresses and floats listed in nova list.

Expected results:
All vms should have IP addresses and floats listed in nova list

Additional info:
I have attached my debug logs and debug addition changes to root cause the problem.

Comment 1 Steven Dake 2015-04-20 18:01:38 UTC
Created attachment 1016480 [details]
launch

Comment 2 Steven Dake 2015-04-20 18:03:12 UTC
Created attachment 1016492 [details]
heat-rg.yml

Comment 3 Steven Dake 2015-04-20 18:03:36 UTC
Created attachment 1016493 [details]
steak.yaml

Comment 4 Steven Dake 2015-04-20 18:04:58 UTC
Created attachment 1016494 [details]
debug logging

Comment 5 Steven Dake 2015-04-20 18:12:13 UTC
Created attachment 1016496 [details]
logs of running system

Comment 6 Steven Dake 2015-04-20 18:17:19 UTC
NB: I suspect this problem will reproduce independently of Kolla with Juno.  It should be simple to test - although you will need enough ram to launch 40 VMs.

Regards
-steve

Comment 7 Steven Dake 2015-04-20 18:19:39 UTC
RE Comment #6, I meant 20 VMs.  This problem was happening with 40VMs, but I wanted to amp it down a notch to 20 VMs, and the problem still appears.  Change "count: 20" in heat-rg.yml to set the VM/port*2 count.

Regards
-steve

Comment 8 Steven Dake 2015-04-20 18:35:35 UTC
Created attachment 1016499 [details]
logs of running system with compute

Comment 9 Steven Dake 2015-04-20 19:02:19 UTC
When nodes don't get IP addresses I get the following output:

2015-04-20 18:29:33.063 1 DEBUG nova.compute.utils [-] STEAKB2 get_nw_info_for_instance instance.info_cache InstanceInfoCache(created_at=2015-04-20T16:09:41Z,deleted=False,deleted_at=None,instance_uuid=07dc65cc-9523-40ec-816a-f0f0f9a39c2e,network_info=NetworkModel(7d6efb40-f1f1-48e3-9228-be4be1e4f26a),updated_at=2015-04-20T16:27:15Z) get_nw_info_for_instance /usr/lib/python2.7/site-packages/nova/compute/utils.py:373


When nodes get IP addresses I get the following output:

2015-04-20 18:31:35.133 1 DEBUG nova.compute.utils [-] STEAKB2 get_nw_info_for_instance instance.info_cache InstanceInfoCache(created_at=2015-04-20T16:09:42Z,deleted=False,deleted_at=None,instance_uuid=99a6d520-1c48-4543-8275-876b808c2424,network_info=NetworkModel(),updated_at=2015-04-20T16:10:11Z) get_nw_info_for_instance /usr/lib/python2.7/site-packages/nova/compute/utils.py:373


NB: The NetworkModel is defined in the "obtains IP addresses" case - defaults in the other case.

Nova list validates this assertion:

[sdake@bigiron kolla]$ nova list | grep 07dc65cc
| 07dc65cc-9523-40ec-816a-f0f0f9a39c2e | st-25npk73mrk6-2-6kjozrh2swfl-steak_node-6hy7vthojq4o | ACTIVE | -          | Running     | demo-net=10.10.10.10, 192.168.100.167 |
[sdake@bigiron kolla]$ nova list | grep 99a6
| 99a6d520-1c48-4543-8275-876b808c2424 | st-5npk73mrk6-19-zclmxdkoaymu-steak_node-ajyivkxjzxmg | ACTIVE | -          | Running     |                                       |

Comment 10 Steven Dake 2015-04-20 19:11:31 UTC
Regarding comment #9 swap first two comments output.

Comment 11 Steven Dake 2015-04-20 23:29:41 UTC
Changing nova-compute, nova-api, nova-scheduler and nova-conductor to icehouse RDO version does not exhibit this problem.  This looks like a juno only problem.

Comment 12 Steven Dake 2015-05-02 15:08:20 UTC
This problem exists in Kilo RC2 Delorean packages as of April 27th.  So I suspect this problem also exists in master.  I have not duplicated on devstack.  When I do that, I'll report back.

Comment 13 Alan Pevec (Fedora) 2015-05-03 22:22:05 UTC
(In reply to Steven Dake from comment #11)
> Changing nova-compute, nova-api, nova-scheduler and nova-conductor to
> icehouse RDO version does not exhibit this problem.  This looks like a juno
> only problem.

What about Neutron, did it stay Juno?
This looks like a networking issue to me, so I'm moving it to openstack-neutron for further investigation.

Comment 14 Jakub Libosvar 2015-05-04 12:08:02 UTC
(In reply to Steven Dake from comment #9)
> When nodes don't get IP addresses I get the following output:
> 
> 2015-04-20 18:29:33.063 1 DEBUG nova.compute.utils [-] STEAKB2
> get_nw_info_for_instance instance.info_cache
> InstanceInfoCache(created_at=2015-04-20T16:09:41Z,deleted=False,
> deleted_at=None,instance_uuid=07dc65cc-9523-40ec-816a-f0f0f9a39c2e,
> network_info=NetworkModel(7d6efb40-f1f1-48e3-9228-be4be1e4f26a),
> updated_at=2015-04-20T16:27:15Z) get_nw_info_for_instance
> /usr/lib/python2.7/site-packages/nova/compute/utils.py:373
> 
> 
> When nodes get IP addresses I get the following output:
> 
> 2015-04-20 18:31:35.133 1 DEBUG nova.compute.utils [-] STEAKB2
> get_nw_info_for_instance instance.info_cache
> InstanceInfoCache(created_at=2015-04-20T16:09:42Z,deleted=False,
> deleted_at=None,instance_uuid=99a6d520-1c48-4543-8275-876b808c2424,
> network_info=NetworkModel(),updated_at=2015-04-20T16:10:11Z)
> get_nw_info_for_instance
> /usr/lib/python2.7/site-packages/nova/compute/utils.py:373
> 
> 
> NB: The NetworkModel is defined in the "obtains IP addresses" case -
> defaults in the other case.
> 
> Nova list validates this assertion:
> 
> [sdake@bigiron kolla]$ nova list | grep 07dc65cc
> | 07dc65cc-9523-40ec-816a-f0f0f9a39c2e |
> st-25npk73mrk6-2-6kjozrh2swfl-steak_node-6hy7vthojq4o | ACTIVE | -         
> | Running     | demo-net=10.10.10.10, 192.168.100.167 |
> [sdake@bigiron kolla]$ nova list | grep 99a6
> | 99a6d520-1c48-4543-8275-876b808c2424 |
> st-5npk73mrk6-19-zclmxdkoaymu-steak_node-ajyivkxjzxmg | ACTIVE | -         
> | Running     |                                       |

I looked into attached logs and tracked the instance 99a6d520. It successfully creates port in neutron and also creates a correct NWInfoCache in nova. After successful boot I bet you see the correct 10.10.10.17 ip in nova list. But after refresh cache is called, nova queries neutron with specific tenant id:
http://192.168.1.100:9696/v2.0/ports.json?tenant_id=1a697505e6e74578813e2209caaf4d6c&device_id=99a6d520-1c48-4543-8275-876b808c2424 -X GET -H "X-Auth-Token: 5091daa5f5274d9d8a34ddd02999d058" -H "User-Agent: python-neutronclient"

and neutron replies with RESP:401 {'date': 'Mon, 20 Apr 2015 17:10:01 GMT', 'content-length': '23', 'content-type': 'text/plain', 'www-authenticate': "Keystone uri='http://192.168.1.100:5000/v2.0'", 'x-openstack-request-id': 'req-a7d741bd-631d-4170-81d3-07423d138e66'} Authentication required

Provided token 5091daa5f5274d9d8a34ddd02999d058 is valid till 17:09
"token": {"issued_at": "2015-04-20T16:09:30.109666", "expires": "2015-04-20T17:09:30Z", "id": "5091daa5f5274d9d8a34ddd02999d058"

But I can also see that token was successfully redacted:
2015-04-20 17:05:57.061 1 DEBUG keystoneclient.session [-] REQ: curl -i -X GET http://192.168.1.100:35357/v3/auth/tokens -H "X-Subject-Token: TOKEN_REDACTED" -H "User-Agent: python-keystoneclient" -H "Accept: application/json" -H "X-Auth-Token: TOKEN_REDACTED" _http_log_request /usr/lib/python2.7/site-packages/keystoneclient/session.py:155^M
2015-04-20 17:05:57.120 1 DEBUG keystoneclient.session [-] RESP: [200] {'x-subject-token': '5091daa5f5274d9d8a34ddd02999d058', 'content-type': 'application/json', 'content-length': '4668', 'vary': 'X-Auth-Token', 'date': 'Mon, 20 Apr 2015 17:05:57 GMT'}


I am not sure how the tokens are cached in nova and I'm also very confused by all the logs being in single file. Steven, may I ask you to push the very same logs in tar file with log per service? It will make grepping easier and thus better tracking of the issue.

Comment 15 Jakub Libosvar 2015-05-04 12:59:12 UTC
This lead me to look into neutronclient and I see if authentication fails due to expired token, client authenticates and sends the same request again, which is what happens in our case. So I guess what I wrote above is dead end. I'll keep digging.

Comment 16 Steven Dake 2015-05-04 13:10:53 UTC
Alan,

This is definitely a nova problem.  I ran Junoo/Kilo of Neuron with Icehouse Nova, and it works perfectly.

I run with Juno/Kilo version of Nova with Juno/Kilo of Neutron, and that does not work.

This implies that Nova had a regression sometime after Icehouse.

Looking through the code and my 5+ days debugging, I can say for sure it is nova that is broken, since it calls IP address assignment in approximately 40+ places.

Comment 17 Steven Dake 2015-05-04 13:23:44 UTC
(In reply to Alan Pevec from comment #13)
> (In reply to Steven Dake from comment #11)
> > Changing nova-compute, nova-api, nova-scheduler and nova-conductor to
> > icehouse RDO version does not exhibit this problem.  This looks like a juno
> > only problem.
> 
> What about Neutron, did it stay Juno?
> This looks like a networking issue to me, so I'm moving it to
> openstack-neutron for further investigation.

Alan,

It does look like a networking issue, but it is a Nova issue.  The "NetworkModel" object in one case is set with data, and in another case is not set with data.  See comment #9.  NetworkModel comes exclusively from Nova.

Comment 18 Alan Pevec (Fedora) 2015-05-04 20:25:50 UTC
Alright, so back to Nova. Eoghan, evidence so far is that neutron/neutronclient do the right thing but nwinfo somehow gets lost in the corridors of Nova?

Comment 19 Kashyap Chamarthy 2015-05-11 10:24:56 UTC
Just curious, can this be reproduced consistently with Nova only, 
*without* Kolla or Heat being involved?

E.g. something like:

    for i in {1..20};

    do
        nova boot --flavor 1 --key_name oskey1 \
        --image cirros-0.3.3-x86_64-disk vm-$i;
    done

Comment 20 Steven Dake 2015-05-12 06:56:21 UTC
Kashyap,

I'm pretty sure the fact it works with icehouse nova eliminates Kolla as the source of the problem.  But your right it does need to be validated on devstack.  I cannot alter my environment to setup devstack prior to summit because I have two demos using kolla.

As for your do nova boot operation, part of the issue is using neutron floating ips so the floats need to be assigned as well.

At one point I had a script which took a parameter which would create N nova images using neutron floats.  IIRC this reproduced there occasionally.

Regards
-steve

Comment 21 Kashyap Chamarthy 2015-05-22 15:34:25 UTC
Steve, when you get a moment, please file an upstream bug (with a reproducer) and link it here.

Comment 22 Steven Dake 2016-02-15 22:57:51 UTC
Kashyap,

This fell off my radar after VC ODS so apologies for not getting a bug filed back in Kilo.  Now filing a bug with upstream at this point is likely to result in no positive action because the problem is fixed in Liberty and master.  The Nova team seemed pretty meh when I mentioned this in the nova irc channel during debugging, indicating the code was a disaster and needed to be rewritten and there was no real way to fix it without a major refactor.  What the nova team eventually did with it, i'm not sure.

The code at that point in time literally set the IP address in hundreds of places across 10+ RPC boundaries.  Dan Smith mentioned this before I got into debugging -> trust but verify and all that, he was definitely correct, it was a mess.  Not being a nova core reviewer, I don't keep up with the nova code changes, so can't offer consulting on what patches need to be backported into RDO or RHEL OSP.

I can confirm Liberty and Master (mitaka) work perfectly.

Regards
-steve

Comment 23 Chandan Kumar 2016-05-19 15:39:45 UTC
This bug is against a Version which has reached End of Life.
If it's still present in supported release (http://releases.openstack.org), please update Version and reopen.


Note You need to log in before you can comment on or make changes to this bug.