Created attachment 1129364 [details] nova-compute.log file Description of problem: During scale test launched 3217 instances/vms and all other failed to be launched with Error: "Build of instance 3899eea1-1367-4932-9207-0474fbaaafd5 aborted: Failed to allocate the network(s), not rescheduling" In nova-compute.log found exceptions: 2016-02-18 10:38:12.474 17443 WARNING nova.virt.libvirt.driver [req-03e59ac6-649c-4571-951b-ef496440a4e9 ea29bb495c1f43f89e8dc4f812f8f180 178100f38c934c29824db1a6ee40d7e2 - - -] Timeout waiting for vif plugging callback for instance 9e39ceb1-1177-4267-b7ed-299d116927fc 2016-02-18 10:38:15.919 17443 ERROR nova.compute.manager [req-03e59ac6-649c-4571-951b-ef496440a4e9 ea29bb495c1f43f89e8dc4f812f8f180 178100f38c934c29824db1a6ee40d7e2 - - -] [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] Instance failed to spawn 2016-02-18 10:38:15.919 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] Traceback (most recent call last): 2016-02-18 10:38:15.919 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2486, in _build_resources 2016-02-18 10:38:15.919 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] yield resources 2016-02-18 10:38:15.919 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2358, in _build_and_run_instance 2016-02-18 10:38:15.919 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] block_device_info=block_device_info) 2016-02-18 10:38:15.919 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2492, in spawn 2016-02-18 10:38:15.919 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] block_device_info=block_device_info) 2016-02-18 10:38:15.919 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4592, in _create_domain_and_network 2016-02-18 10:38:15.919 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] raise exception.VirtualInterfaceCreateException() 2016-02-18 10:38:15.919 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] VirtualInterfaceCreateException: Virtual Interface creation failed 2016-02-18 10:38:15.919 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] 2016-02-18 10:38:16.873 17443 ERROR nova.compute.manager [req-03e59ac6-649c-4571-951b-ef496440a4e9 ea29bb495c1f43f89e8dc4f812f8f180 178100f38c934c29824db1a6ee40d7e2 - - -] [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] Failed to allocate network(s) 2016-02-18 10:38:16.873 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] Traceback (most recent call last): 2016-02-18 10:38:16.873 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2358, in _build_and_run_instance 2016-02-18 10:38:16.873 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] block_device_info=block_device_info) 2016-02-18 10:38:16.873 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2492, in spawn 2016-02-18 10:38:16.873 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] block_device_info=block_device_info) 2016-02-18 10:38:16.873 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4592, in _create_domain_and_network 2016-02-18 10:38:16.873 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] raise exception.VirtualInterfaceCreateException() 2016-02-18 10:38:16.873 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] VirtualInterfaceCreateException: Virtual Interface creation failed 2016-02-18 10:38:16.873 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] 2016-02-18 10:38:16.876 17443 ERROR nova.compute.manager [req-03e59ac6-649c-4571-951b-ef496440a4e9 ea29bb495c1f43f89e8dc4f812f8f180 178100f38c934c29824db1a6ee40d7e2 - - -] [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] Build of instance 9e39ceb1-1177-4267-b7ed-299d116927fc aborted: Failed to allocate the network(s), not rescheduling. 2016-02-18 10:38:16.876 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] Traceback (most recent call last): 2016-02-18 10:38:16.876 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2261, in _do_build_and_run_instance 2016-02-18 10:38:16.876 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] filter_properties) 2016-02-18 10:38:16.876 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2393, in _build_and_run_instance 2016-02-18 10:38:16.876 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] reason=msg) 2016-02-18 10:38:16.876 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] BuildAbortException: Build of instance 9e39ceb1-1177-4267-b7ed-299d116927fc aborted: Failed to allocate the network(s), not rescheduling. 2016-02-18 10:38:16.876 17443 TRACE nova.compute.manager [instance: 9e39ceb1-1177-4267-b7ed-299d116927fc] 2016-02-18 10:38:17.536 17443 ERROR nova.network.neutronv2.api [req-03e59ac6-649c-4571-951b-ef496440a4e9 ea29bb495c1f43f89e8dc4f812f8f180 178100f38c934c29824db1a6ee40d7e2 - - -] Unable to clear device ID for port 'None' 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api Traceback (most recent call last): 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 366, in _unbind_ports 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api port_client.update_port(port_id, port_req_body) 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 102, in with_params 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api ret = self.function(instance, *args, **kwargs) 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 549, in update_port 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api return self.put(self.port_path % (port), body=body) 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 302, in put 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api headers=headers, params=params) 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 270, in retry_request 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api headers=headers, params=params) 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 211, in do_request 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api self._handle_fault_response(status_code, replybody) 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 185, in _handle_fault_response 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api exception_handler_v20(status_code, des_error_body) 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 83, in exception_handler_v20 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api message=message) 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api NeutronClientException: 404 Not Found 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api 2016-02-18 10:38:17.536 17443 TRACE nova.network.neutronv2.api The resource could not be found. OpenStack environment is: 3 controller nodes and 6 compute nodes, HA. where each is 24 CPUs and 64 G RAM Instances/vms were created using: - image: cirros-0.3.2-sc (12.6 MB) - flavor: m1.nano (1VCPU, 64MB RAM, 1GB Disk) Version-Release number of selected component (if applicable): rhos-release 7 -p 2015-12-16.1 # rpm -qa | grep nova openstack-nova-console-2015.1.2-7.el7ost.noarch openstack-nova-novncproxy-2015.1.2-7.el7ost.noarch python-nova-2015.1.2-7.el7ost.noarch openstack-nova-api-2015.1.2-7.el7ost.noarch openstack-nova-common-2015.1.2-7.el7ost.noarch python-novaclient-2.23.0-2.el7ost.noarch openstack-nova-compute-2015.1.2-7.el7ost.noarch openstack-nova-conductor-2015.1.2-7.el7ost.noarch openstack-nova-scheduler-2015.1.2-7.el7ost.noarch openstack-nova-cert-2015.1.2-7.el7ost.noarch How reproducible: Run scale test which creates 50 tenants, 50 network per tenant and as more as possible instances/vm Steps to Reproduce: 1. 2. 3. Actual results: Failed create more than 3217 instances Raise exception in nova-compute.log Expected results: Additional info:
See test results https://mojo.redhat.com/docs/DOC-1067371
Created attachment 1129365 [details] ospd node neutron-server log file
Created attachment 1129366 [details] ospd node neutron-openswitch log file
The problem is that neutron has not called to nova to deliver the vif-plugged event, which causes nova to time out. I don't see any errors in the neutron log that indicate that something failed which prevented this call, but after nova fails the request, it isn't able to talk to the neutron API to deallocate the resources either. Suggest we point some neutron people at this to take a look.
Bug reproduced (OSP 7.3) on rhos-release 7 -p 2016-02-24.1
See test results on https://mojo.redhat.com/docs/DOC-1070135
Would it be possible to gain access to a setup as the issue reproduces?
I don't have a setup of OPS7 right now. Bug was reported before 4-5 months. I have a setup of OSP 8 -p 2016-05-31.1 and I'm going to rerun the same scale test soon. If you would like to take a look at the issue on OSP 8 let me know, Because I applied a workaround into the setup "Workaround for an issue "Failed to allocate the network(s), not rescheduling" Need update nova settings on compute nodes - vif_plugging_is_fatal=False in files /etc/nova/nova.conf instead of vif_plugging_is_fatal=True." I'll update you when the test will be start
The issue reproduced on OSP 8 -p 2016-05-31.1 Tonight I executed the same scale test and created only 3661 instances All other created with Status Error. I provided to Assaf setup details and access.
Created attachment 1168705 [details] openvswitch-agent.log file from overcloud-compute-2
Here, nova is able to create port for vm and able to bind to compute node(checked through neutron server log). Then it is plugging the vif(confirmed through ovsdb log message in ovs agent log) and waiting for 5 minutes for the network-vif-plugged from neutron server. (When timeout happend, it is deleting vif, checked through ovsdb log message in ovs agent) When nova plugged vif, ovs agent received information about new port through ovsdb, but it was unable to process this port as it was busy in rpc_loop handling existing ports(>600 ports in compute node). While processing these ports, agent called update_device_list RPC method, but failed with RPC timeout error. I see response comping for this RPC after 2 minutes(Default RPC timeout is 1 minute). As there is timeout, agent is again trying to process these 600 ports. As ovs agent is busy processing these ports all time, it is ignoring wiring up of new port added by nova and hence nova is not receiving network-vif-plugged notification, and failing building of instance with "Timeout waiting for vif plugging callback for instance". One of the solution can be increasing the RPC timeout. But there is a u/s patch https://review.openstack.org/#/c/314319/ which adds an exponential backoff mechanism for timeout values on RPC call. Yuri will be testing with this patch on his test setup with OSP 8 -p 2016-05-31.1
Created attachment 1170248 [details] neutron server log contr-0 after disable firewall driver
Created attachment 1170249 [details] neutron server log contr-1 after disable firewall driver
Created attachment 1170251 [details] neutron server log contr-2 after disable firewall driver
Created attachment 1170254 [details] openvswitch-agent.log compute 2 - part aa
Created attachment 1170255 [details] openvswitch-agent.log compute 2 - part ab
Created attachment 1170257 [details] openvswitch-agent.log compute 2 - part ac
Created attachment 1170258 [details] openvswitch-agent.log compute 2 - part ad
Need to try with this change https://review.openstack.org/#/c/337064/
@Yuri Any update on getting setup to try the fix https://review.openstack.org/#/c/337064/ Thanks Anil
Hi Anil, I still don't have a hardware for bug reproducing. But, hopefully, I'll got the servers on next week. I'll update you with status. Thanks, Yuri
These changes improve the performance of ovs agent [1] https://review.openstack.org/#/c/337064/ [2] https://review.openstack.org/#/c/352440/ [3] https://review.openstack.org/#/c/351773/ Change [1] is already merged in u/s and is also part of OSP9. Changes [2] and [3] are in waiting for review in u/s. These changes avoid unnecessary duplicate ipset calls, making ovs agent run faster when many ports are added.
Hi Toni, Not yet, I'll reproduce it at once I have a working setup. I've opened a ticket at eng-ops https://engineering.redhat.com/rt/Ticket/Display.html?id=427891 Waiting for the fix Yuri
Thanks, Can you please provide me the access details to investigate further?
Hi, Anil Finally, I've got a hardware for reproducing the bug. Now, it's reproduced on rhos 8 [root@overcloud-compute-0 nova]# rpm -qa | grep nova openstack-nova-novncproxy-12.0.5-9.el7ost.noarch openstack-nova-compute-12.0.5-9.el7ost.noarch openstack-nova-api-12.0.5-9.el7ost.noarch python-novaclient-3.1.0-2.el7ost.noarch python-nova-12.0.5-9.el7ost.noarch openstack-nova-conductor-12.0.5-9.el7ost.noarch openstack-nova-common-12.0.5-9.el7ost.noarch openstack-nova-scheduler-12.0.5-9.el7ost.noarch openstack-nova-console-12.0.5-9.el7ost.noarch openstack-nova-cert-12.0.5-9.el7ost.noarch with the same error: 2017-05-01 01:56:40.239 33560 ERROR nova.compute.manager [instance: ef090168-e140-45ba-894b-7bad74172217] BuildAbortException: Build of instance ef090168-e140-45ba-894b-7bad74172217 aborted: Failed to allocate the network(s), not rescheduling. Now, how we can apply your fix and rerun the test? Yuri
Forget add neutron versions [stack@c02-h02-r620 ~]$ rpm -qa | grep neutron openstack-neutron-7.2.0-5.el7ost.noarch openstack-neutron-common-7.2.0-5.el7ost.noarch openstack-neutron-ml2-7.2.0-5.el7ost.noarch openstack-neutron-openvswitch-7.2.0-5.el7ost.noarch python-neutron-7.2.0-5.el7ost.noarch python-neutronclient-3.1.0-3.el7ost.noarch
Hi Anil, Unfortunately, the bug reproduced on rhos-release 9 -p 2017-04-07.5 also [root@c02-h02-r620 ~]# rpm -qa | grep neutron python-neutron-lib-0.0.2-1.el7ost.noarch openstack-neutron-8.3.0-5.el7ost.noarch python-neutronclient-4.1.1-2.el7ost.noarch python-neutron-vpnaas-tests-8.0.0-1.el7ost.noarch python-neutron-tests-8.3.0-5.el7ost.noarch python-neutron-8.3.0-5.el7ost.noarch python-neutron-fwaas-tests-8.0.0-3.el7ost.noarch openstack-neutron-ml2-8.3.0-5.el7ost.noarch python-neutron-lbaas-tests-8.1.0-2.el7ost.noarch python-neutron-vpnaas-8.0.0-1.el7ost.noarch openstack-neutron-openvswitch-8.3.0-5.el7ost.noarch python-neutron-lbaas-8.1.0-2.el7ost.noarch openstack-neutron-common-8.3.0-5.el7ost.noarch python-neutron-fwaas-8.0.0-3.el7ost.noarch See attached txt file with error. I didn't clean stack, waiting when you'll take a look. Yuri
Created attachment 1275991 [details] error
(In reply to Yuri Obshansky from comment #33) > Created attachment 1275991 [details] > error Can you provide the logs of neutron and ovs agent related to that error ? Also can you ensure that "/etc/neutron/neutron.conf" is well configured with: notify_nova_on_port_status_changes=True notify_nova_on_port_data_changes=True nova_url=http://<IP>/v2.1 Thanks
Hi, I checked /etc/neutron/neutron.conf on compute node ... # Send notification to nova when port status changes (boolean value) #notify_nova_on_port_status_changes = true # Send notification to nova when port data (fixed_ips/floatingip) changes so # nova can update its cache. (boolean value) #notify_nova_on_port_data_changes = true .... There is not nova_url=http://<IP>/v2.1 setting in it. Attached part of openvswitch-agent.log and ovs-vswitchd.log in the same time frame. Thanks
Created attachment 1276361 [details] openvswitch-agent.log and ovs-vswitchd.log
(In reply to Yuri Obshansky from comment #35) > Hi, > I checked /etc/neutron/neutron.conf on compute node > ... > # Send notification to nova when port status changes (boolean value) > #notify_nova_on_port_status_changes = true > > # Send notification to nova when port data (fixed_ips/floatingip) changes so > # nova can update its cache. (boolean value) > #notify_nova_on_port_data_changes = true > .... > > There is not nova_url=http://<IP>/v2.1 setting in it. Ok, can you check with what is indicated here [0] and let us to know the result? [0] https://access.redhat.com/solutions/1212643
Hi, I cannot see solution -> Subscriber exclusive content !!! Ping me in IRC - yobshans on neutron/rhos-dev/rhos-qe channels Thanks
Hi, We double check the configuration on Controller node [root@overcloud-controller-0 ~]# egrep '(^nova_|^notify_nova)' /etc/neutron/neutron.conf notify_nova_on_port_status_changes=True notify_nova_on_port_data_changes=True nova_url=http://172.16.0.10:8774/v2.1/%(tenant_id)s It looks ok. I'll generate sosreport for for controller and compute nodes. Thanks
Created attachment 1276394 [details] sosreport controller part aa Original file is large, about 44 MB Split by using split -b 15M sosreport-overcloud-controller-0.localdomain-20170504153651.tar.xz "sos-contr-" Could join by cat sos-contr-* > sosreport-contr.tar.gz
Created attachment 1276395 [details] sosreport controller part ab
Created attachment 1276396 [details] sosreport controller part ac
This needs a fresh setup to understand the performance issues, then a couple of patches to fix them. Better to do scale testing on later branches than OSP7.