Description of problem: Several instances changed to state Error when try to delete them under REST API load test. We can see message "Request to http://172.16.0.11:9696/v2.0/ports/186d784e-dbe4-46a8-9807-3708cfefb3c5.json timed out" in Horizon. (see attached screenshots for details) In nova.log 2017-06-07 18:47:10.768 19858 ERROR nova.compute.manager [req-fd8b77cb-9f65-4102-9634-cd9bf803e220 e79bc38b8bcd4deabfd5cbaa240f1939 9ec91896f9ac47f98546715fb631f8b9 - - -] [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] Failed to deallocate network for instance. Error: Request to http://172.16.0.11:9696/v2.0/ports/186d784e-dbe4-46a8-9807-3708cfefb3c5.json timed out 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [req-fd8b77cb-9f65-4102-9634-cd9bf803e220 e79bc38b8bcd4deabfd5cbaa240f1939 9ec91896f9ac47f98546715fb631f8b9 - - -] [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] Setting instance vm_state to ERROR 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] Traceback (most recent call last): 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2428, in do_terminate_instance 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] self._delete_instance(context, instance, bdms, quotas) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/nova/hooks.py", line 154, in inner 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] rv = f(*args, **kwargs) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2391, in _delete_instance 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] quotas.rollback() 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] self.force_reraise() 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] six.reraise(self.type_, self.value, self.tb) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2355, in _delete_instance 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] self._shutdown_instance(context, instance, bdms) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2260, in _shutdown_instance 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] self._try_deallocate_network(context, instance, requested_networks) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2182, in _try_deallocate_network 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] self._set_instance_obj_error_state(context, instance) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] self.force_reraise() 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] six.reraise(self.type_, self.value, self.tb) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2176, in _try_deallocate_network 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] self._deallocate_network(context, instance, requested_networks) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1664, in _deallocate_network 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] context, instance, requested_networks=requested_networks) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 1178, in deallocate_for_instance 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] self._delete_ports(neutron, instance, ports, raise_if_fail=True) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 1141, in _delete_ports 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] neutron.delete_port(port) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 99, in wrapper 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] ret = obj(*args, **kwargs) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 772, in delete_port 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] return self.delete(self.port_path % (port)) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 99, in wrapper 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] ret = obj(*args, **kwargs) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 357, in delete 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] headers=headers, params=params) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 99, in wrapper 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] ret = obj(*args, **kwargs) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 338, in retry_request 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] headers=headers, params=params) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 99, in wrapper 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] ret = obj(*args, **kwargs) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 289, in do_request 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] resp, replybody = self.httpclient.do_request(action, method, body=body) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/neutronclient/client.py", line 314, in do_request 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] return self.request(url, method, **kwargs) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/neutronclient/client.py", line 302, in request 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] resp = super(SessionClient, self).request(*args, **kwargs) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 142, in request 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] return self.session.request(url, method, **kwargs) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/positional/__init__.py", line 101, in inner 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] return wrapped(*args, **kwargs) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 616, in request 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] resp = send(**kwargs) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 681, in _send_request 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] raise exceptions.ConnectTimeout(msg) 2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] ConnectTimeout: Request to http://172.16.0.11:9696/v2.0/ports/186d784e-dbe4-46a8-9807-3708cfefb3c5.json timed out Version-Release number of selected component (if applicable): rhos-release 11 -p 2017-05-30.1 How reproducible: It could be reproduced by run the REST API performance test which simulates load on Openstack with 20 concurrent threads. Each thread create instance and perform actions: 01.NOVA.GET.Images 02.NOVA.GET.Flavors 03.NEUTRON.POST.Create.Network 04.NEUTRON.POST.Create.Subnet 05.NOVA.POST.Boot.Server 00.NOVA.GET.Server.Details 06.NOVA.POST.Pause.Server 07.NOVA.POST.Unpause.Server 08.NOVA.POST.Suspend.Server 09.NOVA.POST.Resume.Server 10.NOVA.POST.Soft.Reboot.Server 11.NOVA.POST.Hard.Reboot.Server 12.NOVA.POST.Stop.Server 13.NOVA.POST.Start.Server 14.NOVA.POST.Create.Image 15.NOVA.GET.Image.Id 16.NOVA.DELETE.Image 17.NOVA.DELETE.Server 18.NEUTRON.DELETE.Network Steps to Reproduce: 1. 2. 3. Actual results: Several instances in state Error Expected results: No instances in state Error Additional info:
Created attachment 1285907 [details] Horizon screenshot of instance details
Created attachment 1285908 [details] Horizon screenshot of instance action log
Created attachment 1285909 [details] part of nova log related to failed instance
Created attachment 1285910 [details] libvirt qemu instance log
Created attachment 1285911 [details] full nova log
The error "Request to http://172.16.0.11:9696/v2.0/ports/186d784e-dbe4-46a8-9807-3708cfefb3c5.json timed out" is a timeout while Nova is calling to Neutron to delete the ports associated with the instance. We need a bit more information to determine what's going wrong here. Since we are timing out waiting for Neutron to respond to us, we will want to look at: 1. neutron logs 2. rabbitmq state (that is, is it overloaded with messages when this happens? is it related to Neutron's delay in responding to Nova?) This isn't as important as seeing the Neutron logs, so if it's not handy, that's okay. We mainly need the Neutron logs.
Hi, I'm deploying rhos11 right now. I'll provide requested information soon. Regards, Yuri
Created attachment 1288423 [details] compute-0-nova-compute.log
Created attachment 1288439 [details] controller-0-server.log
Created attachment 1288440 [details] controller-0-server.log-20170616.gz
Created attachment 1288441 [details] controller-1-server.log
Created attachment 1288442 [details] controller-1-server.log-20170616.gz
Created attachment 1288443 [details] controller-2-server.log
Created attachment 1288444 [details] controller-2-server.log-20170616.gz
Created attachment 1288445 [details] rabbit
Created attachment 1288446 [details] rabbit
I spent some time investigating this and what I see so far is that it appears Neutron is taking a long time (a little over a minute) to reply to nova-compute's REST API request to delete the Neutron port associated with the instance being deleted. The Neutron server.log-20170616.gz is light on details as it's not logging at level DEBUG. It's only logging GET/PUT/POST/DELETE along with HTTP status code and timing -- that's it. To get more to the bottom of this, I think we need to run Neutron logging at level DEBUG next time we load test. In general, I think we're going to need a routine where we run all of the main openstack services (Nova, Neutron, Cinder, Glance) at DEBUG when we see major performance degradation, to find root causes. So, Neutron is taking > 60 seconds to reply to Nova and we don't know anything about why. We would want to see what Neutron is doing during the 60 seconds and see whether it's experiencing RPC timeouts. Maybe rabbitmq is overloaded with messages. Here are some log excerpts from an example I traced: Compute-0: 2017-06-15 18:38:15.701 19863 INFO nova.compute.manager [req-71054d3d-6674-4dfa-a15a-ff78fb7c9522 c609a31ba3fb4699ae5d2c14f777270b 503c0495b8154e4e8217135bb3e2b4a6 - - -] [instance: b010a39b-f1f7-46be-aa5e-e3684ceb4bd9] Terminating instance Looks like the python global socket timeout is around 30 seconds? 2017-06-15 18:38:47.652 19863 ERROR nova.compute.manager [req-71054d3d-6674-4dfa-a15a-ff78fb7c9522 c609a31ba3fb4699ae5d2c14f777270b 503c0495b8154e4e8217135bb3e2b4a6 - - -] [instance: b010a39b-f1f7-46be-aa5e-e3684ceb4bd9] Failed to deallocate network for instance. Error: Request to http://172.16.0.17:9696/v2.0/ports/3ebaa036-9432-469c-b315-9caee0b3904a.json timed out 2017-06-15 18:39:26.333 19863 INFO nova.compute.manager [req-e29279ba-2af7-4f47-9c32-b68a0b47ec3d c609a31ba3fb4699ae5d2c14f777270b 503c0495b8154e4e8217135bb3e2b4a6 - - -] [instance: b010a39b-f1f7-46be-aa5e-e3684ceb4bd9] Took 0.82 seconds to deallocate network for instance. Neutron server on controller-0: Took a little over a minute after nova-compute started to delete the instance before the log of the request to delete the port shows up at Neutron. So 30 seconds after Nova already timed out. 2017-06-15 18:39:21.020 28716 INFO neutron.wsgi [req-286b99cb-85ab-45eb-988c-a1500b88d6af c609a31ba3fb4699ae5d2c14f777270b 503c0495b8154e4e8217135bb3e2b4a6 - - -] 172.16.0.14 - - [15/Jun/2017 18:39:21] "DELETE /v2.0/ports/3ebaa036-9432-469c-b315-9caee0b3904a.json HTTP/1.1" 204 149 63.368901
There's also a gap in /var/log/messages where it dropped messages during the timestamp range of the example trace in comment 17: Jun 15 18:36:13 overcloud-controller-0 rsyslogd-2177: imjournal: begin to drop messages due to rate-limiting Jun 15 18:43:46 overcloud-controller-0 rsyslogd-2177: imjournal: 57976 messages lost due to rate-limiting
We might need in /etc/systemd/journald.conf: [Journal] RateLimitBurst=0 similar to https://review.openstack.org/#/c/462725 to avoid losing log messages.
As discussed on the rh-openstack-dev@ thread [1], this is part of a larger problem with the database being overwhelmed during the load tests and we're going to need to re-run the tests and deep dive into investigating the root cause of the performance degradation in RHOSP 11. [1] http://post-office.corp.redhat.com/archives/rh-openstack-dev/2017-June/msg00392.html