Bug 1459687

Summary: Failed to delete Instance, it's state changed to Error under load test
Product: Red Hat OpenStack Reporter: Yuri Obshansky <yobshans>
Component: openstack-novaAssignee: Eoghan Glynn <eglynn>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Joe H. Rahme <jhakimra>
Severity: high Docs Contact:
Priority: unspecified    
Version: 11.0 (Ocata)CC: berrange, dasmith, eglynn, kchamart, mwitt, sbauza, sferdjao, sgordon, srevivo, vromanso, yobshans
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-23 15:18:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Horizon screenshot of instance details
none
Horizon screenshot of instance action log
none
part of nova log related to failed instance
none
libvirt qemu instance log
none
full nova log
none
compute-0-nova-compute.log
none
controller-0-server.log
none
controller-0-server.log-20170616.gz
none
controller-1-server.log
none
controller-1-server.log-20170616.gz
none
controller-2-server.log
none
controller-2-server.log-20170616.gz
none
rabbit@overcloud-controller-0.log
none
rabbit@overcloud-controller-0-sasl.log none

Description Yuri Obshansky 2017-06-07 20:09:49 UTC
Description of problem:
Several instances changed to state Error when try to delete them under REST API load test. We can see message "Request to http://172.16.0.11:9696/v2.0/ports/186d784e-dbe4-46a8-9807-3708cfefb3c5.json timed out" in Horizon.
(see attached screenshots for details)
In nova.log 
2017-06-07 18:47:10.768 19858 ERROR nova.compute.manager [req-fd8b77cb-9f65-4102-9634-cd9bf803e220 e79bc38b8bcd4deabfd5cbaa240f1939 9ec91896f9ac47f98546715fb631f8b9 - - -] [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] Failed to deallocate network for instance. Error: Request to http://172.16.0.11:9696/v2.0/ports/186d784e-dbe4-46a8-9807-3708cfefb3c5.json timed out
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [req-fd8b77cb-9f65-4102-9634-cd9bf803e220 e79bc38b8bcd4deabfd5cbaa240f1939 9ec91896f9ac47f98546715fb631f8b9 - - -] [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] Setting instance vm_state to ERROR
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] Traceback (most recent call last):
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2428, in do_terminate_instance
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     self._delete_instance(context, instance, bdms, quotas)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/nova/hooks.py", line 154, in inner
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     rv = f(*args, **kwargs)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2391, in _delete_instance
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     quotas.rollback()
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     self.force_reraise()
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     six.reraise(self.type_, self.value, self.tb)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2355, in _delete_instance
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     self._shutdown_instance(context, instance, bdms)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2260, in _shutdown_instance
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     self._try_deallocate_network(context, instance, requested_networks)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2182, in _try_deallocate_network
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     self._set_instance_obj_error_state(context, instance)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     self.force_reraise()
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     six.reraise(self.type_, self.value, self.tb)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2176, in _try_deallocate_network
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     self._deallocate_network(context, instance, requested_networks)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1664, in _deallocate_network
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     context, instance, requested_networks=requested_networks)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 1178, in deallocate_for_instance
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     self._delete_ports(neutron, instance, ports, raise_if_fail=True)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 1141, in _delete_ports
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     neutron.delete_port(port)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 99, in wrapper
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     ret = obj(*args, **kwargs)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 772, in delete_port
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     return self.delete(self.port_path % (port))
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 99, in wrapper
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     ret = obj(*args, **kwargs)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 357, in delete
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     headers=headers, params=params)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 99, in wrapper
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     ret = obj(*args, **kwargs)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 338, in retry_request
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     headers=headers, params=params)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 99, in wrapper
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     ret = obj(*args, **kwargs)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 289, in do_request
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     resp, replybody = self.httpclient.do_request(action, method, body=body)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/neutronclient/client.py", line 314, in do_request
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     return self.request(url, method, **kwargs)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/neutronclient/client.py", line 302, in request
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     resp = super(SessionClient, self).request(*args, **kwargs)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 142, in request
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     return self.session.request(url, method, **kwargs)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/positional/__init__.py", line 101, in inner
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     return wrapped(*args, **kwargs)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 616, in request
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     resp = send(**kwargs)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]   File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 681, in _send_request
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010]     raise exceptions.ConnectTimeout(msg)
2017-06-07 18:47:11.023 19858 ERROR nova.compute.manager [instance: 262b189d-8edd-4d8a-8161-a0af54c73010] ConnectTimeout: Request to http://172.16.0.11:9696/v2.0/ports/186d784e-dbe4-46a8-9807-3708cfefb3c5.json timed out

Version-Release number of selected component (if applicable):
rhos-release 11   -p 2017-05-30.1

How reproducible:
It could be reproduced by run the REST API performance test
which simulates load on Openstack with 20 concurrent threads.
Each thread create instance and perform actions:
01.NOVA.GET.Images
02.NOVA.GET.Flavors
03.NEUTRON.POST.Create.Network
04.NEUTRON.POST.Create.Subnet
05.NOVA.POST.Boot.Server
00.NOVA.GET.Server.Details
06.NOVA.POST.Pause.Server
07.NOVA.POST.Unpause.Server
08.NOVA.POST.Suspend.Server
09.NOVA.POST.Resume.Server
10.NOVA.POST.Soft.Reboot.Server
11.NOVA.POST.Hard.Reboot.Server
12.NOVA.POST.Stop.Server
13.NOVA.POST.Start.Server
14.NOVA.POST.Create.Image
15.NOVA.GET.Image.Id
16.NOVA.DELETE.Image
17.NOVA.DELETE.Server
18.NEUTRON.DELETE.Network

Steps to Reproduce:
1.
2.
3.

Actual results:
Several instances in state Error

Expected results:
No instances in state Error

Additional info:

Comment 1 Yuri Obshansky 2017-06-07 20:10:57 UTC
Created attachment 1285907 [details]
Horizon screenshot of instance details

Comment 2 Yuri Obshansky 2017-06-07 20:11:31 UTC
Created attachment 1285908 [details]
Horizon screenshot of instance action log

Comment 3 Yuri Obshansky 2017-06-07 20:12:09 UTC
Created attachment 1285909 [details]
part of nova log related to failed instance

Comment 4 Yuri Obshansky 2017-06-07 20:12:59 UTC
Created attachment 1285910 [details]
libvirt qemu instance log

Comment 5 Yuri Obshansky 2017-06-07 20:13:32 UTC
Created attachment 1285911 [details]
full nova log

Comment 6 melanie witt 2017-06-14 23:08:11 UTC
The error "Request to http://172.16.0.11:9696/v2.0/ports/186d784e-dbe4-46a8-9807-3708cfefb3c5.json timed out" is a timeout while Nova is calling to Neutron to delete the ports associated with the instance.

We need a bit more information to determine what's going wrong here. Since we are timing out waiting for Neutron to respond to us, we will want to look at:

1. neutron logs

2. rabbitmq state (that is, is it overloaded with messages when this happens? is it related to Neutron's delay in responding to Nova?) This isn't as important as seeing the Neutron logs, so if it's not handy, that's okay. We mainly need the Neutron logs.

Comment 7 Yuri Obshansky 2017-06-15 15:55:52 UTC
Hi, 
I'm deploying rhos11 right now. 
I'll provide requested information soon.
Regards,
Yuri

Comment 8 melanie witt 2017-06-16 17:44:09 UTC
Created attachment 1288423 [details]
compute-0-nova-compute.log

Comment 9 melanie witt 2017-06-16 18:04:03 UTC
Created attachment 1288439 [details]
controller-0-server.log

Comment 10 melanie witt 2017-06-16 18:04:46 UTC
Created attachment 1288440 [details]
controller-0-server.log-20170616.gz

Comment 11 melanie witt 2017-06-16 18:05:53 UTC
Created attachment 1288441 [details]
controller-1-server.log

Comment 12 melanie witt 2017-06-16 18:07:16 UTC
Created attachment 1288442 [details]
controller-1-server.log-20170616.gz

Comment 13 melanie witt 2017-06-16 18:09:10 UTC
Created attachment 1288443 [details]
controller-2-server.log

Comment 14 melanie witt 2017-06-16 18:09:57 UTC
Created attachment 1288444 [details]
controller-2-server.log-20170616.gz

Comment 15 melanie witt 2017-06-16 18:10:46 UTC
Created attachment 1288445 [details]
rabbit

Comment 16 melanie witt 2017-06-16 18:11:11 UTC
Created attachment 1288446 [details]
rabbit

Comment 17 melanie witt 2017-06-16 21:03:31 UTC
I spent some time investigating this and what I see so far is that it appears Neutron is taking a long time (a little over a minute) to reply to nova-compute's REST API request to delete the Neutron port associated with the instance being deleted.

The Neutron server.log-20170616.gz is light on details as it's not logging at level DEBUG. It's only logging GET/PUT/POST/DELETE along with HTTP status code and timing -- that's it. To get more to the bottom of this, I think we need to run Neutron logging at level DEBUG next time we load test.

In general, I think we're going to need a routine where we run all of the main openstack services (Nova, Neutron, Cinder, Glance) at DEBUG when we see major performance degradation, to find root causes.

So, Neutron is taking > 60 seconds to reply to Nova and we don't know anything about why. We would want to see what Neutron is doing during the 60 seconds and see whether it's experiencing RPC timeouts. Maybe rabbitmq is overloaded with messages.

Here are some log excerpts from an example I traced:

Compute-0:
 
2017-06-15 18:38:15.701 19863 INFO nova.compute.manager [req-71054d3d-6674-4dfa-a15a-ff78fb7c9522 c609a31ba3fb4699ae5d2c14f777270b 503c0495b8154e4e8217135bb3e2b4a6 - - -] [instance: b010a39b-f1f7-46be-aa5e-e3684ceb4bd9] Terminating instance
 
Looks like the python global socket timeout is around 30 seconds?
 
2017-06-15 18:38:47.652 19863 ERROR nova.compute.manager [req-71054d3d-6674-4dfa-a15a-ff78fb7c9522 c609a31ba3fb4699ae5d2c14f777270b 503c0495b8154e4e8217135bb3e2b4a6 - - -] [instance: b010a39b-f1f7-46be-aa5e-e3684ceb4bd9] Failed to deallocate network for instance. Error: Request to http://172.16.0.17:9696/v2.0/ports/3ebaa036-9432-469c-b315-9caee0b3904a.json timed out
 
2017-06-15 18:39:26.333 19863 INFO nova.compute.manager [req-e29279ba-2af7-4f47-9c32-b68a0b47ec3d c609a31ba3fb4699ae5d2c14f777270b 503c0495b8154e4e8217135bb3e2b4a6 - - -] [instance: b010a39b-f1f7-46be-aa5e-e3684ceb4bd9] Took 0.82 seconds to deallocate network for instance.
 
 
Neutron server on controller-0:
 
Took a little over a minute after nova-compute started to delete the instance before the log of the request to delete the port shows up at Neutron. So 30 seconds after Nova already timed out.
 
2017-06-15 18:39:21.020 28716 INFO neutron.wsgi [req-286b99cb-85ab-45eb-988c-a1500b88d6af c609a31ba3fb4699ae5d2c14f777270b 503c0495b8154e4e8217135bb3e2b4a6 - - -] 172.16.0.14 - - [15/Jun/2017 18:39:21] "DELETE /v2.0/ports/3ebaa036-9432-469c-b315-9caee0b3904a.json HTTP/1.1" 204 149 63.368901

Comment 18 melanie witt 2017-06-16 21:22:01 UTC
There's also a gap in /var/log/messages where it dropped messages during the timestamp range of the example trace in comment 17:

Jun 15 18:36:13 overcloud-controller-0 rsyslogd-2177: imjournal: begin to drop messages due to rate-limiting
Jun 15 18:43:46 overcloud-controller-0 rsyslogd-2177: imjournal: 57976 messages lost due to rate-limiting

Comment 19 melanie witt 2017-06-16 22:43:34 UTC
We might need in /etc/systemd/journald.conf:

[Journal]
RateLimitBurst=0

similar to https://review.openstack.org/#/c/462725 to avoid losing log messages.

Comment 20 melanie witt 2017-06-23 15:18:13 UTC
As discussed on the rh-openstack-dev@ thread [1], this is part of a larger problem with the database being overwhelmed during the load tests and we're going to need to re-run the tests and deep dive into investigating the root cause of the performance degradation in RHOSP 11.

[1] http://post-office.corp.redhat.com/archives/rh-openstack-dev/2017-June/msg00392.html