Bug 1459687
Summary: | Failed to delete Instance, it's state changed to Error under load test | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Yuri Obshansky <yobshans> |
Component: | openstack-nova | Assignee: | Eoghan Glynn <eglynn> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Joe H. Rahme <jhakimra> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 11.0 (Ocata) | CC: | berrange, dasmith, eglynn, kchamart, mwitt, sbauza, sferdjao, sgordon, srevivo, vromanso, yobshans |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-06-23 15:18:13 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Yuri Obshansky
2017-06-07 20:09:49 UTC
Created attachment 1285907 [details]
Horizon screenshot of instance details
Created attachment 1285908 [details]
Horizon screenshot of instance action log
Created attachment 1285909 [details]
part of nova log related to failed instance
Created attachment 1285910 [details]
libvirt qemu instance log
Created attachment 1285911 [details]
full nova log
The error "Request to http://172.16.0.11:9696/v2.0/ports/186d784e-dbe4-46a8-9807-3708cfefb3c5.json timed out" is a timeout while Nova is calling to Neutron to delete the ports associated with the instance. We need a bit more information to determine what's going wrong here. Since we are timing out waiting for Neutron to respond to us, we will want to look at: 1. neutron logs 2. rabbitmq state (that is, is it overloaded with messages when this happens? is it related to Neutron's delay in responding to Nova?) This isn't as important as seeing the Neutron logs, so if it's not handy, that's okay. We mainly need the Neutron logs. Hi, I'm deploying rhos11 right now. I'll provide requested information soon. Regards, Yuri Created attachment 1288423 [details]
compute-0-nova-compute.log
Created attachment 1288439 [details]
controller-0-server.log
Created attachment 1288440 [details]
controller-0-server.log-20170616.gz
Created attachment 1288441 [details]
controller-1-server.log
Created attachment 1288442 [details]
controller-1-server.log-20170616.gz
Created attachment 1288443 [details]
controller-2-server.log
Created attachment 1288444 [details]
controller-2-server.log-20170616.gz
Created attachment 1288445 [details]
rabbit
Created attachment 1288446 [details]
rabbit
I spent some time investigating this and what I see so far is that it appears Neutron is taking a long time (a little over a minute) to reply to nova-compute's REST API request to delete the Neutron port associated with the instance being deleted. The Neutron server.log-20170616.gz is light on details as it's not logging at level DEBUG. It's only logging GET/PUT/POST/DELETE along with HTTP status code and timing -- that's it. To get more to the bottom of this, I think we need to run Neutron logging at level DEBUG next time we load test. In general, I think we're going to need a routine where we run all of the main openstack services (Nova, Neutron, Cinder, Glance) at DEBUG when we see major performance degradation, to find root causes. So, Neutron is taking > 60 seconds to reply to Nova and we don't know anything about why. We would want to see what Neutron is doing during the 60 seconds and see whether it's experiencing RPC timeouts. Maybe rabbitmq is overloaded with messages. Here are some log excerpts from an example I traced: Compute-0: 2017-06-15 18:38:15.701 19863 INFO nova.compute.manager [req-71054d3d-6674-4dfa-a15a-ff78fb7c9522 c609a31ba3fb4699ae5d2c14f777270b 503c0495b8154e4e8217135bb3e2b4a6 - - -] [instance: b010a39b-f1f7-46be-aa5e-e3684ceb4bd9] Terminating instance Looks like the python global socket timeout is around 30 seconds? 2017-06-15 18:38:47.652 19863 ERROR nova.compute.manager [req-71054d3d-6674-4dfa-a15a-ff78fb7c9522 c609a31ba3fb4699ae5d2c14f777270b 503c0495b8154e4e8217135bb3e2b4a6 - - -] [instance: b010a39b-f1f7-46be-aa5e-e3684ceb4bd9] Failed to deallocate network for instance. Error: Request to http://172.16.0.17:9696/v2.0/ports/3ebaa036-9432-469c-b315-9caee0b3904a.json timed out 2017-06-15 18:39:26.333 19863 INFO nova.compute.manager [req-e29279ba-2af7-4f47-9c32-b68a0b47ec3d c609a31ba3fb4699ae5d2c14f777270b 503c0495b8154e4e8217135bb3e2b4a6 - - -] [instance: b010a39b-f1f7-46be-aa5e-e3684ceb4bd9] Took 0.82 seconds to deallocate network for instance. Neutron server on controller-0: Took a little over a minute after nova-compute started to delete the instance before the log of the request to delete the port shows up at Neutron. So 30 seconds after Nova already timed out. 2017-06-15 18:39:21.020 28716 INFO neutron.wsgi [req-286b99cb-85ab-45eb-988c-a1500b88d6af c609a31ba3fb4699ae5d2c14f777270b 503c0495b8154e4e8217135bb3e2b4a6 - - -] 172.16.0.14 - - [15/Jun/2017 18:39:21] "DELETE /v2.0/ports/3ebaa036-9432-469c-b315-9caee0b3904a.json HTTP/1.1" 204 149 63.368901 There's also a gap in /var/log/messages where it dropped messages during the timestamp range of the example trace in comment 17: Jun 15 18:36:13 overcloud-controller-0 rsyslogd-2177: imjournal: begin to drop messages due to rate-limiting Jun 15 18:43:46 overcloud-controller-0 rsyslogd-2177: imjournal: 57976 messages lost due to rate-limiting We might need in /etc/systemd/journald.conf: [Journal] RateLimitBurst=0 similar to https://review.openstack.org/#/c/462725 to avoid losing log messages. As discussed on the rh-openstack-dev@ thread [1], this is part of a larger problem with the database being overwhelmed during the load tests and we're going to need to re-run the tests and deep dive into investigating the root cause of the performance degradation in RHOSP 11. [1] http://post-office.corp.redhat.com/archives/rh-openstack-dev/2017-June/msg00392.html |