Bug 1108638
Summary: | nova-compute Timeout while waiting on RPC response errors result in hosts being marked as down. | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Lee Yarwood <lyarwood> |
Component: | openstack-nova | Assignee: | Russell Bryant <rbryant> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Toure Dunnon <tdunnon> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 4.0 | CC: | agpxnet, avozza, hdemonte, jdexter, kgiusti, lyarwood, michele, ndipanov, sgordon, yeylon |
Target Milestone: | --- | Keywords: | Unconfirmed, ZStream |
Target Release: | 5.0 (RHEL 7) | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | openstack-nova-2013.2.3-9.el6ost | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-10-08 11:59:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Lee Yarwood
2014-06-12 11:13:44 UTC
(In reply to Lee Yarwood from comment #0) > Steps to Reproduce: > This appears to correlate with maintenance operations (mass instance > creation/deletion) carried out in the environment, more details to follow. Can you provide some more detail on this? Exactly how many instances are created/destroyed? How quickly / over what time frame? Based on this hint, I'd like to check on some common load issues. Specifically, we should check to see if the nova-conductor service is getting overloaded. Can you check to see what the CPU load for nova-conductor looks like when this occurs? If the conductor service isn't able to keep up during this time, it could cause timeouts like those seen in the logs. I recommend setting the [conductor] workers= option in nova.conf equal to the number of CPUs on the host. Even if this isn't the issue, it's still the configuration I would recommend. The only problem with this is that you mentioned later that the workaround was to restart qpidd and that restarting OpenStack services did not fix it. If this problem is reproducible, is that the solution every time? (In reply to Russell Bryant from comment #9) > (In reply to Lee Yarwood from comment #0) > > Steps to Reproduce: > > This appears to correlate with maintenance operations (mass instance > > creation/deletion) carried out in the environment, more details to follow. > > Can you provide some more detail on this? Exactly how many instances are > created/destroyed? How quickly / over what time frame? As few as 7 instances according to the customer over a span of a few minutes in the middle of the night. We are planning to monitor an attempt to reproduce this issue on Sunday and should be able to provide a much more detailed account of how the customer is reproducing this issue then. > Based on this hint, I'd like to check on some common load issues. > Specifically, we should check to see if the nova-conductor service is > getting overloaded. Can you check to see what the CPU load for > nova-conductor looks like when this occurs? ACK, I'll add this to the action plan for the next attempt to reproduce this on Sunday. > If the conductor service isn't able to keep up during this time, it could > cause timeouts like those seen in the logs. I recommend setting the > [conductor] workers= option in nova.conf equal to the number of CPUs on the > host. Even if this isn't the issue, it's still the configuration I would > recommend. > > The only problem with this is that you mentioned later that the workaround > was to restart qpidd and that restarting OpenStack services did not fix it. > If this problem is reproducible, is that the solution every time? Yes AFAIK this has been the only way to recover, again I'll confirm on Sunday. Leaving the needinfo in place as a reminder to update this bug on Sunday. Hi Lee - just to cover all bases, can you get the syslog log output from qpidd when you reproduce? Also, if you can run the following commands against qpidd and capture the output? qpid-stat -q qpid-stat -u qpid-stat -e Best to run these command periodically while you reproduce the problem. Lastly, if you can run the following qpidd monitoring tools while you reproduce the problem - capturing output: qpid-printevents qpid-queue-stats Thanks! Changes implemented as requested at 14:30, Sunday 15 June. Also, wrote a bash script to log the output of qstat and nova service-list every 5 minutes. We won't run stress tests now since the developer is not available. Thanks all for your efforts, we have now left the customer (after a successful demo in front of the CIO of Very Big Telco(TM). Next week I'll apply the python-qpid patch remotely, and coordinate with the customer what to do next. Hi all. I have the same issue. How I can fix it? Ok solved, restarting the conductor on the controller and the nova-compute service on the compute node (in this order). Not sure, why this happens... :\ (In reply to avozza from comment #23) > Thanks all for your efforts, we have now left the customer (after a > successful demo in front of the CIO of Very Big Telco(TM). Next week I'll > apply the python-qpid patch remotely, and coordinate with the customer what > to do next. Hello! Anything else you need from us? Hi Russell, not from my side, I consider the issue closed, in light of the customer moving on. But I'm CC'ing Henri just in case, he's taking care of that particular subject. Hi Russell, What became of the build mentioned here, has it been shipped via official channels? Thanks, Steve (In reply to Stephen Gordon from comment #28) > Hi Russell, > > What became of the build mentioned here, has it been shipped via official > channels? The Nova patch mentioned here is in openstack-nova-2013.2.3-9.el6ost via bug 1085006. |