Bug 1108638

Summary:	nova-compute Timeout while waiting on RPC response errors result in hosts being marked as down.
Product:	Red Hat OpenStack	Reporter:	Lee Yarwood <lyarwood>
Component:	openstack-nova	Assignee:	Russell Bryant <rbryant>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Toure Dunnon <tdunnon>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.0	CC:	agpxnet, avozza, hdemonte, jdexter, kgiusti, lyarwood, michele, ndipanov, sgordon, yeylon
Target Milestone:	---	Keywords:	Unconfirmed, ZStream
Target Release:	5.0 (RHEL 7)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-nova-2013.2.3-9.el6ost	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-10-08 11:59:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lee Yarwood 2014-06-12 11:13:44 UTC

Description of problem:

'Timeout while waiting on RPC response' errors seen within compute services result in hosts being marked as down.

compute.log

58229 2014-06-04 00:56:26.724 59741 ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager.update_available_resource: Timeout while waiting on RPC response - topic: "conductor", RPC method: "object_class_action" info: "<unknown>"
58230 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task Traceback (most recent call last):
58231 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.6/site-packages/nova/openstack/common/periodic_task.py", line 180, in run_periodic_tasks
58232 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task     task(self, context)
58233 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 4889, in update_available_resource
58234 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task     rt.update_available_resource(context)
58235 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.6/site-packages/nova/openstack/common/lockutils.py", line 246, in inner
58236 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task     return f(*args, **kwargs)
58237 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.6/site-packages/nova/compute/resource_tracker.py", line 296, in update_available_resource
58238 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task     context, self.host, self.nodename)
58239 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.6/site-packages/nova/objects/base.py", line 106, in wrapper
58240 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task     args, kwargs)
58241 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.6/site-packages/nova/conductor/rpcapi.py", line 492, in object_class_action
58242 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task     objver=objver, args=args, kwargs=kwargs)
58243 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.6/site-packages/nova/rpcclient.py", line 85, in call
58244 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task     return self._invoke(self.proxy.call, ctxt, method, **kwargs)
58245 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.6/site-packages/nova/rpcclient.py", line 63, in _invoke
58246 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task     return cast_or_call(ctxt, msg, **self.kwargs)
58247 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/proxy.py", line 130, in call
58248 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task     exc.info, real_topic, msg.get('method'))
58249 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task Timeout: Timeout while waiting on RPC response - topic: "conductor", RPC method: "object_class_action" info: "<unknown>"
58250 2014-06-04 00:56:26.724 59741 TRACE nova.openstack.common.periodic_task

A restart of qpidd is required to resolve the above. Restarting both the compute and conductor services does not workaround this issue as in BZ#1024651.

Version-Release number of selected component (if applicable):
# grep nova installed-rpms 
openstack-nova-common-2013.2.2-2.el6ost.noarch              Thu May  8 22:23:56 2014
openstack-nova-compute-2013.2.2-2.el6ost.noarch             Thu May  8 22:23:58 2014
python-nova-2013.2.2-2.el6ost.noarch                        Thu May  8 22:23:56 2014
python-novaclient-2.15.0-3.el6ost.noarch                    Thu May  8 22:16:41 2014

How reproducible:
Unclear.

Steps to Reproduce:
This appears to correlate with maintenance operations (mass instance creation/deletion) carried out in the environment, more details to follow.

Actual results:
RPC timeouts seen within the compute services, hosts marked as Down.

Expected results:
No RPC timeouts seen within the compute services, hosts remain UP.

Additional info:
I've created this bug as the workaround from BZ#1024651 doesn't appear to apply here. Please feel free to mark this as a duplicate if required.

Comment 9 Russell Bryant 2014-06-12 18:09:44 UTC

(In reply to Lee Yarwood from comment #0)
> Steps to Reproduce:
> This appears to correlate with maintenance operations (mass instance
> creation/deletion) carried out in the environment, more details to follow.

Can you provide some more detail on this?  Exactly how many instances are created/destroyed?  How quickly / over what time frame?

Based on this hint, I'd like to check on some common load issues.  Specifically, we should check to see if the nova-conductor service is getting overloaded.  Can you check to see what the CPU load for nova-conductor looks like when this occurs?

If the conductor service isn't able to keep up during this time, it could cause timeouts like those seen in the logs.  I recommend setting the [conductor] workers= option in nova.conf equal to the number of CPUs on the host.  Even if this isn't the issue, it's still the configuration I would recommend.

The only problem with this is that you mentioned later that the workaround was to restart qpidd and that restarting OpenStack services did not fix it.  If this problem is reproducible, is that the solution every time?

Comment 10 Lee Yarwood 2014-06-13 10:37:23 UTC

(In reply to Russell Bryant from comment #9)
> (In reply to Lee Yarwood from comment #0)
> > Steps to Reproduce:
> > This appears to correlate with maintenance operations (mass instance
> > creation/deletion) carried out in the environment, more details to follow.
> 
> Can you provide some more detail on this?  Exactly how many instances are
> created/destroyed?  How quickly / over what time frame?

As few as 7 instances according to the customer over a span of a few minutes in the middle of the night. We are planning to monitor an attempt to reproduce this issue on Sunday and should be able to provide a much more detailed account of how the customer is reproducing this issue then.

> Based on this hint, I'd like to check on some common load issues. 
> Specifically, we should check to see if the nova-conductor service is
> getting overloaded.  Can you check to see what the CPU load for
> nova-conductor looks like when this occurs?

ACK, I'll add this to the action plan for the next attempt to reproduce this on Sunday.
 
> If the conductor service isn't able to keep up during this time, it could
> cause timeouts like those seen in the logs.  I recommend setting the
> [conductor] workers= option in nova.conf equal to the number of CPUs on the
> host.  Even if this isn't the issue, it's still the configuration I would
> recommend.
> 
> The only problem with this is that you mentioned later that the workaround
> was to restart qpidd and that restarting OpenStack services did not fix it. 
> If this problem is reproducible, is that the solution every time?

Yes AFAIK this has been the only way to recover, again I'll confirm on Sunday.

Leaving the needinfo in place as a reminder to update this bug on Sunday.

Comment 11 Ken Giusti 2014-06-13 12:48:12 UTC

Hi Lee - just to cover all bases, can you get the syslog log output from qpidd when you reproduce?  Also, if you can run the following commands against qpidd and capture the output?

qpid-stat -q
qpid-stat -u
qpid-stat -e

Best to run these command periodically while you reproduce the problem.

Lastly, if you can run the following qpidd monitoring tools while you reproduce the problem - capturing output:

qpid-printevents
qpid-queue-stats


Thanks!

Comment 12 Alessandro Vozza 2014-06-15 11:47:06 UTC

Changes implemented as requested at 14:30, Sunday 15 June. Also, wrote a bash script to log the output of qstat and nova service-list every 5 minutes. We won't run stress tests now since the developer is not available.

Comment 23 Alessandro Vozza 2014-06-19 21:42:54 UTC

Thanks all for your efforts, we have now left the customer (after a successful demo in front of the CIO of Very Big Telco(TM). Next week I'll apply the python-qpid patch remotely, and coordinate with the customer what to do next.

Comment 24 agpxnet 2014-07-31 12:07:58 UTC

Hi all. I have the same issue. How I can fix it?

Comment 25 agpxnet 2014-07-31 12:27:22 UTC

Ok solved, restarting the conductor on the controller and the nova-compute service on the compute node (in this order). Not sure, why this happens... :\

Comment 26 Russell Bryant 2014-08-13 15:19:59 UTC

(In reply to avozza from comment #23)
> Thanks all for your efforts, we have now left the customer (after a
> successful demo in front of the CIO of Very Big Telco(TM). Next week I'll
> apply the python-qpid patch remotely, and coordinate with the customer what
> to do next.

Hello!  Anything else you need from us?

Comment 27 Alessandro Vozza 2014-08-22 21:02:22 UTC

Hi Russell,

not from my side, I consider the issue closed, in light of the customer moving on. But I'm CC'ing Henri just in case, he's taking care of that particular subject.

Comment 28 Stephen Gordon 2014-10-08 01:31:34 UTC

Hi Russell,

What became of the build mentioned here, has it been shipped via official channels?

Thanks,

Steve

Comment 29 Russell Bryant 2014-10-08 11:59:42 UTC

(In reply to Stephen Gordon from comment #28)
> Hi Russell,
> 
> What became of the build mentioned here, has it been shipped via official
> channels?

The Nova patch mentioned here is in openstack-nova-2013.2.3-9.el6ost via bug 1085006.