1393802 – heat-engine takes 100% CPU when calling resource-list and the operation times out

Bug 1393802 - heat-engine takes 100% CPU when calling resource-list and the operation times out

Summary: heat-engine takes 100% CPU when calling resource-list and the operation times...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-heat
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	async
Target Release:	10.0 (Newton)
Assignee:	Zane Bitter
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1396391 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-10 11:22 UTC by Udi Kalifon
Modified:	2023-09-14 03:34 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-02-09 15:25:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
heat logs (252.82 KB, application/x-gzip) 2016-11-10 15:47 UTC, Udi Kalifon	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	397822	None	None	None	2016-11-16 13:48:23 UTC
OpenStack gerrit	399619	None	None	None	2016-11-28 14:38:59 UTC
OpenStack gerrit	400431	None	None	None	2016-11-28 14:39:55 UTC

Description Udi Kalifon 2016-11-10 11:22:21 UTC

Description of problem:
I had a failed deployment and wanted to troubleshoot it. It was impossible to get any info because 'heat resource-list -n 5 overcloud' hangs for 1 minute and then returns:

ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

If you don't pass '-n 5' then the operation doesn't time out. In any case, however, you can see the CPU usage of heat-engine jump to 100% when calling resource-list (also without -n 5).


How reproducible:
100%


Steps to Reproduce:
1. After a failed deployment, call: 
heat resource-list -n 5 overcloud | egrep -v COMPLETE


Actual results:
100% CPU usage and the operation times out after ~1 minute.


Expected results:
This is not supposed to be an expensive operation and should be very fast.

Comment 1 Thomas Hervé 2016-11-10 13:03:46 UTC

Unfortunately, it is an expensive operation. What's your deployment like? How many nodes do you have in the overcloud? What's the undercloud sizing? Can you attach logs from the heat-engine? Thanks.

Comment 2 Zane Bitter 2016-11-10 14:18:10 UTC

Yeah, this is expensive and not likely to get cheaper any time soon. It's worth looking at whether we need to adjust the timeout on the load balancer though, assuming that's what's timing out.

Comment 3 Udi Kalifon 2016-11-10 15:47:45 UTC

Created attachment 1219454 [details]
heat logs

I am attaching the logs. Sorry that it's all the logs and they include other tests I made also.

I was trying to deploy 3 controllers and 2 computes on a bare metal setup. I tried to ssl-ize the overcloud and I think the deployment failed because I didn't set the keys and certificates in the template.

Can we release with such a basic operation not working?

Comment 4 Thomas Hervé 2016-11-10 16:24:55 UTC

We can see the 2 failures in the API log when nested_depth is specified. The calls took around 64s to finish, so just after a haproxy timeout I presume. It is slow, I'm not sure we'll get around to fix that now though.

The command "openstack stack failures list" is meant to do what you'd like to do I think, maybe that mitigate this issue.

Comment 5 Zane Bitter 2016-11-16 13:48:24 UTC

A patch has been proposed upstream to bump the HAProxy timeout to 2 minutes.

Comment 7 Zane Bitter 2016-11-28 14:38:59 UTC

So it's *really* not helping that we reduced the number of engine workers to 2. That's going to make everything painfully slow. Fix here: https://review.openstack.org/#/c/399619/

Comment 10 Zane Bitter 2017-01-19 15:58:18 UTC

Is there any desire to backport https://review.openstack.org/399619 to stable/newton, or should we just close this bug?

Comment 11 Zane Bitter 2017-01-26 21:31:35 UTC

*** Bug 1396391 has been marked as a duplicate of this bug. ***

Comment 12 Zane Bitter 2017-02-09 15:25:27 UTC

Apparently we are not going to increase the number of workers back to 4 on OSP 10, so I'm going to close this bug since all of the other patches have been released.

Comment 21 Red Hat Bugzilla 2023-09-14 03:34:12 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.