1265653 – Heat consumes high CPU and Memory

Bug 1265653 - Heat consumes high CPU and Memory

Summary: Heat consumes high CPU and Memory

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-eventlet
Sub Component:
Version:	6.0 (Juno)
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	async
Target Release:	6.0 (Juno)
Assignee:	Jon Schlueter
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-09-23 12:22 UTC by Pablo Iranzo Gómez
Modified:	2019-08-15 05:29 UTC (History)
CC List:	16 users (show)
Fixed In Version:	python-eventlet-0.17.4-2.el7ost
Doc Type:	Rebase: Bug Fixes and Enhancements
Doc Text:	Rebase package to version: 0.17.4 This version resolves the following error: "maximum recursion depth exceeded in GreenSocket.__del__"
Clone Of:
Environment:
Last Closed:	2015-10-15 12:30:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:1903	0	normal	SHIPPED_LIVE	python-eventlet bug fix advisory	2015-10-15 16:27:52 UTC

Description Pablo Iranzo Gómez 2015-09-23 12:22:38 UTC

Description of problem:

Customer is experiencing heat eating memory and cpu resources on the vm it's executed.

On the logs we can see timeouts and tracebacks:

ERROR root [-] Unexpected exception occurred 57 time(s)... retrying.
TRACE root Traceback (most recent call last):
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/utils/excutils.py", line 92, in inner_func
TRACE root     return infunc(*args, **kwargs)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_executors/impl_eventlet.py", line 87, in _executor_thread
TRACE root     incoming = self.listener.poll()
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 129, in poll
TRACE root     self.conn.consume(limit=1)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_qpid.py", line 711, in consume
TRACE root     six.next(it)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_qpid.py", line 641, in iterconsume
TRACE root     yield self.ensure(_error_callback, _consume)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_qpid.py", line 580, in ensure
TRACE root     return method()
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_qpid.py", line 632, in _consume
TRACE root     nxt_receiver = self.session.next_receiver(timeout=timeout)
TRACE root   File "<string>", line 6, in next_receiver
TRACE root   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 674, in next_receiver
TRACE root     if self._ecwait(lambda: self.incoming, timeout):
TRACE root   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
TRACE root     result = self._ewait(lambda: self.closed or predicate(), timeout)
TRACE root   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 580, in _ewait
TRACE root     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
TRACE root   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 219, in _ewait
TRACE root     self.check_error()
TRACE root   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 212, in check_error
TRACE root     raise e
TRACE root error: [Errno 32] Broken pipe

And in the api log file:

DEBUG root [req-2f613501-c452-48d3-8d71-04895830141e ] Calling <heat.api.openstack.v1.stacks.StackController object at 0x4a85e10> : index __call__ /usr/lib/python2.7/site-packages/heat/common/wsgi.py:630
ERROR root [req-26797326-dac1-4a62-91d0-1c17094005eb ] Unexpected error occurred serving API: Timed out waiting for a reply to message ID 7e13bed0dc4b40c383433e1f6da3aee6.
TRACE root Traceback (most recent call last):
TRACE root   File "/usr/lib/python2.7/site-packages/heat/common/wsgi.py", line 633, in __call__
TRACE root     request, **action_args)
TRACE root   File "/usr/lib/python2.7/site-packages/heat/common/wsgi.py", line 707, in dispatch
TRACE root     return method(*args, **kwargs)
TRACE root   File "/usr/lib/python2.7/site-packages/heat/api/openstack/v1/util.py", line 37, in handle_stack_method
TRACE root     return handler(controller, req, **kwargs)
TRACE root   File "/usr/lib/python2.7/site-packages/heat/api/openstack/v1/stacks.py", line 250, in index
TRACE root     return self._index(req)
TRACE root   File "/usr/lib/python2.7/site-packages/heat/api/openstack/v1/stacks.py", line 219, in _index
TRACE root     **params)
TRACE root   File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 97, in list_stacks
TRACE root     show_nested=show_nested))
TRACE root   File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 50, in call
TRACE root     return client.call(ctxt, method, **kwargs)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/client.py", line 389, in call
TRACE root     return self.prepare().call(ctxt, method, **kwargs)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/client.py", line 152, in call
TRACE root     retry=self.retry)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/transport.py", line 90, in _send
TRACE root     timeout=timeout, retry=retry)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 416, in send
TRACE root     retry=retry)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 405, in _send
TRACE root     result = self._waiter.wait(msg_id, timeout)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 293, in wait
TRACE root     reply, ending = self._poll_connection(msg_id, timer)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 240, in _poll_connection
TRACE root     self._raise_timeout_exception(msg_id)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 209, in _raise_timeout_exception
TRACE root     _('Timed out waiting for a reply to message ID %s.') % msg_id)
TRACE root MessagingTimeout: Timed out waiting for a reply to message ID 7e13bed0dc4b40c383433e1f6da3aee6.
TRACE root 


Customer is using patch from BZ1260990 on this system



Version-Release number of selected component (if applicable):

heat-cfntools-1.2.6-4.el7.noarch    
openstack-heat-api-2014.2.1-3.el7ost.noarch    
openstack-heat-api-cfn-2014.2.1-3.el7ost.noarch    
openstack-heat-common-2014.2.1-3.el7ost.noarch    
openstack-heat-engine-2014.2.1-3.el7ost.noarch    
python-heatclient-0.2.12-1.el7ost.noarch

Comment 3 Zane Bitter 2015-09-23 12:43:22 UTC

From the traceback it looks like RabbitMQ died.

Comment 4 Zane Bitter 2015-09-23 21:41:28 UTC

Possibly related to bug 1265418

Comment 5 Steve Baker 2015-09-24 03:50:07 UTC

Could you please compare the following configuration settings in /etc/heat/heat.conf against other OpenStack services? They should be consistent (or consistently using the default value):

- rpc_backend
- rabbit_*

Comment 15 Zane Bitter 2015-09-30 03:30:38 UTC

It's highly unlikely that max_resources_per_stack is the problem here. In RHOS 7 the problem was that each nested stack ran in its own RPC request, and that each stack counted its resources by loading every stack in the tree, with the result that the total memory consumption was O(n^2) in the number of nested stacks in the tree.

The reason that bug even happened was that this was just not the case in RHOS 6. All of the stacks in the tree were handled in a single RPC context, which just counted all of the resources in-memory. It doesn't change the amount that is loaded into memory at all.

It's far more likely that the patch to eliminate reference loops (so that objects are deleted immediately when they are no longer referenced, rather than waiting for the garbage collector) might make a difference.

Most likely of all though is that something is leaking memory - with eventlet and qpid being the prime suspects.

Comment 16 Steve Baker 2015-09-30 05:13:16 UTC

Could you please supply the response to the following?

rpm -qa |egrep "eventlet|greenlet|requests|qpid"

Comment 19 Zane Bitter 2015-09-30 13:57:01 UTC

Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <bound method GreenSocket.__del__ of <eventlet.greenio.GreenSocket object at 0x54bee10>> ignored

This ^ is a known bug in eventlet 0.15.2. It was identified upstream by our OpenStack team: https://bugs.launchpad.net/oslo.messaging/+bug/1369999

The description states that it occurs after an RPC Timeout with QPID, and the result is "The load average shoots up, lots of services hang up and chew through CPU spewing the same message into the logs." That sounds like exactly what you're seeing here.

Here is the upstream eventlet bug: https://github.com/eventlet/eventlet/issues/137

(Note there was also a follow-up fix required: https://github.com/eventlet/eventlet/issues/154)

This was fixed in eventlet 0.16.0 http://eventlet.net/doc/changelog.html#id7

Comment 27 Jon Schlueter 2015-10-01 16:19:42 UTC

To verify bug:  from upstream bug

standard install with qpid running with HOST_IP on a nic that can be taken down
(he used wifi nic)

install with qpid and neutron for networking
https://bugs.launchpad.net/oslo.messaging/+bug/1369999/comments/4
https://bugs.launchpad.net/oslo.messaging/+bug/1369999/comments/5

after things are up and running
bring down the nic

CPU usage will spike for a number of services including heat

Comment 30 Jon Schlueter 2015-10-02 18:49:48 UTC

Backported to OSP 6 and picked up patch for another bug that was originally pushed out on OSP 6 - rhbz 1259758 new build 
python-eventlet-0.17.4-2.el7ost

Comment 34 errata-xmlrpc 2015-10-15 12:30:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1903.html

Note You need to log in before you can comment on or make changes to this bug.