Bug 1265653

Summary:	Heat consumes high CPU and Memory
Product:	Red Hat OpenStack	Reporter:	Pablo Iranzo Gómez <pablo.iranzo>
Component:	python-eventlet	Assignee:	Jon Schlueter <jschluet>
Status:	CLOSED ERRATA	QA Contact:	Amit Ugol <augol>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.0 (Juno)	CC:	adahms, dmaley, fahmed, jschluet, lhh, mburns, mschuppe, nlevinki, pablo.iranzo, rhel-osp-director-maint, sasha, sbaker, sclewis, shardy, yeylon, zbitter
Target Milestone:	async	Keywords:	Rebase, Triaged, ZStream
Target Release:	6.0 (Juno)
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:	python-eventlet-0.17.4-2.el7ost	Doc Type:	Rebase: Bug Fixes and Enhancements
Doc Text:	Rebase package to version: 0.17.4 This version resolves the following error: "maximum recursion depth exceeded in GreenSocket.__del__"	Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-10-15 12:30:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pablo Iranzo Gómez 2015-09-23 12:22:38 UTC

Description of problem:

Customer is experiencing heat eating memory and cpu resources on the vm it's executed.

On the logs we can see timeouts and tracebacks:

ERROR root [-] Unexpected exception occurred 57 time(s)... retrying.
TRACE root Traceback (most recent call last):
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/utils/excutils.py", line 92, in inner_func
TRACE root     return infunc(*args, **kwargs)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_executors/impl_eventlet.py", line 87, in _executor_thread
TRACE root     incoming = self.listener.poll()
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 129, in poll
TRACE root     self.conn.consume(limit=1)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_qpid.py", line 711, in consume
TRACE root     six.next(it)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_qpid.py", line 641, in iterconsume
TRACE root     yield self.ensure(_error_callback, _consume)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_qpid.py", line 580, in ensure
TRACE root     return method()
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_qpid.py", line 632, in _consume
TRACE root     nxt_receiver = self.session.next_receiver(timeout=timeout)
TRACE root   File "<string>", line 6, in next_receiver
TRACE root   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 674, in next_receiver
TRACE root     if self._ecwait(lambda: self.incoming, timeout):
TRACE root   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
TRACE root     result = self._ewait(lambda: self.closed or predicate(), timeout)
TRACE root   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 580, in _ewait
TRACE root     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
TRACE root   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 219, in _ewait
TRACE root     self.check_error()
TRACE root   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 212, in check_error
TRACE root     raise e
TRACE root error: [Errno 32] Broken pipe

And in the api log file:

DEBUG root [req-2f613501-c452-48d3-8d71-04895830141e ] Calling <heat.api.openstack.v1.stacks.StackController object at 0x4a85e10> : index __call__ /usr/lib/python2.7/site-packages/heat/common/wsgi.py:630
ERROR root [req-26797326-dac1-4a62-91d0-1c17094005eb ] Unexpected error occurred serving API: Timed out waiting for a reply to message ID 7e13bed0dc4b40c383433e1f6da3aee6.
TRACE root Traceback (most recent call last):
TRACE root   File "/usr/lib/python2.7/site-packages/heat/common/wsgi.py", line 633, in __call__
TRACE root     request, **action_args)
TRACE root   File "/usr/lib/python2.7/site-packages/heat/common/wsgi.py", line 707, in dispatch
TRACE root     return method(*args, **kwargs)
TRACE root   File "/usr/lib/python2.7/site-packages/heat/api/openstack/v1/util.py", line 37, in handle_stack_method
TRACE root     return handler(controller, req, **kwargs)
TRACE root   File "/usr/lib/python2.7/site-packages/heat/api/openstack/v1/stacks.py", line 250, in index
TRACE root     return self._index(req)
TRACE root   File "/usr/lib/python2.7/site-packages/heat/api/openstack/v1/stacks.py", line 219, in _index
TRACE root     **params)
TRACE root   File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 97, in list_stacks
TRACE root     show_nested=show_nested))
TRACE root   File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 50, in call
TRACE root     return client.call(ctxt, method, **kwargs)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/client.py", line 389, in call
TRACE root     return self.prepare().call(ctxt, method, **kwargs)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/client.py", line 152, in call
TRACE root     retry=self.retry)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/transport.py", line 90, in _send
TRACE root     timeout=timeout, retry=retry)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 416, in send
TRACE root     retry=retry)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 405, in _send
TRACE root     result = self._waiter.wait(msg_id, timeout)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 293, in wait
TRACE root     reply, ending = self._poll_connection(msg_id, timer)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 240, in _poll_connection
TRACE root     self._raise_timeout_exception(msg_id)
TRACE root   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 209, in _raise_timeout_exception
TRACE root     _('Timed out waiting for a reply to message ID %s.') % msg_id)
TRACE root MessagingTimeout: Timed out waiting for a reply to message ID 7e13bed0dc4b40c383433e1f6da3aee6.
TRACE root 


Customer is using patch from BZ1260990 on this system



Version-Release number of selected component (if applicable):

heat-cfntools-1.2.6-4.el7.noarch    
openstack-heat-api-2014.2.1-3.el7ost.noarch    
openstack-heat-api-cfn-2014.2.1-3.el7ost.noarch    
openstack-heat-common-2014.2.1-3.el7ost.noarch    
openstack-heat-engine-2014.2.1-3.el7ost.noarch    
python-heatclient-0.2.12-1.el7ost.noarch

Comment 3 Zane Bitter 2015-09-23 12:43:22 UTC

From the traceback it looks like RabbitMQ died.

Comment 4 Zane Bitter 2015-09-23 21:41:28 UTC

Possibly related to bug 1265418

Comment 5 Steve Baker 2015-09-24 03:50:07 UTC

Could you please compare the following configuration settings in /etc/heat/heat.conf against other OpenStack services? They should be consistent (or consistently using the default value):

- rpc_backend
- rabbit_*

Comment 15 Zane Bitter 2015-09-30 03:30:38 UTC

It's highly unlikely that max_resources_per_stack is the problem here. In RHOS 7 the problem was that each nested stack ran in its own RPC request, and that each stack counted its resources by loading every stack in the tree, with the result that the total memory consumption was O(n^2) in the number of nested stacks in the tree.

The reason that bug even happened was that this was just not the case in RHOS 6. All of the stacks in the tree were handled in a single RPC context, which just counted all of the resources in-memory. It doesn't change the amount that is loaded into memory at all.

It's far more likely that the patch to eliminate reference loops (so that objects are deleted immediately when they are no longer referenced, rather than waiting for the garbage collector) might make a difference.

Most likely of all though is that something is leaking memory - with eventlet and qpid being the prime suspects.

Comment 16 Steve Baker 2015-09-30 05:13:16 UTC

Could you please supply the response to the following?

rpm -qa |egrep "eventlet|greenlet|requests|qpid"

Comment 19 Zane Bitter 2015-09-30 13:57:01 UTC

Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <bound method GreenSocket.__del__ of <eventlet.greenio.GreenSocket object at 0x54bee10>> ignored

This ^ is a known bug in eventlet 0.15.2. It was identified upstream by our OpenStack team: https://bugs.launchpad.net/oslo.messaging/+bug/1369999

The description states that it occurs after an RPC Timeout with QPID, and the result is "The load average shoots up, lots of services hang up and chew through CPU spewing the same message into the logs." That sounds like exactly what you're seeing here.

Here is the upstream eventlet bug: https://github.com/eventlet/eventlet/issues/137

(Note there was also a follow-up fix required: https://github.com/eventlet/eventlet/issues/154)

This was fixed in eventlet 0.16.0 http://eventlet.net/doc/changelog.html#id7

Comment 27 Jon Schlueter 2015-10-01 16:19:42 UTC

To verify bug:  from upstream bug

standard install with qpid running with HOST_IP on a nic that can be taken down
(he used wifi nic)

install with qpid and neutron for networking
https://bugs.launchpad.net/oslo.messaging/+bug/1369999/comments/4
https://bugs.launchpad.net/oslo.messaging/+bug/1369999/comments/5

after things are up and running
bring down the nic

CPU usage will spike for a number of services including heat

Comment 30 Jon Schlueter 2015-10-02 18:49:48 UTC

Backported to OSP 6 and picked up patch for another bug that was originally pushed out on OSP 6 - rhbz 1259758 new build 
python-eventlet-0.17.4-2.el7ost

Comment 34 errata-xmlrpc 2015-10-15 12:30:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1903.html