Description of problem: Customer is experiencing heat eating memory and cpu resources on the vm it's executed. On the logs we can see timeouts and tracebacks: ERROR root [-] Unexpected exception occurred 57 time(s)... retrying. TRACE root Traceback (most recent call last): TRACE root File "/usr/lib/python2.7/site-packages/oslo/utils/excutils.py", line 92, in inner_func TRACE root return infunc(*args, **kwargs) TRACE root File "/usr/lib/python2.7/site-packages/oslo/messaging/_executors/impl_eventlet.py", line 87, in _executor_thread TRACE root incoming = self.listener.poll() TRACE root File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 129, in poll TRACE root self.conn.consume(limit=1) TRACE root File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_qpid.py", line 711, in consume TRACE root six.next(it) TRACE root File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_qpid.py", line 641, in iterconsume TRACE root yield self.ensure(_error_callback, _consume) TRACE root File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_qpid.py", line 580, in ensure TRACE root return method() TRACE root File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_qpid.py", line 632, in _consume TRACE root nxt_receiver = self.session.next_receiver(timeout=timeout) TRACE root File "<string>", line 6, in next_receiver TRACE root File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 674, in next_receiver TRACE root if self._ecwait(lambda: self.incoming, timeout): TRACE root File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait TRACE root result = self._ewait(lambda: self.closed or predicate(), timeout) TRACE root File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 580, in _ewait TRACE root result = self.connection._ewait(lambda: self.error or predicate(), timeout) TRACE root File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 219, in _ewait TRACE root self.check_error() TRACE root File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 212, in check_error TRACE root raise e TRACE root error: [Errno 32] Broken pipe And in the api log file: DEBUG root [req-2f613501-c452-48d3-8d71-04895830141e ] Calling <heat.api.openstack.v1.stacks.StackController object at 0x4a85e10> : index __call__ /usr/lib/python2.7/site-packages/heat/common/wsgi.py:630 ERROR root [req-26797326-dac1-4a62-91d0-1c17094005eb ] Unexpected error occurred serving API: Timed out waiting for a reply to message ID 7e13bed0dc4b40c383433e1f6da3aee6. TRACE root Traceback (most recent call last): TRACE root File "/usr/lib/python2.7/site-packages/heat/common/wsgi.py", line 633, in __call__ TRACE root request, **action_args) TRACE root File "/usr/lib/python2.7/site-packages/heat/common/wsgi.py", line 707, in dispatch TRACE root return method(*args, **kwargs) TRACE root File "/usr/lib/python2.7/site-packages/heat/api/openstack/v1/util.py", line 37, in handle_stack_method TRACE root return handler(controller, req, **kwargs) TRACE root File "/usr/lib/python2.7/site-packages/heat/api/openstack/v1/stacks.py", line 250, in index TRACE root return self._index(req) TRACE root File "/usr/lib/python2.7/site-packages/heat/api/openstack/v1/stacks.py", line 219, in _index TRACE root **params) TRACE root File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 97, in list_stacks TRACE root show_nested=show_nested)) TRACE root File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 50, in call TRACE root return client.call(ctxt, method, **kwargs) TRACE root File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/client.py", line 389, in call TRACE root return self.prepare().call(ctxt, method, **kwargs) TRACE root File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/client.py", line 152, in call TRACE root retry=self.retry) TRACE root File "/usr/lib/python2.7/site-packages/oslo/messaging/transport.py", line 90, in _send TRACE root timeout=timeout, retry=retry) TRACE root File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 416, in send TRACE root retry=retry) TRACE root File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 405, in _send TRACE root result = self._waiter.wait(msg_id, timeout) TRACE root File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 293, in wait TRACE root reply, ending = self._poll_connection(msg_id, timer) TRACE root File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 240, in _poll_connection TRACE root self._raise_timeout_exception(msg_id) TRACE root File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 209, in _raise_timeout_exception TRACE root _('Timed out waiting for a reply to message ID %s.') % msg_id) TRACE root MessagingTimeout: Timed out waiting for a reply to message ID 7e13bed0dc4b40c383433e1f6da3aee6. TRACE root Customer is using patch from BZ1260990 on this system Version-Release number of selected component (if applicable): heat-cfntools-1.2.6-4.el7.noarch openstack-heat-api-2014.2.1-3.el7ost.noarch openstack-heat-api-cfn-2014.2.1-3.el7ost.noarch openstack-heat-common-2014.2.1-3.el7ost.noarch openstack-heat-engine-2014.2.1-3.el7ost.noarch python-heatclient-0.2.12-1.el7ost.noarch
From the traceback it looks like RabbitMQ died.
Possibly related to bug 1265418
Could you please compare the following configuration settings in /etc/heat/heat.conf against other OpenStack services? They should be consistent (or consistently using the default value): - rpc_backend - rabbit_*
It's highly unlikely that max_resources_per_stack is the problem here. In RHOS 7 the problem was that each nested stack ran in its own RPC request, and that each stack counted its resources by loading every stack in the tree, with the result that the total memory consumption was O(n^2) in the number of nested stacks in the tree. The reason that bug even happened was that this was just not the case in RHOS 6. All of the stacks in the tree were handled in a single RPC context, which just counted all of the resources in-memory. It doesn't change the amount that is loaded into memory at all. It's far more likely that the patch to eliminate reference loops (so that objects are deleted immediately when they are no longer referenced, rather than waiting for the garbage collector) might make a difference. Most likely of all though is that something is leaking memory - with eventlet and qpid being the prime suspects.
Could you please supply the response to the following? rpm -qa |egrep "eventlet|greenlet|requests|qpid"
Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <bound method GreenSocket.__del__ of <eventlet.greenio.GreenSocket object at 0x54bee10>> ignored This ^ is a known bug in eventlet 0.15.2. It was identified upstream by our OpenStack team: https://bugs.launchpad.net/oslo.messaging/+bug/1369999 The description states that it occurs after an RPC Timeout with QPID, and the result is "The load average shoots up, lots of services hang up and chew through CPU spewing the same message into the logs." That sounds like exactly what you're seeing here. Here is the upstream eventlet bug: https://github.com/eventlet/eventlet/issues/137 (Note there was also a follow-up fix required: https://github.com/eventlet/eventlet/issues/154) This was fixed in eventlet 0.16.0 http://eventlet.net/doc/changelog.html#id7
To verify bug: from upstream bug standard install with qpid running with HOST_IP on a nic that can be taken down (he used wifi nic) install with qpid and neutron for networking https://bugs.launchpad.net/oslo.messaging/+bug/1369999/comments/4 https://bugs.launchpad.net/oslo.messaging/+bug/1369999/comments/5 after things are up and running bring down the nic CPU usage will spike for a number of services including heat
Backported to OSP 6 and picked up patch for another bug that was originally pushed out on OSP 6 - rhbz 1259758 new build python-eventlet-0.17.4-2.el7ost
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1903.html