Bug 1373105

Summary: Heat stucks in DELETE_IN_PROGRESS for some input data
Product: Red Hat OpenStack Reporter: Aviv Guetta <aguetta>
Component: openstack-heatAssignee: Zane Bitter <zbitter>
Status: CLOSED ERRATA QA Contact: Amit Ugol <augol>
Severity: high Docs Contact:
Priority: high    
Version: 8.0 (Liberty)CC: agurenko, dmaley, lmiccini, mbayer, mburns, pablo.iranzo, rhel-osp-director-maint, sbaker, shardy, srevivo, zbitter
Target Milestone: asyncKeywords: Triaged, ZStream
Target Release: 8.0 (Liberty)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-heat-5.0.1-9.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1384667 (view as bug list) Environment:
Last Closed: 2016-12-21 16:43:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1384667    

Description Aviv Guetta 2016-09-05 08:34:02 UTC
Description of problem:
Upstream bug https://bugs.launchpad.net/heat/+bug/1499669
Heat stucks in Delete_IN_PROGRESS in case of specific creation/deletion sequence of a stack.

Version-Release number of selected component (if applicable):
RHOSP 10 - Newton

How reproducible:
In specific 

Steps to Reproduce:
1.Create the stack:
rg.yaml:
heat_template_version: 2013-05-23
resources:
    rg:
        type: OS::Heat::ResourceGroup
        properties:
            count: 125
            resource_def:
                type: rand_str.yaml

rand_str.yaml:
heat_template_version: 2013-05-23
resources:

(yep, without any resource:) )

2. Run the stack:
heat stack-create abc --template-file rg.yaml
wait about 20-30 sec and run:
heat stack-delete abc
(before stack will be in CREATE_COMPLETE)

Actual results:
2015-09-25 12:59:15.450 ERROR heat.engine.resource [-] DB error Not found

or

2015-09-25 13:04:52.109 ERROR heat.engine.resource [-] DB error This result object does not return rows. It has been closed automatically.
2015-09-25 13:04:52.110 ERROR sqlalchemy.pool.QueuePool [-] Exception during reset or similar
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool Traceback (most recent call last):
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 636, in _finalize_fairy
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool fairy._reset(pool)
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 776, in _reset
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool pool._dialect.do_rollback(self)
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/dialects/mysql/base.py", line 2519, in do_rollback
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool dbapi_connection.rollback()
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool File "/usr/local/lib/python2.7/dist-packages/pymysql/connections.py", line 711, in rollback
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool self._read_ok_packet()
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool File "/usr/local/lib/python2.7/dist-packages/pymysql/connections.py", line 687, in _read_ok_packet
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool raise err.OperationalError(2014, "Command Out of Sync")
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool OperationalError: (2014, 'Command Out of Sync')

And: 

2016-06-04 02:51:54.091 3205 INFO heat.engine.stack [req-d3d7848b-ea6e-4557-8759-37626e7f5f63 demo demo - default default] Stopped due to GreenletExit() in create
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 457, in fire_timers
    timer()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 217, in main
    self._resolve_links()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 232, in _resolve_links
    f(self, *ca, **ckw)
  File "/opt/stack/new/heat/heat/engine/service.py", line 187, in release
    stack.persist_state_and_release_lock(lock.engine_id)
  File "/opt/stack/new/heat/heat/engine/stack.py", line 913, in persist_state_and_release_lock
    stack = stack_object.Stack.get_by_id(self.context, self.id)
  File "/opt/stack/new/heat/heat/objects/stack.py", line 85, in get_by_id
    db_stack = db_api.stack_get(context, stack_id, **kwargs)
  File "/opt/stack/new/heat/heat/db/api.py", line 146, in stack_get
    eager_load=eager_load)
  File "/opt/stack/new/heat/heat/db/sqlalchemy/api.py", line 401, in stack_get
    result = query.get(stack_id)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 831, in get
    return self._get_impl(ident, loading.load_on_ident)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 864, in _get_impl
    return fallback_fn(self, key)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/loading.py", line 219, in load_on_ident
    return q.one()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2718, in one
    ret = list(self)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2761, in __iter__
    return self._execute_and_instances(context)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2774, in _execute_and_instances
    close_with_result=True)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2765, in _connection_from_session
    **kw)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 893, in connection
    execution_options=execution_options)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 900, in _connection_for_bind
    conn = engine.contextual_connect(**kw)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 2041, in contextual_connect
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 92, in __init__
    self.dispatch.engine_connect(self, self.__branch)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/event/attr.py", line 256, in __call__
    fn(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py", line 72, in _connect_ping_listener
    connection.scalar(select([1]))
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 844, in scalar
    return self.execute(object, *multiparams, **params).scalar()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/result.py", line 1066, in scalar
    row = self.first()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/result.py", line 1040, in first
    return self._non_result(None)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/result.py", line 922, in _non_result
    "This result object does not return rows. "
ResourceClosedError: This result object does not return rows. It has been closed automatically.
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool [-] Exception during reset or similar
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call last):
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 636, in _finalize_fairy
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool fairy._reset(pool)
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 776, in _reset
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool pool._dialect.do_rollback(self)
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/dialects/mysql/base.py", line 2542, in do_rollback
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool dbapi_connection.rollback()
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool File "/usr/local/lib/python2.7/dist-packages/pymysql/connections.py", line 772, in rollback
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool self._read_ok_packet()
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool File "/usr/local/lib/python2.7/dist-packages/pymysql/connections.py", line 748, in _read_ok_packet
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool raise err.OperationalError(2014, "Command Out of Sync")
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool OperationalError: (2014, 'Command Out of Sync')
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool
Exception AssertionError: AssertionError('do not call blocking functions from the mainloop',) in <function <lambda> at 0x7f2b1241b6e0> ignored

Expected results:
Heat deletes the stack. No exceptions.

Additional info:
Reproduced also in RH-OSP8.

Comment 4 Zane Bitter 2016-09-06 12:36:00 UTC
I believe this is caused by a database transaction not being closed when a thread is cancelled. (When you start a delete in the middle of a create/update, the create/update gets stopped so that you don't have to wait for in order to start your delete.)

This particular reproducer is kind of the perfect storm - create a lot of resources very quickly (so lots of transactions going on), then cancel it in the middle. This greatly increases the chances of encountering the bug.

Unfortunately we're yet to find the source of the problem upstream. Until we have, it's difficult to say whether the fix will be readily applicable to Liberty.

Comment 9 Zane Bitter 2016-09-20 20:17:54 UTC
I'm still investigating, but it's looking increasingly possible that this could be due to a bug in sqlalchemy. This thread describes a similar issue: http://lists.openstack.org/pipermail/openstack-dev/2016-September/103674.html

Comment 10 Zane Bitter 2016-09-21 12:34:43 UTC
After discussion with Mike Bayer yesterday[1] I believe we've tracked down the problem. When we cancel the greenthread, an exception is raised inside PyMySQL, where it is communicating with the client. PyMySQL mistakenly believes that only IOError exceptions can occur here, and therefore does not handle it. Sqlalchemy likewise does not have any special handling. This means that from the server's perspective the connection is in the middle of a write. However from the Sqlalchemy/PyMySQL perspective, it's just another connection that can be used to either roll back the failed transaction or be returned to the connection pool. Next time we attempt to use the connection, the DB operation will fail.

Opinions... differ on where the best place to handle this is ;). In theory, a signal handler can raise an exception between any two Python opcodes, so libraries like PyMySQL and sqlalchemy ought to take this into account. In practice, a co-operative multithreading implementation like eventlet makes the problem much easier to encounter, by ensuring that the exception is effectively always raised when either sleeping or doing I/O, and in the latter case this will always trigger the failure.

Steve has added a fix in Newton (https://review.openstack.org/#/c/373518/) that should almost completely prevent us from ever cancelling a greenthread. However, it depends on extensive changes that occurred during Newton development that are certainly not backportable to Liberty.

It remains to be seen what the shape of a solution at the DB connection level might look like. Until we've established that, it's difficult to say how (or even if) we can backport the fix to Liberty. That's what we'll try to figure out next.

[1] http://eavesdrop.openstack.org/irclogs/%23heat/%23heat.2016-09-20.log.html#t2016-09-20T20:25:07

Comment 11 Michael Bayer 2016-09-21 13:28:39 UTC
from the oslo_db / SQLAlchemy side, oslo_db can be made to handle these GreenletExit exceptions automatically but it requires an adjustment to SQLAlchemy, and I'd prefer to keep this as a SQLA 1.1 thing as it's a new feature.    The basic issue is that GreenletExit is not a subclass of Exception and SQLAlchemy's DB-exception-interception logic only goes as low as Exception, so this would need a configuration hook.   It could also be monkeypatched into known SQLAlchemy 1.0.x versions which is something we've done in oslo_db in the past.

Comment 12 Zane Bitter 2016-09-21 15:25:28 UTC
It's worth noting that "stuck" is probably the wrong term here. The root stack fails to notice that the child stack hasn't started deleting - which is difficult to address without introducing race conditions. However, the parent stack will still time out as usual and move to the DELETE_FAILED state. The user can also initiate another delete at any time (before or after the timeout), and this second delete will generally succeed.

Comment 13 Zane Bitter 2016-09-22 01:19:09 UTC
OK, there are 3 things going on here:

1) Database errors due to exit exceptions not being handled by sqlalchemy/PyMySQL. Mike is fixing that in sqlalchemy here: https://bitbucket.org/zzzeek/sqlalchemy/issues/3803/dbapi-connections-go-invalid-on and there's a good chance the fix will be in sqlalchemy 1.1. It's unlikely anybody is going to want to go back and use that with Liberty OpenStack though. Fortunately this isn't the biggest problem here - losing writes to the database sucks, but Heat almost always handles it pretty gracefully.

2) When we fail to start deleting the ResourceGroup nested stack, the parent stack should immediately transition to DELETE_FAILED, rather than remain DELETE_IN_PROGRESS until it times out. I've proposed a fix at https://review.openstack.org/374442 - it's trivial and will be easy to backport.

3) The *real* problem. When we cancel the in-progress creation of the ResourceGroup nested stack, there is a super-short timeout (2s by default) during which all of the tasks have to be cancelled, and if they're not cancelled by then Heat gives up and won't start the delete. On a stack this size, that just isn't long enough. You can work around this by setting the config option "engine_life_check_timeout" to a larger value (e.g. 30s). I've proposed a patch to not use that option for this particular timeout, and to use 30s instead: https://review.openstack.org/374443 - it's also very simple and will be easy to backport.

Comment 21 errata-xmlrpc 2016-12-21 16:43:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2989.html