Bug 1307125

Summary: Heat engine crashes during osp-d update from 7.1 and from 7.0 to 7.3
Product: Red Hat OpenStack Reporter: Toure Dunnon <tdunnon>
Component: openstack-heatAssignee: Steve Baker <sbaker>
Status: CLOSED ERRATA QA Contact: Amit Ugol <augol>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: adahms, athomas, jcoufal, mburns, ohochman, rhel-osp-director-maint, sasha, sbaker, sclewis, shardy, yeylon, zbitter
Target Milestone: asyncKeywords: Reopened, ZStream
Target Release: 7.0 (Kilo)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-heat-2015.1.2-10.el7ost Doc Type: Known Issue
Doc Text:
If the database is not available when heat-engine starts, heat-engine fails to start, preventing heat-engine from running under certain circumstances. This occurs after you update a machine on which both the database and heat-engine are running using yum alone such as when updating the undercloud. As a workaround, heat-engine must be explicitly started by running 'systemctl start openstack-heat-engine.service'. You can confirm whether heat-engine is running by running 'systemctl status openstack-heat-engine.service'.
Story Points: ---
Clone Of:
: 1308576 (view as bug list) Environment:
Last Closed: 2016-03-10 17:07:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1308576    
Attachments:
Description Flags
heat engine log backtrace
none
7.1 heat logs
none
7.0 heat logs none

Description Toure Dunnon 2016-02-12 18:45:00 UTC
Description of problem:
When performing an update from OSP-D 7.1 to 7.3 the heat engine crashes during
the phase were yum starts to update packages. This does not occur under 7.2. 

Version-Release number of selected component (if applicable):
openstack-heat-engine-2015.1.2-8.el7ost.noarch

How reproducible:
After 7.1 OSP-D deploys the overcloud and is populated with instances an
update is perform.

Comment 3 Toure Dunnon 2016-02-12 18:50:58 UTC
Created attachment 1123619 [details]
heat engine log backtrace

Comment 4 Mike Burns 2016-02-12 18:54:45 UTC
*** Bug 1307124 has been marked as a duplicate of this bug. ***

Comment 5 Alexander Chuzhoy 2016-02-12 19:01:53 UTC
Reproduced on 7.0 and 7.1:
7.0:
Feb 12 12:01:49 Installed: openstack-heat-engine-2015.1.0-4.el7ost.noarch
Feb 12 13:27:40 Updated: openstack-heat-engine-2015.1.2-9.el7ost.noarch

7.1:
Feb 12 11:45:20 Installed: openstack-heat-engine-2015.1.1-5.el7ost.noarch
Feb 12 13:16:10 Updated: openstack-heat-engine-2015.1.2-9.el7ost.noarch



During the yum update the openstack-heat-engine service fails.
The W/A is to start it.

Comment 6 Alexander Chuzhoy 2016-02-12 19:07:24 UTC
Created attachment 1123623 [details]
7.1 heat logs

Comment 7 Alexander Chuzhoy 2016-02-12 19:08:43 UTC
Created attachment 1123624 [details]
7.0 heat logs

Comment 8 Zane Bitter 2016-02-12 22:20:08 UTC
2016-02-12 13:18:46.487 13936 CRITICAL heat-engine [req-33db6d70-974c-4210-8311-b85ba49fb031 - -] DBConnectionError: (_mysql_exceptions.OperationalError) (2006, 'MySQL server has gone away') [SQL: u'SELECT watch
_rule.created_at AS watch_rule_created_at, watch_rule.updated_at AS watch_rule_updated_at, watch_rule.id AS watch_rule_id, watch_rule.name AS watch_rule_name, watch_rule.rule AS watch_rule_rule, watch_rule.state
 AS watch_rule_state, watch_rule.last_evaluated AS watch_rule_last_evaluated, watch_rule.stack_id AS watch_rule_stack_id \nFROM watch_rule \nWHERE watch_rule.stack_id = %s'] [parameters: ('fda3526e-7dbe-4f23-bb3
d-a1b8b3bd0fa8',)]
2016-02-12 13:18:46.487 13936 TRACE heat-engine Traceback (most recent call last):
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/bin/heat-engine", line 83, in <module>
2016-02-12 13:18:46.487 13936 TRACE heat-engine     srv.create_periodic_tasks()
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper
2016-02-12 13:18:46.487 13936 TRACE heat-engine     return f(*args, **kwargs)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 332, in create_periodic_tasks
2016-02-12 13:18:46.487 13936 TRACE heat-engine     self.stack_watch.start_watch_task(s.id, admin_context)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/engine/service_stack_watch.py", line 57, in start_watch_task
2016-02-12 13:18:46.487 13936 TRACE heat-engine     if stack_has_a_watchrule(stack_id):
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/engine/service_stack_watch.py", line 52, in stack_has_a_watchrule
2016-02-12 13:18:46.487 13936 TRACE heat-engine     if stack_has_a_watchrule(child.id):
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/engine/service_stack_watch.py", line 52, in stack_has_a_watchrule
2016-02-12 13:18:46.487 13936 TRACE heat-engine     if stack_has_a_watchrule(child.id):
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/engine/service_stack_watch.py", line 52, in stack_has_a_watchrule
2016-02-12 13:18:46.487 13936 TRACE heat-engine     if stack_has_a_watchrule(child.id):
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/engine/service_stack_watch.py", line 36, in stack_has_a_watchrule
2016-02-12 13:18:46.487 13936 TRACE heat-engine     wrs = watch_rule_object.WatchRule.get_all_by_stack(cnxt, sid)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/objects/watch_rule.py", line 77, in get_all_by_stack
2016-02-12 13:18:46.487 13936 TRACE heat-engine     stack_id)]
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/db/api.py", line 248, in watch_rule_get_all_by_stack
2016-02-12 13:18:46.487 13936 TRACE heat-engine     return IMPL.watch_rule_get_all_by_stack(context, stack_id)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/db/sqlalchemy/api.py", line 718, in watch_rule_get_all_by_stack
2016-02-12 13:18:46.487 13936 TRACE heat-engine     context, models.WatchRule).filter_by(stack_id=stack_id).all()
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2399, in all
2016-02-12 13:18:46.487 13936 TRACE heat-engine     return list(self)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2516, in __iter__
2016-02-12 13:18:46.487 13936 TRACE heat-engine     return self._execute_and_instances(context)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2531, in _execute_and_instances
2016-02-12 13:18:46.487 13936 TRACE heat-engine     result = conn.execute(querycontext.statement, self._params)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute
2016-02-12 13:18:46.487 13936 TRACE heat-engine     return meth(self, multiparams, params)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
2016-02-12 13:18:46.487 13936 TRACE heat-engine     return connection._execute_clauseelement(self, multiparams, params)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
2016-02-12 13:18:46.487 13936 TRACE heat-engine     compiled_sql, distilled_params
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
2016-02-12 13:18:46.487 13936 TRACE heat-engine     context)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1337, in _handle_dbapi_exception
2016-02-12 13:18:46.487 13936 TRACE heat-engine     util.raise_from_cause(newraise, exc_info)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
2016-02-12 13:18:46.487 13936 TRACE heat-engine     reraise(type(exception), exception, tb=exc_tb)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
2016-02-12 13:18:46.487 13936 TRACE heat-engine     context)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
2016-02-12 13:18:46.487 13936 TRACE heat-engine     cursor.execute(statement, parameters)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/MySQLdb/cursors.py", line 174, in execute
2016-02-12 13:18:46.487 13936 TRACE heat-engine     self.errorhandler(self, exc, value)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
2016-02-12 13:18:46.487 13936 TRACE heat-engine     raise errorclass, errorvalue
2016-02-12 13:18:46.487 13936 TRACE heat-engine DBConnectionError: (_mysql_exceptions.OperationalError) (2006, 'MySQL server has gone away') [SQL: u'SELECT watch_rule.created_at AS watch_rule_created_at, watch_rule.updated_at AS watch_rule_updated_at, watch_rule.id AS watch_rule_id, watch_rule.name AS watch_rule_name, watch_rule.rule AS watch_rule_rule, watch_rule.state AS watch_rule_state, watch_rule.last_evaluated AS watch_rule_last_evaluated, watch_rule.stack_id AS watch_rule_stack_id \nFROM watch_rule \nWHERE watch_rule.stack_id = %s'] [parameters: ('fda3526e-7dbe-4f23-bb3d-a1b8b3bd0fa8',)]
2016-02-12 13:18:46.487 13936 TRACE heat-engine 
2016-02-12 13:18:46.611 13995 INFO heat.openstack.common.service [-] Parent process has died unexpectedly, exiting
2016-02-12 13:18:46.611 13993 INFO heat.openstack.common.service [-] Parent process has died unexpectedly, exiting
2016-02-12 13:18:46.612 13995 DEBUG heat.engine.service [-] Attempting to stop engine service... _stop_rpc_server /usr/lib/python2.7/site-packages/heat/engine/service.py:373
2016-02-12 13:18:46.612 13993 DEBUG heat.engine.service [-] Attempting to stop engine service... _stop_rpc_server /usr/lib/python2.7/site-packages/heat/engine/service.py:373
2016-02-12 13:18:46.613 13992 INFO heat.openstack.common.service [-] Parent process has died unexpectedly, exiting

Could be just bad timing with the database going down just as Heat is starting up maybe?

Comment 9 Angus Thomas 2016-02-15 14:34:43 UTC
Doctext supplied for the 7.3 release.

Comment 12 Omri Hochman 2016-02-15 18:06:06 UTC
I would like to explain how does it looks from the user side - and why IMO it can cause a very bad user experience . 

[stack@instack ~]$ date
Mon Feb 15 12:31:57 EST 2016
[stack@instack ~]$ yes "" | openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml  -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml
...
...hangs
...hangs
...hangs 
... 
...10 minutes later fails on heat timeout ..

That is the output :(non informative message ):
-------------------
"ERROR: openstack ERROR: Timed out waiting for a reply to message ID aeeb27a4937d4825be6de2e5a6c57dbe

looking at the heat log you don't see anything meaningful, even worse you see heat problem to connect to DB (you might think it's connectivity issue) : 

2016-02-12 13:18:46.487 13936 CRITICAL heat-engine [req-33db6d70-974c-4210-8311-b85ba49fb031 - -] DBConnectionError: (_mysql_exceptions.OperationalError) (2006, 'MySQL server has gone away') [SQL: u'SELECT watch


Only after digging enough into /var/log/messages - if you're lucky you'll notice the heat-engine service failed during the yum update (in messages it fails without a trace-back - so it hard to spot it)  
 

[root@instack ~]# openstack-status | grep heat
openstack-heat-api:                     active
openstack-heat-api-cfn:                 active
openstack-heat-api-cloudwatch:          active
openstack-heat-engine:                  failed
Warning keystonerc not sourced


then apply workaround..

Comment 13 Steve Baker 2016-02-15 22:29:59 UTC
I'll look into fixing this by moving create_periodic_tasks to its own thread where it can retry until the db is available. This will allow systemd.notify to be called even when the db is down.

Comment 14 Steve Baker 2016-02-16 00:12:09 UTC
Here is a proposed upstream fix, will backport now and test heat's behaviour without a database.

Comment 15 Steve Baker 2016-02-16 02:28:26 UTC
This downstream change allowed heat-engine to start with a stopped mariadb, and then recover once mariadb was started 90 minutes later.

Comment 16 Jaromir Coufal 2016-02-16 14:06:00 UTC
Workaround available, not a blocker for z4. Though, I asked Lon if we can add the backported patch to the latest build for z4 if all possible.
-- Jarda

Comment 22 errata-xmlrpc 2016-03-10 17:07:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0435.html