Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 1307125 - Heat engine crashes during osp-d update from 7.1 and from 7.0 to 7.3
Heat engine crashes during osp-d update from 7.1 and from 7.0 to 7.3
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat (Show other bugs)
7.0 (Kilo)
x86_64 Linux
high Severity high
: async
: 7.0 (Kilo)
Assigned To: Steve Baker
Amit Ugol
: Reopened, ZStream
: 1307124 (view as bug list)
Depends On:
Blocks: 1308576
  Show dependency treegraph
 
Reported: 2016-02-12 13:45 EST by Toure Dunnon
Modified: 2016-04-26 19:37 EDT (History)
12 users (show)

See Also:
Fixed In Version: openstack-heat-2015.1.2-10.el7ost
Doc Type: Known Issue
Doc Text:
If the database is not available when heat-engine starts, heat-engine fails to start, preventing heat-engine from running under certain circumstances. This occurs after you update a machine on which both the database and heat-engine are running using yum alone such as when updating the undercloud. As a workaround, heat-engine must be explicitly started by running 'systemctl start openstack-heat-engine.service'. You can confirm whether heat-engine is running by running 'systemctl status openstack-heat-engine.service'.
Story Points: ---
Clone Of:
: 1308576 (view as bug list)
Environment:
Last Closed: 2016-03-10 12:07:10 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
heat engine log backtrace (7.74 KB, text/plain)
2016-02-12 13:50 EST, Toure Dunnon
no flags Details
7.1 heat logs (10.42 MB, application/x-gzip)
2016-02-12 14:07 EST, Alexander Chuzhoy
no flags Details
7.0 heat logs (7.03 MB, application/x-gzip)
2016-02-12 14:08 EST, Alexander Chuzhoy
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenStack gerrit 280451 None None None 2016-02-15 19:12 EST
Red Hat Product Errata RHBA-2016:0435 normal SHIPPED_LIVE openstack-heat bug fix advisory 2016-03-10 17:05:44 EST

  None (edit)
Description Toure Dunnon 2016-02-12 13:45:00 EST
Description of problem:
When performing an update from OSP-D 7.1 to 7.3 the heat engine crashes during
the phase were yum starts to update packages. This does not occur under 7.2. 

Version-Release number of selected component (if applicable):
openstack-heat-engine-2015.1.2-8.el7ost.noarch

How reproducible:
After 7.1 OSP-D deploys the overcloud and is populated with instances an
update is perform.
Comment 3 Toure Dunnon 2016-02-12 13:50 EST
Created attachment 1123619 [details]
heat engine log backtrace
Comment 4 Mike Burns 2016-02-12 13:54:45 EST
*** Bug 1307124 has been marked as a duplicate of this bug. ***
Comment 5 Alexander Chuzhoy 2016-02-12 14:01:53 EST
Reproduced on 7.0 and 7.1:
7.0:
Feb 12 12:01:49 Installed: openstack-heat-engine-2015.1.0-4.el7ost.noarch
Feb 12 13:27:40 Updated: openstack-heat-engine-2015.1.2-9.el7ost.noarch

7.1:
Feb 12 11:45:20 Installed: openstack-heat-engine-2015.1.1-5.el7ost.noarch
Feb 12 13:16:10 Updated: openstack-heat-engine-2015.1.2-9.el7ost.noarch



During the yum update the openstack-heat-engine service fails.
The W/A is to start it.
Comment 6 Alexander Chuzhoy 2016-02-12 14:07 EST
Created attachment 1123623 [details]
7.1 heat logs
Comment 7 Alexander Chuzhoy 2016-02-12 14:08 EST
Created attachment 1123624 [details]
7.0 heat logs
Comment 8 Zane Bitter 2016-02-12 17:20:08 EST
2016-02-12 13:18:46.487 13936 CRITICAL heat-engine [req-33db6d70-974c-4210-8311-b85ba49fb031 - -] DBConnectionError: (_mysql_exceptions.OperationalError) (2006, 'MySQL server has gone away') [SQL: u'SELECT watch
_rule.created_at AS watch_rule_created_at, watch_rule.updated_at AS watch_rule_updated_at, watch_rule.id AS watch_rule_id, watch_rule.name AS watch_rule_name, watch_rule.rule AS watch_rule_rule, watch_rule.state
 AS watch_rule_state, watch_rule.last_evaluated AS watch_rule_last_evaluated, watch_rule.stack_id AS watch_rule_stack_id \nFROM watch_rule \nWHERE watch_rule.stack_id = %s'] [parameters: ('fda3526e-7dbe-4f23-bb3
d-a1b8b3bd0fa8',)]
2016-02-12 13:18:46.487 13936 TRACE heat-engine Traceback (most recent call last):
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/bin/heat-engine", line 83, in <module>
2016-02-12 13:18:46.487 13936 TRACE heat-engine     srv.create_periodic_tasks()
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper
2016-02-12 13:18:46.487 13936 TRACE heat-engine     return f(*args, **kwargs)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 332, in create_periodic_tasks
2016-02-12 13:18:46.487 13936 TRACE heat-engine     self.stack_watch.start_watch_task(s.id, admin_context)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/engine/service_stack_watch.py", line 57, in start_watch_task
2016-02-12 13:18:46.487 13936 TRACE heat-engine     if stack_has_a_watchrule(stack_id):
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/engine/service_stack_watch.py", line 52, in stack_has_a_watchrule
2016-02-12 13:18:46.487 13936 TRACE heat-engine     if stack_has_a_watchrule(child.id):
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/engine/service_stack_watch.py", line 52, in stack_has_a_watchrule
2016-02-12 13:18:46.487 13936 TRACE heat-engine     if stack_has_a_watchrule(child.id):
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/engine/service_stack_watch.py", line 52, in stack_has_a_watchrule
2016-02-12 13:18:46.487 13936 TRACE heat-engine     if stack_has_a_watchrule(child.id):
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/engine/service_stack_watch.py", line 36, in stack_has_a_watchrule
2016-02-12 13:18:46.487 13936 TRACE heat-engine     wrs = watch_rule_object.WatchRule.get_all_by_stack(cnxt, sid)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/objects/watch_rule.py", line 77, in get_all_by_stack
2016-02-12 13:18:46.487 13936 TRACE heat-engine     stack_id)]
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/db/api.py", line 248, in watch_rule_get_all_by_stack
2016-02-12 13:18:46.487 13936 TRACE heat-engine     return IMPL.watch_rule_get_all_by_stack(context, stack_id)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib/python2.7/site-packages/heat/db/sqlalchemy/api.py", line 718, in watch_rule_get_all_by_stack
2016-02-12 13:18:46.487 13936 TRACE heat-engine     context, models.WatchRule).filter_by(stack_id=stack_id).all()
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2399, in all
2016-02-12 13:18:46.487 13936 TRACE heat-engine     return list(self)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2516, in __iter__
2016-02-12 13:18:46.487 13936 TRACE heat-engine     return self._execute_and_instances(context)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2531, in _execute_and_instances
2016-02-12 13:18:46.487 13936 TRACE heat-engine     result = conn.execute(querycontext.statement, self._params)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute
2016-02-12 13:18:46.487 13936 TRACE heat-engine     return meth(self, multiparams, params)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
2016-02-12 13:18:46.487 13936 TRACE heat-engine     return connection._execute_clauseelement(self, multiparams, params)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
2016-02-12 13:18:46.487 13936 TRACE heat-engine     compiled_sql, distilled_params
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
2016-02-12 13:18:46.487 13936 TRACE heat-engine     context)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1337, in _handle_dbapi_exception
2016-02-12 13:18:46.487 13936 TRACE heat-engine     util.raise_from_cause(newraise, exc_info)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
2016-02-12 13:18:46.487 13936 TRACE heat-engine     reraise(type(exception), exception, tb=exc_tb)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
2016-02-12 13:18:46.487 13936 TRACE heat-engine     context)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
2016-02-12 13:18:46.487 13936 TRACE heat-engine     cursor.execute(statement, parameters)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/MySQLdb/cursors.py", line 174, in execute
2016-02-12 13:18:46.487 13936 TRACE heat-engine     self.errorhandler(self, exc, value)
2016-02-12 13:18:46.487 13936 TRACE heat-engine   File "/usr/lib64/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
2016-02-12 13:18:46.487 13936 TRACE heat-engine     raise errorclass, errorvalue
2016-02-12 13:18:46.487 13936 TRACE heat-engine DBConnectionError: (_mysql_exceptions.OperationalError) (2006, 'MySQL server has gone away') [SQL: u'SELECT watch_rule.created_at AS watch_rule_created_at, watch_rule.updated_at AS watch_rule_updated_at, watch_rule.id AS watch_rule_id, watch_rule.name AS watch_rule_name, watch_rule.rule AS watch_rule_rule, watch_rule.state AS watch_rule_state, watch_rule.last_evaluated AS watch_rule_last_evaluated, watch_rule.stack_id AS watch_rule_stack_id \nFROM watch_rule \nWHERE watch_rule.stack_id = %s'] [parameters: ('fda3526e-7dbe-4f23-bb3d-a1b8b3bd0fa8',)]
2016-02-12 13:18:46.487 13936 TRACE heat-engine 
2016-02-12 13:18:46.611 13995 INFO heat.openstack.common.service [-] Parent process has died unexpectedly, exiting
2016-02-12 13:18:46.611 13993 INFO heat.openstack.common.service [-] Parent process has died unexpectedly, exiting
2016-02-12 13:18:46.612 13995 DEBUG heat.engine.service [-] Attempting to stop engine service... _stop_rpc_server /usr/lib/python2.7/site-packages/heat/engine/service.py:373
2016-02-12 13:18:46.612 13993 DEBUG heat.engine.service [-] Attempting to stop engine service... _stop_rpc_server /usr/lib/python2.7/site-packages/heat/engine/service.py:373
2016-02-12 13:18:46.613 13992 INFO heat.openstack.common.service [-] Parent process has died unexpectedly, exiting

Could be just bad timing with the database going down just as Heat is starting up maybe?
Comment 9 Angus Thomas 2016-02-15 09:34:43 EST
Doctext supplied for the 7.3 release.
Comment 12 Omri Hochman 2016-02-15 13:06:06 EST
I would like to explain how does it looks from the user side - and why IMO it can cause a very bad user experience . 

[stack@instack ~]$ date
Mon Feb 15 12:31:57 EST 2016
[stack@instack ~]$ yes "" | openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml  -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml
...
...hangs
...hangs
...hangs 
... 
...10 minutes later fails on heat timeout ..

That is the output :(non informative message ):
-------------------
"ERROR: openstack ERROR: Timed out waiting for a reply to message ID aeeb27a4937d4825be6de2e5a6c57dbe

looking at the heat log you don't see anything meaningful, even worse you see heat problem to connect to DB (you might think it's connectivity issue) : 

2016-02-12 13:18:46.487 13936 CRITICAL heat-engine [req-33db6d70-974c-4210-8311-b85ba49fb031 - -] DBConnectionError: (_mysql_exceptions.OperationalError) (2006, 'MySQL server has gone away') [SQL: u'SELECT watch


Only after digging enough into /var/log/messages - if you're lucky you'll notice the heat-engine service failed during the yum update (in messages it fails without a trace-back - so it hard to spot it)  
 

[root@instack ~]# openstack-status | grep heat
openstack-heat-api:                     active
openstack-heat-api-cfn:                 active
openstack-heat-api-cloudwatch:          active
openstack-heat-engine:                  failed
Warning keystonerc not sourced


then apply workaround..
Comment 13 Steve Baker 2016-02-15 17:29:59 EST
I'll look into fixing this by moving create_periodic_tasks to its own thread where it can retry until the db is available. This will allow systemd.notify to be called even when the db is down.
Comment 14 Steve Baker 2016-02-15 19:12:09 EST
Here is a proposed upstream fix, will backport now and test heat's behaviour without a database.
Comment 15 Steve Baker 2016-02-15 21:28:26 EST
This downstream change allowed heat-engine to start with a stopped mariadb, and then recover once mariadb was started 90 minutes later.
Comment 16 Jaromir Coufal 2016-02-16 09:06:00 EST
Workaround available, not a blocker for z4. Though, I asked Lon if we can add the backported patch to the latest build for z4 if all possible.
-- Jarda
Comment 22 errata-xmlrpc 2016-03-10 12:07:10 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0435.html

Note You need to log in before you can comment on or make changes to this bug.