Bug 1292101 - Unable to deploy overcloud operation times out waiting for messages
Unable to deploy overcloud operation times out waiting for messages
Status: CLOSED WORKSFORME
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director (Show other bugs)
7.0 (Kilo)
All Linux
urgent Severity high
: ga
: 8.0 (Liberty)
Assigned To: Hugh Brock
yeylon@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-16 08:28 EST by Anand Nande
Modified: 2016-04-18 02:59 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-05 07:27:39 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Comment 2 Peter Lemenkov 2015-12-21 09:31:43 EST
I'd really love to see RabbitMQ logs as well. So next time please supply also the contents of a /var/log/rabbitm directory. This would help debugging this.

Also please check if a network is fine. Just telnet to rabbitmq-host and port 5672 and type "HELLO THERE" (or any other polite greeting) followed by Ctrl+D (to disconnect). If this message is logged then RabbitMQ works fine. At least at the moment of checking.

Also please try getting tcpdump log from 5672 port. E.g. something like

tcpdump -s 0 -i any -n -w ~/amqp.log port 5672

If there is an activity then looks like everything works fine on this level.

After that I'd try checking network settings (using sysctl).
Comment 4 Anand Nande 2015-12-22 03:52:59 EST
One more thing that we are observing with their underclod/director system is that the openstack-heat-engine.service is in failed state :

[stack@osp7 ~]$ sudo service openstack-heat-engine status
Redirecting to /bin/systemctl status  openstack-heat-engine.service
● openstack-heat-engine.service - Openstack Heat Engine Service
   Loaded: loaded (/usr/lib/systemd/system/openstack-heat-engine.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2015-12-22 14:10:41 IST; 31s ago
  Process: 23540 ExecStart=/usr/bin/heat-engine (code=exited, status=1/FAILURE)
 Main PID: 23540 (code=exited, status=1/FAILURE)

Dec 22 14:10:41 osp7.sdiad.com heat-engine[23540]: util.raise_from_cause(newraise, exc_info)
Dec 22 14:10:41 osp7.sdiad.com heat-engine[23540]: File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
Dec 22 14:10:41 osp7.sdiad.com heat-engine[23540]: reraise(type(exception), exception, tb=exc_tb)
Dec 22 14:10:41 osp7.sdiad.com heat-engine[23540]: File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
Dec 22 14:10:41 osp7.sdiad.com heat-engine[23540]: context)
Dec 22 14:10:41 osp7.sdiad.com heat-engine[23540]: File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
Dec 22 14:10:41 osp7.sdiad.com heat-engine[23540]: cursor.execute(statement, parameters)
Dec 22 14:10:41 osp7.sdiad.com systemd[1]: openstack-heat-engine.service: main process exited, code=exited, status=1/FAILURE
Dec 22 14:10:41 osp7.sdiad.com systemd[1]: Unit openstack-heat-engine.service entered failed state.
Dec 22 14:10:41 osp7.sdiad.com systemd[1]: openstack-heat-engine.service failed.

whereas this is "active (running)" on my director system.
I am not sure if this leads to any timeouts though - Just wondering if its a good idea to start it manually and then test 'heat stack-delete overcloud' to see if this succeeds ?
Comment 5 Anand Nande 2015-12-22 03:57:07 EST
[stack@osp7 ~]$ sudo openstack-service status | grep -i fail
MainPID=0 Id=neutron-openvswitch-agent.service ActiveState=failed
MainPID=0 Id=openstack-heat-engine.service ActiveState=failed
Comment 6 Anand Nande 2015-12-23 07:29:02 EST
We tried to :

- flush keystone tokens, restart mariadb, keystone and heat services:

sudo keystone-manage token_flush && sudo systemctl reset-failed && for i in {rabbitmq-server.service,mariadb.service,openstack-keystone.service,openstack-heat-api-cfn.service,openstack-heat-api-cloudwatch.service,openstack-heat-api.service,openstack-heat-engine.service};do sudo systemctl restart $i;done

no luck. 

The openstack-heat-engine service still is in failed state:

Loaded: loaded (/usr/lib/systemd/system/openstack-heat-engine.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2015-12-22 17:07:41 IST; 27s ago
  Process: 668 ExecStart=/usr/bin/heat-engine (code=exited, status=1/FAILURE)
...
Dec 22 17:07:41 osp7.sdiad.com systemd[1]: Unit openstack-heat-engine.service entered failed state.
Dec 22 17:07:41 osp7.sdiad.com systemd[1]: openstack-heat-engine.service failed.
Comment 8 Mike Orazi 2015-12-23 08:12:51 EST
The heat log seems to indicate the root cause of its failures is that a table heat expects to exist is not being found.  Is it possible that the heat database was manually edited?

From the problem description it seems heat was initially responding as expecting and sometime after the initial time out something changed to cause heat to be unable to start.
Comment 9 Anand Nande 2015-12-23 08:30:14 EST
I could see the following in the output of "journalctl -u openstack-heat-engine"
---
Dec 23 17:26:38 osp7.sdiad.com heat-engine[11996]: ProgrammingError: (_mysql_exceptions.ProgrammingError) (1146, "Table 'heat.service' doesn't exist") [SQL: u'SELECT service.created_at AS service_created_at, service.updated_at AS service_updated_at, service.deleted_at AS service_deleted_at, service.id AS service_id, service.engine_id AS service_engine_id, service.host AS service_host, service.hostname AS service_hostname, service.`binary` AS service_binary, service.topic AS service_topic, service.report_interval AS service_report_interval \nFROM service \nWHERE service.host = %s AND service.`binary` = %s AND service.hostname = %s'] [parameters: ('osp7.sdiad.com', 'heat-engine', 'osp7.sdiad.com')]
---

We did a :
$ sudo heat-manage db_sync
$ sudo systemctl start openstack-heat-engine.service

This did the trick - built/re-built the 'heat.service' table and systemctl was able to start the heat-enine service.
Comment 10 Anand Nande 2015-12-24 05:45:00 EST
(In reply to Mike Orazi from comment #8)
> The heat log seems to indicate the root cause of its failures is that a
> table heat expects to exist is not being found.  Is it possible that the
> heat database was manually edited?

No - this was not manually edited - Just a previous "openstack deploy" was ran before this (done over a remote session - for which we dont have the logs).
Could the deploy delete the heat.service table ? 

> 
> From the problem description it seems heat was initially responding as
> expecting and sometime after the initial time out something changed to cause
> heat to be unable to start.

right. Is there anything in the code that we can point to - that deleted the table? causing future deploys to fail or stack-delete (stack-list, basically any heat related op) to fail?
Comment 13 Hugh Brock 2016-02-05 07:27:39 EST
I'm closing this WORKSFORME since we do not have a reproducer for it (I don't believe there is any code in director that randomly drops tables from Heat's database). Please re-open if it reproduces. Thanks.

Note You need to log in before you can comment on or make changes to this bug.