Bug 1357229

Summary:	openstack-heat-engine resource in failed status after controller non-graceful reset
Product:	Red Hat OpenStack	Reporter:	Marian Krcmarik <mkrcmari>
Component:	rhosp-director	Assignee:	Angus Thomas <athomas>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Omri Hochman <ohochman>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	9.0 (Mitaka)	CC:	abeekhof, aschultz, dbecker, fdinitto, mburns, mcornea, mkrcmari, morazi, oblaut, rhel-osp-director-maint, rscarazz, sbaker, shardy, srevivo
Target Milestone:	---	Keywords:	AutomationBlocker, Regression, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-07-28 18:17:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marian Krcmarik 2016-07-16 22:13:15 UTC

Description of problem:
openstack-heat-engine resource is in failed status and the service is stopped after non-gracefully reset controller comes back online in Openstack HA deployment. The openstack services are managed by pacemaker on controllers in HA setup - pacemaker sets constraints and start ordering of service and start/stop them as needed or required. The pacemaker uses systemd for managing the heat-engine service and It seems that systemd unit settings of heat-engine somehow comflict with pacemaker constraints and ordering setting. I can see following line in /usr/lib/systemd/system/openstack-heat-engine.service:
After=syslog.target network.target qpidd.service mysqld.service openstack-keystone.service tgtd.service openstack-glance-api.service openstack-glance-registry.service openstack-nova-api.service openstack-nova-objectstore.service openstack-nova.compute.service openstack-nova-network.service openstack-nova-volume.service openstack-nova-scheduler.service openstack-nova-cert.service openstack-cinder-volume.service

If I reduce it to standard:
After=syslog.target network.target

Then problem disappears and heat-engine resource is started and service running after the reset.
Most likely it somehow conflicts with pacemaker settings and anyway It has some service which are not used anymore and service ordering should be still managed by pacemaker and in future every service should be able to run on its own.

Version-Release number of selected component (if applicable):
openstack-heat-engine-6.0.0-7.el7ost.noarch
python-heatclient-1.2.0-1.el7ost.noarch
openstack-heat-api-cfn-6.0.0-7.el7ost.noarch
openstack-heat-api-cloudwatch-6.0.0-7.el7ost.noarch
openstack-heat-common-6.0.0-7.el7ost.noarch
openstack-heat-api-6.0.0-7.el7ost.noarch

How reproducible:
Often

Steps to Reproduce:
1. Non-graceful reset of controller in HA openstack deployment


Actual results:
Failed Actions:
* openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=303, status=complete, exitreason='none',
    last-rc-change='Sat Jul 16 20:57:35 2016', queued=0ms, exec=2120ms

Expected results:


Additional info:

Comment 2 Zane Bitter 2016-07-18 13:30:14 UTC

I was under the impression that Pacemaker used its own separate config and did not rely on the regular systemd unit file.

Comment 3 Andrew Beekhof 2016-07-18 23:30:36 UTC

(In reply to Zane Bitter from comment #2)
> I was under the impression that Pacemaker used its own separate config and
> did not rely on the regular systemd unit file.

No, we use systemd unit files for most openstack services.

Comment 4 Fabio Massimo Di Nitto 2016-07-20 07:11:35 UTC

(In reply to Marian Krcmarik from comment #0)
> Description of problem:
> openstack-heat-engine resource is in failed status and the service is
> stopped after non-gracefully reset controller comes back online in Openstack
> HA deployment. The openstack services are managed by pacemaker on
> controllers in HA setup - pacemaker sets constraints and start ordering of
> service and start/stop them as needed or required. The pacemaker uses
> systemd for managing the heat-engine service and It seems that systemd unit
> settings of heat-engine somehow comflict with pacemaker constraints and
> ordering setting. I can see following line in
> /usr/lib/systemd/system/openstack-heat-engine.service:
> After=syslog.target network.target qpidd.service mysqld.service
> openstack-keystone.service tgtd.service openstack-glance-api.service
> openstack-glance-registry.service openstack-nova-api.service
> openstack-nova-objectstore.service openstack-nova.compute.service
> openstack-nova-network.service openstack-nova-volume.service
> openstack-nova-scheduler.service openstack-nova-cert.service
> openstack-cinder-volume.service
> 
> If I reduce it to standard:
> After=syslog.target network.target

I think heat systemd unit needs to drop all of the above dependencies that are/could be in conflict with composable roles and scale out requirements anyway AFAICT.

Fabio

> 
> Then problem disappears and heat-engine resource is started and service
> running after the reset.
> Most likely it somehow conflicts with pacemaker settings and anyway It has
> some service which are not used anymore and service ordering should be still
> managed by pacemaker and in future every service should be able to run on
> its own.
> 
> Version-Release number of selected component (if applicable):
> openstack-heat-engine-6.0.0-7.el7ost.noarch
> python-heatclient-1.2.0-1.el7ost.noarch
> openstack-heat-api-cfn-6.0.0-7.el7ost.noarch
> openstack-heat-api-cloudwatch-6.0.0-7.el7ost.noarch
> openstack-heat-common-6.0.0-7.el7ost.noarch
> openstack-heat-api-6.0.0-7.el7ost.noarch
> 
> How reproducible:
> Often
> 
> Steps to Reproduce:
> 1. Non-graceful reset of controller in HA openstack deployment
> 
> 
> Actual results:
> Failed Actions:
> * openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7):
> call=303, status=complete, exitreason='none',
>     last-rc-change='Sat Jul 16 20:57:35 2016', queued=0ms, exec=2120ms
> 
> Expected results:
> 
> 
> Additional info:

Comment 7 Raoul Scarazzini 2016-08-25 09:17:55 UTC

I'm adding my experience since I think this is strictly related to this problem. While testing in a RDO/Mitaka deployment the resource HA behavior (by stopping and then starting master/slave and core resources galera, redis and rabbit) I hit the same exact problem.
This were the cluster's failed actions after the restart:

Failed Actions:
* openstack-nova-scheduler_start_0 on overcloud-controller-2 'not running' (7): call=824, status=complete, exitreason='none',
    last-rc-change='Wed Aug 24 01:21:43 2016', queued=0ms, exec=2213ms
* openstack-heat-engine_start_0 on overcloud-controller-2 'not running' (7): call=816, status=complete, exitreason='none',
    last-rc-change='Wed Aug 24 01:21:21 2016', queued=0ms, exec=2076ms
* openstack-nova-scheduler_start_0 on overcloud-controller-1 'not running' (7): call=835, status=complete, exitreason='none',
    last-rc-change='Wed Aug 24 01:21:43 2016', queued=0ms, exec=2213ms
* openstack-heat-engine_start_0 on overcloud-controller-1 'not running' (7): call=827, status=complete, exitreason='none',
    last-rc-change='Wed Aug 24 01:21:21 2016', queued=0ms, exec=2073ms
* openstack-nova-scheduler_start_0 on overcloud-controller-0 'not running' (7): call=840, status=complete, exitreason='none',
    last-rc-change='Wed Aug 24 01:21:43 2016', queued=0ms, exec=2216ms
* openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=832, status=complete, exitreason='none',
    last-rc-change='Wed Aug 24 01:21:21 2016', queued=0ms, exec=2076ms

So not only openstack-heat-engine was impacted but also openstack-nova-scheduler, and note that this is systematic and reproducible all the time.
Fact is that after applying Marian's workaround (So limiting the After list inside /usr/lib/systemd/system/openstack-heat-engine.service) the problem disappears, not only for openstack-heat-engine, but also for openstack-nova-scheduler.

You can find here [1] all the sosreports from the overcloud and other logs taken while hitting this problem.

[1] http://file.rdu.redhat.com/~rscarazz/BZ1357229/