Bug 1357229 - openstack-heat-engine resource in failed status after controller non-graceful reset
Summary: openstack-heat-engine resource in failed status after controller non-graceful...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Angus Thomas
QA Contact: Omri Hochman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-16 22:13 UTC by Marian Krcmarik
Modified: 2017-07-28 18:17 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-07-28 18:17:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Marian Krcmarik 2016-07-16 22:13:15 UTC
Description of problem:
openstack-heat-engine resource is in failed status and the service is stopped after non-gracefully reset controller comes back online in Openstack HA deployment. The openstack services are managed by pacemaker on controllers in HA setup - pacemaker sets constraints and start ordering of service and start/stop them as needed or required. The pacemaker uses systemd for managing the heat-engine service and It seems that systemd unit settings of heat-engine somehow comflict with pacemaker constraints and ordering setting. I can see following line in /usr/lib/systemd/system/openstack-heat-engine.service:
After=syslog.target network.target qpidd.service mysqld.service openstack-keystone.service tgtd.service openstack-glance-api.service openstack-glance-registry.service openstack-nova-api.service openstack-nova-objectstore.service openstack-nova.compute.service openstack-nova-network.service openstack-nova-volume.service openstack-nova-scheduler.service openstack-nova-cert.service openstack-cinder-volume.service

If I reduce it to standard:
After=syslog.target network.target

Then problem disappears and heat-engine resource is started and service running after the reset.
Most likely it somehow conflicts with pacemaker settings and anyway It has some service which are not used anymore and service ordering should be still managed by pacemaker and in future every service should be able to run on its own.

Version-Release number of selected component (if applicable):
openstack-heat-engine-6.0.0-7.el7ost.noarch
python-heatclient-1.2.0-1.el7ost.noarch
openstack-heat-api-cfn-6.0.0-7.el7ost.noarch
openstack-heat-api-cloudwatch-6.0.0-7.el7ost.noarch
openstack-heat-common-6.0.0-7.el7ost.noarch
openstack-heat-api-6.0.0-7.el7ost.noarch

How reproducible:
Often

Steps to Reproduce:
1. Non-graceful reset of controller in HA openstack deployment


Actual results:
Failed Actions:
* openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=303, status=complete, exitreason='none',
    last-rc-change='Sat Jul 16 20:57:35 2016', queued=0ms, exec=2120ms

Expected results:


Additional info:

Comment 2 Zane Bitter 2016-07-18 13:30:14 UTC
I was under the impression that Pacemaker used its own separate config and did not rely on the regular systemd unit file.

Comment 3 Andrew Beekhof 2016-07-18 23:30:36 UTC
(In reply to Zane Bitter from comment #2)
> I was under the impression that Pacemaker used its own separate config and
> did not rely on the regular systemd unit file.

No, we use systemd unit files for most openstack services.

Comment 4 Fabio Massimo Di Nitto 2016-07-20 07:11:35 UTC
(In reply to Marian Krcmarik from comment #0)
> Description of problem:
> openstack-heat-engine resource is in failed status and the service is
> stopped after non-gracefully reset controller comes back online in Openstack
> HA deployment. The openstack services are managed by pacemaker on
> controllers in HA setup - pacemaker sets constraints and start ordering of
> service and start/stop them as needed or required. The pacemaker uses
> systemd for managing the heat-engine service and It seems that systemd unit
> settings of heat-engine somehow comflict with pacemaker constraints and
> ordering setting. I can see following line in
> /usr/lib/systemd/system/openstack-heat-engine.service:
> After=syslog.target network.target qpidd.service mysqld.service
> openstack-keystone.service tgtd.service openstack-glance-api.service
> openstack-glance-registry.service openstack-nova-api.service
> openstack-nova-objectstore.service openstack-nova.compute.service
> openstack-nova-network.service openstack-nova-volume.service
> openstack-nova-scheduler.service openstack-nova-cert.service
> openstack-cinder-volume.service
> 
> If I reduce it to standard:
> After=syslog.target network.target

I think heat systemd unit needs to drop all of the above dependencies that are/could be in conflict with composable roles and scale out requirements anyway AFAICT.

Fabio

> 
> Then problem disappears and heat-engine resource is started and service
> running after the reset.
> Most likely it somehow conflicts with pacemaker settings and anyway It has
> some service which are not used anymore and service ordering should be still
> managed by pacemaker and in future every service should be able to run on
> its own.
> 
> Version-Release number of selected component (if applicable):
> openstack-heat-engine-6.0.0-7.el7ost.noarch
> python-heatclient-1.2.0-1.el7ost.noarch
> openstack-heat-api-cfn-6.0.0-7.el7ost.noarch
> openstack-heat-api-cloudwatch-6.0.0-7.el7ost.noarch
> openstack-heat-common-6.0.0-7.el7ost.noarch
> openstack-heat-api-6.0.0-7.el7ost.noarch
> 
> How reproducible:
> Often
> 
> Steps to Reproduce:
> 1. Non-graceful reset of controller in HA openstack deployment
> 
> 
> Actual results:
> Failed Actions:
> * openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7):
> call=303, status=complete, exitreason='none',
>     last-rc-change='Sat Jul 16 20:57:35 2016', queued=0ms, exec=2120ms
> 
> Expected results:
> 
> 
> Additional info:

Comment 7 Raoul Scarazzini 2016-08-25 09:17:55 UTC
I'm adding my experience since I think this is strictly related to this problem. While testing in a RDO/Mitaka deployment the resource HA behavior (by stopping and then starting master/slave and core resources galera, redis and rabbit) I hit the same exact problem.
This were the cluster's failed actions after the restart:

Failed Actions:
* openstack-nova-scheduler_start_0 on overcloud-controller-2 'not running' (7): call=824, status=complete, exitreason='none',
    last-rc-change='Wed Aug 24 01:21:43 2016', queued=0ms, exec=2213ms
* openstack-heat-engine_start_0 on overcloud-controller-2 'not running' (7): call=816, status=complete, exitreason='none',
    last-rc-change='Wed Aug 24 01:21:21 2016', queued=0ms, exec=2076ms
* openstack-nova-scheduler_start_0 on overcloud-controller-1 'not running' (7): call=835, status=complete, exitreason='none',
    last-rc-change='Wed Aug 24 01:21:43 2016', queued=0ms, exec=2213ms
* openstack-heat-engine_start_0 on overcloud-controller-1 'not running' (7): call=827, status=complete, exitreason='none',
    last-rc-change='Wed Aug 24 01:21:21 2016', queued=0ms, exec=2073ms
* openstack-nova-scheduler_start_0 on overcloud-controller-0 'not running' (7): call=840, status=complete, exitreason='none',
    last-rc-change='Wed Aug 24 01:21:43 2016', queued=0ms, exec=2216ms
* openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=832, status=complete, exitreason='none',
    last-rc-change='Wed Aug 24 01:21:21 2016', queued=0ms, exec=2076ms

So not only openstack-heat-engine was impacted but also openstack-nova-scheduler, and note that this is systematic and reproducible all the time.
Fact is that after applying Marian's workaround (So limiting the After list inside /usr/lib/systemd/system/openstack-heat-engine.service) the problem disappears, not only for openstack-heat-engine, but also for openstack-nova-scheduler.

You can find here [1] all the sosreports from the overcloud and other logs taken while hitting this problem.

[1] http://file.rdu.redhat.com/~rscarazz/BZ1357229/


Note You need to log in before you can comment on or make changes to this bug.