Description of problem: openstack-heat-engine resource is in failed status and the service is stopped after non-gracefully reset controller comes back online in Openstack HA deployment. The openstack services are managed by pacemaker on controllers in HA setup - pacemaker sets constraints and start ordering of service and start/stop them as needed or required. The pacemaker uses systemd for managing the heat-engine service and It seems that systemd unit settings of heat-engine somehow comflict with pacemaker constraints and ordering setting. I can see following line in /usr/lib/systemd/system/openstack-heat-engine.service: After=syslog.target network.target qpidd.service mysqld.service openstack-keystone.service tgtd.service openstack-glance-api.service openstack-glance-registry.service openstack-nova-api.service openstack-nova-objectstore.service openstack-nova.compute.service openstack-nova-network.service openstack-nova-volume.service openstack-nova-scheduler.service openstack-nova-cert.service openstack-cinder-volume.service If I reduce it to standard: After=syslog.target network.target Then problem disappears and heat-engine resource is started and service running after the reset. Most likely it somehow conflicts with pacemaker settings and anyway It has some service which are not used anymore and service ordering should be still managed by pacemaker and in future every service should be able to run on its own. Version-Release number of selected component (if applicable): openstack-heat-engine-6.0.0-7.el7ost.noarch python-heatclient-1.2.0-1.el7ost.noarch openstack-heat-api-cfn-6.0.0-7.el7ost.noarch openstack-heat-api-cloudwatch-6.0.0-7.el7ost.noarch openstack-heat-common-6.0.0-7.el7ost.noarch openstack-heat-api-6.0.0-7.el7ost.noarch How reproducible: Often Steps to Reproduce: 1. Non-graceful reset of controller in HA openstack deployment Actual results: Failed Actions: * openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=303, status=complete, exitreason='none', last-rc-change='Sat Jul 16 20:57:35 2016', queued=0ms, exec=2120ms Expected results: Additional info:
I was under the impression that Pacemaker used its own separate config and did not rely on the regular systemd unit file.
(In reply to Zane Bitter from comment #2) > I was under the impression that Pacemaker used its own separate config and > did not rely on the regular systemd unit file. No, we use systemd unit files for most openstack services.
(In reply to Marian Krcmarik from comment #0) > Description of problem: > openstack-heat-engine resource is in failed status and the service is > stopped after non-gracefully reset controller comes back online in Openstack > HA deployment. The openstack services are managed by pacemaker on > controllers in HA setup - pacemaker sets constraints and start ordering of > service and start/stop them as needed or required. The pacemaker uses > systemd for managing the heat-engine service and It seems that systemd unit > settings of heat-engine somehow comflict with pacemaker constraints and > ordering setting. I can see following line in > /usr/lib/systemd/system/openstack-heat-engine.service: > After=syslog.target network.target qpidd.service mysqld.service > openstack-keystone.service tgtd.service openstack-glance-api.service > openstack-glance-registry.service openstack-nova-api.service > openstack-nova-objectstore.service openstack-nova.compute.service > openstack-nova-network.service openstack-nova-volume.service > openstack-nova-scheduler.service openstack-nova-cert.service > openstack-cinder-volume.service > > If I reduce it to standard: > After=syslog.target network.target I think heat systemd unit needs to drop all of the above dependencies that are/could be in conflict with composable roles and scale out requirements anyway AFAICT. Fabio > > Then problem disappears and heat-engine resource is started and service > running after the reset. > Most likely it somehow conflicts with pacemaker settings and anyway It has > some service which are not used anymore and service ordering should be still > managed by pacemaker and in future every service should be able to run on > its own. > > Version-Release number of selected component (if applicable): > openstack-heat-engine-6.0.0-7.el7ost.noarch > python-heatclient-1.2.0-1.el7ost.noarch > openstack-heat-api-cfn-6.0.0-7.el7ost.noarch > openstack-heat-api-cloudwatch-6.0.0-7.el7ost.noarch > openstack-heat-common-6.0.0-7.el7ost.noarch > openstack-heat-api-6.0.0-7.el7ost.noarch > > How reproducible: > Often > > Steps to Reproduce: > 1. Non-graceful reset of controller in HA openstack deployment > > > Actual results: > Failed Actions: > * openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): > call=303, status=complete, exitreason='none', > last-rc-change='Sat Jul 16 20:57:35 2016', queued=0ms, exec=2120ms > > Expected results: > > > Additional info:
I'm adding my experience since I think this is strictly related to this problem. While testing in a RDO/Mitaka deployment the resource HA behavior (by stopping and then starting master/slave and core resources galera, redis and rabbit) I hit the same exact problem. This were the cluster's failed actions after the restart: Failed Actions: * openstack-nova-scheduler_start_0 on overcloud-controller-2 'not running' (7): call=824, status=complete, exitreason='none', last-rc-change='Wed Aug 24 01:21:43 2016', queued=0ms, exec=2213ms * openstack-heat-engine_start_0 on overcloud-controller-2 'not running' (7): call=816, status=complete, exitreason='none', last-rc-change='Wed Aug 24 01:21:21 2016', queued=0ms, exec=2076ms * openstack-nova-scheduler_start_0 on overcloud-controller-1 'not running' (7): call=835, status=complete, exitreason='none', last-rc-change='Wed Aug 24 01:21:43 2016', queued=0ms, exec=2213ms * openstack-heat-engine_start_0 on overcloud-controller-1 'not running' (7): call=827, status=complete, exitreason='none', last-rc-change='Wed Aug 24 01:21:21 2016', queued=0ms, exec=2073ms * openstack-nova-scheduler_start_0 on overcloud-controller-0 'not running' (7): call=840, status=complete, exitreason='none', last-rc-change='Wed Aug 24 01:21:43 2016', queued=0ms, exec=2216ms * openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=832, status=complete, exitreason='none', last-rc-change='Wed Aug 24 01:21:21 2016', queued=0ms, exec=2076ms So not only openstack-heat-engine was impacted but also openstack-nova-scheduler, and note that this is systematic and reproducible all the time. Fact is that after applying Marian's workaround (So limiting the After list inside /usr/lib/systemd/system/openstack-heat-engine.service) the problem disappears, not only for openstack-heat-engine, but also for openstack-nova-scheduler. You can find here [1] all the sosreports from the overcloud and other logs taken while hitting this problem. [1] http://file.rdu.redhat.com/~rscarazz/BZ1357229/