Bug 1357229
Summary: | openstack-heat-engine resource in failed status after controller non-graceful reset | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marian Krcmarik <mkrcmari> |
Component: | rhosp-director | Assignee: | Angus Thomas <athomas> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Omri Hochman <ohochman> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 9.0 (Mitaka) | CC: | abeekhof, aschultz, dbecker, fdinitto, mburns, mcornea, mkrcmari, morazi, oblaut, rhel-osp-director-maint, rscarazz, sbaker, shardy, srevivo |
Target Milestone: | --- | Keywords: | AutomationBlocker, Regression, ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-07-28 18:17:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Marian Krcmarik
2016-07-16 22:13:15 UTC
I was under the impression that Pacemaker used its own separate config and did not rely on the regular systemd unit file. (In reply to Zane Bitter from comment #2) > I was under the impression that Pacemaker used its own separate config and > did not rely on the regular systemd unit file. No, we use systemd unit files for most openstack services. (In reply to Marian Krcmarik from comment #0) > Description of problem: > openstack-heat-engine resource is in failed status and the service is > stopped after non-gracefully reset controller comes back online in Openstack > HA deployment. The openstack services are managed by pacemaker on > controllers in HA setup - pacemaker sets constraints and start ordering of > service and start/stop them as needed or required. The pacemaker uses > systemd for managing the heat-engine service and It seems that systemd unit > settings of heat-engine somehow comflict with pacemaker constraints and > ordering setting. I can see following line in > /usr/lib/systemd/system/openstack-heat-engine.service: > After=syslog.target network.target qpidd.service mysqld.service > openstack-keystone.service tgtd.service openstack-glance-api.service > openstack-glance-registry.service openstack-nova-api.service > openstack-nova-objectstore.service openstack-nova.compute.service > openstack-nova-network.service openstack-nova-volume.service > openstack-nova-scheduler.service openstack-nova-cert.service > openstack-cinder-volume.service > > If I reduce it to standard: > After=syslog.target network.target I think heat systemd unit needs to drop all of the above dependencies that are/could be in conflict with composable roles and scale out requirements anyway AFAICT. Fabio > > Then problem disappears and heat-engine resource is started and service > running after the reset. > Most likely it somehow conflicts with pacemaker settings and anyway It has > some service which are not used anymore and service ordering should be still > managed by pacemaker and in future every service should be able to run on > its own. > > Version-Release number of selected component (if applicable): > openstack-heat-engine-6.0.0-7.el7ost.noarch > python-heatclient-1.2.0-1.el7ost.noarch > openstack-heat-api-cfn-6.0.0-7.el7ost.noarch > openstack-heat-api-cloudwatch-6.0.0-7.el7ost.noarch > openstack-heat-common-6.0.0-7.el7ost.noarch > openstack-heat-api-6.0.0-7.el7ost.noarch > > How reproducible: > Often > > Steps to Reproduce: > 1. Non-graceful reset of controller in HA openstack deployment > > > Actual results: > Failed Actions: > * openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): > call=303, status=complete, exitreason='none', > last-rc-change='Sat Jul 16 20:57:35 2016', queued=0ms, exec=2120ms > > Expected results: > > > Additional info: I'm adding my experience since I think this is strictly related to this problem. While testing in a RDO/Mitaka deployment the resource HA behavior (by stopping and then starting master/slave and core resources galera, redis and rabbit) I hit the same exact problem. This were the cluster's failed actions after the restart: Failed Actions: * openstack-nova-scheduler_start_0 on overcloud-controller-2 'not running' (7): call=824, status=complete, exitreason='none', last-rc-change='Wed Aug 24 01:21:43 2016', queued=0ms, exec=2213ms * openstack-heat-engine_start_0 on overcloud-controller-2 'not running' (7): call=816, status=complete, exitreason='none', last-rc-change='Wed Aug 24 01:21:21 2016', queued=0ms, exec=2076ms * openstack-nova-scheduler_start_0 on overcloud-controller-1 'not running' (7): call=835, status=complete, exitreason='none', last-rc-change='Wed Aug 24 01:21:43 2016', queued=0ms, exec=2213ms * openstack-heat-engine_start_0 on overcloud-controller-1 'not running' (7): call=827, status=complete, exitreason='none', last-rc-change='Wed Aug 24 01:21:21 2016', queued=0ms, exec=2073ms * openstack-nova-scheduler_start_0 on overcloud-controller-0 'not running' (7): call=840, status=complete, exitreason='none', last-rc-change='Wed Aug 24 01:21:43 2016', queued=0ms, exec=2216ms * openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=832, status=complete, exitreason='none', last-rc-change='Wed Aug 24 01:21:21 2016', queued=0ms, exec=2076ms So not only openstack-heat-engine was impacted but also openstack-nova-scheduler, and note that this is systematic and reproducible all the time. Fact is that after applying Marian's workaround (So limiting the After list inside /usr/lib/systemd/system/openstack-heat-engine.service) the problem disappears, not only for openstack-heat-engine, but also for openstack-nova-scheduler. You can find here [1] all the sosreports from the overcloud and other logs taken while hitting this problem. [1] http://file.rdu.redhat.com/~rscarazz/BZ1357229/ |