Bug 1253038 - After HA deployment on BM, heat-api service was down on 1 out of 3 controllers.
After HA deployment on BM, heat-api service was down on 1 out of 3 controllers.
Status: NEW
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-keystonemiddleware (Show other bugs)
Director
x86_64 Linux
high Severity urgent
: async
: 7.0 (Kilo)
Assigned To: John Dennis
Shai Revivo
: Triaged, ZStream
Depends On:
Blocks: 1452683
  Show dependency treegraph
 
Reported: 2015-08-12 15:20 EDT by Omri Hochman
Modified: 2018-04-29 21:51 EDT (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1452683 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1649735 None None None 2016-12-13 19:04 EST

  None (edit)
Description Omri Hochman 2015-08-12 15:20:23 EDT
After HA deployment on BM, heat-api service was down on 1 our of 3 controllers. 

environment (on the Instack: ) :
------------
instack-undercloud-2.1.2-23.el7ost.noarch
instack-0.0.7-1.el7ost.noarch
openstack-puppet-modules-2015.1.8-8.el7ost.noarch
puppet-3.6.2-2.el7.noarch
openstack-tripleo-puppet-elements-0.0.1-4.el7ost.noarch
openstack-heat-api-2015.1.0-5.el7ost.noarch
openstack-heat-common-2015.1.0-5.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-46.el7ost.noarch

(on the controller:)
openstack-heat-common-2015.1.0-4.el7ost.noarch
openstack-heat-api-cfn-2015.1.0-4.el7ost.noarch
python-heatclient-0.6.0-1.el7ost.noarch
openstack-heat-api-2015.1.0-4.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.0-4.el7ost.noarch
openstack-heat-engine-2015.1.0-4.el7ost.noarch


Description:
-------------
After HA Deployment on Bare-Metal environment, I ssh to one of the controllers (1 out of 3) and noticed that 'pcs status' shows a problem to start the
heat-api service . 

the heat-api service service probably failed to start during the deployment, but it happened only on 1 controller out of 3 , on the other 2 controllers the heat-api service was up and running.   

pcs status
-----------
Failed actions:
    openstack-heat-api-cloudwatch_monitor_60000 on overcloud-controller-0 'not running' (7): call=190, status=complete, exit-reason='none', last-rc-change='Mon Aug 10 13:34:21 2015', queued=0ms, exec=7ms

[root@overcloud-controller-1 ~]# service openstack-heat-api status
Redirecting to /bin/systemctl status  openstack-heat-api.service
openstack-heat-api.service - OpenStack Heat API Service
   Loaded: loaded (/usr/lib/systemd/system/openstack-heat-api.service; disabled)
   Active: failed (Result: exit-code) since Mon 2015-08-10 13:34:14 EDT; 2 days ago
 Main PID: 8160 (code=exited, status=1/FAILURE)
   CGroup: /system.slice/openstack-heat-api.service

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.



heat-api.log :
--------------
2015-08-10 13:34:14.833 8160 CRITICAL heat-api [-] OSError: [Errno 17] File exists: '/tmp/keystone-signing-heat'
2015-08-10 13:34:14.833 8160 TRACE heat-api Traceback (most recent call last):
2015-08-10 13:34:14.833 8160 TRACE heat-api   File "/usr/bin/heat-api", line 58, in <module>
2015-08-10 13:34:14.833 8160 TRACE heat-api     app = config.load_paste_app()
2015-08-10 13:34:14.833 8160 TRACE heat-api   File "/usr/lib/python2.7/site-packages/heat/common/config.py", line 397, in load_paste_app
2015-08-10 13:34:14.833 8160 TRACE heat-api     app = wsgi.paste_deploy_app(conf_file, app_name, cfg.CONF)
2015-08-10 13:34:14.833 8160 TRACE heat-api   File "/usr/lib/python2.7/site-packages/heat/common/wsgi.py", line 898, in paste_deploy_app
2015-08-10 13:34:14.833 8160 TRACE heat-api     return deploy.loadapp("config:%s" % paste_config_file, name=app_name)
2015-08-10 13:34:14.833 8160 TRACE heat-api   File "/usr/lib/python2.7/site-packages/paste/deploy/loadwsgi.py", line 247, in loadapp
2015-08-10 13:34:14.833 8160 TRACE heat-api     return loadobj(APP, uri, name=name, **kw)
2015-08-10 13:34:14.833 8160 TRACE heat-api   File "/usr/lib/python2.7/site-packages/paste/deploy/loadwsgi.py", line 272, in loadobj
2015-08-10 13:34:14.833 8160 TRACE heat-api     return context.create()
2015-08-10 13:34:14.833 8160 TRACE heat-api   File "/usr/lib/python2.7/site-packages/paste/deploy/loadwsgi.py", line 710, in create
2015-08-10 13:34:14.833 8160 TRACE heat-api     return self.object_type.invoke(self)
2015-08-10 13:34:14.833 8160 TRACE heat-api   File "/usr/lib/python2.7/site-packages/paste/deploy/loadwsgi.py", line 207, in invoke
2015-08-10 13:34:14.833 8160 TRACE heat-api     app = filter(app)
2015-08-10 13:34:14.833 8160 TRACE heat-api   File "/usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py", line 1122, in auth_filter
2015-08-10 13:34:14.833 8160 TRACE heat-api     return AuthProtocol(app, conf)
2015-08-10 13:34:14.833 8160 TRACE heat-api   File "/usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py", line 521, in __init__
2015-08-10 13:34:14.833 8160 TRACE heat-api     directory_name=self._conf_get('signing_dir'), log=self._LOG)
2015-08-10 13:34:14.833 8160 TRACE heat-api   File "/usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/_signing_dir.py", line 38, in __init__
2015-08-10 13:34:14.833 8160 TRACE heat-api     self._verify_signing_dir()
2015-08-10 13:34:14.833 8160 TRACE heat-api   File "/usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/_signing_dir.py", line 83, in _verify_signing_dir
2015-08-10 13:34:14.833 8160 TRACE heat-api     os.makedirs(self._directory_name, stat.S_IRWXU)
2015-08-10 13:34:14.833 8160 TRACE heat-api   File "/usr/lib64/python2.7/os.py", line 157, in makedirs
2015-08-10 13:34:14.833 8160 TRACE heat-api     mkdir(name, mode)
2015-08-10 13:34:14.833 8160 TRACE heat-api OSError: [Errno 17] File exists: '/tmp/keystone-signing-heat'
2015-08-10 13:34:14.833 8160 TRACE heat-api 
/var/log/heat/heat-api.log (END)


/var/log/meassges :
--------------------
Aug 10 13:31:13 localhost os-collect-config: to 'heat'\u001b[0m\n\u001b[mNotice: /File[/etc/heat/heat.conf]/seluser: seluser changed 'unconfined_u' to 'system_u'\u001b[0m\n\u001b[mNotice: /Stage[main]/Nova::Cert/Nova::Generic_service[cert]/Service[nova-cert]: Triggered 'refresh' from 1 events\u001b[0m\n\u001b[mNotice: /Stage[main]/Heat::Api_cfn/Service[heat-api-cfn]: Triggered 'refresh' from 40 events\u001b[0m\n\u001b[mNotice: /Stage[main]/Heat::Engine/Service[heat-engine]: Triggered 'refresh' from 40 events\u001b[0m\n\u001b[mNotice: /Stage[main]/Heat::Api/Service[heat-api]: Triggered 'refresh' from 40 events\u001b[0m\n\u001b[mNotice: /Stage[main]/Heat::Api_cloudwatch/Service[heat-api-cloudwatch]: Triggered 'refresh' from 40 events\u001b[0m\n\u001b[mNotice: Finished catalog run in 78.49 seconds\u001b[0m\n", "deploy_stderr": "\u001b[1;31mWarning: Scope(Class[Keystone]): Execution of db_sync does not depend on $enabled anymore. Please use sync_db instead.\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Glance::Registry]): Execution of db_sync does not depend on $manage_service or $enabled anymore. Please use sync_db instead.\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_host'; class ::nova::compute has not been evaluated\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_protocol'; class ::nova::compute has not been evaluated\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_port'; class ::nova::compute has not been evaluated\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_path'; class ::nova::compute has not been evaluated\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Concat::Setup]): concat::setup is deprecated as a public API of the concat module and should no longer be directly included in the manifest.\u001b[0m\n\u001b[1;31mWarning
Aug 10 13:31:14 localhost os-collect-config: Notice: /Stage[main]/Heat::Api_cfn/Service[heat-api-cfn]: Triggered 'refresh' from 40 events
Aug 10 13:31:14 localhost os-collect-config: Notice: /Stage[main]/Heat::Api/Service[heat-api]: Triggered 'refresh' from 40 events
Aug 10 13:31:14 localhost os-collect-config: Notice: /Stage[main]/Heat::Api_cloudwatch/Service[heat-api-cloudwatch]: Triggered 'refresh' from 40 events
Aug 10 13:34:13 localhost crmd[22153]: notice: process_lrm_event: Operation openstack-heat-api_monitor_0: not running (node=overcloud-controller-1, call=142, rc=7, cib-update=69, confirmed=true)
Aug 10 13:34:13 localhost crmd[22153]: notice: process_lrm_event: Operation openstack-heat-api-cloudwatch_monitor_0: not running (node=overcloud-controller-1, call=157, rc=7, cib-update=74, confirmed=true)
Aug 10 13:34:13 localhost systemd: Starting Cluster Controlled openstack-heat-api...
Aug 10 13:34:13 localhost systemd: Started Cluster Controlled openstack-heat-api.
Aug 10 13:34:13 localhost systemd: Starting Cluster Controlled openstack-heat-api-cloudwatch...
Aug 10 13:34:13 localhost systemd: Started Cluster Controlled openstack-heat-api-cloudwatch.
Aug 10 13:34:14 localhost systemd: openstack-heat-api.service: main process exited, code=exited, status=1/FAILURE
Aug 10 13:34:14 localhost systemd: Unit openstack-heat-api.service entered failed state.
Aug 10 13:34:15 localhost crmd[22153]: notice: process_lrm_event: Operation openstack-heat-api_start_0: not running (node=overcloud-controller-1, call=178, rc=7, cib-update=79, confirmed=true)
Aug 10 13:34:15 localhost crmd[22153]: notice: process_lrm_event: Operation openstack-heat-api-cloudwatch_start_0: ok (node=overcloud-controller-1, call=181, rc=0, cib-update=81, confirmed=true)
Aug 10 13:34:21 localhost crmd[22153]: notice: process_lrm_event: Operation openstack-heat-api-cloudwatch_monitor_60000: ok (node=overcloud-controller-1, call=188, rc=0, cib-update=88, confirmed=false)
Aug 10 13:34:31 localhost systemd: Stopped Cluster Controlled openstack-heat-api.
Aug 10 13:34:33 localhost crmd[22153]: notice: process_lrm_event: Operation openstack-heat-api_stop_0: ok (node=overcloud-controller-1, call=223, rc=0, cib-update=100, confirmed=true)
Aug 10 13:34:56 localhost crmd[22153]: notice: process_lrm_event: Operation openstack-heat-api-cfn_monitor_0: not running (node=overcloud-controller-1, call=272, rc=7, cib-update=124, confirmed=true)
Aug 10 13:34:57 localhost systemd: Starting Cluster Controlled openstack-heat-api-cfn...
Aug 10 13:34:57 localhost systemd: Started Cluster Controlled openstack-heat-api-cfn.
Aug 10 13:34:59 localhost crmd[22153]: notice: process_lrm_event: Operation openstack-heat-api-cfn_start_0: ok (node=overcloud-controller-1, call=278, rc=0, cib-update=127, confirmed=true)
Aug 10 13:35:07 localhost crmd[22153]: notice: process_lrm_event: Operation openstack-heat-api-cfn_monitor_60000: ok (node=overcloud-controller-1, call=281, rc=0, cib-update=131, confirmed=false)
Aug 10 13:35:21 localhost systemd: Stopping Cluster Controlled openstack-heat-api-cfn...
Aug 10 13:35:21 localhost systemd: Stopped Cluster Controlled openstack-heat-api-cfn.
Aug 10 13:35:23 localhost crmd[22153]: notice: process_lrm_event: Operation openstack-heat-api-cfn_stop_0: ok (node=overcloud-controller-1, call=291, rc=0, cib-update=136, confirmed=true)
Aug 10 13:35:44 localhost systemd: Stopping Cluster Controlled openstack-heat-api-cloudwatch...
Aug 10 13:35:44 localhost systemd: Stopped Cluster Controlled openstack-heat-api-cloudwatch.
Aug 10 13:35:46 localhost crmd[22153]: notice: process_lrm_event: Operation openstack-heat-api-cloudwatch_stop_0: ok (node=overcloud-controller-1, call=299, rc=0, cib-update=142, confirmed=true)
Aug 10 13:40:20 localhost os-collect-config: 01b[0m\n\u001b[mNotice: /Stage[main]/Neutron::Agents::Ml2::Ovs/Service[neutron-ovs-agent-service]/ensure: ensure changed 'running' to 'stopped'\u001b[0m\n\u001b[mNotice: /Stage[main]/Heat::Keystone::Domain/Exec[heat_domain_create]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /Stage[main]/Heat::Keystone::Domain/Heat_domain_id_setter[heat_domain_id]/ensure: created\u001b[0m\n\u001b[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[enable-not-start-tripleo_cluster]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[Set password for hacluster user on tripleo_cluster]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[auth-successful-across-all-nodes]/returns: executed successfully\u001b[0m\n\u001b[mNotice: Pacemaker has reported quorum achieved\u001b[0m\n\u001b[mNotice: /Stage[main]/Pacemaker::Corosync/Notify[pacemaker settled]/message: defined 'message' as 'Pacemaker has reported quorum achieved'\u001b[0m\n\u001b[mNotice: /Stage[main]/Heat::Api_cfn/Service[heat-api-cfn]: Triggered 'refresh' from 2 events\u001b[0m\n\u001b[mNotice: /Stage[main]/Heat::Engine/Service[heat-engine]: Triggered 'refresh' from 2 events\u001b[0m\n\u001b[mNotice: /Stage[main]/Heat::Api/Service[heat-api]: Triggered 'refresh' from 2 events\u001b[0m\n\u001b[mNotice: /Stage[main]/Heat::Api_cloudwatch/Service[heat-api-cloudwatch]: Triggered 'refresh' from 2 events\u001b[0m\n\u001b[mNotice: Finished catalog run in 37.57 seconds\u001b[0m\n", "deploy_stderr": "\u001b[1;31mWarning: Scope(Class[Keystone]): Execution of db_sync does not depend on $enabled anymore. Please use sync_db instead.\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Glance::Registry]): Execution of db_sync does not depend on $manage_service or $enabled anymore. Please use sync_db instead.\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_host'; class ::nova::compute has not been evaluated\u001b[0m\n\u001b[1;31
Aug 10 13:40:20 localhost os-collect-config: Notice: /Stage[main]/Heat::Api_cfn/Service[heat-api-cfn]: Triggered 'refresh' from 2 events
Aug 10 13:40:20 localhost os-collect-config: Notice: /Stage[main]/Heat::Api/Service[heat-api]: Triggered 'refresh' from 2 events
Aug 10 13:40:20 localhost os-collect-config: Notice: /Stage[main]/Heat::Api_cloudwatch/Service[heat-api-cloudwatch]: Triggered 'refresh' from 2 events
Comment 3 Omri Hochman 2015-08-12 15:22:18 EDT
There was no problem to start manually the heat-api service manually on the controller: 


[root@overcloud-controller-1 ~]# service openstack-heat-api status
Redirecting to /bin/systemctl status  openstack-heat-api.service
openstack-heat-api.service - OpenStack Heat API Service
   Loaded: loaded (/usr/lib/systemd/system/openstack-heat-api.service; disabled)
   Active: failed (Result: exit-code) since Mon 2015-08-10 13:34:14 EDT; 2 days ago
 Main PID: 8160 (code=exited, status=1/FAILURE)
   CGroup: /system.slice/openstack-heat-api.service

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
[root@overcloud-controller-1 ~]# 
[root@overcloud-controller-1 ~]# 
[root@overcloud-controller-1 ~]# 
[root@overcloud-controller-1 ~]# service openstack-heat-api start
Redirecting to /bin/systemctl start  openstack-heat-api.service
[root@overcloud-controller-1 ~]# service openstack-heat-api status
Redirecting to /bin/systemctl status  openstack-heat-api.service
openstack-heat-api.service - OpenStack Heat API Service
   Loaded: loaded (/usr/lib/systemd/system/openstack-heat-api.service; disabled)
   Active: active (running) since Wed 2015-08-12 15:15:45 EDT; 2s ago
 Main PID: 15703 (heat-api)
   CGroup: /system.slice/openstack-heat-api.service
           └─15703 /usr/bin/python /usr/bin/heat-api
Comment 4 Omri Hochman 2015-08-14 16:27:14 EDT
setting higher severity : 

The issue reproduced, it seems pretty consistent problem .  when deploying HA (3 controllers  ), on one of the controller the heat-api service will be down after the deployment.
Comment 6 jliberma@redhat.com 2015-08-30 01:18:54 EDT
It is not recommended to start pacemaker-managed services outside of pacemaker as a workaround. 

Does "pcs status" report openstack-heat-api as started on all three controllers?

On my HA baremetal controllers it is started and running on all 3 controllers.

However, I am having another serious problem with heat where neutron-server stops responding when I deploy any stack that involves Neutron. The service does not fail but the heat API is non-responsive until I restart neutron-server-clone on the controllers.
Comment 8 Evgeny 2015-11-24 11:44:32 EST
Just couple cents from my side. 
I see exactly same behavior, but running just single controller.
Everything is good after I start services again.
NOTE: heat-api is failed exactly after deploy.

command for deployment: overcloud deploy --templates --control-scale 1 --compute-scale 3 --ceph-storage-scale 0 --block-storage-scale 0 --swift-storage-scale 0 --neutron-network-type vlan --neutron-disable-tunneling --neutron-public-interface eth0 --hypervisor-neutron-public-interface eth0 -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /home/stack/templates/networking-cisco-environment.yaml --neutron-tunnel-type vlan --neutron-network-vlan-ranges datacentre:2000:2500

I have UCSM and N9K Cisco plugins enabled in addition
Comment 10 Mike Burns 2016-04-07 16:47:27 EDT
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.
Comment 15 Steve Baker 2016-12-13 18:57:58 EST
Doing a python import of keystonemiddleware.auth_token has the side-effect of creating a directory for token signing verification.

The problem is that keystonemiddleware.auth_token._signing_dir._verify_signing_dir has a race in checking for and creating the directory. If multiple heat-api processes are spawned at the same time then OSError is raised for creating a directory that already exists.

This method should tolerate OSError being raised for os.makedirs, then verify the directory regardless.
Comment 16 Steve Baker 2016-12-13 19:04:59 EST
Raised an upstream bug and reassigning to python-keystonemiddleware. The fix will need to be backported to keystonemiddleware 1.5.x to reach OSP 7

Note You need to log in before you can comment on or make changes to this bug.