Bug 1351784

Summary: osp-director-9: Upgrade from OSP8 to OSP9 causes services on the controllers to fail, which requires a manual resource cleanup in pacemaker and cluster restart.
Product: Red Hat OpenStack Reporter: Omri Hochman <ohochman>
Component: rhosp-directorAssignee: Marios Andreou <mandreou>
Status: CLOSED DUPLICATE QA Contact: Omri Hochman <ohochman>
Severity: high Docs Contact:
Priority: medium    
Version: 9.0 (Mitaka)CC: dbecker, jason.dobies, jcoufal, jeckersb, jjoyce, jstransk, mandreou, mburns, mcornea, mkrcmari, morazi, ohochman, rhel-osp-director-maint, sasha, sclewis, tvignaud
Target Milestone: gaKeywords: Triaged
Target Release: 9.0 (Mitaka)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-29 15:32:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1343905, 1353031, 1359760    
Bug Blocks:    

Description Omri Hochman 2016-06-30 20:27:35 UTC
osp-director-9: Upgrade from OSP8 to OSP9 causes services on the controllers to fail, which requires a manual resource cleanup in pacemaker and cluster restart. 

Environment:
-------------
python-heatclient-1.2.0-1.el7ost.noarch
openstack-heat-api-cloudwatch-6.0.0-6.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-12.el7ost.noarch
openstack-heat-engine-6.0.0-6.el7ost.noarch
openstack-tripleo-heat-templates-liberty-2.0.0-12.el7ost.noarch
openstack-tripleo-heat-templates-kilo-2.0.0-12.el7ost.noarch
openstack-heat-api-6.0.0-6.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
openstack-heat-common-6.0.0-6.el7ost.noarch
openstack-heat-api-cfn-6.0.0-6.el7ost.noarch
openstack-heat-templates-0-0.8.20150605git.el7ost.noarch
instack-undercloud-4.0.0-5.el7ost.noarch
instack-0.0.8-3.el7ost.noarch
openstack-puppet-modules-8.1.2-1.el7ost.noarch
puppet-3.6.2-2.el7.noarch
openstack-tripleo-puppet-elements-2.0.0-2.el7ost.noarch
pcs-0.9.143-15.el7.x86_64

Description: 
------------
After one of the upgrade steps, PCS resources failed on the controllers, specifically heat-engine and rabbitmq. These services were down. In order to proceed with the upgrade, it was necessary to manually start the services and clean up the failed actions


Workaround: 
-----------
(1)ssh the controller and start the failed services  
 In that case:  
    (a) sudo systemctl start rabbitmq-server.service
    (b) sudo systemctl start openstack-heat-engine
(2)sudo pcs resource cleanup


Continue with the next step of the upgrade process.


the upgrade command: 
--------------------
openstack overcloud deploy  --templates /usr/share/openstack-tripleo-heat-templates -e   /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml  -e  /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml  -e  /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml  -e  /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /home/stack/network-environment.yaml -e  /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml



[root@overcloud-controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Last updated: Thu Jun 30 19:58:04 2016		Last change: Thu Jun 30 19:39:33 2016 by root via crm_resource on overcloud-controller-0
Stack: corosync
Current DC: overcloud-controller-1 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum
3 nodes and 115 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 ip-10.19.184.210	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-0
 ip-192.168.200.10	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-1
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-192.168.0.6	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-2
 ip-10.19.104.11	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-0
 ip-10.19.105.10	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-1
 ip-10.19.104.10	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-2
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-1 ]
     Slaves: [ overcloud-controller-0 overcloud-controller-2 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: mongod-clone [mongod]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-api-clone [openstack-nova-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-registry-clone [openstack-glance-registry]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: delay-clone [delay]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-server-clone [neutron-server]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: httpd-clone [httpd]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	Started overcloud-controller-1
 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-aodh-listener-clone [openstack-aodh-listener]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-aodh-notifier-clone [openstack-aodh-notifier]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-aodh-evaluator-clone [openstack-aodh-evaluator]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-core-clone [openstack-core]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Failed Actions:
* openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=279, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 19:39:10 2016', queued=0ms, exec=2082ms
* openstack-heat-engine_start_0 on overcloud-controller-1 'not running' (7): call=281, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 19:39:10 2016', queued=0ms, exec=2189ms
* rabbitmq_start_0 on overcloud-controller-1 'unknown error' (1): call=231, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 19:37:53 2016', queued=0ms, exec=25056ms
* openstack-heat-engine_start_0 on overcloud-controller-2 'not running' (7): call=274, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 19:39:10 2016', queued=0ms, exec=2137ms
* rabbitmq_start_0 on overcloud-controller-2 'unknown error' (1): call=227, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 19:38:21 2016', queued=1ms, exec=18502ms


PCSD Status:
  overcloud-controller-0: Online
  overcloud-controller-1: Online
  overcloud-controller-2: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 2 Omri Hochman 2016-06-30 20:29:23 UTC
Although there's workaround, I'm setting the blocker flag for PM, 
due to poor user experience in this case.

Comment 3 Omri Hochman 2016-06-30 20:43:54 UTC
To get the PCS status from the undercloud machine run :
# make sure there are no stopped or unmanaged services
ssh heat-admin@overcloud-controller-0 sudo pcs status | grep -i stopped -B2
ssh heat-admin@overcloud-controller-0 sudo pcs status | grep -i unmanaged -B2


Steps were according to : http://etherpad.corp.redhat.com/ospd9-upgrade

Comment 4 Alexander Chuzhoy 2016-06-30 21:14:36 UTC
Reproduced:
After the step with " -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml" checked the pcs resources on a controller.
The following services were down:
rabbitmq-server.service
openstack-heat-engine

To fix:
'pcs resource cleanup'

Comment 5 Omri Hochman 2016-07-01 13:48:59 UTC
Post Successful Upgrade from OSP8 -> OSP9  the heat-engine service was down on the controllers 


from heat-engine.log : 
-----------------------
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service     dbapi_connection.rollback()
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 724, in rollback
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service     self._read_ok_packet()
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 698, in _read_ok_packet
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service     pkt = self._read_packet()
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 895, in _read_packet
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service     packet_header = self._read_bytes(4)
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 912, in _read_bytes
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service     data = self._rfile.read(num_bytes)
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/pymysql/_socketio.py", line 59, in readinto
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service     return self._sock.recv_into(b)
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 346, in recv_into
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service     timeout_exc=socket.timeout("timed out"))
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 201, in _trampoline
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service     mark_as_closed=self._mark_as_closed)
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py", line 144, in trampoline
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service     assert hub.greenlet is not current, 'do not call blocking functions from the mainloop'
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service DBAPIError: (exceptions.AssertionError) do not call blocking functions from the mainloop
2016-06-30 19:20:04.026 5697 ERROR oslo_service.service 
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters [req-33c633c4-d079-41a6-bf84-af8a23fcfbac - -] DB exception wrapped.
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call last):
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execut
e_context
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters     context)
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_ex
ecute
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters     result.read()
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1138, in read
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters     first_packet = self.connection._read_packet()
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 895, in _read_packet
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters     packet_header = self._read_bytes(4)
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 912, in _read_bytes
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters     data = self._rfile.read(num_bytes)
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/_socketio.py", line 59, in readinto
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters     return self._sock.recv_into(b)
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 346, in recv_into
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters     timeout_exc=socket.timeout("timed out"))
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 201, in _trampoline
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters     mark_as_closed=self._mark_as_closed)
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py", line 144, in trampoline
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters     assert hub.greenlet is not current, 'do not call blocking functions from the mainloop'
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters AssertionError: do not call blocking functions from the mainloop
2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters 



PCS Status ( Post-Upgrade ):
----------------------------
[root@overcloud-controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Last updated: Fri Jul  1 13:30:17 2016		Last change: Thu Jun 30 21:37:13 2016 by root via crm_resource on overcloud-controller-0
Stack: corosync
Current DC: overcloud-controller-1 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum
3 nodes and 127 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 ip-10.19.184.210	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-0
 ip-192.168.200.10	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-1
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-192.168.0.6	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-2
 ip-10.19.104.11	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-0
 ip-10.19.105.10	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-1
 ip-10.19.104.10	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-2
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-1 ]
     Slaves: [ overcloud-controller-0 overcloud-controller-2 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: mongod-clone [mongod]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-api-clone [openstack-nova-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-registry-clone [openstack-glance-registry]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: delay-clone [delay]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-server-clone [neutron-server]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: httpd-clone [httpd]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	Started overcloud-controller-0
 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-aodh-listener-clone [openstack-aodh-listener]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-aodh-notifier-clone [openstack-aodh-notifier]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-aodh-evaluator-clone [openstack-aodh-evaluator]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-core-clone [openstack-core]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-gnocchi-metricd-clone [openstack-gnocchi-metricd]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-sahara-api-clone [openstack-sahara-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-sahara-engine-clone [openstack-sahara-engine]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-gnocchi-statsd-clone [openstack-gnocchi-statsd]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Failed Actions:
* openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=1260, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 21:36:50 2016', queued=0ms, exec=2152ms
* openstack-gnocchi-metricd_start_0 on overcloud-controller-0 'not running' (7): call=1111, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 21:31:29 2016', queued=0ms, exec=2209ms
* openstack-gnocchi-statsd_start_0 on overcloud-controller-0 'not running' (7): call=1106, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 21:31:22 2016', queued=0ms, exec=2455ms
* openstack-heat-engine_start_0 on overcloud-controller-1 'not running' (7): call=1254, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 21:36:50 2016', queued=0ms, exec=2078ms
* openstack-gnocchi-metricd_start_0 on overcloud-controller-1 'not running' (7): call=1102, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 21:31:11 2016', queued=0ms, exec=2154ms
* openstack-heat-engine_start_0 on overcloud-controller-2 'not running' (7): call=1240, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 21:36:50 2016', queued=0ms, exec=2201ms
* openstack-gnocchi-metricd_start_0 on overcloud-controller-2 'not running' (7): call=1095, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 21:31:11 2016', queued=0ms, exec=2294ms


PCSD Status:
  overcloud-controller-0: Online
  overcloud-controller-1: Online
  overcloud-controller-2: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 6 Jay Dobies 2016-07-05 13:12:32 UTC
Marios - Can you double check that this is covered in Jirka's manual workaround write up?

Comment 7 Marios Andreou 2016-07-05 15:46:07 UTC
o/ jdob... so no it wasn't and I've added a note at the controller upgrade step.

I initially thought this was the same as but 1351204 but that is for the aodh migration (there is already a note on the etherpad regarding cluster cleanup for that though).

WRT the deployment @omri or @sasha can you verify the heat stack went to UPDATE_COMPLETE ? (i.e. the error here is confined to the stopped services, and the heat stack is OK? so the step really did execute completely, is what I'm after)

Comment 8 Omri Hochman 2016-07-05 19:53:05 UTC
(In reply to marios from comment #7)

> WRT the deployment @omri or @sasha can you verify the heat stack went to
> UPDATE_COMPLETE ? (i.e. the error here is confined to the stopped services,
> and the heat stack is OK? so the step really did execute completely, is what
> I'm after)


Marios that's right - we managed to get the stack in: UPDATE_COMPLETE. 
As you mentioned, we do have notes for it in the etherpad.
This bug is exactly to track this issue, the requirement to start services and clean failed resources after each of the upgrade steps, causes poor user experience which we would like to avoid if possible.  by failing to clean the resources after each step, the upgrade process will end with : UPDATE_FAILED.

Comment 10 Alexander Chuzhoy 2016-07-19 19:17:27 UTC
heat stack-list reports:
| 87838b75-5687-4dce-9a5a-d4750e9e7b48 | overcloud  | UPDATE_COMPLETE | 2016-06-28T18:27:58 | 2016-06-29T16:00:39 |

On the console I see:
2016-06-29 16:28:59 [ControllerDeployment]: SIGNAL_COMPLETE Unknown
2016-06-29 16:29:00 [0]: SIGNAL_COMPLETE Unknown
2016-06-29 16:29:01 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
2016-06-29 16:29:01 [0]: SIGNAL_COMPLETE Unknown
Stack overcloud UPDATE_COMPLETE
Overcloud Endpoint: http://10.19.184.180:5000/v2.0
Overcloud Deployed



Yet, when I check the pcs resources:
[root@overcloud-controller-0 ~]# pcs status|grep -B2 -i stop
 Clone Set: rabbitmq-clone [rabbitmq]                       
     Started: [ overcloud-controller-0 ]                    
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
--
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]                                 
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine]                       
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]                     
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]                             
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]               
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-server-clone [neutron-server]                                     
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]



Running 'pcs resource cleanup' doesn't help  - unable to start rabbitmq on all controllers.

Comment 11 Marian Krcmarik 2016-07-21 18:50:06 UTC
I believe that heat-engine fail could be related to this bug (it's worthy to check) - https://bugzilla.redhat.com/show_bug.cgi?id=1357229

wrt rabbitmq, pacemaker does not use systemd unit file for managing rabbitmq service but it uses pacemaker resource agent so the better way how to recover failed rabbitmq resource is "pcs resource cleanup rabbitmq", standard unit systemd file does not support HA mode and recreation of rabbitmq cluster.

Comment 12 Marios Andreou 2016-07-22 11:11:52 UTC
Hi Omri, Sasha, 

from the original description above and comments here the common theme is, for the controllers upgrade step (... -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml ) heat says UPDATE_COMPLETE but you see services, down. Of particular interest to me is:

 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]

which is rabbit not coming back on nodes 1/2 but fine on 0. That sounds to me like it could be fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1343905 (whatever the fix ultimately is there). I think other services failing is likely a consequence of the pcs constraints (since rabbit is down...)

What do you gents think? duplicate of 1343905 and track there? We can re open if necessary for any specific service that isn't starting (if any) once rabbit is working?

thanks.

Comment 15 Jiri Stransky 2016-07-28 09:22:46 UTC
Should we mark this one as duplicate or make it TestOnly?

The fixes i had to pull in to get a green upgrade run are already associated 
with more specific bugzillas: RabbitMQ rejoin [1], python-cradox install 
[2] and gnocchi pacemaker constraint [3].

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1343905
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1359760
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1353031

Comment 16 Mike Orazi 2016-07-29 15:32:52 UTC
I'm going to close it as a dupe for now.  If there is additional new information that suggests this is a distinct issue, feel free to provide additional reproducer information and reopen.

*** This bug has been marked as a duplicate of bug 1359760 ***