Bug 1353031

Summary: osp-director-9: After successful upgrade from OSP8 to OSP9 there are failed resources on the controllers for heat-engine/gnocchi/ceilometer.
Product: Red Hat OpenStack Reporter: Omri Hochman <ohochman>
Component: openstack-tripleo-heat-templatesAssignee: Marios Andreou <mandreou>
Status: CLOSED ERRATA QA Contact: Omri Hochman <ohochman>
Severity: high Docs Contact:
Priority: high    
Version: 8.0 (Liberty)CC: dbecker, dyasny, fdinitto, jason.dobies, jcoufal, jjoyce, jstransk, mburns, mcornea, michele, mlammon, morazi, ohochman, pkilambi, rhallise, rhel-osp-director-maint, rscarazz, sasha, sclewis, tvignaud, yprokule
Target Milestone: gaKeywords: Triaged
Target Release: 9.0 (Mitaka)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-2.0.0-23.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-11 11:35:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1351784    
Attachments:
Description Flags
messages from controller0
none
heat-engine.log from undercloud none

Description Omri Hochman 2016-07-05 20:26:18 UTC
osp-director-9: After successful upgrade from OSP8 to OSP9 there are failed services on the controllers for heat-engine/gnocchi/ceilometer.  

Environment:
-------------
python-heatclient-1.2.0-1.el7ost.noarch
openstack-heat-api-cloudwatch-6.0.0-6.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-12.el7ost.noarch
openstack-heat-engine-6.0.0-6.el7ost.noarch
openstack-tripleo-heat-templates-liberty-2.0.0-12.el7ost.noarch
openstack-tripleo-heat-templates-kilo-2.0.0-12.el7ost.noarch
openstack-heat-api-6.0.0-6.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
openstack-heat-common-6.0.0-6.el7ost.noarch
openstack-heat-api-cfn-6.0.0-6.el7ost.noarch
openstack-heat-templates-0-0.8.20150605git.el7ost.noarch
instack-undercloud-4.0.0-5.el7ost.noarch
instack-0.0.8-3.el7ost.noarch
openstack-puppet-modules-8.1.2-1.el7ost.noarch
puppet-3.6.2-2.el7.noarch
openstack-tripleo-puppet-elements-2.0.0-2.el7ost.noarch
pcs-0.9.143-15.el7.x86_64

Steps :
--------
(1) attempt to upgrade OSP8 to OSP9 using director 
(2) after UPDATE_COMPLETE --> ssh on the controllers and run (as root)  
 pcs status 

Results :
---------
[root@overcloud-controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Last updated: Tue Jul  5 19:24:37 2016		Last change: Sat Jul  2 08:12:19 2016 by root via crm_resource on overcloud-controller-0
Stack: corosync
Current DC: overcloud-controller-0 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum
3 nodes and 127 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 ip-10.19.184.210	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-0
 ip-192.168.200.10	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-1
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-192.168.0.6	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-2
 ip-10.19.105.10	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-0
 ip-10.19.104.11	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-1
 ip-10.19.104.10	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-2
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-1 ]
     Slaves: [ overcloud-controller-0 overcloud-controller-2 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: mongod-clone [mongod]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-api-clone [openstack-nova-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-registry-clone [openstack-glance-registry]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: delay-clone [delay]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-server-clone [neutron-server]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: httpd-clone [httpd]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	Started overcloud-controller-0
 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-aodh-listener-clone [openstack-aodh-listener]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-aodh-notifier-clone [openstack-aodh-notifier]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-aodh-evaluator-clone [openstack-aodh-evaluator]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-core-clone [openstack-core]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-gnocchi-metricd-clone [openstack-gnocchi-metricd]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-sahara-api-clone [openstack-sahara-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-sahara-engine-clone [openstack-sahara-engine]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-gnocchi-statsd-clone [openstack-gnocchi-statsd]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Failed Actions:
* openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=1161, status=complete, exitreason='none',
    last-rc-change='Sat Jul  2 08:11:55 2016', queued=0ms, exec=2105ms
* openstack-ceilometer-collector_monitor_0 on overcloud-controller-0 'OCF_PENDING' (196): call=751, status=complete, exitreason='none',
    last-rc-change='Sat Jul  2 07:16:30 2016', queued=0ms, exec=34ms
* openstack-gnocchi-metricd_start_0 on overcloud-controller-0 'not running' (7): call=1004, status=complete, exitreason='none',
    last-rc-change='Sat Jul  2 08:06:08 2016', queued=0ms, exec=2195ms
* openstack-heat-engine_start_0 on overcloud-controller-2 'not running' (7): call=1145, status=complete, exitreason='none',
    last-rc-change='Sat Jul  2 08:11:55 2016', queued=0ms, exec=2199ms
* openstack-ceilometer-collector_monitor_0 on overcloud-controller-2 'OCF_PENDING' (196): call=742, status=complete, exitreason='none',
    last-rc-change='Sat Jul  2 07:16:30 2016', queued=0ms, exec=29ms
* openstack-gnocchi-metricd_start_0 on overcloud-controller-2 'not running' (7): call=1001, status=complete, exitreason='none',
    last-rc-change='Sat Jul  2 08:06:25 2016', queued=0ms, exec=2103ms
* openstack-gnocchi-statsd_start_0 on overcloud-controller-2 'not running' (7): call=996, status=complete, exitreason='none',
    last-rc-change='Sat Jul  2 08:06:18 2016', queued=0ms, exec=2395ms
* openstack-heat-engine_start_0 on overcloud-controller-1 'not running' (7): call=1155, status=complete, exitreason='none',
    last-rc-change='Sat Jul  2 08:11:55 2016', queued=1ms, exec=2217ms
* openstack-ceilometer-collector_monitor_0 on overcloud-controller-1 'OCF_PENDING' (196): call=757, status=complete, exitreason='none',
    last-rc-change='Sat Jul  2 07:16:30 2016', queued=1ms, exec=40ms
* openstack-gnocchi-metricd_start_0 on overcloud-controller-1 'not running' (7): call=1003, status=complete, exitreason='none',
    last-rc-change='Sat Jul  2 08:06:08 2016', queued=1ms, exec=2257ms


(adding log files in attachment )

Comment 1 Omri Hochman 2016-07-05 20:30:50 UTC
Created attachment 1176659 [details]
messages from controller0

attaching messages from controller0

Comment 2 Omri Hochman 2016-07-05 20:31:33 UTC
Created attachment 1176660 [details]
heat-engine.log from undercloud

heat-engine.log from undercloud

Comment 3 Jay Dobies 2016-07-07 13:05:44 UTC
Omri - Can you verify that this is repeatable and not related to a specific environment issue?

Comment 5 Omri Hochman 2016-07-12 01:18:12 UTC
(In reply to Jay Dobies from comment #3)
> Omri - Can you verify that this is repeatable and not related to a specific
> environment issue?

Yes Jay-  It's reproduced on my setup with  latest poodle:
-----------------------------------------------------------------
instack-undercloud-4.0.0-6.el7ost.noarch
instack-0.0.8-3.el7ost.noarch
openstack-heat-api-cfn-6.0.0-7.el7ost.noarch
python-heat-tests-6.0.0-7.el7ost.noarch
openstack-tripleo-heat-templates-liberty-2.0.0-14.el7ost.noarch
openstack-heat-api-cloudwatch-6.0.0-7.el7ost.noarch
python-heatclient-1.2.0-1.el7ost.noarch
openstack-heat-api-6.0.0-7.el7ost.noarch
openstack-tripleo-heat-templates-kilo-2.0.0-14.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-14.el7ost.noarch
openstack-heat-engine-6.0.0-7.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
openstack-heat-common-6.0.0-7.el7ost.noarch
openstack-heat-templates-0-0.8.20150605git.el7ost.noarch


the same services from the Bz body are down post upgrade. 


Reply comment #4
(*: I did not use the patch:
https://review.openstack.org/#/c/334486/3/tripleoclient/v1/overcloud_deploy.py)

Comment 6 Alexander Chuzhoy 2016-07-14 19:22:54 UTC
Running "pcs resource cleanup openstack-heat-engine" revives the openstack-heat-engine (checked on 2 setups with the issue reproduced).

Comment 7 Marios Andreou 2016-07-21 13:02:33 UTC
omri o/ (thanks jjoyce for ping) we need to find out/be more specific about which step the services were stopped after so that we can debug what went wrong, otherwise it isn't clear what to start debugging and where. As we discussed in yesterday's call, after each step, we need to ensure that the cluster is fully running, even if that means, for now, a manual intervention. I'm not clear if have applied that in this environment which happened a while ago afaics. If you *have* then it implies the services are failing exactly in the last step, i.e. stopping/failing to restart during upgrade converge? But if you haven't been checking after each of the steps in this env, then the failed services could have happened during any of the earlier steps, including the migrations 

Speaking of which, I know (and you know) there are reports of servces down after the keystone migration (https://bugzilla.redhat.com/show_bug.cgi?id=1348831 though that should be fixed now at least it is in my local testing), as well as after the controller upgrade (rabbit issue at https://bugzilla.redhat.com/show_bug.cgi?id=1343905). Since you mentinoed gnocchi i also saw this fly by today - gnocchi related pcs constraints fixup at https://review.openstack.org/#/c/344823/9. My point in mentioning these is, depending on the answer to 'which step did it fail on' it may have the same root cause as one of those other bugzillas.

Omri wdyt?

Comment 10 Raoul Scarazzini 2016-07-25 09:15:24 UTC
This issue maybe related to this bug https://bugzilla.redhat.com/show_bug.cgi?id=1348222 a redis dependency missing.

Comment 11 Marios Andreou 2016-07-26 16:00:27 UTC
while waiting for more info/testing from qe, especially after testing the rabbit fixup, explicitly noting that the reported 'gnochi down' from the description above may be fixed by https://review.openstack.org/#/c/344823/ - adding as a related review in the external tracker above

Comment 13 Jiri Stransky 2016-07-27 10:22:10 UTC
I was able to finish converge run with no failed/stopped services at all. Here are the extra things which aren't yet merged downstream but were present on my env:

* the patch we mention above https://review.openstack.org/#/c/344823/

* manuall install of python-cradox to work around bug 1359760

* the original workaround for bug 1343905 (not the latest fix, i'm yet about to test with that one)

Comment 17 mlammon 2016-08-03 16:13:48 UTC
I followed the latest upgrade guide and finished on 02 AUG 16
(http://etherpad.corp.redhat.com/ospd9-upgrade)

Initial deployment:
openstack overcloud deploy --templates --control-scale 3 --compute-scale 1   --neutron-network-type vxlan --neutron-tunnel-types vxlan  --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1

I think based on my latest upgrade testing as well as seen by Yuri in comment
(https://bugzilla.redhat.com/show_bug.cgi?id=1353031#c14) we can safely mark verified

Comment 19 errata-xmlrpc 2016-08-11 11:35:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1599.html