Bug 1397918

Summary: Updating the overcloud results in stopped pcs ressource.
Product: Red Hat OpenStack Reporter: Gregory Charot <gcharot>
Component: documentationAssignee: Dan Macpherson <dmacpher>
Status: CLOSED CURRENTRELEASE QA Contact: RHOS Documentation Team <rhos-docs>
Severity: unspecified Docs Contact:
Priority: low    
Version: 9.0 (Mitaka)CC: chjones, dbecker, dmacpher, fbaudin, fdinitto, gcharot, jcoufal, mandreou, mburns, michele, morazi, pkilambi, rhel-osp-director-maint, sasha, sathlang, srevivo
Target Milestone: ---Keywords: Documentation
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-23 08:00:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
output from Gregory via http://etherpad.corp.redhat.com/iUGe8rDAHj
none
sosreport controller none

Description Gregory Charot 2016-11-23 14:53:14 UTC
Description of problem:

After doing an update, a pacemaker resource fails and ends into a stopped state.


Version-Release number of selected component (if applicable):
OSP 9.

How reproducible:

This does not happen every time, and resource seems to change, had it with openstack-gnocchi-statsd and openstack-heat-engine_start_0.

Steps to Reproduce:

1. Install basic overcloud
undercloud$ time openstack overcloud deploy     --templates /usr/share/openstack-tripleo-heat-templates/mitaka/     --ntp-server 10.16.255.1 --control-scale 1 --compute-scale 2     --neutron-tunnel-types vxlan --neutron-network-type vxlan --control-flavor control     --compute-flavor compute

2. Update overcloud 
openstack overcloud update stack overcloud -i \
--templates /usr/share/openstack-tripleo-heat-templates/mitaka/

3. SSH into the ctrl and do a pcs status.

Actual results:

Resource is down :

* openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=544, status=complete, exitreason='none',
    last-rc-change='Wed Nov 23 13:59:34 2016', queued=0ms, exec=2443ms

Also had it with :
Failed Actions:
* openstack-gnocchi-statsd_start_0 on overcloud-controller-0 'not running' (7): call=246, status=complete, exitreason='none',
    last-rc-change='Tue Nov 22 23:54:00 2016', queued=0ms, exec=2101ms


Expected results:

No resource down.

Additional info:

* Haven't tried with HA ctrl
* Using director 10 to deploy OSP 9
* This does not happen every time
* This affects upgrade from 9 to 10 as the upgrade process will complain with :
 "deploy_stdout": "ERROR: upgrade cannot start with stopped resources on the cluster. Make sure that all the resources are up and running.\n",
* Detailed ouput for the gnocchi resource available here http://etherpad.corp.redhat.com/iUGe8rDAHj
* I have an internal RH system with the heat-engine ressource down available for inspection - please contact me if needed.

Comment 1 Gregory Charot 2016-11-23 15:04:36 UTC
Forgot to add that doing a pcs resource clean up, solves the problem.

Comment 2 Marios Andreou 2016-11-24 11:37:22 UTC
Created attachment 1223796 [details]
output from Gregory via http://etherpad.corp.redhat.com/iUGe8rDAHj

Comment 3 Marios Andreou 2016-11-24 11:40:11 UTC
Thanks Gregory. Going on the description in comment #0 and given the 'heat or gnocchi' services are down I think you may be hitting BZ 1353031 - there was a fix landed there in https://review.openstack.org/#/c/344823/ and that BZ has a fixed in version of openstack-tripleo-heat-templates-2.0.0-23.el7ost - can you check which version the openstack-tripleo-heat-templates you have in your env? As you have noted, the workaround is to run a pcs resource cleanup and the cluster should go to a clean state. If it isn't BZ 1353031 then we should probably assign to DFG:Telemetry - would be nice to sanity check the error in the traces Gregory points to in comment #0 etherpad (I attached them here https://bugzilla.redhat.com/attachment.cgi?id=1223796 )

WRT the upgrade not starting because the cluster has stopped services, that is a feature not a bug... previously we would have gone ahead with the upgrade even though there may have been services down (like gnocchi or heat in this case). We landed the checks into newton for the 9-->10 upgrade. 

Hope that helps for now?

Comment 4 Gregory Charot 2016-11-24 13:04:27 UTC
Here is the version I used

$ rpm -qa | grep  openstack-tripleo-heat-templates
openstack-tripleo-heat-templates-5.1.0-3.el7ost.noarch
openstack-tripleo-heat-templates-compat-2.0.0-34.4.el7ost.noarch

For OSP9 i use the "compact" one so version 2.0.0-34.4
$ rpm -ql openstack-tripleo-heat-templates-compat | grep mitaka
/usr/share/openstack-tripleo-heat-templates/compat/environments/major-upgrade-keystone-liberty-mitaka.yaml
/usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/liberty_to_mitaka_aodh_upgrade.yaml
/usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/liberty_to_mitaka_aodh_upgrade_1.pp
/usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/liberty_to_mitaka_aodh_upgrade_2.pp
/usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/liberty_to_mitaka_keystone_upgrade.pp
/usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/major_upgrade_keystone_liberty_mitaka.yaml
/usr/share/openstack-tripleo-heat-templates/mitaka

Looked at the upstream patch, this is very similar indeed however it seems to be merged in my env :

$ grep -n -A9 "keystone-then-gnocchi-metricd-constraint"  /usr/share/openstack-tripleo-heat-templates/mitaka/puppet/manifests/overcloud_controller_pacemaker.pp

1966:    pacemaker::constraint::base { 'keystone-then-gnocchi-metricd-constraint':
1967-      constraint_type => 'order',
1968-      first_resource  => 'openstack-core-clone',
1969-      second_resource => "${::gnocchi::params::metricd_service_name}-clone",
1970-      first_action    => 'start',
1971-      second_action   => 'start',
1972-      require         => [Pacemaker::Resource::Service[$::gnocchi::params::metricd_service_name],
1973-                          Pacemaker::Resource::Ocf['openstack-core']],
1974-    }
1975-    pacemaker::constraint::base { 'gnocchi-metricd-then-gnocchi-statsd-constraint':

Agreed for the features that prevents upgrade if services are down, this is a must have, just pointing it out !

Please contact me offline one IRC (gcharot) if you need to have a look to the env; if not please let me know so I can wipe it out.

Comment 5 Marios Andreou 2016-11-24 14:25:57 UTC
(In reply to Gregory Charot from comment #4)
> Here is the version I used
> 
> $ rpm -qa | grep  openstack-tripleo-heat-templates
> openstack-tripleo-heat-templates-5.1.0-3.el7ost.noarch
> openstack-tripleo-heat-templates-compat-2.0.0-34.4.el7ost.noarch
> 
> For OSP9 i use the "compact" one so version 2.0.0-34.4
> $ rpm -ql openstack-tripleo-heat-templates-compat | grep mitaka
> /usr/share/openstack-tripleo-heat-templates/compat/environments/major-
> upgrade-keystone-liberty-mitaka.yaml
> /usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/
> liberty_to_mitaka_aodh_upgrade.yaml
> /usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/
> liberty_to_mitaka_aodh_upgrade_1.pp
> /usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/
> liberty_to_mitaka_aodh_upgrade_2.pp
> /usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/
> liberty_to_mitaka_keystone_upgrade.pp
> /usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/
> major_upgrade_keystone_liberty_mitaka.yaml
> /usr/share/openstack-tripleo-heat-templates/mitaka
> 
> Looked at the upstream patch, this is very similar indeed however it seems
> to be merged in my env :
> 
> $ grep -n -A9 "keystone-then-gnocchi-metricd-constraint" 
> /usr/share/openstack-tripleo-heat-templates/mitaka/puppet/manifests/
> overcloud_controller_pacemaker.pp
> 
> 1966:    pacemaker::constraint::base {
> 'keystone-then-gnocchi-metricd-constraint':
> 1967-      constraint_type => 'order',
> 1968-      first_resource  => 'openstack-core-clone',
> 1969-      second_resource =>
> "${::gnocchi::params::metricd_service_name}-clone",
> 1970-      first_action    => 'start',
> 1971-      second_action   => 'start',
> 1972-      require         =>
> [Pacemaker::Resource::Service[$::gnocchi::params::metricd_service_name],
> 1973-                          Pacemaker::Resource::Ocf['openstack-core']],
> 1974-    }
> 1975-    pacemaker::constraint::base {
> 'gnocchi-metricd-then-gnocchi-statsd-constraint':
> 
> Agreed for the features that prevents upgrade if services are down, this is
> a must have, just pointing it out !
> 
> Please contact me offline one IRC (gcharot) if you need to have a look to
> the env; if not please let me know so I can wipe it out.


ACK - thanks for checking Gregory. Please keep the environment around for a little longer if possible. I'm going to reach out to PIDONE and Telemetry (adding internal whiteboard too for now) teams for a triage here and they may need to get access

@fabio and @pradk appreciate any thoughts based on the description and comments here. I suspect still it may be related to one of the many things we've landed recently.

Comment 6 Gregory Charot 2016-11-24 14:46:08 UTC
Sure will keep it around ! 

FYI the env has the heat engine resource down not the gnocchi one. While trying to figure out if the bug was repeatable I ended up with heat engine going down "instead" of gnocchi-statd

Comment 7 Michele Baldessari 2016-11-24 15:09:51 UTC
My initial hunch is that this is a duplicate of:
https://bugzilla.redhat.com/show_bug.cgi?id=1377788

Can we collect sosreports from all three controllers and put them up somewhere?

Comment 8 Alexander Chuzhoy 2016-11-24 15:56:15 UTC
Looking at the doc:
https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/upgrading-red-hat-openstack-platform/chapter-3-director-based-environments-performing-upgrades-to-major-versions

There's a note after each update step:
"
Login to a Controller node and run the pcs status command to check if all resources are active in the Controller cluster. 
"

Seems like we miss the instruction to cleanup resources upon finding failures. 
Is this a doc bug?

Comment 9 Gregory Charot 2016-11-24 16:13:39 UTC
Created attachment 1223928 [details]
sosreport controller

File too big to be attached, please find it at

Comment 10 Sofer Athlan-Guyot 2016-12-14 11:50:26 UTC
Hi,

As Alexander puts it, it would be nice to have the OSP9 doc updated to specify the "pcs resource cleanup" command.

As for rhos-10/rhos-11 they will use a massively different upgrade approach, this cannot be forward I think.

Comment 16 Dan Macpherson 2017-02-03 04:45:44 UTC
Instructions now include pcs resource cleanup:

https://access.redhat.com/documentation/en/red-hat-openstack-platform/10/single/upgrading-red-hat-openstack-platform/#sect-Major-Upgrading_the_Overcloud

Sofer, anything further to add for this note?

Comment 17 Dan Macpherson 2017-02-23 08:00:42 UTC
No response in over two weeks. If nothing further to add, I'll close this BZ.

If further changes are required, please feel free to reopen it.

Comment 18 Sofer Athlan-Guyot 2017-03-05 23:23:25 UTC
Hi Dan,

Sorry I didn't reply earlier.  The text is fine thanks a lot.