| Summary: | Updating the overcloud results in stopped pcs ressource. | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Gregory Charot <gcharot> | ||||||
| Component: | documentation | Assignee: | Dan Macpherson <dmacpher> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | RHOS Documentation Team <rhos-docs> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | low | ||||||||
| Version: | 9.0 (Mitaka) | CC: | chjones, dbecker, dmacpher, fbaudin, fdinitto, gcharot, jcoufal, mandreou, mburns, michele, morazi, pkilambi, rhel-osp-director-maint, sasha, sathlang, srevivo | ||||||
| Target Milestone: | --- | Keywords: | Documentation | ||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2017-02-23 08:00:42 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
Forgot to add that doing a pcs resource clean up, solves the problem. Created attachment 1223796 [details] output from Gregory via http://etherpad.corp.redhat.com/iUGe8rDAHj Thanks Gregory. Going on the description in comment #0 and given the 'heat or gnocchi' services are down I think you may be hitting BZ 1353031 - there was a fix landed there in https://review.openstack.org/#/c/344823/ and that BZ has a fixed in version of openstack-tripleo-heat-templates-2.0.0-23.el7ost - can you check which version the openstack-tripleo-heat-templates you have in your env? As you have noted, the workaround is to run a pcs resource cleanup and the cluster should go to a clean state. If it isn't BZ 1353031 then we should probably assign to DFG:Telemetry - would be nice to sanity check the error in the traces Gregory points to in comment #0 etherpad (I attached them here https://bugzilla.redhat.com/attachment.cgi?id=1223796 ) WRT the upgrade not starting because the cluster has stopped services, that is a feature not a bug... previously we would have gone ahead with the upgrade even though there may have been services down (like gnocchi or heat in this case). We landed the checks into newton for the 9-->10 upgrade. Hope that helps for now? Here is the version I used
$ rpm -qa | grep openstack-tripleo-heat-templates
openstack-tripleo-heat-templates-5.1.0-3.el7ost.noarch
openstack-tripleo-heat-templates-compat-2.0.0-34.4.el7ost.noarch
For OSP9 i use the "compact" one so version 2.0.0-34.4
$ rpm -ql openstack-tripleo-heat-templates-compat | grep mitaka
/usr/share/openstack-tripleo-heat-templates/compat/environments/major-upgrade-keystone-liberty-mitaka.yaml
/usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/liberty_to_mitaka_aodh_upgrade.yaml
/usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/liberty_to_mitaka_aodh_upgrade_1.pp
/usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/liberty_to_mitaka_aodh_upgrade_2.pp
/usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/liberty_to_mitaka_keystone_upgrade.pp
/usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/major_upgrade_keystone_liberty_mitaka.yaml
/usr/share/openstack-tripleo-heat-templates/mitaka
Looked at the upstream patch, this is very similar indeed however it seems to be merged in my env :
$ grep -n -A9 "keystone-then-gnocchi-metricd-constraint" /usr/share/openstack-tripleo-heat-templates/mitaka/puppet/manifests/overcloud_controller_pacemaker.pp
1966: pacemaker::constraint::base { 'keystone-then-gnocchi-metricd-constraint':
1967- constraint_type => 'order',
1968- first_resource => 'openstack-core-clone',
1969- second_resource => "${::gnocchi::params::metricd_service_name}-clone",
1970- first_action => 'start',
1971- second_action => 'start',
1972- require => [Pacemaker::Resource::Service[$::gnocchi::params::metricd_service_name],
1973- Pacemaker::Resource::Ocf['openstack-core']],
1974- }
1975- pacemaker::constraint::base { 'gnocchi-metricd-then-gnocchi-statsd-constraint':
Agreed for the features that prevents upgrade if services are down, this is a must have, just pointing it out !
Please contact me offline one IRC (gcharot) if you need to have a look to the env; if not please let me know so I can wipe it out.
(In reply to Gregory Charot from comment #4) > Here is the version I used > > $ rpm -qa | grep openstack-tripleo-heat-templates > openstack-tripleo-heat-templates-5.1.0-3.el7ost.noarch > openstack-tripleo-heat-templates-compat-2.0.0-34.4.el7ost.noarch > > For OSP9 i use the "compact" one so version 2.0.0-34.4 > $ rpm -ql openstack-tripleo-heat-templates-compat | grep mitaka > /usr/share/openstack-tripleo-heat-templates/compat/environments/major- > upgrade-keystone-liberty-mitaka.yaml > /usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/ > liberty_to_mitaka_aodh_upgrade.yaml > /usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/ > liberty_to_mitaka_aodh_upgrade_1.pp > /usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/ > liberty_to_mitaka_aodh_upgrade_2.pp > /usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/ > liberty_to_mitaka_keystone_upgrade.pp > /usr/share/openstack-tripleo-heat-templates/compat/extraconfig/tasks/ > major_upgrade_keystone_liberty_mitaka.yaml > /usr/share/openstack-tripleo-heat-templates/mitaka > > Looked at the upstream patch, this is very similar indeed however it seems > to be merged in my env : > > $ grep -n -A9 "keystone-then-gnocchi-metricd-constraint" > /usr/share/openstack-tripleo-heat-templates/mitaka/puppet/manifests/ > overcloud_controller_pacemaker.pp > > 1966: pacemaker::constraint::base { > 'keystone-then-gnocchi-metricd-constraint': > 1967- constraint_type => 'order', > 1968- first_resource => 'openstack-core-clone', > 1969- second_resource => > "${::gnocchi::params::metricd_service_name}-clone", > 1970- first_action => 'start', > 1971- second_action => 'start', > 1972- require => > [Pacemaker::Resource::Service[$::gnocchi::params::metricd_service_name], > 1973- Pacemaker::Resource::Ocf['openstack-core']], > 1974- } > 1975- pacemaker::constraint::base { > 'gnocchi-metricd-then-gnocchi-statsd-constraint': > > Agreed for the features that prevents upgrade if services are down, this is > a must have, just pointing it out ! > > Please contact me offline one IRC (gcharot) if you need to have a look to > the env; if not please let me know so I can wipe it out. ACK - thanks for checking Gregory. Please keep the environment around for a little longer if possible. I'm going to reach out to PIDONE and Telemetry (adding internal whiteboard too for now) teams for a triage here and they may need to get access @fabio and @pradk appreciate any thoughts based on the description and comments here. I suspect still it may be related to one of the many things we've landed recently. Sure will keep it around ! FYI the env has the heat engine resource down not the gnocchi one. While trying to figure out if the bug was repeatable I ended up with heat engine going down "instead" of gnocchi-statd My initial hunch is that this is a duplicate of: https://bugzilla.redhat.com/show_bug.cgi?id=1377788 Can we collect sosreports from all three controllers and put them up somewhere? Looking at the doc: https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/upgrading-red-hat-openstack-platform/chapter-3-director-based-environments-performing-upgrades-to-major-versions There's a note after each update step: " Login to a Controller node and run the pcs status command to check if all resources are active in the Controller cluster. " Seems like we miss the instruction to cleanup resources upon finding failures. Is this a doc bug? Created attachment 1223928 [details]
sosreport controller
File too big to be attached, please find it at
Hi, As Alexander puts it, it would be nice to have the OSP9 doc updated to specify the "pcs resource cleanup" command. As for rhos-10/rhos-11 they will use a massively different upgrade approach, this cannot be forward I think. Instructions now include pcs resource cleanup: https://access.redhat.com/documentation/en/red-hat-openstack-platform/10/single/upgrading-red-hat-openstack-platform/#sect-Major-Upgrading_the_Overcloud Sofer, anything further to add for this note? No response in over two weeks. If nothing further to add, I'll close this BZ. If further changes are required, please feel free to reopen it. Hi Dan, Sorry I didn't reply earlier. The text is fine thanks a lot. |
Description of problem: After doing an update, a pacemaker resource fails and ends into a stopped state. Version-Release number of selected component (if applicable): OSP 9. How reproducible: This does not happen every time, and resource seems to change, had it with openstack-gnocchi-statsd and openstack-heat-engine_start_0. Steps to Reproduce: 1. Install basic overcloud undercloud$ time openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates/mitaka/ --ntp-server 10.16.255.1 --control-scale 1 --compute-scale 2 --neutron-tunnel-types vxlan --neutron-network-type vxlan --control-flavor control --compute-flavor compute 2. Update overcloud openstack overcloud update stack overcloud -i \ --templates /usr/share/openstack-tripleo-heat-templates/mitaka/ 3. SSH into the ctrl and do a pcs status. Actual results: Resource is down : * openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=544, status=complete, exitreason='none', last-rc-change='Wed Nov 23 13:59:34 2016', queued=0ms, exec=2443ms Also had it with : Failed Actions: * openstack-gnocchi-statsd_start_0 on overcloud-controller-0 'not running' (7): call=246, status=complete, exitreason='none', last-rc-change='Tue Nov 22 23:54:00 2016', queued=0ms, exec=2101ms Expected results: No resource down. Additional info: * Haven't tried with HA ctrl * Using director 10 to deploy OSP 9 * This does not happen every time * This affects upgrade from 9 to 10 as the upgrade process will complain with : "deploy_stdout": "ERROR: upgrade cannot start with stopped resources on the cluster. Make sure that all the resources are up and running.\n", * Detailed ouput for the gnocchi resource available here http://etherpad.corp.redhat.com/iUGe8rDAHj * I have an internal RH system with the heat-engine ressource down available for inspection - please contact me if needed.