Bug 1353031
Summary: | osp-director-9: After successful upgrade from OSP8 to OSP9 there are failed resources on the controllers for heat-engine/gnocchi/ceilometer. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Omri Hochman <ohochman> | ||||||
Component: | openstack-tripleo-heat-templates | Assignee: | Marios Andreou <mandreou> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Omri Hochman <ohochman> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 8.0 (Liberty) | CC: | dbecker, dyasny, fdinitto, jason.dobies, jcoufal, jjoyce, jstransk, mburns, mcornea, michele, mlammon, morazi, ohochman, pkilambi, rhallise, rhel-osp-director-maint, rscarazz, sasha, sclewis, tvignaud, yprokule | ||||||
Target Milestone: | ga | Keywords: | Triaged | ||||||
Target Release: | 9.0 (Mitaka) | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | openstack-tripleo-heat-templates-2.0.0-23.el7ost | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2016-08-11 11:35:13 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1351784 | ||||||||
Attachments: |
|
Description
Omri Hochman
2016-07-05 20:26:18 UTC
Created attachment 1176659 [details]
messages from controller0
attaching messages from controller0
Created attachment 1176660 [details]
heat-engine.log from undercloud
heat-engine.log from undercloud
Omri - Can you verify that this is repeatable and not related to a specific environment issue? (In reply to Jay Dobies from comment #3) > Omri - Can you verify that this is repeatable and not related to a specific > environment issue? Yes Jay- It's reproduced on my setup with latest poodle: ----------------------------------------------------------------- instack-undercloud-4.0.0-6.el7ost.noarch instack-0.0.8-3.el7ost.noarch openstack-heat-api-cfn-6.0.0-7.el7ost.noarch python-heat-tests-6.0.0-7.el7ost.noarch openstack-tripleo-heat-templates-liberty-2.0.0-14.el7ost.noarch openstack-heat-api-cloudwatch-6.0.0-7.el7ost.noarch python-heatclient-1.2.0-1.el7ost.noarch openstack-heat-api-6.0.0-7.el7ost.noarch openstack-tripleo-heat-templates-kilo-2.0.0-14.el7ost.noarch openstack-tripleo-heat-templates-2.0.0-14.el7ost.noarch openstack-heat-engine-6.0.0-7.el7ost.noarch heat-cfntools-1.3.0-2.el7ost.noarch openstack-heat-common-6.0.0-7.el7ost.noarch openstack-heat-templates-0-0.8.20150605git.el7ost.noarch the same services from the Bz body are down post upgrade. Reply comment #4 (*: I did not use the patch: https://review.openstack.org/#/c/334486/3/tripleoclient/v1/overcloud_deploy.py) Running "pcs resource cleanup openstack-heat-engine" revives the openstack-heat-engine (checked on 2 setups with the issue reproduced). omri o/ (thanks jjoyce for ping) we need to find out/be more specific about which step the services were stopped after so that we can debug what went wrong, otherwise it isn't clear what to start debugging and where. As we discussed in yesterday's call, after each step, we need to ensure that the cluster is fully running, even if that means, for now, a manual intervention. I'm not clear if have applied that in this environment which happened a while ago afaics. If you *have* then it implies the services are failing exactly in the last step, i.e. stopping/failing to restart during upgrade converge? But if you haven't been checking after each of the steps in this env, then the failed services could have happened during any of the earlier steps, including the migrations Speaking of which, I know (and you know) there are reports of servces down after the keystone migration (https://bugzilla.redhat.com/show_bug.cgi?id=1348831 though that should be fixed now at least it is in my local testing), as well as after the controller upgrade (rabbit issue at https://bugzilla.redhat.com/show_bug.cgi?id=1343905). Since you mentinoed gnocchi i also saw this fly by today - gnocchi related pcs constraints fixup at https://review.openstack.org/#/c/344823/9. My point in mentioning these is, depending on the answer to 'which step did it fail on' it may have the same root cause as one of those other bugzillas. Omri wdyt? This issue maybe related to this bug https://bugzilla.redhat.com/show_bug.cgi?id=1348222 a redis dependency missing. while waiting for more info/testing from qe, especially after testing the rabbit fixup, explicitly noting that the reported 'gnochi down' from the description above may be fixed by https://review.openstack.org/#/c/344823/ - adding as a related review in the external tracker above I was able to finish converge run with no failed/stopped services at all. Here are the extra things which aren't yet merged downstream but were present on my env: * the patch we mention above https://review.openstack.org/#/c/344823/ * manuall install of python-cradox to work around bug 1359760 * the original workaround for bug 1343905 (not the latest fix, i'm yet about to test with that one) I followed the latest upgrade guide and finished on 02 AUG 16 (http://etherpad.corp.redhat.com/ospd9-upgrade) Initial deployment: openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1 I think based on my latest upgrade testing as well as seen by Yuri in comment (https://bugzilla.redhat.com/show_bug.cgi?id=1353031#c14) we can safely mark verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1599.html |