Bug 1354627
Summary: | Existing nodes get rebuilt during scale out after 8->9 upgrade | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> | |
Component: | openstack-tripleo-common | Assignee: | Brad P. Crochet <brad> | |
Status: | CLOSED ERRATA | QA Contact: | Marius Cornea <mcornea> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 9.0 (Mitaka) | CC: | dbecker, gfidente, jason.dobies, jcoufal, jstransk, mburns, mcornea, morazi, ramishra, rhel-osp-director-maint, sasha, sclewis, slinaber, tvignaud, zbitter | |
Target Milestone: | ga | Keywords: | Reopened, Triaged | |
Target Release: | 9.0 (Mitaka) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openstack-tripleo-common-2.0.0-8.el7ost | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1409851 (view as bug list) | Environment: | ||
Last Closed: | 2016-08-11 11:36:00 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1362612, 1409851 |
Description
Marius Cornea
2016-07-11 19:08:27 UTC
I wasn't able to reproduce this issue with the latest build. I'm going to reopen it if I see it again. Reopening it - I was able to reproduce only when I update the images after the last upgrade step: major-upgrade-pacemaker-converge.yaml. Note that if I update the images right after upgrading the undercloud this issue doesn't show up. Also note that the enable-tls.yaml environment was changed during the overcloud upgrade process in order to overcome BZ#1353079#c6 Nevertheless the result is destructive as all the nodes get recreated so we should make sure a user doesn't end up in this situation. Just checked that the issue doesn't reproduce on clean deployment of 9 + scale out. Ran into https://bugzilla.redhat.com/show_bug.cgi?id=1360421, which probably confirms this bug. I was able to reproduce this without the Ceph nodes, so the ordering of the image update is looking like a good candidate. I will investigate why that is. I have now reproduced this with a single controller (running pacemaker) and a single compute. Still investigating why the ordering of the image upload make a difference. @mcornea Do you run the 'openstack baremetal configure boot' when you update the images right after the upgrading the undercloud? (In reply to Brad P. Crochet from comment #15) > @mcornea Do you run the 'openstack baremetal configure boot' when you update > the images right after the upgrading the undercloud? Yes, I do. What I did was: source ~/stackrc; openstack overcloud image upload --update-existing openstack baremetal configure boot I hit the nodes getting rebuilt in a different scenario: 1. update images 2. upgrade overcloud 3. scale out additional compute node 4. remove the added node: source ~/stackrc export THT=/usr/share/openstack-tripleo-heat-templates openstack overcloud node delete --stack overcloud --templates \ -e $THT/environments/network-isolation.yaml \ -e $THT/environments/network-management.yaml \ -e ~/templates/network-environment.yaml \ -e $THT/environments/storage-environment.yaml \ -e ~/templates/disk-layout.yaml \ -e ~/templates/wipe-disk-env.yaml \ fe07a5fb-52ff-4736-acda-64e4267301ff resulting in: [stack@undercloud ~]$ nova list +--------------------------------------+-------------------------+---------+------------------+-------------+-----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------+---------+------------------+-------------+-----------------------+ | 063f9adc-626f-4735-96ca-471e232f90c7 | overcloud-cephstorage-0 | REBUILD | rebuild_spawning | Running | ctlplane=192.168.0.20 | | 6b1ab8c3-835d-427f-bb4a-0c831313d098 | overcloud-compute-0 | REBUILD | rebuild_spawning | Running | ctlplane=192.168.0.21 | | e4c7970b-d43d-46fe-b959-44e367e76b16 | overcloud-controller-0 | REBUILD | rebuild_spawning | Running | ctlplane=192.168.0.23 | | a88fd352-532b-4c23-a845-e53ece208811 | overcloud-controller-1 | REBUILD | rebuild_spawning | Running | ctlplane=192.168.0.22 | | 23810c05-cdf0-4063-86d6-6ed9797e189f | overcloud-controller-2 | REBUILD | rebuild_spawning | Running | ctlplane=192.168.0.24 | +--------------------------------------+-------------------------+---------+------------------+-------------+-----------------------+ Progress... I tried doing the upgrade, but manually changing the deploy_kernel and deploy_ramdisk on only unused nodes, leaving the already installed nodes alone. The old nodes were not rebuilt. It's probably a bad idea to have the old images updated like that anyway. So, the fix may need to come in the 'configure boot' command, and have it ignore already installed nodes. Just a data point -- (assuming this is not an intermittent issue) i was able to prevent the rebuild from happening by editing heat-engine code this way: https://paste.fedoraproject.org/399907/1497781/raw/ Obviously this is not a solution, but maybe it could help us narrow down the search for the cause. I wonder why doing the above is necessary, even though we have a Heat plugin to ignore property changes on OS::Nova::Server, which seemed to previously prevent OS::Nova::Server replacement successfully: https://github.com/openstack/tripleo-common/blob/stable/mitaka/undercloud_heat_plugins/server_update_allowed.py Is the difference here rebuild vs. replace perhaps? Previously we've seen issues where 2nd instance of the server was deployed, while now we see them rebuilding instead. Another data point -- i managed to reproduce the issue like this: # ... finish upgrade ... tar -xvf overcloud-full.tar openstack overcloud image upload --update-existing # ... and now do the scale up ... ^^ the point being i didn't download updated ironic agent image and i didn't run the `configure boot` command, but the issue still reproduced. (In reply to Jiri Stransky from comment #21) > Just a data point -- (assuming this is not an intermittent issue) i was able > to prevent the rebuild from happening by editing heat-engine code this way: > > https://paste.fedoraproject.org/399907/1497781/raw/ > > Obviously this is not a solution, but maybe it could help us narrow down the > search for the cause. Yes, the proximate cause is clearly that the image name is changing. So we need to figure out why. > I wonder why doing the above is necessary, even though > we have a Heat plugin to ignore property changes on OS::Nova::Server, which > seemed to previously prevent OS::Nova::Server replacement successfully: > > https://github.com/openstack/tripleo-common/blob/stable/mitaka/ > undercloud_heat_plugins/server_update_allowed.py > > Is the difference here rebuild vs. replace perhaps? Previously we've seen > issues where 2nd instance of the server was deployed, while now we see them > rebuilding instead. Correct, that custom plugin is designed to prevent any changes to properties triggering a replacement, not to ignore all changes. (In reply to Zane Bitter from comment #24) > (In reply to Jiri Stransky from comment #21) > > Just a data point -- (assuming this is not an intermittent issue) i was able > > to prevent the rebuild from happening by editing heat-engine code this way: > > > > https://paste.fedoraproject.org/399907/1497781/raw/ > > > > Obviously this is not a solution, but maybe it could help us narrow down the > > search for the cause. > > Yes, the proximate cause is clearly that the image name is changing. So we > need to figure out why. > It does seem to be only on scale up/down, rather than a "simple" stack update. This change in Mitaka: https://review.openstack.org/#/c/287834/11 added a translation rule that causes the 'image' property passed to OS::Nova::Server to be automatically translated to a UUID prior to the properties being assembled. The result is that TripleO's previous trick of uploading a new image and keeping the same name (but getting a new UUID) no longer works to prevent Heat from rebuilding the server when the image changes. I assume the update_restrict feature can also be used to avoid update/replacement. http://docs.openstack.org/developer/heat/template_guide/environment.html#restrict-update-or-replace-of-a-given-resource openstack-tripleo-common-2.0.0-8.el7ost.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1599.html |