Bug 1354627

Summary:	Existing nodes get rebuilt during scale out after 8->9 upgrade
Product:	Red Hat OpenStack	Reporter:	Marius Cornea <mcornea>
Component:	openstack-tripleo-common	Assignee:	Brad P. Crochet <brad>
Status:	CLOSED ERRATA	QA Contact:	Marius Cornea <mcornea>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	9.0 (Mitaka)	CC:	dbecker, gfidente, jason.dobies, jcoufal, jstransk, mburns, mcornea, morazi, ramishra, rhel-osp-director-maint, sasha, sclewis, slinaber, tvignaud, zbitter
Target Milestone:	ga	Keywords:	Reopened, Triaged
Target Release:	9.0 (Mitaka)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-common-2.0.0-8.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1409851 (view as bug list)		Environment:
Last Closed:	2016-08-11 11:36:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1362612, 1409851

Description Marius Cornea 2016-07-11 19:08:27 UTC

Description of problem:

During the first scale out attempt(adding an additional compute node) on an upgraded deployment all the existing nodes get rebuilt:

[stack@undercloud ~]$ nova list


+--------------------------------------+-------------------------+---------+------------------+-------------+-----------------------+
| ID                                   | Name                    | Status  | Task State       | Power State | Networks              |
+--------------------------------------+-------------------------+---------+------------------+-------------+-----------------------+
| bef41d32-bfed-42c8-9839-bd07f8ad2d93 | overcloud-cephstorage-0 | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.21 |
| 5f7fe8b4-54ee-4790-8fd4-c9ab1fad5cf8 | overcloud-cephstorage-1 | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.20 |
| 3772244f-e332-458a-b1b9-6f466a6d7411 | overcloud-cephstorage-2 | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.24 |
| 8a5dbbfd-696e-4ed3-be94-4f77c0f871b5 | overcloud-controller-0  | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.25 |
| ebcdcdb5-de05-45c6-b430-b0119ae04a60 | overcloud-controller-1  | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.23 |
| 4a118c5b-de23-48e8-8dc1-f3e1aaea2db6 | overcloud-controller-2  | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.26 |
| 692099f5-1ddd-4e0a-9fe4-e7c47d1f4d36 | overcloud-novacompute-0 | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.22 |
| 90795dbb-a353-4203-a820-108e3015b499 | overcloud-novacompute-1 | BUILD   | spawning         | NOSTATE     | ctlplane=192.168.0.11 |
+--------------------------------------+-------------------------+---------+------------------+-------------+-----------------------+ 

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-2.0.0-12.el7ost.noarch

How reproducible:


Steps to Reproduce:
1. Do initial deployment
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

2. Upgrade undercloud 
sudo yum update -y
openstack undercloud upgrade

3. Apply THT patches
pushd /usr/share/openstack-tripleo-heat-templates
curl -4 'https://review.openstack.org/gitweb?p=openstack/tripleo-heat-templates.git;a=patch;h=947ed53bc01ad25c90b403e9ad6cef4673a2e71f' | sudo patch -p1
curl -4 'https://review.openstack.org/gitweb?p=openstack/tripleo-heat-templates.git;a=patch;h=65efc468db16a28c623c428dd205f100809c73a1' | sudo patch -p1
curl -4 'https://review.openstack.org/gitweb?p=openstack/tripleo-heat-templates.git;a=patch;h=a6cd51c0c987dec1f391438e87511c5285b07124' | sudo patch -p1
popd

4. Add osp8 repos on overcloud nodes

5. major-upgrade-aodh.yaml

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-aodh.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

6. major-upgrade-keystone-liberty-mitaka.yaml 

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-keystone-liberty-mitaka.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

7. Add OSP9 repos on overcloud nodes

8. 
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e  /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

9. Update os-collect-config and resource-agents on overcloud nodes

10. 
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e  /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

11. Start rabbitmq on controller-1 and controller-2
systemctl start rabbitmq-server.service
pcs resource cleanup

12. 
upgrade-non-controller.sh --upgrade overcloud-novacompute-0

13. 
upgrade-non-controller.sh --upgrade overcloud-cephstorage-0
upgrade-non-controller.sh --upgrade overcloud-cephstorage-1
upgrade-non-controller.sh --upgrade overcloud-cephstorage-2

14. 
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e  /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

15. Update images
openstack overcloud image upload --update-existing
openstack baremetal configure boot

16. Scale out
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 2 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

Actual results:
All the nodes get reprovisioned.

Expected results:
The existing node remain running. 

Additional info:
I've only provided the overcloud-resource-registry-puppet.yaml environment file in the last step - major-upgrade-pacemaker-converge.yaml (I missed it for the first commands) so could this be the cause for this behavior?

Comment 2 Marius Cornea 2016-07-12 12:34:58 UTC

I wasn't able to reproduce this issue with the latest build. I'm going to reopen it if I see it again.

Comment 3 Marius Cornea 2016-07-14 08:18:42 UTC

Reopening it - I was able to reproduce only when I update the images after the last upgrade step: major-upgrade-pacemaker-converge.yaml. 

Note that if I update the images right after upgrading the undercloud this issue doesn't show up. Also note that the enable-tls.yaml environment was changed during the overcloud upgrade process in order to overcome BZ#1353079#c6

Nevertheless the result is destructive as all the nodes get recreated so we should make sure a user doesn't end up in this situation.

Comment 11 Alexander Chuzhoy 2016-07-25 20:16:13 UTC

Just checked that the issue doesn't reproduce on clean deployment of 9 + scale out.

Comment 12 Alexander Chuzhoy 2016-07-26 17:11:24 UTC

Ran into https://bugzilla.redhat.com/show_bug.cgi?id=1360421, which probably confirms this bug.

Comment 13 Brad P. Crochet 2016-07-28 12:01:45 UTC

I was able to reproduce this without the Ceph nodes, so the ordering of the image update is looking like a good candidate. I will investigate why that is.

Comment 14 Brad P. Crochet 2016-07-29 11:40:56 UTC

I have now reproduced this with a single controller (running pacemaker) and a single compute. Still investigating why the ordering of the image upload make a difference.

Comment 15 Brad P. Crochet 2016-07-29 11:42:30 UTC

@mcornea Do you run the 'openstack baremetal configure boot' when you update the images right after the upgrading the undercloud?

Comment 16 Marius Cornea 2016-07-29 12:25:33 UTC

(In reply to Brad P. Crochet from comment #15)
> @mcornea Do you run the 'openstack baremetal configure boot' when you update
> the images right after the upgrading the undercloud?

Yes, I do. What I did was:

source ~/stackrc; 
openstack overcloud image upload --update-existing
openstack baremetal configure boot

Comment 17 Marius Cornea 2016-07-29 13:33:10 UTC

I hit the nodes getting rebuilt in a different scenario:

1. update images
2. upgrade overcloud
3. scale out additional compute node

4. remove the added node:

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud node delete --stack overcloud --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
fe07a5fb-52ff-4736-acda-64e4267301ff

resulting in:
[stack@undercloud ~]$ nova list
+--------------------------------------+-------------------------+---------+------------------+-------------+-----------------------+
| ID                                   | Name                    | Status  | Task State       | Power State | Networks              |
+--------------------------------------+-------------------------+---------+------------------+-------------+-----------------------+
| 063f9adc-626f-4735-96ca-471e232f90c7 | overcloud-cephstorage-0 | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.20 |
| 6b1ab8c3-835d-427f-bb4a-0c831313d098 | overcloud-compute-0     | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.21 |
| e4c7970b-d43d-46fe-b959-44e367e76b16 | overcloud-controller-0  | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.23 |
| a88fd352-532b-4c23-a845-e53ece208811 | overcloud-controller-1  | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.22 |
| 23810c05-cdf0-4063-86d6-6ed9797e189f | overcloud-controller-2  | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.24 |
+--------------------------------------+-------------------------+---------+------------------+-------------+-----------------------+

Comment 18 Brad P. Crochet 2016-08-01 23:07:41 UTC

Progress... I tried doing the upgrade, but manually changing the deploy_kernel and deploy_ramdisk on only unused nodes, leaving the already installed nodes alone. The old nodes were not rebuilt. It's probably a bad idea to have the old images updated like that anyway. So, the fix may need to come in the 'configure boot' command, and have it ignore already installed nodes.

Comment 21 Jiri Stransky 2016-08-02 15:12:40 UTC

Just a data point -- (assuming this is not an intermittent issue) i was able to prevent the rebuild from happening by editing heat-engine code this way:

https://paste.fedoraproject.org/399907/1497781/raw/

Obviously this is not a solution, but maybe it could help us narrow down the search for the cause. I wonder why doing the above is necessary, even though we have a Heat plugin to ignore property changes on OS::Nova::Server, which seemed to previously prevent OS::Nova::Server replacement successfully:

https://github.com/openstack/tripleo-common/blob/stable/mitaka/undercloud_heat_plugins/server_update_allowed.py

Is the difference here rebuild vs. replace perhaps? Previously we've seen issues where 2nd instance of the server was deployed, while now we see them rebuilding instead.



Another data point -- i managed to reproduce the issue like this:

# ... finish upgrade ...

tar -xvf overcloud-full.tar
openstack overcloud image upload --update-existing

# ... and now do the scale up ...

^^ the point being i didn't download updated ironic agent image and i didn't run the `configure boot` command, but the issue still reproduced.

Comment 24 Zane Bitter 2016-08-02 15:22:26 UTC

(In reply to Jiri Stransky from comment #21)
> Just a data point -- (assuming this is not an intermittent issue) i was able
> to prevent the rebuild from happening by editing heat-engine code this way:
> 
> https://paste.fedoraproject.org/399907/1497781/raw/
> 
> Obviously this is not a solution, but maybe it could help us narrow down the
> search for the cause.

Yes, the proximate cause is clearly that the image name is changing. So we need to figure out why.

> I wonder why doing the above is necessary, even though
> we have a Heat plugin to ignore property changes on OS::Nova::Server, which
> seemed to previously prevent OS::Nova::Server replacement successfully:
> 
> https://github.com/openstack/tripleo-common/blob/stable/mitaka/
> undercloud_heat_plugins/server_update_allowed.py
> 
> Is the difference here rebuild vs. replace perhaps? Previously we've seen
> issues where 2nd instance of the server was deployed, while now we see them
> rebuilding instead.

Correct, that custom plugin is designed to prevent any changes to properties triggering a replacement, not to ignore all changes.

Comment 25 Brad P. Crochet 2016-08-02 15:29:29 UTC

(In reply to Zane Bitter from comment #24)
> (In reply to Jiri Stransky from comment #21)
> > Just a data point -- (assuming this is not an intermittent issue) i was able
> > to prevent the rebuild from happening by editing heat-engine code this way:
> > 
> > https://paste.fedoraproject.org/399907/1497781/raw/
> > 
> > Obviously this is not a solution, but maybe it could help us narrow down the
> > search for the cause.
> 
> Yes, the proximate cause is clearly that the image name is changing. So we
> need to figure out why.
> 

It does seem to be only on scale up/down, rather than a "simple" stack update.

Comment 26 Zane Bitter 2016-08-02 15:45:37 UTC

This change in Mitaka:

https://review.openstack.org/#/c/287834/11

added a translation rule that causes the 'image' property passed to OS::Nova::Server to be automatically translated to a UUID prior to the properties being assembled. The result is that TripleO's previous trick of uploading a new image and keeping the same name (but getting a new UUID) no longer works to prevent Heat from rebuilding the server when the image changes.

Comment 27 Rabi Mishra 2016-08-02 16:44:51 UTC

I assume the update_restrict feature can also be used to avoid update/replacement.

http://docs.openstack.org/developer/heat/template_guide/environment.html#restrict-update-or-replace-of-a-given-resource

Comment 29 Marius Cornea 2016-08-05 15:26:52 UTC

openstack-tripleo-common-2.0.0-8.el7ost.noarch

Comment 31 errata-xmlrpc 2016-08-11 11:36:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1599.html