Bug 1360421

Summary:	rhel-osp-director: Attempted to scale +1 compute after upgrade 8.0->9.0, without "openstack baremetal configure boot" - the setup is in a bad state, can't fix.
Product:	Red Hat OpenStack	Reporter:	Alexander Chuzhoy <sasha>
Component:	rhosp-director	Assignee:	Brad P. Crochet <brad>
Status:	CLOSED WONTFIX	QA Contact:	Omri Hochman <ohochman>
Severity:	high	Docs Contact:
Priority:	medium
Version:	9.0 (Mitaka)	CC:	bnemec, dbecker, jason.dobies, jcoufal, mburns, morazi, rhel-osp-director-maint, tvignaud
Target Milestone:	ga	Keywords:	Triaged
Target Release:	9.0 (Mitaka)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-08-02 13:21:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alexander Chuzhoy 2016-07-26 17:03:26 UTC

rhel-osp-director:  Attempted to scale +1 compute after upgrade 8.0->9.0, without "openstack baremetal configure boot" - the setup is in a bad state, can't fix.


Environment:
instack-undercloud-4.0.0-8.el7ost.noarch
openstack-tripleo-heat-templates-kilo-2.0.0-18.el7ost.noarch
openstack-tripleo-heat-templates-liberty-2.0.0-18.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-18.el7ost.noarch
openstack-puppet-modules-8.1.5-1.el7ost.noarch


Steps to reproduce:
1. Deploy 8.0 with:
openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1


2. Upgrade the setup to 9.0

3. Upgrade the overcloud images and don't run "openstack baremetal configure boot"

4. Attempt to scale +1 compute with:
openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1

This will give warnings:
openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1
[stack@instack ~]$ openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1
Node uuid=5c7d7721-ecaa-44ff-81d0-8cc2b8a49fb3 has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7".
Node uuid=5c7d7721-ecaa-44ff-81d0-8cc2b8a49fb3 has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d".
Node uuid=6c95b7f6-fd1e-4378-9adc-058167b72b51 has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7".
Node uuid=6c95b7f6-fd1e-4378-9adc-058167b72b51 has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d".
Node uuid=defa4c68-5020-4b2d-bf68-18557a5bd71e has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7".
Node uuid=defa4c68-5020-4b2d-bf68-18557a5bd71e has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d".
Node uuid=5f3e2d26-89e3-4e1a-904f-0573791d4eab has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7".
Node uuid=5f3e2d26-89e3-4e1a-904f-0573791d4eab has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d".
Node uuid=d68671c6-d4f0-4136-96db-2de12315607f has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7".
Node uuid=d68671c6-d4f0-4136-96db-2de12315607f has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d".
Node uuid=8026afe2-7291-4e50-a3ee-55ac2b14d139 has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7".
Node uuid=8026afe2-7291-4e50-a3ee-55ac2b14d139 has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d".
Node uuid=d6a43128-f78c-458b-b1a0-79a4a4a94fda has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7".
Node uuid=d6a43128-f78c-458b-b1a0-79a4a4a94fda has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d".
Node uuid=b49e6412-d3ae-45ef-8a9b-dd7470d784ae has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7".
Node uuid=b49e6412-d3ae-45ef-8a9b-dd7470d784ae has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d".
Configuration has 16 errors, fix them before proceeding. Ignoring these errors is likely to lead to a failed deploy.
Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates


5. Hit on ctrl+c, run: openstack baremetal configure boot
Rerun the scale command.

openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1
Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates
ERROR: Stack overcloud already has an action (UPDATE) in progress.


Eventually the scale attempt fails.
[stack@instack ~]$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
| ID                                   | Name                    | Status | Task State | Power State | Networks            |
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
| d6209516-cadf-40e1-939c-4d62c1383307 | overcloud-cephstorage-0 | ERROR  | -          | Running     | ctlplane=192.0.2.7  |
| 11fcc012-99af-4789-88f3-0e05cd9e74bf | overcloud-compute-0     | ERROR  | -          | Running     | ctlplane=192.0.2.8  |
| 96169856-2640-46b0-91ae-eb565467f3af | overcloud-compute-1     | BUILD  | spawning   | NOSTATE     | ctlplane=192.0.2.18 |
| 5adeef96-0985-47ad-a353-36a89e598c09 | overcloud-controller-0  | ERROR  | -          | Running     | ctlplane=192.0.2.11 |
| 2597a97b-47ff-421e-a37a-c3c1ee8d5057 | overcloud-controller-1  | ERROR  | -          | Running     | ctlplane=192.0.2.9  |
| b867f183-31ab-4cc6-8afb-6fcbba6ae190 | overcloud-controller-2  | ERROR  | -          | Running     | ctlplane=192.0.2.10 |
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+



6. Re-ran the scale command. failed again, but now I have 2 compute-1
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
| ID                                   | Name                    | Status | Task State | Power State | Networks            |
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
| d6209516-cadf-40e1-939c-4d62c1383307 | overcloud-cephstorage-0 | ERROR  | -          | Running     | ctlplane=192.0.2.7  |
| 11fcc012-99af-4789-88f3-0e05cd9e74bf | overcloud-compute-0     | ERROR  | -          | Running     | ctlplane=192.0.2.8  |
| c8a31539-bc47-4b62-924d-d037900337af | overcloud-compute-1     | ERROR  | -          | NOSTATE     |                     |
| ee3f8271-d677-4b34-a6cd-06408572aa3d | overcloud-compute-1     | ERROR  | -          | NOSTATE     |                     |
| 5adeef96-0985-47ad-a353-36a89e598c09 | overcloud-controller-0  | ERROR  | -          | Running     | ctlplane=192.0.2.11 |
| 2597a97b-47ff-421e-a37a-c3c1ee8d5057 | overcloud-controller-1  | ERROR  | -          | Running     | ctlplane=192.0.2.9  |
| b867f183-31ab-4cc6-8afb-6fcbba6ae190 | overcloud-controller-2  | ERROR  | -          | Running     | ctlplane=192.0.2.10 |
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+

7. re-ran the original deployment command. failed:
[stack@instack ~]$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
| ID                                   | Name                    | Status | Task State | Power State | Networks            |
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
| d6209516-cadf-40e1-939c-4d62c1383307 | overcloud-cephstorage-0 | ERROR  | -          | Running     | ctlplane=192.0.2.7  |
| 11fcc012-99af-4789-88f3-0e05cd9e74bf | overcloud-compute-0     | ERROR  | -          | Running     | ctlplane=192.0.2.8  |
| 5adeef96-0985-47ad-a353-36a89e598c09 | overcloud-controller-0  | ERROR  | -          | Running     | ctlplane=192.0.2.11 |
| 2597a97b-47ff-421e-a37a-c3c1ee8d5057 | overcloud-controller-1  | ERROR  | -          | Running     | ctlplane=192.0.2.9  |
| b867f183-31ab-4cc6-8afb-6fcbba6ae190 | overcloud-controller-2  | ERROR  | -          | Running     | ctlplane=192.0.2.10 |
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+


Expected result:
Don't start a deployment/scale with incorrectly configure details.

Comment 4 Mike Orazi 2016-08-01 14:43:46 UTC

On further review, this looks like we are putting the cloud in a bad state and then trying to scale out.  Should this get a more similar treatment to:  https://bugzilla.redhat.com/show_bug.cgi?id=1356777 re: can we document 'make sure the cloud is in a reasonable state before trying scale, update, or upgrade type operations'?

Comment 5 Ben Nemec 2016-08-01 15:39:27 UTC

The concerning thing here is that forgetting to run configure boot can leave your cloud in an unrecoverable state (this is also an example of why validation errors should be fatal by default...).  It _looks_ to me like this may have triggered a rebuild of all the existing nodes, based on the fact that the previously deployed instances have all gone to error state too (unless the initial deploy failed, in which case we are back to "make sure your cloud is in a consistent state", but it's not clear to me whether that's the case here).

So I'm not sure we can call this a doc text-only bug, but it may very well be related to the node rebuild bug Brad is looking into and may be fixed when that one is.

Comment 6 Jay Dobies 2016-08-02 13:21:47 UTC

Closing this out for 9 as it represents an unlikely case. This can be addressed via a new bug for 10 to handle such CLI interactions.

Comment 7 Ben Nemec 2016-08-02 17:24:10 UTC

Note that I went ahead and pushed a patch upstream to make this sort of error fatal, so we won't mistakenly try to deploy when the nodes are in a bad state: https://review.openstack.org/349609  Hopefully that will at least help with similar situations in the future.