rhel-osp-director: Attempted to scale +1 compute after upgrade 8.0->9.0, without "openstack baremetal configure boot" - the setup is in a bad state, can't fix. Environment: instack-undercloud-4.0.0-8.el7ost.noarch openstack-tripleo-heat-templates-kilo-2.0.0-18.el7ost.noarch openstack-tripleo-heat-templates-liberty-2.0.0-18.el7ost.noarch openstack-tripleo-heat-templates-2.0.0-18.el7ost.noarch openstack-puppet-modules-8.1.5-1.el7ost.noarch Steps to reproduce: 1. Deploy 8.0 with: openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1 2. Upgrade the setup to 9.0 3. Upgrade the overcloud images and don't run "openstack baremetal configure boot" 4. Attempt to scale +1 compute with: openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1 This will give warnings: openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1 [stack@instack ~]$ openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1 Node uuid=5c7d7721-ecaa-44ff-81d0-8cc2b8a49fb3 has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7". Node uuid=5c7d7721-ecaa-44ff-81d0-8cc2b8a49fb3 has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d". Node uuid=6c95b7f6-fd1e-4378-9adc-058167b72b51 has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7". Node uuid=6c95b7f6-fd1e-4378-9adc-058167b72b51 has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d". Node uuid=defa4c68-5020-4b2d-bf68-18557a5bd71e has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7". Node uuid=defa4c68-5020-4b2d-bf68-18557a5bd71e has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d". Node uuid=5f3e2d26-89e3-4e1a-904f-0573791d4eab has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7". Node uuid=5f3e2d26-89e3-4e1a-904f-0573791d4eab has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d". Node uuid=d68671c6-d4f0-4136-96db-2de12315607f has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7". Node uuid=d68671c6-d4f0-4136-96db-2de12315607f has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d". Node uuid=8026afe2-7291-4e50-a3ee-55ac2b14d139 has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7". Node uuid=8026afe2-7291-4e50-a3ee-55ac2b14d139 has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d". Node uuid=d6a43128-f78c-458b-b1a0-79a4a4a94fda has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7". Node uuid=d6a43128-f78c-458b-b1a0-79a4a4a94fda has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d". Node uuid=b49e6412-d3ae-45ef-8a9b-dd7470d784ae has an incorrectly configured driver_info/deploy_ramdisk. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "c93d25d3-a08a-4247-ad9c-a4af3aa9f4b7". Node uuid=b49e6412-d3ae-45ef-8a9b-dd7470d784ae has an incorrectly configured driver_info/deploy_kernel. Expected "26fde597-058b-4992-8e7a-6f51fe4c275c" but got "7d7ad1c0-8b33-45b8-8906-2dd3050e1e8d". Configuration has 16 errors, fix them before proceeding. Ignoring these errors is likely to lead to a failed deploy. Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates 5. Hit on ctrl+c, run: openstack baremetal configure boot Rerun the scale command. openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1 Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates ERROR: Stack overcloud already has an action (UPDATE) in progress. Eventually the scale attempt fails. [stack@instack ~]$ nova list +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+ | d6209516-cadf-40e1-939c-4d62c1383307 | overcloud-cephstorage-0 | ERROR | - | Running | ctlplane=192.0.2.7 | | 11fcc012-99af-4789-88f3-0e05cd9e74bf | overcloud-compute-0 | ERROR | - | Running | ctlplane=192.0.2.8 | | 96169856-2640-46b0-91ae-eb565467f3af | overcloud-compute-1 | BUILD | spawning | NOSTATE | ctlplane=192.0.2.18 | | 5adeef96-0985-47ad-a353-36a89e598c09 | overcloud-controller-0 | ERROR | - | Running | ctlplane=192.0.2.11 | | 2597a97b-47ff-421e-a37a-c3c1ee8d5057 | overcloud-controller-1 | ERROR | - | Running | ctlplane=192.0.2.9 | | b867f183-31ab-4cc6-8afb-6fcbba6ae190 | overcloud-controller-2 | ERROR | - | Running | ctlplane=192.0.2.10 | +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+ 6. Re-ran the scale command. failed again, but now I have 2 compute-1 +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+ | d6209516-cadf-40e1-939c-4d62c1383307 | overcloud-cephstorage-0 | ERROR | - | Running | ctlplane=192.0.2.7 | | 11fcc012-99af-4789-88f3-0e05cd9e74bf | overcloud-compute-0 | ERROR | - | Running | ctlplane=192.0.2.8 | | c8a31539-bc47-4b62-924d-d037900337af | overcloud-compute-1 | ERROR | - | NOSTATE | | | ee3f8271-d677-4b34-a6cd-06408572aa3d | overcloud-compute-1 | ERROR | - | NOSTATE | | | 5adeef96-0985-47ad-a353-36a89e598c09 | overcloud-controller-0 | ERROR | - | Running | ctlplane=192.0.2.11 | | 2597a97b-47ff-421e-a37a-c3c1ee8d5057 | overcloud-controller-1 | ERROR | - | Running | ctlplane=192.0.2.9 | | b867f183-31ab-4cc6-8afb-6fcbba6ae190 | overcloud-controller-2 | ERROR | - | Running | ctlplane=192.0.2.10 | +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+ 7. re-ran the original deployment command. failed: [stack@instack ~]$ nova list +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+ | d6209516-cadf-40e1-939c-4d62c1383307 | overcloud-cephstorage-0 | ERROR | - | Running | ctlplane=192.0.2.7 | | 11fcc012-99af-4789-88f3-0e05cd9e74bf | overcloud-compute-0 | ERROR | - | Running | ctlplane=192.0.2.8 | | 5adeef96-0985-47ad-a353-36a89e598c09 | overcloud-controller-0 | ERROR | - | Running | ctlplane=192.0.2.11 | | 2597a97b-47ff-421e-a37a-c3c1ee8d5057 | overcloud-controller-1 | ERROR | - | Running | ctlplane=192.0.2.9 | | b867f183-31ab-4cc6-8afb-6fcbba6ae190 | overcloud-controller-2 | ERROR | - | Running | ctlplane=192.0.2.10 | +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+ Expected result: Don't start a deployment/scale with incorrectly configure details.
On further review, this looks like we are putting the cloud in a bad state and then trying to scale out. Should this get a more similar treatment to: https://bugzilla.redhat.com/show_bug.cgi?id=1356777 re: can we document 'make sure the cloud is in a reasonable state before trying scale, update, or upgrade type operations'?
The concerning thing here is that forgetting to run configure boot can leave your cloud in an unrecoverable state (this is also an example of why validation errors should be fatal by default...). It _looks_ to me like this may have triggered a rebuild of all the existing nodes, based on the fact that the previously deployed instances have all gone to error state too (unless the initial deploy failed, in which case we are back to "make sure your cloud is in a consistent state", but it's not clear to me whether that's the case here). So I'm not sure we can call this a doc text-only bug, but it may very well be related to the node rebuild bug Brad is looking into and may be fixed when that one is.
Closing this out for 9 as it represents an unlikely case. This can be addressed via a new bug for 10 to handle such CLI interactions.
Note that I went ahead and pushed a patch upstream to make this sort of error fatal, so we won't mistakenly try to deploy when the nodes are in a bad state: https://review.openstack.org/349609 Hopefully that will at least help with similar situations in the future.