Bug 1571437 - [RFE] [Improvements] FFU: openstack overcloud ffwd-upgrade prepare fails with No valid host was found. No valid host found for resize (HTTP 400) and deletes overcloud nodes
Summary: [RFE] [Improvements] FFU: openstack overcloud ffwd-upgrade prepare fails with...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: zstream
: 13.0 (Queens)
Assignee: Lukas Bezdicka
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On: 1576336
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-24 19:23 UTC by Marius Cornea
Modified: 2020-04-24 16:23 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1576336 (view as bug list)
Environment:
Last Closed: 2019-01-28 16:07:39 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Marius Cornea 2018-04-24 19:23:47 UTC
Description of problem:
FFU: openstack overcloud ffwd-upgrade prepare fails with No valid host was found. No valid host found for resize (HTTP 400):

2018-04-24 19:00:56Z [overcloud-Controller-ypuzux5fed5s-0-queb23gl3ns3.Controller]: UPDATE_FAILED  BadRequest: resources.Controller: No valid host was found. No valid host found for resize (HTTP 400) (Request-ID: req-a1d89384-0c78-4c50-8d54-ffae9ce58355)
2018-04-24 19:00:56Z [overcloud-Controller-ypuzux5fed5s-0-queb23gl3ns3]: UPDATE_FAILED  Resource UPDATE failed: BadRequest: resources.Controller: No valid host was found. No valid host found for resize (HTTP 400) (Request-ID: req-a1d89384-0c78-4c50-8d54-ffae9ce58355)
2018-04-24 19:00:56Z [overcloud-Controller-ypuzux5fed5s.0]: UPDATE_FAILED  BadRequest: resources[0].resources.Controller: No valid host was found. No valid host found for resize (HTTP 400) (Request-ID: req-a1d89384-0c78-4c50-8d54-ffae9ce58355)
2018-04-24 19:00:56Z [overcloud-Controller-ypuzux5fed5s]: UPDATE_FAILED  Resource UPDATE failed: BadRequest: resources[0].resources.Controller: No valid host was found. No valid host found for resize (HTTP 400) (Request-ID: req-a1d89384-0c78-4c50-8d54-ffae9ce58355)
2018-04-24 19:00:57Z [overcloud-Compute-e2cgp6m6tc4q-0-hvprdpxiglhj.NovaCompute]: UPDATE_FAILED  BadRequest: resources.NovaCompute: No valid host was found. No valid host found for resize (HTTP 400) (Request-ID: req-3c0ec9db-d112-4cfe-ae6b-1b3a4c3c06c7)
2018-04-24 19:00:57Z [overcloud-Compute-e2cgp6m6tc4q-0-hvprdpxiglhj]: UPDATE_FAILED  Resource UPDATE failed: BadRequest: resources.NovaCompute: No valid host was found. No valid host found for resize (HTTP 400) (Request-ID: req-3c0ec9db-d112-4cfe-ae6b-1b3a4c3c06c7)
2018-04-24 19:00:57Z [Controller]: UPDATE_FAILED  BadRequest: resources.Controller.resources[0].resources.Controller: No valid host was found. No valid host found for resize (HTTP 400) (Request-ID: req-a1d89384-0c78-4c50-8d54-ffae9ce58355)
2018-04-24 19:00:57Z [overcloud]: UPDATE_FAILED  Resource UPDATE failed: BadRequest: resources.Controller.resources[0].resources.Controller: No valid host was found. No valid host found for resize (HTTP 400) (Request-ID: req-a1d89384-0c78-4c50-8d54-ffae9ce58355)
2018-04-24 19:00:57Z [overcloud-Compute-e2cgp6m6tc4q.0]: UPDATE_FAILED  BadRequest: resources[0].resources.NovaCompute: No valid host was found. No valid host found for resize (HTTP 400) (Request-ID: req-3c0ec9db-d112-4cfe-ae6b-1b3a4c3c06c7)
2018-04-24 19:00:57Z [overcloud-Compute-e2cgp6m6tc4q]: UPDATE_FAILED  Resource UPDATE failed: BadRequest: resources[0].resources.NovaCompute: No valid host was found. No valid host found for resize (HTTP 400) (Request-ID: req-3c0ec9db-d112-4cfe-ae6b-1b3a4c3c06c7)

 Stack overcloud UPDATE_FAILED 

overcloud.Controller.0.Controller:
  resource_type: OS::TripleO::ControllerServer
  physical_resource_id: 7baec254-6820-4986-a17e-482f4d1b54f8
  status: UPDATE_FAILED
  status_reason: |
    BadRequest: resources.Controller: No valid host was found. No valid host found for resize (HTTP 400) (Request-ID: req-a1d89384-0c78-4c50-8d54-ffae9ce58355)
overcloud.Compute.0.NovaCompute:
  resource_type: OS::TripleO::ComputeServer
  physical_resource_id: e14d156f-69ed-403f-8b31-3901d3f1eabf
  status: UPDATE_FAILED
  status_reason: |
    BadRequest: resources.NovaCompute: No valid host was found. No valid host found for resize (HTTP 400) (Request-ID: req-3c0ec9db-d112-4cfe-ae6b-1b3a4c3c06c7)
Heat Stack update failed.
Heat Stack update failed.


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-8.0.2-0.20180416194362.29a5ad5.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10 with 3 controller + 2 computes + 3 ceph nodes:

openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--control-scale 3 \
--control-flavor controller \
--compute-scale 2 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \

2. Upgrade undercloud to OSP11/12/13

3. Run fast forward prepare step:

openstack overcloud ffwd-upgrade prepare --templates --stack overcloud \
--container-registry-file docker-images.yaml \
--roles-file /usr/share/openstack-tripleo-heat-templates/roles_data.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e ffu_repos.yaml


Actual results:
Heat stack update fails trying to resize the hosts.

Expected results:
The initial flavors are preserved and heat doesn't try to resize the hosts to allow the ffu prepare stage to complete.

Additional info:

Comment 1 Marius Cornea 2018-04-24 20:04:51 UTC
Moreover the command not only tried to resize existing instance but it *deleted* overcloud nodes leaving only 1 controller + 1 compute deployed. I suspect this is caused by the overcloud ffwd-upgrade prepare requiring all the options(not only the environment files passed via -e but also --control-scale, --compute-scale, etc) passed to the inital overcloud deploy command to be present. 

Nevertheless I think we need a mechanism to avoid this situation, we cannot afford to allow deleting nodes while running upgrade prepare commands.

Before openstack overcloud ffwd-upgrade prepare:

(undercloud) [stack@undercloud-0 ~]$ ironic node-list
The "ironic" CLI is deprecated and will be removed in the S* release. Please use the "openstack baremetal" CLI instead.
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| bd6f659b-08d1-40f2-b124-8cccd025fcd8 | ceph-0       | ced72b29-34d4-4d35-8870-b975244c650c | power on    | active             | False       |
| 764b734d-b5d3-4d76-bdcf-f23cfe27d112 | ceph-1       | 0ccba50e-b156-4908-9516-46e7914ac10a | power on    | active             | False       |
| d09f4447-4e1c-424e-a889-731df67c3afd | ceph-2       | f09a052b-10ac-4e51-b2d0-c378de7b43ea | power on    | active             | False       |
| b1793f03-3fb3-49b5-823c-ead630b9e3b5 | compute-0    | e14d156f-69ed-403f-8b31-3901d3f1eabf | power on    | active             | False       |
| de0d2899-28e8-4486-bd46-472841f31415 | compute-1    | 2509ce08-2557-4459-b14b-2ffb44683076 | power on    | active             | False       |
| b56659b5-a3de-41fb-8ed5-a90dfdfec8f0 | controller-0 | 38833614-14ee-4f2a-9a69-adf4cc1b4ca4 | power on    | active             | False       |
| 4302b8cd-906d-4542-8159-403d377dfa7e | controller-1 | 7baec254-6820-4986-a17e-482f4d1b54f8 | power on    | active             | False       |
| b206d65f-c9cb-4801-84d1-977406c77f14 | controller-2 | 394368a1-39ec-457e-925d-0c1ea78f24a6 | power on    | active             | False       |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
(undercloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks               |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| 0ccba50e-b156-4908-9516-46e7914ac10a | ceph-0       | ACTIVE | -          | Running     | ctlplane=192.168.24.23 |
| ced72b29-34d4-4d35-8870-b975244c650c | ceph-1       | ACTIVE | -          | Running     | ctlplane=192.168.24.10 |
| f09a052b-10ac-4e51-b2d0-c378de7b43ea | ceph-2       | ACTIVE | -          | Running     | ctlplane=192.168.24.19 |
| e14d156f-69ed-403f-8b31-3901d3f1eabf | compute-0    | ACTIVE | -          | Running     | ctlplane=192.168.24.6  |
| 2509ce08-2557-4459-b14b-2ffb44683076 | compute-1    | ACTIVE | -          | Running     | ctlplane=192.168.24.13 |
| 7baec254-6820-4986-a17e-482f4d1b54f8 | controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.18 |
| 394368a1-39ec-457e-925d-0c1ea78f24a6 | controller-1 | ACTIVE | -          | Running     | ctlplane=192.168.24.16 |
| 38833614-14ee-4f2a-9a69-adf4cc1b4ca4 | controller-2 | ACTIVE | -          | Running     | ctlplane=192.168.24.11 |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+


After openstack overcloud ffwd-upgrade prepare:

(undercloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks               |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| e14d156f-69ed-403f-8b31-3901d3f1eabf | compute-0    | ACTIVE | -          | Running     | ctlplane=192.168.24.6  |
| 7baec254-6820-4986-a17e-482f4d1b54f8 | controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.18 |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
(undercloud) [stack@undercloud-0 ~]$ ironic node-list
The "ironic" CLI is deprecated and will be removed in the S* release. Please use the "openstack baremetal" CLI instead.
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| bd6f659b-08d1-40f2-b124-8cccd025fcd8 | ceph-0       | None                                 | power off   | available          | False       |
| 764b734d-b5d3-4d76-bdcf-f23cfe27d112 | ceph-1       | None                                 | power off   | available          | False       |
| d09f4447-4e1c-424e-a889-731df67c3afd | ceph-2       | None                                 | power off   | available          | False       |
| b1793f03-3fb3-49b5-823c-ead630b9e3b5 | compute-0    | e14d156f-69ed-403f-8b31-3901d3f1eabf | power on    | active             | False       |
| de0d2899-28e8-4486-bd46-472841f31415 | compute-1    | None                                 | power off   | available          | False       |
| b56659b5-a3de-41fb-8ed5-a90dfdfec8f0 | controller-0 | None                                 | power off   | available          | False       |
| 4302b8cd-906d-4542-8159-403d377dfa7e | controller-1 | 7baec254-6820-4986-a17e-482f4d1b54f8 | power on    | active             | False       |
| b206d65f-c9cb-4801-84d1-977406c77f14 | controller-2 | None                                 | power off   | available          | False       |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+

Comment 2 Marios Andreou 2018-04-25 06:36:54 UTC
o/ Marius - not sure what's happening there yet. I just checked/confirmed that the ffwd-upgrade and upgrade prepare commands are pretty identical in how we implemented them [1][2] so if we *do* for some reason need to pass in all the --opts then it would also apply to the major upgrade cli but afaik we aren't hitting it. 

When was the last time you successfully ran the ffwd-upgrade prepare? Alternatively did you confirm if passing in all the --opts makes this go away?


[1] https://github.com/openstack/python-tripleoclient/blob/5c7c923a01d4f8b460fc9481b7c38454cde10f5f/tripleoclient/v1/overcloud_ffwd_upgrade.py#L25-L84
[2] https://github.com/openstack/python-tripleoclient/blob/5c7c923a01d4f8b460fc9481b7c38454cde10f5f/tripleoclient/v1/overcloud_upgrade.py#L27-L83

Comment 3 Marius Cornea 2018-04-25 13:50:27 UTC
(In reply to Marios Andreou from comment #2)
> o/ Marius - not sure what's happening there yet. I just checked/confirmed
> that the ffwd-upgrade and upgrade prepare commands are pretty identical in
> how we implemented them [1][2] so if we *do* for some reason need to pass in
> all the --opts then it would also apply to the major upgrade cli but afaik
> we aren't hitting it. 
> 
> When was the last time you successfully ran the ffwd-upgrade prepare?
> Alternatively did you confirm if passing in all the --opts makes this go
> away?
> 

This was my first time trying to use the Openstack CLI for running the fast forward upgrade. Note that the openstack overcloud ffwd-upgrade prepare passed when passing all the --opts but it later did the same thing(deleted the nodes) while running openstack overcloud ffwd-upgrade run where passing --opts is not an option.

Comment 4 Carlos Camacho 2018-04-30 10:43:57 UTC
Hey Marius I had the same issue a few weeks ago, easy to workaround, just add:

--control-flavor controller \
--compute-flavor compute \
--ceph-storage-flavor ceph \

To the upgrade prepare command and it should work.

Comment 5 Jose Luis Franco 2018-04-30 14:46:23 UTC
@Jirka, this BZ has been assigned to you during triage duty call. Please, feel free to reasign.

Comment 6 Jiri Stransky 2018-05-04 16:08:58 UTC
Yes we'll need to keep providing all args to the prepare/converge commands, not just -e files. They're essentially `overcloud deploy` commands with additional env files prepended to the list.

Re deleting nodes when running `openstack overcloud ffwd-upgrade run` -- did you manage to reproduce that? I can't imagine how could that happen because that command just runs Ansible, it doesn't do any operation with the overcloud Heat stack.

If we don't reproduce this ^ then i think we should transform this BZ into a docs check that we document passing all `deploy` args to the `prepare` and `converge` commands.

Comment 7 Marius Cornea 2018-05-04 16:15:21 UTC
(In reply to Jiri Stransky from comment #6)
> Yes we'll need to keep providing all args to the prepare/converge commands,
> not just -e files. They're essentially `overcloud deploy` commands with
> additional env files prepended to the list.
> 
> Re deleting nodes when running `openstack overcloud ffwd-upgrade run` -- did
> you manage to reproduce that? I can't imagine how could that happen because
> that command just runs Ansible, it doesn't do any operation with the
> overcloud Heat stack.

I haven't tried that specifically lately. 

> If we don't reproduce this ^ then i think we should transform this BZ into a
> docs check that we document passing all `deploy` args to the `prepare` and
> `converge` commands.

My thought was that given these options were deprecated in Ocata we should recommend converting them into parameters and pass them inside an environment file to be able to keep using them for future overcloud deploy/upgrade commands.

Like documentation for something in the lines of https://review.openstack.org/#/c/564836/30/tasks/common/convert_cli_opts_params.yaml

Comment 8 Marios Andreou 2018-05-07 12:57:24 UTC
can you please triage this (from triage call round robin)

Comment 11 Jiri Stransky 2018-05-09 08:51:25 UTC
Yup i wanted to leave it untriaged until we get more info whether this is reproducible. But AFAICT this wasn't reproduced, so i'll drop the blocker flag and mark triaged, as i don't think this needs a code fix.

If we hit this again, we can put back the blocker flag and bump up the priority.

I reported a docs BZ here to document carrying over all params (not just -e files) and deal with the deprecation of some of them:

https://bugzilla.redhat.com/show_bug.cgi?id=1576336

Comment 12 Lukas Bezdicka 2018-05-30 15:03:37 UTC
Can we create some safety check for amount of supplied nodes vs amount of existing nodes and if they diff fail?

Comment 14 Carlos Camacho 2018-09-07 14:07:48 UTC
We are moving this to the validations DFG as this might be a really nice validation to have.

Comment 15 Rods 2018-12-27 12:24:18 UTC
(In reply to Carlos Camacho from comment #4)
> Hey Marius I had the same issue a few weeks ago, easy to workaround, just
> add:
> 
> --control-flavor controller \
> --compute-flavor compute \
> --ceph-storage-flavor ceph \
> 
> To the upgrade prepare command and it should work.

I tried this, does not seem to work. Is this issue affected by specific node IDs?

Comment 17 Lukas Bezdicka 2019-01-28 16:07:39 UTC
Most of the cases hitting similar issue are having bad role_data.yaml. Either they use deprecated_* option where it should not have been done or they actually don't use it. In any case we have documentation covering it and it's not something we can easily address with validations :(


If you hit this issue:

Verify you have specified in your templates amount of compute/control nodes or you use --<role>-scale in the prepare command. Make sure your role_data.yaml does use proper options especially with or without deprecated params. Ideal solution is to not to use deprecated_ in the templates and just update the templates for the new parameter.


Note You need to log in before you can comment on or make changes to this bug.