Bug 1511874 - OSP11 -> OSP12 upgrade: unable to scale out compute nodes post upgrade
Summary: OSP11 -> OSP12 upgrade: unable to scale out compute nodes post upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: instack-undercloud
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 12.0 (Pike)
Assignee: Dmitry Tantsur
QA Contact: Marius Cornea
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-10 10:45 UTC by Marius Cornea
Modified: 2018-02-05 19:15 UTC (History)
10 users (show)

Fixed In Version: instack-undercloud-7.4.3-4.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-13 22:20:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1731885 0 None None None 2017-11-13 11:15:53 UTC
OpenStack gerrit 519312 0 None MERGED Fix fetching ironic nodes for updating resource classes on upgrade 2020-03-12 11:21:32 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description Marius Cornea 2017-11-10 10:45:19 UTC
Description of problem:
OSP11 -> OSP12 upgrade: unable to scale out compute nodes post upgrade. Trying to deploy with an additional node fails with:

2017-11-10 10:30:35Z [overcloud]: UPDATE_FAILED  resources.Compute: ResourceInError: resources[2].resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. , Code: 500"
 

Version-Release number of selected component (if applicable):
2017-11-09.2 build

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP11 with 3 controllers, 2 computes, 3 ceph nodes
2. Upgrade to OSP12
3. Remove one compute node from deployment:
openstack overcloud node delete --stack overcloud efd8563d-7619-40f9-ac4f-67cf7b6798a1
4. Wait for stack to get UPDATE_COMPLETE
5. Rerun the openstack overcloud deploy with ComputeCount: 2 to get the deleted compute node reprovisioned

Actual results:
The deploy command fails with:

2017-11-10 10:30:35Z [overcloud]: UPDATE_FAILED  resources.Compute: ResourceInError: resources[2].resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. , Code: 500"

 Stack overcloud UPDATE_FAILED 

overcloud.Compute.2.NovaCompute:
  resource_type: OS::TripleO::ComputeServer
  physical_resource_id: 492f864f-76bf-4acf-9f89-8148b4ed427b
  status: CREATE_FAILED
  status_reason: |
    ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. , Code: 500"
Heat Stack update failed.
Heat Stack update failed.

Expected results:
The deploy command gets completed fine.

Additional info:
Attaching the sosreport on the undercloud.

Comment 2 Marius Cornea 2017-11-10 11:00:38 UTC
(undercloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks               |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| eadafa81-0ce3-48ef-9101-ae80e3509e71 | ceph-0       | ACTIVE | -          | Running     | ctlplane=192.168.24.11 |
| 8fad8238-c463-4807-992b-19a0bdfe840f | ceph-1       | ACTIVE | -          | Running     | ctlplane=192.168.24.12 |
| 88826ab3-fd49-4866-9f18-daa3be19bcd1 | ceph-2       | ACTIVE | -          | Running     | ctlplane=192.168.24.10 |
| 2e145e34-c57e-4a75-a59b-1c19bd58f289 | compute-1    | ACTIVE | -          | Running     | ctlplane=192.168.24.9  |
| 492f864f-76bf-4acf-9f89-8148b4ed427b | compute-2    | ERROR  | -          | NOSTATE     |                        |
| 61a4692f-8acc-418b-a3da-3e5294b58d37 | controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.19 |
| b230be0b-1699-4078-995d-a6a1ca6e1cb3 | controller-1 | ACTIVE | -          | Running     | ctlplane=192.168.24.13 |
| ecfef989-f2b9-4f42-8f73-bbd3c2c3ce47 | controller-2 | ACTIVE | -          | Running     | ctlplane=192.168.24.7  |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+

Checking the nova logs for the failed node uuid we can see in /var/log/nova/nova-scheduler.log:

2017-11-10 05:30:02.529 1348 DEBUG nova.scheduler.manager [req-6cb9920f-7705-43c9-ad06-42be84e6bf9c a1f3cd9117df43c8ad2a236b6f70e801 d6b72ece1f95470b817ea14f96205691 - default default] Starting to schedule for instances: [u'492f864f-76bf-4acf-9f89-8148b4ed427b'] select_destinations /usr/lib/python2.7/site-packages/nova/scheduler/manager.py:113
2017-11-10 05:30:02.550 1348 DEBUG nova.scheduler.manager [req-6cb9920f-7705-43c9-ad06-42be84e6bf9c a1f3cd9117df43c8ad2a236b6f70e801 d6b72ece1f95470b817ea14f96205691 - default default] Got no allocation candidates from the Placement API. This may be a temporary occurrence as compute nodes start up and begin reporting inventory to the Placement service. select_destinations /usr/lib/python2.7/site-packages/nova/scheduler/manager.py:133
2017-11-10 05:30:33.083 1348 DEBUG oslo_concurrency.lockutils [req-d6621942-d42d-4826-bbbd-f3197a374167 - - - - -] Lock "host_instance" acquired by "nova.scheduler.host_manager.sync_instance_info" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:270

In /var/log/nova/nova-conductor.log:

2017-11-10 05:29:02.934 3033 ERROR nova.conductor.manager [req-dd94e11d-a69b-4d29-8ab3-667325074865 a1f3cd9117df43c8ad2a236b6f70e801 d6b72ece1f95470b817ea14f96205691 - default default] Failed to schedule instances: NoValidHost_Remote: No v
alid host was found. 
Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 232, in inner
    return func(*args, **kwargs)

  File "/usr/lib/python2.7/site-packages/nova/scheduler/manager.py", line 137, in select_destinations
    raise exception.NoValidHost(reason="")

NoValidHost: No valid host was found.

Comment 3 Ollie Walsh 2017-11-10 13:07:45 UTC
Resource class was only set on one of the ironic nodes during upgrade

/home/stack/undercloud_upgrade.log:

2017-11-09 06:42:05,991 INFO: [2017-11-09 06:42:05,991] (os-refresh-config) [INFO] Completed phase post-configure
2017-11-09 06:42:06,000 INFO: os-refresh-config completed successfully
2017-11-09 06:42:07,623 INFO: Node f99ca41a-9daf-4927-8458-e937de3c93e3 resource class was set to baremetal
2017-11-09 06:42:07,662 INFO: Not creating flavor "baremetal" because it already exists.
2017-11-09 06:42:07,758 INFO: Flavor baremetal updated to use custom resource class baremetal
2017-11-09 06:42:07,876 INFO: Created flavor "control" with profile "control"
2017-11-09 06:42:07,876 INFO: Not creating flavor "compute" because it already exists.
2017-11-09 06:42:07,950 INFO: Flavor compute updated to use custom resource class baremetal
2017-11-09 06:42:08,046 INFO: Created flavor "ceph-storage" with profile "ceph-storage"
2017-11-09 06:42:08,137 INFO: Created flavor "block-storage" with profile "block-storage"
2017-11-09 06:42:08,228 INFO: Created flavor "swift-storage" with profile "swift-storage"
2017-11-09 06:42:08,236 INFO: Configuring Mistral workbooks
2017-11-09 06:42:34,598 INFO: Mistral workbooks configured successfully
2017-11-09 06:42:35,099 INFO: Migrating environment for plan overcloud to Swift.
2017-11-09 06:42:35,212 INFO: Not creating default plan "overcloud" because it already exists.
2017-11-09 06:42:35,212 INFO: Configuring an hourly cron trigger for tripleo-ui logging
2017-11-09 06:42:37,703 INFO: Added _member_ role to admin user
2017-11-09 06:42:37,986 INFO: Starting and waiting for validation groups ['post-upgrade']

limit should be 0 here https://review.openstack.org/#/c/490851/9/instack_undercloud/undercloud.py@1414.

Comment 4 Ollie Walsh 2017-11-10 13:09:50 UTC
with limit==-1 [<Node {u'uuid': u'f99ca41a-9daf-4927-8458-e937de3c93e3', u'links': [{u'href': u'http://192.168.24.1:6385/v1/nodes/f99ca41a-9daf-4927-8458-e937de3c93e3', u'rel': u'self'}, {u'href': u'http://192.168.24.1:6385/nodes/f99ca41a-9daf-4927-8458-e937de3c93e3', u'rel': u'bookmark'}], u'resource_class': u'baremetal'}>]

with limit==0 [<Node {u'uuid': u'f99ca41a-9daf-4927-8458-e937de3c93e3', u'links': [{u'href': u'http://192.168.24.1:6385/v1/nodes/f99ca41a-9daf-4927-8458-e937de3c93e3', u'rel': u'self'}, {u'href': u'http://192.168.24.1:6385/nodes/f99ca41a-9daf-4927-8458-e937de3c93e3', u'rel': u'bookmark'}], u'resource_class': u'baremetal'}>, <Node {u'uuid': u'4ebf6ff1-3f3a-447f-b5c2-ec9c04ced8ce', u'links': [{u'href': u'http://192.168.24.1:6385/v1/nodes/4ebf6ff1-3f3a-447f-b5c2-ec9c04ced8ce', u'rel': u'self'}, {u'href': u'http://192.168.24.1:6385/nodes/4ebf6ff1-3f3a-447f-b5c2-ec9c04ced8ce', u'rel': u'bookmark'}], u'resource_class': None}>, <Node {u'uuid': u'a6c3c3fb-0ff2-46dc-a02b-6d6ffe9d74b2', u'links': [{u'href': u'http://192.168.24.1:6385/v1/nodes/a6c3c3fb-0ff2-46dc-a02b-6d6ffe9d74b2', u'rel': u'self'}, {u'href': u'http://192.168.24.1:6385/nodes/a6c3c3fb-0ff2-46dc-a02b-6d6ffe9d74b2', u'rel': u'bookmark'}], u'resource_class': None}>, <Node {u'uuid': u'f5dd8219-6b8f-4a39-8a96-6330689d54e2', u'links': [{u'href': u'http://192.168.24.1:6385/v1/nodes/f5dd8219-6b8f-4a39-8a96-6330689d54e2', u'rel': u'self'}, {u'href': u'http://192.168.24.1:6385/nodes/f5dd8219-6b8f-4a39-8a96-6330689d54e2', u'rel': u'bookmark'}], u'resource_class': None}>, <Node {u'uuid': u'046cb1f3-5d50-4be8-80c2-1d4ccc58487a', u'links': [{u'href': u'http://192.168.24.1:6385/v1/nodes/046cb1f3-5d50-4be8-80c2-1d4ccc58487a', u'rel': u'self'}, {u'href': u'http://192.168.24.1:6385/nodes/046cb1f3-5d50-4be8-80c2-1d4ccc58487a', u'rel': u'bookmark'}], u'resource_class': None}>, <Node {u'uuid': u'782bdc4f-af01-47c4-ac02-d73276d7ab77', u'links': [{u'href': u'http://192.168.24.1:6385/v1/nodes/782bdc4f-af01-47c4-ac02-d73276d7ab77', u'rel': u'self'}, {u'href': u'http://192.168.24.1:6385/nodes/782bdc4f-af01-47c4-ac02-d73276d7ab77', u'rel': u'bookmark'}], u'resource_class': None}>, <Node {u'uuid': u'c7c26891-88d1-498f-a84e-c15886ec3198', u'links': [{u'href': u'http://192.168.24.1:6385/v1/nodes/c7c26891-88d1-498f-a84e-c15886ec3198', u'rel': u'self'}, {u'href': u'http://192.168.24.1:6385/nodes/c7c26891-88d1-498f-a84e-c15886ec3198', u'rel': u'bookmark'}], u'resource_class': None}>, <Node {u'uuid': u'81f8dd71-e0c6-4be7-b20f-47871c61a2a9', u'links': [{u'href': u'http://192.168.24.1:6385/v1/nodes/81f8dd71-e0c6-4be7-b20f-47871c61a2a9', u'rel': u'self'}, {u'href': u'http://192.168.24.1:6385/nodes/81f8dd71-e0c6-4be7-b20f-47871c61a2a9', u'rel': u'bookmark'}], u'resource_class': None}>]

Comment 5 Dmitry Tantsur 2017-11-13 11:10:52 UTC
Thanks for triaging, I can take care of it.

Comment 6 Dmitry Tantsur 2017-11-13 11:33:06 UTC
Correction: stable/pike patch is https://review.openstack.org/519312

Comment 7 Bob Fournier 2017-11-22 14:47:13 UTC
Merged downstream - https://code.engineering.redhat.com/gerrit/#/c/123953/

Comment 11 errata-xmlrpc 2017-12-13 22:20:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.