1958940 – unable to scale out after scale in

Bug 1958940 - unable to scale out after scale in

Summary: unable to scale out after scale in

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	16.2 (Train)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	OSP DFG:Compute
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1900723
TreeView+	depends on / blocked

Reported:	2021-05-10 13:04 UTC by Ella Shulman
Modified:	2023-03-21 19:42 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 12:18:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-3715	0	None	None	None	2022-08-23 10:22:52 UTC

Description Ella Shulman 2021-05-10 13:04:23 UTC

Description of problem:
after scaling in overcloud from 2 nodes to 1 I tried scaling it back out (with the same node I removed when I scaled in) and deployment failed with an error for not enough hosts available:
{'code': 500, 'created': '2021-05-10T11:58:54Z', 'message': 'No valid host was found. There are not enough hosts available.', 'details': 'Traceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 1379, in schedule_and_build_instances\n    instance_uuids, return_alternates=True)\n  File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 839, in _schedule_instances\n    return_alternates=return_alternates)\n  File "/usr/lib/python3.6/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations\n    instance_uuids, return_objects, return_alternates)\n  File "/usr/lib/python3.6/site-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations\n    return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call\n    transport_options=self.transport_options)\n  File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send\n    transport_options=transport_options)\n  File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 674, in send\n    transport_options=transport_options)\n  File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 664, in _send\n    raise result\nnova.exception_Remote.NoValidHost_Remote: No valid host was found. There are not enough hosts available.\nTraceback (most recent call last):\n\n  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 235, in inner\n    return func(*args, **kwargs)\n\n  File "/usr/lib/python3.6/site-packages/nova/scheduler/manager.py", line 214, in select_destinations\n    allocation_request_version, return_alternates)\n\n  File "/usr/lib/python3.6/site-packages/nova/scheduler/filter_scheduler.py", line 96, in select_destinations\n    allocation_request_version, return_alternates)\n\n  File "/usr/lib/python3.6/site-packages/nova/scheduler/filter_scheduler.py", line 265, in _schedule\n    claimed_instance_uuids)\n\n  File "/usr/lib/python3.6/site-packages/nova/scheduler/filter_scheduler.py", line 302, in _ensure_sufficient_hosts\n    raise exception.NoValidHost(reason=reason)\n\nnova.exception.NoValidHost: No valid host was found. There are not enough hosts available.\n\n'} 

Version-Release number of selected component (if applicable):
16.2

How reproducible:
100%

Steps to Reproduce:
1. deploy an overcloud with 2 nodes.
2. remove overcloud nodes with the following steps (from https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/director_installation_and_usage/scaling-overcloud-nodes):
    a. source /home/stack/overcloudrc
    b. openstack compute service set <compute> --disable
    c. source /home/stack/stackrc
    d. openstack overcloud node delete --stack overcloud -y <compute>
    e. source /home/stack/overcloudrc
    f. openstack compute service delete <compute>
    g. for AGENT in \$(openstack network agent list --host <compute> -c ID -f value) ; do openstack network agent delete \$AGENT ; done ;
    h. openstack resource provider delete <compute uuid>

3. try to scale out using overcloud deploy and you will get an error (the steps in here https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/director_installation_and_usage/scaling-overcloud-nodes)

Actual results:
deployment fail due to "No valid host was found. There are not enough hosts available."

Expected results:
overcloud scales successfully 

Additional info:

Comment 1 Alex Schultz 2021-05-10 13:41:45 UTC

This error indicates you don't have available hosts to deploy. If you are reusing the hosts, you need to make sure they have been cleaned and are available in Ironic again.

Comment 2 Ella Shulman 2021-05-10 14:47:10 UTC

Docs are needed (I'm unable to set the flag)

Comment 3 Steve Baker 2021-05-11 19:49:26 UTC

It sounds like the undercloud ironic config automated_clean might be set to False in this case, or not all nodes are in an available state.

However the documentation should make it clear there are prerequisites before attempting scale-up.

I think section 16.2 needs some prerequisite bullet points just like in section 8.5[1] to ensure there are nodes available before attempting scale-up.  

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/director_installation_and_usage/index#scaling-up-bare-metal-nodes

Comment 4 Ella Shulman 2021-05-12 16:51:23 UTC

It seems like the automated_clean was set to false. assuming I would like to cleanup manually what are the steps that I should do?

Comment 5 Alex Schultz 2021-05-18 13:38:01 UTC

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/bare_metal_provisioning/configuring-the-bare-metal-provisioning-service-after-deployment#cleaning-nodes-manually_bare-metal-post-deployment

Comment 6 Ella Shulman 2021-05-20 07:47:27 UTC

Hi,

I tried with both manual and automatic cleanup but it still fails with the same error.
steps:
1. deploy an overcloud with 2 compute nodes
2. followed steps in to scale down from https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/director_installation_and_usage/scaling-overcloud-nodes
3. openstack baremetal node clean c10ff1a6-11fb-4165-a86f-9153349e6c7f --clean-steps '[{"interface": "deploy", "step": "erase_devices"}]' / set the automated_clean flag to true and restart ironic containers
4. openstack baremetal node list (to make sure node status is available)
5. re-deployed the overcloud

Comment 7 Julia Kreger 2021-05-25 13:17:29 UTC

Ella,

Can you supply us with an sosreport from the undercloud where you've encountered this?

An additional question, how much time was there between ensuring that the node status changed to available, and the attempt to re-deploy the overcloud?

There is a reconciliation loop that only picks up changes every 2-3 minutes inside of nova if nova is in use for the deployment. If the deployment is nova-less, realistically you shouldn't be encountering this issue.

Comment 8 Ella Shulman 2021-05-25 13:54:30 UTC

Hi Julia,

Can you supply us with a sosreport from the undercloud where you've encountered this?
unfortunately, there is an issue with Gerrit so I'll send sosreport as soon as I'll be able to redeploy.

An additional question, how much time was there between ensuring that the node status changed to available, and the attempt to re-deploy the overcloud?
The scale in and scale out are part of a larger automation so as soon as the step that removes the overcloud node is over it starts the deployment

Comment 9 Greg Rakauskas 2021-05-28 11:20:51 UTC

Hi Ella,

I am a member of the rhos-docs team. I have emailed one of the "Director
Installation and Usage" guide writers and asked him to investigate this BZ.
Thanks for your feedback!

Best,
--Greg

Comment 13 Yariv 2021-06-13 16:47:01 UTC

Adding the following sos report as per comment #7

http://rhos-release.virt.bos.redhat.com/log/bz1958940/

(In reply to Julia Kreger from comment #7)
> Ella,
> 
> Can you supply us with an sosreport from the undercloud where you've
> encountered this?
> 
> An additional question, how much time was there between ensuring that the
> node status changed to available, and the attempt to re-deploy the overcloud?
> 
> There is a reconciliation loop that only picks up changes every 2-3 minutes
> inside of nova if nova is in use for the deployment. If the deployment is
> nova-less, realistically you shouldn't be encountering this issue.

Comment 14 Yariv 2021-06-14 07:24:32 UTC

------------------------+(In reply to Steve Baker from comment #3)
> It sounds like the undercloud ironic config automated_clean might be set to
> False in this case, or not all nodes are in an available state.

It could explain the issue here:
so the node is in available state

+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
3afa3f02-8583-470f-a0a9-5c705b26b7ad | compute-1    | None                                 | power off   | available          | False       |

but in the undercloud.conf we have this:
clean_nodes = True

so the data is lost,

> 
> However the documentation should make it clear there are prerequisites
> before attempting scale-up.
> 
> I think section 16.2 needs some prerequisite bullet points just like in
> section 8.5[1] to ensure there are nodes available before attempting
> scale-up.  
> 
> [1]
> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.
> 1/html-single/director_installation_and_usage/index#scaling-up-bare-metal-
> nodes

based on comment #3 it could explain the problem

(undercloud) [stack@undercloud-0 ~]$ openstack baremetal node show 39022793-d37d-4b31-8c37-b492b8a2a009 -c "instance_info"
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field         | Value                                                                                                                                                                                                                                                                                                                                  |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| instance_info | {'image_source': '4d6b3cc2-3bb2-4ff8-b091-e4f7a32ae9c1', 'root_gb': '97', 'swap_mb': '0', 'display_name': 'computehciovsdpdksriov-0', 'vcpus': '39', 'nova_host_id': 'undercloud-0.localdomain', 'memory_mb': '125000', 'local_gb': '1861', 'capabilities': '{"boot_option": "local", "profile": "compute"}', 'configdrive': '******'} |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


I will try the following:
* remove the node again from overcloud,
* Introspect
* tag 
* scale

In case of success:
will have to update the undercloud through the ironic, and retry the procedure

Comment 15 Yariv 2021-06-14 10:24:28 UTC

(In reply to Yariv from comment #14)
> ------------------------+(In reply to Steve Baker from comment #3)
> > It sounds like the undercloud ironic config automated_clean might be set to
> > False in this case, or not all nodes are in an available state.
> 
> It could explain the issue here:
> so the node is in available state
> 
> +--------------------------------------+--------------+----------------------
> ----------------+-------------+--------------------+-------------+
> | UUID                                 | Name         | Instance UUID       
> | Power State | Provisioning State | Maintenance |
> +--------------------------------------+--------------+----------------------
> ----------------+-------------+--------------------+-------------+
> 3afa3f02-8583-470f-a0a9-5c705b26b7ad | compute-1    | None                  
> | power off   | available          | False       |
> 
> but in the undercloud.conf we have this:
> clean_nodes = True
> 
> so the data is lost,
> 
> > 
> > However the documentation should make it clear there are prerequisites
> > before attempting scale-up.
> > 
> > I think section 16.2 needs some prerequisite bullet points just like in
> > section 8.5[1] to ensure there are nodes available before attempting
> > scale-up.  
> > 
> > [1]
> > https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.
> > 1/html-single/director_installation_and_usage/index#scaling-up-bare-metal-
> > nodes
> 
> based on comment #3 it could explain the problem
> 
> (undercloud) [stack@undercloud-0 ~]$ openstack baremetal node show
> 39022793-d37d-4b31-8c37-b492b8a2a009 -c "instance_info"
> +---------------+------------------------------------------------------------
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> -------------------------------------+
> | Field         | Value                                                     
> |
> +---------------+------------------------------------------------------------
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> -------------------------------------+
> | instance_info | {'image_source': '4d6b3cc2-3bb2-4ff8-b091-e4f7a32ae9c1',
> 'root_gb': '97', 'swap_mb': '0', 'display_name': 'computehciovsdpdksriov-0',
> 'vcpus': '39', 'nova_host_id': 'undercloud-0.localdomain', 'memory_mb':
> '125000', 'local_gb': '1861', 'capabilities': '{"boot_option": "local",
> "profile": "compute"}', 'configdrive': '******'} |
> +---------------+------------------------------------------------------------
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> -------------
> 
> 
> I will try the following:
> * remove the node again from overcloud,
> * Introspect
> * tag 
> * scale
> 
> In case of success:
> will have to update the undercloud through the ironic, and retry the
> procedure


Did not have any success,

Comment 17 Julia Kreger 2021-06-17 13:40:11 UTC

> > 
> > I will try the following:
> > * remove the node again from overcloud,
> > * Introspect
> > * tag 
> > * scale
> > 
> > In case of success:
> > will have to update the undercloud through the ironic, and retry the
> > procedure
> 
> 
> Did not have any success,

So to understand this correctly. You removed the node. You re-added. You re-inspected, tagged, and then attempted scale out?

Any change in nova errors? Were there any prior errors and was the error just the last error? Can we get the baremetal node list. Are you just encountering issues with nova scheduling your scale out? Does `openstack baremetal node list` show sufficient baremetal machines available? What does `nova hypervisor-list` indicate?

Without logs, it is impossible to confirm, but this seems like something is going sideways with the data in placement as if the node is in available state and properly configured matching the requested flavor of compute node, then it is all down to scheduling at that point.

Comment 21 Martin Schuppert 2021-06-30 07:14:54 UTC

From the nova-scheduler log the TripleOCapabilitiesFilter filters nodes and that results in the scheduling to fail:

2021-06-13 16:06:15.085 14 INFO nova.filters [req-84fed395-5a7a-4281-8704-a6cf57adc0ed 5cba6568454c40a2a9987876d900193b 3bb4842ccdf24eb08ac7edffc0e76d1d - default default] Filter TripleOCapabilitiesFilter returned 0 hosts

Do you use scheduler hints / custom hostname mapping? Since the new compute will have a different index, have you updated the capabilities for the new index?





In general I tested scale down/up with latest compose and it worked for me:

(undercloud) [stack@undercloud-0 ~]$ openstack stack list
+--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+
| ID                                   | Stack Name | Project                          | Stack Status    | Creation Time        | Updated Time |
+--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+
| 4e8f6021-6a1b-4df9-9f77-d12a5a927e88 | overcloud  | f79c666f280c436f9530ff5eda9fcdba | CREATE_COMPLETE | 2021-06-30T05:35:46Z | None         |
+--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+
(undercloud) [stack@undercloud-0 ~]$ source /home/stack/overcloudrc
(overcloud) [stack@undercloud-0 ~]$ openstack compute service list
+--------------------------------------+----------------+---------------------------+----------+---------+-------+----------------------------+
| ID                                   | Binary         | Host                      | Zone     | Status  | State | Updated At                 |
+--------------------------------------+----------------+---------------------------+----------+---------+-------+----------------------------+
| 3d1601b1-1d75-48a5-adb2-f079ee6aa3c4 | nova-conductor | controller-0.redhat.local | internal | enabled | up    | 2021-06-30T06:15:41.000000 |
| 7c344811-7bce-4664-9a7a-8491793e05d5 | nova-scheduler | controller-0.redhat.local | internal | enabled | up    | 2021-06-30T06:15:34.000000 |
| 0d88bf40-161c-457e-a826-300605e376e3 | nova-compute   | compute-0.redhat.local    | nova     | enabled | up    | 2021-06-30T06:15:34.000000 |
| fa5d6935-d229-4b65-b08a-55bf2a0a7f0c | nova-compute   | compute-1.redhat.local    | nova     | enabled | up    | 2021-06-30T06:15:34.000000 |
+--------------------------------------+----------------+---------------------------+----------+---------+-------+----------------------------+
(overcloud) [stack@undercloud-0 ~]$ openstack compute service set compute-1.redhat.local nova-compute --disable
(overcloud) [stack@undercloud-0 ~]$ openstack compute service list
+--------------------------------------+----------------+---------------------------+----------+----------+-------+----------------------------+
| ID                                   | Binary         | Host                      | Zone     | Status   | State | Updated At                 |
+--------------------------------------+----------------+---------------------------+----------+----------+-------+----------------------------+
| 3d1601b1-1d75-48a5-adb2-f079ee6aa3c4 | nova-conductor | controller-0.redhat.local | internal | enabled  | up    | 2021-06-30T06:16:21.000000 |
| 7c344811-7bce-4664-9a7a-8491793e05d5 | nova-scheduler | controller-0.redhat.local | internal | enabled  | up    | 2021-06-30T06:16:24.000000 |
| 0d88bf40-161c-457e-a826-300605e376e3 | nova-compute   | compute-0.redhat.local    | nova     | enabled  | up    | 2021-06-30T06:16:24.000000 |
| fa5d6935-d229-4b65-b08a-55bf2a0a7f0c | nova-compute   | compute-1.redhat.local    | nova     | disabled | up    | 2021-06-30T06:16:24.000000 |
+--------------------------------------+----------------+---------------------------+----------+----------+-------+----------------------------+
 
(undercloud) [stack@undercloud-0 ~]$ source /home/stack/stackrc
(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| 636f8bcb-404e-4c94-bef9-4f7171e862c7 | controller-0 | ACTIVE | ctlplane=192.168.24.51 | overcloud-full | controller |
| 14cceadd-c550-4a0b-b565-f1a0ad2912a2 | compute-0    | ACTIVE | ctlplane=192.168.24.55 | overcloud-full | compute    |
| aac841b8-e6e7-4994-ae46-7c8b15398667 | compute-1    | ACTIVE | ctlplane=192.168.24.12 | overcloud-full | compute    |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
(undercloud) [stack@undercloud-0 ~]$ openstack overcloud node delete --stack overcloud -y compute-1
Deleting the following nodes from stack overcloud:
- compute-1
Waiting for messages on queue 'tripleo' with no timeout.


[stack@undercloud-0 ~]$ source overcloudrc 
(overcloud) [stack@undercloud-0 ~]$ openstack compute service list
+--------------------------------------+----------------+---------------------------+----------+---------+-------+----------------------------+
| ID                                   | Binary         | Host                      | Zone     | Status  | State | Updated At                 |
+--------------------------------------+----------------+---------------------------+----------+---------+-------+----------------------------+
| 3d1601b1-1d75-48a5-adb2-f079ee6aa3c4 | nova-conductor | controller-0.redhat.local | internal | enabled | up    | 2021-06-30T06:22:11.000000 |
| 7c344811-7bce-4664-9a7a-8491793e05d5 | nova-scheduler | controller-0.redhat.local | internal | enabled | up    | 2021-06-30T06:22:04.000000 |
| 0d88bf40-161c-457e-a826-300605e376e3 | nova-compute   | compute-0.redhat.local    | nova     | enabled | up    | 2021-06-30T06:22:04.000000 |
+--------------------------------------+----------------+---------------------------+----------+---------+-------+----------------------------+
(overcloud) [stack@undercloud-0 ~]$ openstack resource provider list
+--------------------------------------+------------------------+------------+
| uuid                                 | name                   | generation |
+--------------------------------------+------------------------+------------+
| ebfcda80-afa1-47c2-9a85-c5446d61a91e | compute-0.redhat.local |          2 |
+--------------------------------------+------------------------+------------+
 
(overcloud) [stack@undercloud-0 ~]$ openstack network agent list
+--------------------------------------+------------------------------+---------------------------+-------------------+-------+-------+-------------------------------+
| ID                                   | Agent Type                   | Host                      | Availability Zone | Alive | State | Binary                        |
+--------------------------------------+------------------------------+---------------------------+-------------------+-------+-------+-------------------------------+
| 48e58d5d-fb97-4833-a36d-901790de08ed | OVN Controller agent         | compute-0.redhat.local    |                   | :-)   | UP    | ovn-controller                |
| 784ad1a2-4445-4ba5-a705-0cf57b199268 | OVN Metadata agent           | compute-0.redhat.local    |                   | :-)   | UP    | networking-ovn-metadata-agent |
| f5c14eb2-d679-4857-9b24-438203bce587 | OVN Controller agent         | compute-1.redhat.local    |                   | :-)   | UP    | ovn-controller                |
| 532fce0a-5921-49a2-8227-b92b087001cb | OVN Metadata agent           | compute-1.redhat.local    |                   | :-)   | UP    | networking-ovn-metadata-agent |
| 1fb383a5-41d3-4fb2-b46f-7ed85ec9bb5d | OVN Controller Gateway agent | controller-0.redhat.local |                   | :-)   | UP    | ovn-controller                |
+--------------------------------------+------------------------------+---------------------------+-------------------+-------+-------+-------------------------------+
 
- no need to manually remove nova-compute service
- no need to manually remove the resource provider
- ovn agents can not be deleted, which is a known issue and there are BZs for it




(undercloud) [stack@undercloud-0 ~]$ source /home/stack/stackrc
(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| 636f8bcb-404e-4c94-bef9-4f7171e862c7 | controller-0 | ACTIVE | ctlplane=192.168.24.51 | overcloud-full | controller |
| 14cceadd-c550-4a0b-b565-f1a0ad2912a2 | compute-0    | ACTIVE | ctlplane=192.168.24.55 | overcloud-full | compute    |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
(undercloud) [stack@undercloud-0 ~]$ openstack baremetal node list
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| f8ede99a-fb92-4b70-8a9b-aba739cbc585 | compute-0    | 14cceadd-c550-4a0b-b565-f1a0ad2912a2 | power on    | active             | False       |
| 49d22e9e-0756-435d-bc53-273f8ef25692 | compute-1    | None                                 | power off   | available          | False       |
| e7d006a3-3925-488c-8c44-8661c4ed271d | controller-0 | 636f8bcb-404e-4c94-bef9-4f7171e862c7 | power on    | active             | False       |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
 
-> scale out running same initial deployment command with requesting 2 computes
(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| 69dd94a2-1cd8-4676-8ba1-a1b2c9cf78cb | compute-2    | BUILD  |                        | overcloud-full | compute    |
| 636f8bcb-404e-4c94-bef9-4f7171e862c7 | controller-0 | ACTIVE | ctlplane=192.168.24.51 | overcloud-full | controller |
| 14cceadd-c550-4a0b-b565-f1a0ad2912a2 | compute-0    | ACTIVE | ctlplane=192.168.24.55 | overcloud-full | compute    |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
(undercloud) [stack@undercloud-0 ~]$ openstack baremetal node list
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| f8ede99a-fb92-4b70-8a9b-aba739cbc585 | compute-0    | 14cceadd-c550-4a0b-b565-f1a0ad2912a2 | power on    | active             | False       |
| 49d22e9e-0756-435d-bc53-273f8ef25692 | compute-1    | 69dd94a2-1cd8-4676-8ba1-a1b2c9cf78cb | power off   | deploying          | False       |
| e7d006a3-3925-488c-8c44-8661c4ed271d | controller-0 | 636f8bcb-404e-4c94-bef9-4f7171e862c7 | power on    | active             | False       |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
...
(undercloud) [stack@undercloud-0 ~]$ openstack baremetal node list
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| f8ede99a-fb92-4b70-8a9b-aba739cbc585 | compute-0    | 14cceadd-c550-4a0b-b565-f1a0ad2912a2 | power on    | active             | False       |
| 49d22e9e-0756-435d-bc53-273f8ef25692 | compute-1    | 69dd94a2-1cd8-4676-8ba1-a1b2c9cf78cb | power on    | active             | False       |
| e7d006a3-3925-488c-8c44-8661c4ed271d | controller-0 | 636f8bcb-404e-4c94-bef9-4f7171e862c7 | power on    | active             | False       |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| 69dd94a2-1cd8-4676-8ba1-a1b2c9cf78cb | compute-2    | ACTIVE | ctlplane=192.168.24.12 | overcloud-full | compute    |
| 636f8bcb-404e-4c94-bef9-4f7171e862c7 | controller-0 | ACTIVE | ctlplane=192.168.24.51 | overcloud-full | controller |
| 14cceadd-c550-4a0b-b565-f1a0ad2912a2 | compute-0    | ACTIVE | ctlplane=192.168.24.55 | overcloud-full | compute    |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+


(overcloud) [stack@undercloud-0 ~]$ openstack compute service list
+--------------------------------------+----------------+---------------------------+----------+---------+-------+----------------------------+
| ID                                   | Binary         | Host                      | Zone     | Status  | State | Updated At                 |
+--------------------------------------+----------------+---------------------------+----------+---------+-------+----------------------------+
| 3d1601b1-1d75-48a5-adb2-f079ee6aa3c4 | nova-conductor | controller-0.redhat.local | internal | enabled | up    | 2021-06-30T06:57:51.000000 |
| 7c344811-7bce-4664-9a7a-8491793e05d5 | nova-scheduler | controller-0.redhat.local | internal | enabled | up    | 2021-06-30T06:57:55.000000 |
| 0d88bf40-161c-457e-a826-300605e376e3 | nova-compute   | compute-0.redhat.local    | nova     | enabled | up    | 2021-06-30T06:57:54.000000 |
| 5ff861e8-9d6c-4f66-87e8-f64bb33117a5 | nova-compute   | compute-2.redhat.local    | nova     | enabled | up    | 2021-06-30T06:57:52.000000 |
+--------------------------------------+----------------+---------------------------+----------+---------+-------+----------------------------+
(overcloud) [stack@undercloud-0 ~]$ openstack resource provider list
+--------------------------------------+------------------------+------------+
| uuid                                 | name                   | generation |
+--------------------------------------+------------------------+------------+
| ebfcda80-afa1-47c2-9a85-c5446d61a91e | compute-0.redhat.local |          2 |
| 3bc4773c-11b0-4e86-98d3-862fd5f13453 | compute-2.redhat.local |          2 |
+--------------------------------------+------------------------+------------+

Comment 22 Ella Shulman 2021-06-30 07:31:51 UTC

Hi, We do use a custom hostname mapping:
```
  ControllerHostnameFormat: 'controller-%index%'
  ControllerSchedulerHints:
    'capabilities:node': 'controller-%index%'
  ComputeOvsDpdkSriovHostnameFormat: 'computeovsdpdksriov-%index%'
  ComputeOvsDpdkSriovSchedulerHints:
    'capabilities:node': 'computeovsdpdksriov-%index%'
```

To ease the debug I will attach a tar.gz of my templates.

BR,
Ella Shulman

Comment 24 Martin Schuppert 2021-06-30 11:00:33 UTC

(In reply to Ella Shulman from comment #22)
> Hi, We do use a custom hostname mapping:
> ```
>   ControllerHostnameFormat: 'controller-%index%'
>   ControllerSchedulerHints:
>     'capabilities:node': 'controller-%index%'
>   ComputeOvsDpdkSriovHostnameFormat: 'computeovsdpdksriov-%index%'
>   ComputeOvsDpdkSriovSchedulerHints:
>     'capabilities:node': 'computeovsdpdksriov-%index%'
> ```
> 
> To ease the debug I will attach a tar.gz of my templates.
> 
> BR,
> Ella Shulman

With the ComputeOvsDpdkSriovSchedulerHints set you need to update the capabilities of the
baremetal node to match the next index you are going to create.

if you have 2 computes which right now have the capabilities node:computeovsdpdksriov-0/1
and you remove one and scale up again you need to have a node with the capabilities
for the new index 2, e.g.

openstack baremetal node set compute-1 --property capabilities='node:computeovsdpdksriov-2,profile:compute,boot_option:local'

Comment 28 Ella Shulman 2021-07-21 06:45:17 UTC

Hi,
I tried running what you suggested. it seems like placement is ok when increasing the index but for some reason, deployment got stuck in the middle I need to check whether it's just a single run issue or a persistent issue.

Comment 32 Martin Schuppert 2021-07-27 12:18:09 UTC

Since steps in comment24 helped for this issue, I'll close this BZ

Note You need to log in before you can comment on or make changes to this bug.

amcleod
aschultz
dasmith
eglynn
eshulman
gregraka
hakhande
jhakimra
jkreger
kchamart
mburns
mschuppe
oblaut
pweeks
sbaker
sbauza
sgordon
supadhya
vromanso
yrachman