Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2104675

Summary: OSP17 Compute replacement fails with error Refusing to proceed can't find hostname
Product: Red Hat OpenStack Reporter: David Rosenfeld <drosenfe>
Component: openstack-tripleo-commonAssignee: Adriano Petrich <apetrich>
Status: CLOSED DUPLICATE QA Contact: David Rosenfeld <drosenfe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 17.0 (Wallaby)CC: bshephar, jslagle, mburns, ramishra, slinaber
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-07-07 03:19:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Rosenfeld 2022-07-06 20:29:23 UTC
Description of problem: A two compute deployment is performed. The computes are compute-0 and compute-1. compute-1 is scaled down and then a scale up of a node named compute-2 is attempted. When a stack update is performed to add compute-2 it fails with this error message:

2022-07-06 19:56:09.708256 | 525400e2-e4ac-47fe-b253-00000000000d |      FATAL | Find existing instances | localhost | error={"changed": false, "msg": "Requested hostname compute-0 was not found, but the deployed node 815d1771-60a4-418d-8a6d-6462ddf20886 has a matching name. Refusing to proceed to avoid confusing results. Please either rename the node or use a different hostname"}

This is seen in openstack baremetal list after deployment:

| 815d1771-60a4-418d-8a6d-6462ddf20886 | compute-0    | ad26596a-54d2-4079-a4b1-8a05ba9da28c | power on    | active             | False       |
| 627170d0-0807-469d-9223-f8762169ffa2 | compute-1    | f9ae797c-a51f-4b9e-8ea4-4db45abbbbe5 | power on    | active             | False       |

This is seen in openstack baremetal list after the scale down:

| 815d1771-60a4-418d-8a6d-6462ddf20886 | compute-0    | ad26596a-54d2-4079-a4b1-8a05ba9da28c | power on    | active             | False       |
| 627170d0-0807-469d-9223-f8762169ffa2 | compute-1    | None                                 | power off   | available          | False       |

This is seen in openstack baremetal node list after the scale up is attempted and fails:

| 815d1771-60a4-418d-8a6d-6462ddf20886 | compute-0    | ad26596a-54d2-4079-a4b1-8a05ba9da28c | power on    | active             | False       |
| 627170d0-0807-469d-9223-f8762169ffa2 | compute-1    | 5988a9de-f6d9-4522-9c51-d6aa6bf670af | power on    | active             | False       |
| 75432c61-aa8b-4763-be97-b05d0bc06560 | compute-2    | None                                 | power off   | available          | False       |


The node named compute-2 was available and it should have been used for the scale up, but it was not.


Version-Release number of selected component (if applicable): RHOS-17.0-RHEL-9-20220701.n.1


How reproducible: Every time


Steps to Reproduce:
1. Execute this job in Jenkins: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/df/view/rfe/job/DFG-df-rfe-17.0-virsh-3cont_3db_3msg_2net_2comp_3ceph-blacklist-2networker-compute-replacement/   Note: the scale down infrared update is in progress and hasn't been committed yet.
2.
3.

Actual results: Compute replacement fails with error above


Expected results: Compute replacement is successful.


Additional info:

Comment 2 Brendan Shephard 2022-07-06 23:01:22 UTC
Hey, so in the baremetal_deployment.yaml file, it looks like we are trying to remove compute-0:
- name: Compute
  count: 2
  hostname_format: compute-%index%
  defaults:
    profile: compute
    network_config:
      template: /home/stack/composable_roles/network/nic-configs/compute.j2
    networks:
    - network: ctlplane
      vif: true
    - network: internal_api
    - network: tenant
    - network: storage
  instances:
  - hostname: compute-0
    name: compute-1
    provisioned: false

So the node with hostname compute-0 in the overcloud, matches compute-1 in Ironic? This seems like it might be a mistake here as compute-0 is still deployed and you tried to scale down compute-1? So, do we mean to have it like this instead?:

- name: Compute
  count: 2
  hostname_format: compute-%index%
  defaults:
    profile: compute
    network_config:
      template: /home/stack/composable_roles/network/nic-configs/compute.j2
    networks:
    - network: ctlplane
      vif: true
    - network: internal_api
    - network: tenant
    - network: storage
  instances:
  - hostname: compute-1
    name: compute-1
    provisioned: false

Comment 3 David Rosenfeld 2022-07-06 23:48:35 UTC
This is metalsmith list:

(undercloud) [stack@undercloud-0 ~]$ metalsmith list
+--------------------------------------+--------------+--------------------------------------+---------------+--------+------------------------+
| UUID                                 | Node Name    | Allocation UUID                      | Hostname      | State  | IP Addresses           |
+--------------------------------------+--------------+--------------------------------------+---------------+--------+------------------------+
| 96a3bfb0-1a5e-4f81-b671-343409700708 | ceph-0       | 1f4d52a7-fdb8-4044-9c8a-8d429f6dedb0 | cephstorage-2 | ACTIVE | ctlplane=192.168.24.22 |
| 70085593-04c2-43a4-8ef1-75a4503141b6 | ceph-1       | 5fcc6079-340c-439e-8c74-ba9ae1a00c1c | cephstorage-1 | ACTIVE | ctlplane=192.168.24.41 |
| 5f2c50f8-0f1e-4f1f-a345-a2f9da5d7192 | ceph-2       | 3b53df04-b57d-46bb-a73b-ee64e0ddcedd | cephstorage-0 | ACTIVE | ctlplane=192.168.24.35 |
| 815d1771-60a4-418d-8a6d-6462ddf20886 | compute-0    | ad26596a-54d2-4079-a4b1-8a05ba9da28c | compute-1     | ACTIVE | ctlplane=192.168.24.51 |
| 627170d0-0807-469d-9223-f8762169ffa2 | compute-1    | f9ae797c-a51f-4b9e-8ea4-4db45abbbbe5 | compute-0     | ACTIVE | ctlplane=192.168.24.27 |
| 16dede8b-016a-4e57-b600-1206fa958f96 | controller-0 | 9ea719ca-12b3-4c90-8491-672ae92c7d3c | controller-1  | ACTIVE | ctlplane=192.168.24.34 |
| eea417c5-d1b8-4bdd-9e83-834d015445c4 | controller-1 | 96a845c6-5e0c-45e6-b094-cc933e2fe4e3 | controller-0  | ACTIVE | ctlplane=192.168.24.38 |
| c496a066-ea50-492b-ac86-6138ed26d263 | controller-2 | 7279e9ca-f8b3-4aa3-9acd-a6e35070832b | controller-2  | ACTIVE | ctlplane=192.168.24.8  |
| fcddf6bd-c743-46df-8db5-9f31631bcdb0 | database-0   | 42926208-8547-454b-8daa-803b4b66b2a5 | database-0    | ACTIVE | ctlplane=192.168.24.9  |
| 90ce7155-cbb5-4b5b-a93e-649af77f2e64 | database-1   | 1c908502-c12a-457d-9fc3-479d62ab62de | database-1    | ACTIVE | ctlplane=192.168.24.31 |
| ba119d83-43cd-4efe-a703-cb30e7d8aa45 | database-2   | 32b39c2b-85f2-4676-86a5-44fd1883a6a7 | database-2    | ACTIVE | ctlplane=192.168.24.45 |
| 6155b66b-e0ec-48cf-b08d-1eed283ee397 | messaging-0  | 99eff70d-e669-4119-8aa6-2ed0012734bb | messaging-2   | ACTIVE | ctlplane=192.168.24.12 |
| 261584fc-42ac-4907-89d5-578c4e9ce6b7 | messaging-1  | c725e257-14ce-46df-8384-933679eff4c1 | messaging-1   | ACTIVE | ctlplane=192.168.24.24 |
| 85016e70-bbcc-4ad8-9536-fd524c154dcc | messaging-2  | f82d3856-9b0e-4362-b1d9-ac36a9b2d059 | messaging-0   | ACTIVE | ctlplane=192.168.24.53 |
| da472ada-943c-45f6-b082-e3b1422362d7 | networker-0  | 30f14784-37e5-4240-9aa1-bc08e8a2c5ea | networker-1   | ACTIVE | ctlplane=192.168.24.14 |
| ed17e123-abc7-459c-bc18-427e9ad90664 | networker-1  | 2cdfd562-a5cb-4e21-b28b-3584d7b54021 | networker-0   | ACTIVE | ctlplane=192.168.24.10 |
+--------------------------------------+--------------+--------------------------------------+---------------+--------+------------------------+

When deployed there is no guarantee that the hostname and the node name match. Also the documentation:

https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/provisioning/baremetal_provision.html#deploying-the-overcloud

says the instances entry contains: 

The name of the baremetal node to remove from the overcloud

The hostname which is assigned to that node

The baremetal_deployment.yaml file in the job matches the documentation and the deployment.

Comment 5 Rabi Mishra 2022-07-07 03:19:59 UTC

*** This bug has been marked as a duplicate of bug 2092444 ***