Bug 1937696 - [Assisted-4.7]node/hostnames vs bmh names inconsistency, skipped cluster index in name
Summary: [Assisted-4.7]node/hostnames vs bmh names inconsistency, skipped cluster inde...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.7
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.9.0
Assignee: Mat Kowalski
QA Contact: Yuri Obshansky
URL:
Whiteboard: AI-Team-Platform KNI-EDGE-4.8
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-11 11:14 UTC by Elena German
Modified: 2021-10-18 17:29 UTC (History)
6 users (show)

Fixed In Version: OCP-Metal-v1.0.23.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:29:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift assisted-service pull 1992 0 None open Bug 1937696: Use hostname for BMH resource name 2021-06-24 06:40:59 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:29:59 UTC

Description Elena German 2021-03-11 11:14:00 UTC
Description of problem:
when hostname of the hosts includes additional param as cluster number (master-0-0, master-0-1...) this parameter was skipped for bmh names: openshift-master-0, openshift-master-1...)

Version-Release number of selected component (if applicable):
UI: 1.5.11
API: stable (4.7)


How reproducible:
Always for virtual environment as far as for real BM setups we have unique names with no cluster counter


Steps to Reproduce:
1. Create virtual setup for OCP deployment using further parameter related to the cluster count number:
[root@titan47 ~]# virsh list --all
 Id   Name                State
------------------------------------
 1    minikube            running
 2    master-0-0          running
 3    master-0-1          running
 4    master-0-2          running
 5    worker-0-0          running
 6    worker-0-1          running

2. deploy ocp cluster using AI (single cluster is enough for problem reproduction)

Actual results:
[root@titan47 ~]# oc get nodes
NAME         STATUS   ROLES    AGE   VERSION
master-0-0   Ready    master   34h   v1.20.0+ba45583
master-0-1   Ready    master   34h   v1.20.0+ba45583
master-0-2   Ready    master   34h   v1.20.0+ba45583
worker-0-0   Ready    worker   34h   v1.20.0+ba45583
worker-0-1   Ready    worker   34h   v1.20.0+ba45583
[root@titan47 ~]# oc get bmh -A
NAMESPACE               NAME                 STATUS       PROVISIONING STATUS   CONSUMER                                 BMC   HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   openshift-master-0   discovered   unmanaged             titan47-cluster-0-xfvlj-master-0                                  true     
openshift-machine-api   openshift-master-1   discovered   unmanaged             titan47-cluster-0-xfvlj-master-1                                  true     
openshift-machine-api   openshift-master-2   discovered   unmanaged             titan47-cluster-0-xfvlj-master-2                                  true     
openshift-machine-api   openshift-worker-0   discovered   unmanaged             titan47-cluster-0-xfvlj-worker-0-c46xw                            true     
openshift-machine-api   openshift-worker-1   discovered   unmanaged             titan47-cluster-0-xfvlj-worker-0-j4h9m                            true     
[root@titan47 ~]# 


Expected results:
NAMESPACE               NAME                 STATUS       PROVISIONING STATUS   CONSUMER                                 BMC   HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   openshift-master-0-0   discovered   unmanaged             titan47-cluster-0-xfvlj-master-0                                  true     
openshift-machine-api   openshift-master-0-1   discovered   unmanaged             titan47-cluster-0-xfvlj-master-1                                  true     
openshift-machine-api   openshift-master-0-2   discovered   unmanaged             titan47-cluster-0-xfvlj-master-2                                  true     
openshift-machine-api   openshift-worker-0-0   discovered   unmanaged             titan47-cluster-0-xfvlj-worker-0-c46xw                            true     
openshift-machine-api   openshift-worker-0-1   discovered   unmanaged             titan47-cluster-0-xfvlj-worker-0-j4h9m                            true     


Additional info:

Comment 1 Flavio Percoco 2021-04-15 12:56:33 UTC
I haven't looked into this issue in great detail but at first sight, it looks like we may be trimming some data off of the name.

Comment 2 Mat Kowalski 2021-05-12 11:18:30 UTC
Could you please provide me with some more info on how you create the virtual setup as well as VBMC dump? When trying to reproduce with dev-scripts I cannot get the same results as yours (i.e. I get a different naming schema for the libvirt VMs and their respective VBMC interfaces).

I have created a virtual setup with the following initial configuration

```
export CLUSTER_NAME="bz1937696"
export MASTER_HOSTNAME_FORMAT=maaaster-0-%d
export WORKER_HOSTNAME_FORMAT=wooorker-0-%d
```

and got results as below

```
# virsh list
 Id   Name                        State
-------------------------------------------
 1    minikube                    running
 2    bz1937696-w8sm6-bootstrap   running
 6    bz1937696_master_1          running
 7    bz1937696_master_2          running
 8    bz1937696_master_0          running

# vbmc list
+--------------------+---------+---------------+------+
| Domain name        | Status  | Address       | Port |
+--------------------+---------+---------------+------+
| bz1937696_master_0 | running | 192.168.222.1 | 6230 |
| bz1937696_master_1 | running | 192.168.222.1 | 6231 |
| bz1937696_master_2 | running | 192.168.222.1 | 6232 |
| bz1937696_worker_0 | running | 192.168.222.1 | 6233 |
| bz1937696_worker_1 | running | 192.168.222.1 | 6234 |
+--------------------+---------+---------------+------+

# virsh net-dhcp-leases bz1937696bm
 Expiry Time           MAC address         Protocol   IP address          Hostname       Client ID or DUID
---------------------------------------------------------------------------------------------------------------
 2021-05-12 08:06:02   00:0a:c1:3b:28:14   ipv4       192.168.222.20/24   maaaster-0-0   01:00:0a:c1:3b:28:14
 2021-05-12 08:06:01   00:0a:c1:3b:28:18   ipv4       192.168.222.21/24   maaaster-0-1   01:00:0a:c1:3b:28:18
 2021-05-12 08:06:01   00:0a:c1:3b:28:1c   ipv4       192.168.222.22/24   maaaster-0-2   01:00:0a:c1:3b:28:1c
 2021-05-12 07:53:33   52:54:00:1e:cd:ad   ipv4       192.168.222.45/24   -              01:52:54:00:1e:cd:ad

# oc get nodes
NAME           STATUS     ROLES    AGE     VERSION
maaaster-0-0   NotReady   master   2m16s   v1.21.0-rc.0+c4bd6f9
maaaster-0-1   NotReady   master   2m17s   v1.21.0-rc.0+c4bd6f9
maaaster-0-2   NotReady   master   2m17s   v1.21.0-rc.0+c4bd6f9

# oc get bmh -A
NAMESPACE               NAME                 STATE   CONSUMER                   ONLINE   ERROR
openshift-machine-api   bz1937696-master-0           bz1937696-w8sm6-master-0   true
openshift-machine-api   bz1937696-master-1           bz1937696-w8sm6-master-1   true
openshift-machine-api   bz1937696-master-2           bz1937696-w8sm6-master-2   true
openshift-machine-api   bz1937696-worker-0                                      true
```

I have checked in the source code where the name of the bmh resource is coming from and it's based on the VBMC Domain Name (with replacement of "_" into "-"). In your setup it looks to me like the VBMC Domain Name does not contain the cluster number, but in order to confirm this I'd need to see the `vbmc list` output.

Comment 3 Elena German 2021-05-18 08:10:52 UTC
Actually we provision our nodes with a redfish, so:

[root@ocp-edge32 ~]# vbmc list
-bash: vbmc: command not found
[root@ocp-edge32 ~]# virsh list
 Id   Name         State
----------------------------
 1    master-0-0   running
 2    master-0-1   running
 3    master-0-2   running
 4    worker-0-0   running
 5    worker-0-1   running
 6    worker-0-2   running

Comment 4 Mat Kowalski 2021-05-19 11:11:47 UTC
Any chance I could get the manifests used to deploy this virtual environment (it's mainly about those libvirt VMs and their redfish interfaces) and/or access to the environment when the problem appears?

Comment 5 Elena German 2021-06-01 17:14:47 UTC
we do not have a manifest file for Assisted Installer, since it is all doing it in the background.
user is not exposed to the manifests Assisted generates for the installation.
I attach install-config.yaml file, maybe it will be helpfull

Comment 7 Flavio Percoco 2021-06-11 07:57:16 UTC
Elena, I'm a bit confused about the setup where you are testing/seeing this issue.

In your last comment, you mentioned you don't have access to the manifests (XML files) for libvirt/redfish but, in the comment previous to your last comment, you mentioned that you are provisioning the nodes with redfish. Furthermore, throughout the BZ you have been giving us (thanks for that) CLI output from your `virsh` commands.

Are these commands you are running? Or is the customer running them? 

Unfortunately, we are not able to reproduce this issue and, unless we can replicate it in an environment similar/equal to yours, we won't be able to help much.

Could you please describe (step by step) how this environment is being setup and how this can be reproduced?

Comment 9 Marius Cornea 2021-06-11 09:11:03 UTC
Hey Flavio o/

This is particular to VM based QE setups used with assisted installer deployments. Currently the VMs are created with an internal QE tool as libvirt VMs. In this case redfish is not involved since the VMs boot from ISO. 

We can get a reproducer environment for debugging but I guess the main question is how are the names for bmh objects created by assisted installer since the user doesn't create them as in the IPI case and if we have any option to adjust/influence how they're generated.

Comment 11 Mat Kowalski 2021-06-11 09:26:14 UTC
>the main question is how are the names for bmh objects created by assisted installer since the user doesn't create them

Please check the comment above, https://bugzilla.redhat.com/show_bug.cgi?id=1937696#c2. The name of the BMH resource is coming from and it's based on the VBMC/BMC Domain Name (with replacement of "_" into "-"). For this reason, if your QE environment you are using VBMC, we need to check the output of `vbmc list` to confirm whether those names there are correct.

Note that the name returned by `virsh list` is a name of the libvirt domain and is not relevant for naming the BMH resource.

Comment 13 Marius Cornea 2021-06-11 09:29:11 UTC
(In reply to Mat Kowalski from comment #11)
> >the main question is how are the names for bmh objects created by assisted installer since the user doesn't create them
> 
> Please check the comment above,
> https://bugzilla.redhat.com/show_bug.cgi?id=1937696#c2. The name of the BMH
> resource is coming from and it's based on the VBMC/BMC Domain Name (with
> replacement of "_" into "-"). For this reason, if your QE environment you
> are using VBMC, we need to check the output of `vbmc list` to confirm
> whether those names there are correct.
> 
> Note that the name returned by `virsh list` is a name of the libvirt domain
> and is not relevant for naming the BMH resource.

In this case(assisted-installer) there's no BMC involved as the VMs boot from ISO.

Comment 14 Mat Kowalski 2021-06-11 10:16:41 UTC
I see... In that case it would be extremely helpful for us to get access to the QE environment where the issue is happening. From what I see in the baremetal-operator codebase, there are only 2 flows setting the BMH resource name and none of them performs the stripping that happened here. Because of this, without more debugging it's difficult for us to point where exactly the issue is coming from

Comment 16 Mat Kowalski 2021-06-11 12:38:17 UTC
In the install-config.yaml I have found the following

```
apiVersion: v1
[...]
metadata:
  name: ocp-edge-cluster-0
[...]
platform:
  baremetal:
    provisioningNetwork: Disabled
    apiVIP: 192.168.123.147
    ingressVIP: 192.168.123.132
    hosts:
    - name: openshift-master-0
      role: master
[...]
    - name: openshift-worker-0
      role: worker
[...]
```

The install-config.yaml in this part is being generated by the following parts of the dev-scripts

- https://github.com/openshift-metal3/dev-scripts/blob/master/ocp_install_env.sh#L234
- https://github.com/openshift-metal3/dev-scripts/blob/master/utils.sh#L153

It looks like it's the `utils.sh` file that was responsible for filling the node names in install-config.yaml in their current form and this is how the name of the BareMetalHost resource got populated. As for now I consider it as an issue in dev-scripts and not in assisted-service nor baremetal-operator. I'll keep investigating how easily we could fix that to preserve some names here.

+++ Additional debug info

```
[root@sealusa2 ~]# oc -n openshift-machine-api get machines
NAME                                      PHASE     TYPE   REGION   ZONE   AGE
ocp-edge-cluster-0-b5kjk-master-0         Running                          90m
ocp-edge-cluster-0-b5kjk-master-1         Running                          90m
ocp-edge-cluster-0-b5kjk-master-2         Running                          90m
ocp-edge-cluster-0-b5kjk-worker-0-mp4wm   Running                          85m
ocp-edge-cluster-0-b5kjk-worker-0-s28pp   Running                          85m

[root@sealusa2 ~]# oc -n openshift-machine-api get baremetalhosts
NAME                 STATUS       PROVISIONING STATUS   CONSUMER                                  BMC   HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0   discovered   unmanaged             ocp-edge-cluster-0-b5kjk-master-0                                  true     
openshift-master-1   discovered   unmanaged             ocp-edge-cluster-0-b5kjk-master-1                                  true     
openshift-master-2   discovered   unmanaged             ocp-edge-cluster-0-b5kjk-master-2                                  true     
openshift-worker-0   discovered   unmanaged             ocp-edge-cluster-0-b5kjk-worker-0-mp4wm                            true     
openshift-worker-1   discovered   unmanaged             ocp-edge-cluster-0-b5kjk-worker-0-s28pp                            true     

[root@sealusa2 ~]# oc -n openshift-machine-api describe machines ocp-edge-cluster-0-b5kjk-master-0
Name:         ocp-edge-cluster-0-b5kjk-master-0
Namespace:    openshift-machine-api
Labels:       machine.openshift.io/cluster-api-cluster=ocp-edge-cluster-0-b5kjk
              machine.openshift.io/cluster-api-machine-role=master
              machine.openshift.io/cluster-api-machine-type=master
Annotations:  metal3.io/BareMetalHost: openshift-machine-api/openshift-master-0
[...]
```

Comment 17 Mat Kowalski 2021-06-11 12:53:27 UTC
For the reference, when creating BareMetalHost resources the openshift/installer uses names from `.platform.baremetal.hosts.name` to fill the `name` field of the BMH - https://github.com/openshift/installer/blob/master/pkg/asset/machines/baremetal/hosts.go#L37

Comment 18 Marius Cornea 2021-06-11 13:00:58 UTC
Please note that dev-scripts is not used on this setup, install-config.yaml was downloaded from assisted-installer

Comment 19 Mat Kowalski 2021-06-11 14:55:33 UTC
I see, in this case it's the following code generating the names in install-config.yaml

- https://github.com/openshift/assisted-service/blob/master/internal/installcfg/installcfg.go#L296
- https://github.com/openshift/assisted-service/blob/master/internal/installcfg/installcfg.go#L146

```
[...]
	prefix := "openshift-master-"
	index := masterIdx
	if host.Role == models.HostRoleWorker {
		prefix = "openshift-worker-"
		index = workerIdx
	}
[...]
```

I'll continue on Monday with checking if we could easily adapt the naming scheme. I can see that in the model we have `host.RequestedHostname` available so the path I will investigate is whether the logic could be easily changed into something more like

```
hosts[yamlHostIdx].Name = host.RequestedHostname
```

Comment 22 Udi Kalifon 2021-07-05 07:36:27 UTC
As I understand it, the BMHs are created and assigned to nodes without any particular order. The assisted installer doesn't recognize the pattern and the indexes that we use for the node names (worker-0-0, worker-0-1 etc...) and just creates BMH names with other indexes and it ends up confusing us a lot. In customer cases, the names of the nodes will not necessarily be indexed and then it will be clear to the user that the indexes in the BMH names are not related to anything.

Still, I see that this bug is on POST. Do we have a new logic for generating BMH names based on the node names? What should we expect to see?

Comment 23 Mat Kowalski 2021-07-05 07:54:10 UTC
We have the PR in review [1] that changes the logic of generating BMH names to use `GetHostnameForMsg()`. The new install-config.yaml will look as below so that an aggregation of BMHs from across multiple clusters will give distinguishable names. This is an improvement for the operators as they are usually managing multiple clusters so having BMHs named in a way that makes them distinct is for sure making the lives easier.

Having said this, the idea of using an index (or even multiple indexes) is in principle not the best one as internally the sorting of Nodes, BareMetalHost and Machine resources cannot be guaranteed to be stable. So yeah, the more indexes are used in various names, the bigger the chance of further confusions. I don't think we are able to create a silver bullet solving all the possible combinations, but the PR [1] should make the operations slightly easier.

[1] https://github.com/openshift/assisted-service/pull/1992

[2]

```
[...]
platform:
  baremetal:
    provisioningNetwork: Disabled
    apiVIP: 192.168.127.91
    ingressVIP: 192.168.127.23
    hosts:
    - name: test-infra-cluster-d63379ba-master-0
      role: master
      bootMACAddress: 02:00:00:46:6e:e7
      bootMode: legacy
    - name: test-infra-cluster-d63379ba-master-1
      role: master
      bootMACAddress: 02:00:00:20:c1:bd
      bootMode: legacy
[...]
```

Comment 25 Udi Kalifon 2021-07-26 11:07:02 UTC
Verified. The BMH names are now the same as the node names, which greatly reduces the confusion as to which object relates to which.

Comment 29 errata-xmlrpc 2021-10-18 17:29:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.