Bug 1937696
Summary: | [Assisted-4.7]node/hostnames vs bmh names inconsistency, skipped cluster index in name | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Elena German <elgerman> |
Component: | assisted-installer | Assignee: | Mat Kowalski <mko> |
assisted-installer sub component: | stand-alone | QA Contact: | Yuri Obshansky <yobshans> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | low | ||
Priority: | low | CC: | aos-bugs, fpercoco, frolland, mcornea, mko, ukalifon |
Version: | 4.7 | Keywords: | Triaged |
Target Milestone: | --- | ||
Target Release: | 4.9.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | AI-Team-Platform KNI-EDGE-4.8 | ||
Fixed In Version: | OCP-Metal-v1.0.23.1 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-10-18 17:29:21 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Elena German
2021-03-11 11:14:00 UTC
I haven't looked into this issue in great detail but at first sight, it looks like we may be trimming some data off of the name. Could you please provide me with some more info on how you create the virtual setup as well as VBMC dump? When trying to reproduce with dev-scripts I cannot get the same results as yours (i.e. I get a different naming schema for the libvirt VMs and their respective VBMC interfaces). I have created a virtual setup with the following initial configuration ``` export CLUSTER_NAME="bz1937696" export MASTER_HOSTNAME_FORMAT=maaaster-0-%d export WORKER_HOSTNAME_FORMAT=wooorker-0-%d ``` and got results as below ``` # virsh list Id Name State ------------------------------------------- 1 minikube running 2 bz1937696-w8sm6-bootstrap running 6 bz1937696_master_1 running 7 bz1937696_master_2 running 8 bz1937696_master_0 running # vbmc list +--------------------+---------+---------------+------+ | Domain name | Status | Address | Port | +--------------------+---------+---------------+------+ | bz1937696_master_0 | running | 192.168.222.1 | 6230 | | bz1937696_master_1 | running | 192.168.222.1 | 6231 | | bz1937696_master_2 | running | 192.168.222.1 | 6232 | | bz1937696_worker_0 | running | 192.168.222.1 | 6233 | | bz1937696_worker_1 | running | 192.168.222.1 | 6234 | +--------------------+---------+---------------+------+ # virsh net-dhcp-leases bz1937696bm Expiry Time MAC address Protocol IP address Hostname Client ID or DUID --------------------------------------------------------------------------------------------------------------- 2021-05-12 08:06:02 00:0a:c1:3b:28:14 ipv4 192.168.222.20/24 maaaster-0-0 01:00:0a:c1:3b:28:14 2021-05-12 08:06:01 00:0a:c1:3b:28:18 ipv4 192.168.222.21/24 maaaster-0-1 01:00:0a:c1:3b:28:18 2021-05-12 08:06:01 00:0a:c1:3b:28:1c ipv4 192.168.222.22/24 maaaster-0-2 01:00:0a:c1:3b:28:1c 2021-05-12 07:53:33 52:54:00:1e:cd:ad ipv4 192.168.222.45/24 - 01:52:54:00:1e:cd:ad # oc get nodes NAME STATUS ROLES AGE VERSION maaaster-0-0 NotReady master 2m16s v1.21.0-rc.0+c4bd6f9 maaaster-0-1 NotReady master 2m17s v1.21.0-rc.0+c4bd6f9 maaaster-0-2 NotReady master 2m17s v1.21.0-rc.0+c4bd6f9 # oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR openshift-machine-api bz1937696-master-0 bz1937696-w8sm6-master-0 true openshift-machine-api bz1937696-master-1 bz1937696-w8sm6-master-1 true openshift-machine-api bz1937696-master-2 bz1937696-w8sm6-master-2 true openshift-machine-api bz1937696-worker-0 true ``` I have checked in the source code where the name of the bmh resource is coming from and it's based on the VBMC Domain Name (with replacement of "_" into "-"). In your setup it looks to me like the VBMC Domain Name does not contain the cluster number, but in order to confirm this I'd need to see the `vbmc list` output. Actually we provision our nodes with a redfish, so: [root@ocp-edge32 ~]# vbmc list -bash: vbmc: command not found [root@ocp-edge32 ~]# virsh list Id Name State ---------------------------- 1 master-0-0 running 2 master-0-1 running 3 master-0-2 running 4 worker-0-0 running 5 worker-0-1 running 6 worker-0-2 running Any chance I could get the manifests used to deploy this virtual environment (it's mainly about those libvirt VMs and their redfish interfaces) and/or access to the environment when the problem appears? we do not have a manifest file for Assisted Installer, since it is all doing it in the background. user is not exposed to the manifests Assisted generates for the installation. I attach install-config.yaml file, maybe it will be helpfull Elena, I'm a bit confused about the setup where you are testing/seeing this issue. In your last comment, you mentioned you don't have access to the manifests (XML files) for libvirt/redfish but, in the comment previous to your last comment, you mentioned that you are provisioning the nodes with redfish. Furthermore, throughout the BZ you have been giving us (thanks for that) CLI output from your `virsh` commands. Are these commands you are running? Or is the customer running them? Unfortunately, we are not able to reproduce this issue and, unless we can replicate it in an environment similar/equal to yours, we won't be able to help much. Could you please describe (step by step) how this environment is being setup and how this can be reproduced? Hey Flavio o/ This is particular to VM based QE setups used with assisted installer deployments. Currently the VMs are created with an internal QE tool as libvirt VMs. In this case redfish is not involved since the VMs boot from ISO. We can get a reproducer environment for debugging but I guess the main question is how are the names for bmh objects created by assisted installer since the user doesn't create them as in the IPI case and if we have any option to adjust/influence how they're generated. >the main question is how are the names for bmh objects created by assisted installer since the user doesn't create them Please check the comment above, https://bugzilla.redhat.com/show_bug.cgi?id=1937696#c2. The name of the BMH resource is coming from and it's based on the VBMC/BMC Domain Name (with replacement of "_" into "-"). For this reason, if your QE environment you are using VBMC, we need to check the output of `vbmc list` to confirm whether those names there are correct. Note that the name returned by `virsh list` is a name of the libvirt domain and is not relevant for naming the BMH resource. (In reply to Mat Kowalski from comment #11) > >the main question is how are the names for bmh objects created by assisted installer since the user doesn't create them > > Please check the comment above, > https://bugzilla.redhat.com/show_bug.cgi?id=1937696#c2. The name of the BMH > resource is coming from and it's based on the VBMC/BMC Domain Name (with > replacement of "_" into "-"). For this reason, if your QE environment you > are using VBMC, we need to check the output of `vbmc list` to confirm > whether those names there are correct. > > Note that the name returned by `virsh list` is a name of the libvirt domain > and is not relevant for naming the BMH resource. In this case(assisted-installer) there's no BMC involved as the VMs boot from ISO. I see... In that case it would be extremely helpful for us to get access to the QE environment where the issue is happening. From what I see in the baremetal-operator codebase, there are only 2 flows setting the BMH resource name and none of them performs the stripping that happened here. Because of this, without more debugging it's difficult for us to point where exactly the issue is coming from In the install-config.yaml I have found the following ``` apiVersion: v1 [...] metadata: name: ocp-edge-cluster-0 [...] platform: baremetal: provisioningNetwork: Disabled apiVIP: 192.168.123.147 ingressVIP: 192.168.123.132 hosts: - name: openshift-master-0 role: master [...] - name: openshift-worker-0 role: worker [...] ``` The install-config.yaml in this part is being generated by the following parts of the dev-scripts - https://github.com/openshift-metal3/dev-scripts/blob/master/ocp_install_env.sh#L234 - https://github.com/openshift-metal3/dev-scripts/blob/master/utils.sh#L153 It looks like it's the `utils.sh` file that was responsible for filling the node names in install-config.yaml in their current form and this is how the name of the BareMetalHost resource got populated. As for now I consider it as an issue in dev-scripts and not in assisted-service nor baremetal-operator. I'll keep investigating how easily we could fix that to preserve some names here. +++ Additional debug info ``` [root@sealusa2 ~]# oc -n openshift-machine-api get machines NAME PHASE TYPE REGION ZONE AGE ocp-edge-cluster-0-b5kjk-master-0 Running 90m ocp-edge-cluster-0-b5kjk-master-1 Running 90m ocp-edge-cluster-0-b5kjk-master-2 Running 90m ocp-edge-cluster-0-b5kjk-worker-0-mp4wm Running 85m ocp-edge-cluster-0-b5kjk-worker-0-s28pp Running 85m [root@sealusa2 ~]# oc -n openshift-machine-api get baremetalhosts NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-master-0 discovered unmanaged ocp-edge-cluster-0-b5kjk-master-0 true openshift-master-1 discovered unmanaged ocp-edge-cluster-0-b5kjk-master-1 true openshift-master-2 discovered unmanaged ocp-edge-cluster-0-b5kjk-master-2 true openshift-worker-0 discovered unmanaged ocp-edge-cluster-0-b5kjk-worker-0-mp4wm true openshift-worker-1 discovered unmanaged ocp-edge-cluster-0-b5kjk-worker-0-s28pp true [root@sealusa2 ~]# oc -n openshift-machine-api describe machines ocp-edge-cluster-0-b5kjk-master-0 Name: ocp-edge-cluster-0-b5kjk-master-0 Namespace: openshift-machine-api Labels: machine.openshift.io/cluster-api-cluster=ocp-edge-cluster-0-b5kjk machine.openshift.io/cluster-api-machine-role=master machine.openshift.io/cluster-api-machine-type=master Annotations: metal3.io/BareMetalHost: openshift-machine-api/openshift-master-0 [...] ``` For the reference, when creating BareMetalHost resources the openshift/installer uses names from `.platform.baremetal.hosts.name` to fill the `name` field of the BMH - https://github.com/openshift/installer/blob/master/pkg/asset/machines/baremetal/hosts.go#L37 Please note that dev-scripts is not used on this setup, install-config.yaml was downloaded from assisted-installer I see, in this case it's the following code generating the names in install-config.yaml - https://github.com/openshift/assisted-service/blob/master/internal/installcfg/installcfg.go#L296 - https://github.com/openshift/assisted-service/blob/master/internal/installcfg/installcfg.go#L146 ``` [...] prefix := "openshift-master-" index := masterIdx if host.Role == models.HostRoleWorker { prefix = "openshift-worker-" index = workerIdx } [...] ``` I'll continue on Monday with checking if we could easily adapt the naming scheme. I can see that in the model we have `host.RequestedHostname` available so the path I will investigate is whether the logic could be easily changed into something more like ``` hosts[yamlHostIdx].Name = host.RequestedHostname ``` As I understand it, the BMHs are created and assigned to nodes without any particular order. The assisted installer doesn't recognize the pattern and the indexes that we use for the node names (worker-0-0, worker-0-1 etc...) and just creates BMH names with other indexes and it ends up confusing us a lot. In customer cases, the names of the nodes will not necessarily be indexed and then it will be clear to the user that the indexes in the BMH names are not related to anything. Still, I see that this bug is on POST. Do we have a new logic for generating BMH names based on the node names? What should we expect to see? We have the PR in review [1] that changes the logic of generating BMH names to use `GetHostnameForMsg()`. The new install-config.yaml will look as below so that an aggregation of BMHs from across multiple clusters will give distinguishable names. This is an improvement for the operators as they are usually managing multiple clusters so having BMHs named in a way that makes them distinct is for sure making the lives easier. Having said this, the idea of using an index (or even multiple indexes) is in principle not the best one as internally the sorting of Nodes, BareMetalHost and Machine resources cannot be guaranteed to be stable. So yeah, the more indexes are used in various names, the bigger the chance of further confusions. I don't think we are able to create a silver bullet solving all the possible combinations, but the PR [1] should make the operations slightly easier. [1] https://github.com/openshift/assisted-service/pull/1992 [2] ``` [...] platform: baremetal: provisioningNetwork: Disabled apiVIP: 192.168.127.91 ingressVIP: 192.168.127.23 hosts: - name: test-infra-cluster-d63379ba-master-0 role: master bootMACAddress: 02:00:00:46:6e:e7 bootMode: legacy - name: test-infra-cluster-d63379ba-master-1 role: master bootMACAddress: 02:00:00:20:c1:bd bootMode: legacy [...] ``` Verified. The BMH names are now the same as the node names, which greatly reduces the confusion as to which object relates to which. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |