Description of problem: Bare Metal IPI installation fails: the worker nodes are never provisioned, and the RHCOS image is not deployed. Version-Release number of selected component (if applicable): tested on arm payloads >= 4.11.0-0.nightly-arm64-2022-05-31-155531 How reproducible: always (see attached install-config) Steps to Reproduce: 1. Install a BM IPI cluster on ARM with Managed Provisioning Network Actual results: no workers join the cluster nor do they get rhcos into their disks. oc get nodes NAME STATUS ROLES AGE VERSION master-00.adistefa-61947.qeclusters.arm.eng.rdu2.redhat.com Ready master 7h55m v1.23.3+ad897c4 master-01.adistefa-61947.qeclusters.arm.eng.rdu2.redhat.com Ready master 7h56m v1.23.3+ad897c4 master-02.adistefa-61947.qeclusters.arm.eng.rdu2.redhat.com Ready master 7h55m v1.23.3+ad897c4 openshift-qe-bastion adistefa-61947 # [root@worker-00 core]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 47.1G 0 loop /run/ephemeral loop1 7:1 0 849.5M 0 loop /sysroot nvme0n1 259:0 0 894.3G 0 disk [root@worker-00 core]# fdisk -l /dev/nvme0n1 Disk /dev/nvme0n1: 894.3 GiB, 960197124096 bytes, 1875385008 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes [root@worker-00 core]# Expected results: Worker nodes are provisioned and ready. Additional info: The issue seems related to the rootDeviceHint. install-config.yaml --- apiVersion: v1 controlPlane: architecture: arm64 hyperthreading: Enabled name: master platform: baremetal: {} replicas: 3 compute: - architecture: arm64 hyperthreading: Enabled name: worker platform: {} replicas: 2 metadata: name: adistefa-61947 platform: baremetal: apiVIP: 192.168.90.224 ingressVIP: 192.168.90.3 provisioningNetworkCIDR: 172.22.224.0/24 provisioningBridge: br-istefa-61947 hosts: - name: master-00 role: master bmc: address: ipmi://BMC_HOST disableCertificateVerification: true username: ** REDACTED ** password: ** REDACTED ** bootMACAddress: a0:36:9f:30:04:c6 rootDeviceHints: deviceName: "/dev/nvme0n1" networkConfig: interfaces: - name: enP2p2s0f0 type: ethernet state: up ipv4: enabled: true dhcp: true ipv6: enabled: true dhcp: true - name: enP2p2s0f1 type: ethernet state: up ipv4: enabled: false dhcp: false ipv6: enabled: false dhcp: false - name: enP4p3s0u2u3c2 type: ethernet state: up ipv4: enabled: false dhcp: false ipv6: enabled: false dhcp: false ******* x3 ****** - name: worker-00 role: worker bmc: address: ipmi://BMC_HOST disableCertificateVerification: true username: ** REDACTED ** password: ** REDACTED ** bootMACAddress: 00:1b:21:e4:3a:b1 rootDeviceHints: deviceName: "/dev/nvme0n1" networkConfig: interfaces: - name: enP2p2s0f0 type: ethernet state: up ipv4: enabled: true dhcp: true ipv6: enabled: true dhcp: true - name: enP2p2s0f1 type: ethernet state: up ipv4: enabled: false dhcp: false ipv6: enabled: false dhcp: false - name: enP4p3s0u2u3c2 type: ethernet state: up ipv4: enabled: false dhcp: false ipv6: enabled: false dhcp: false ******* x2 ****** networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 serviceNetwork: - 172.30.0.0/16 machineNetwork: - cidr: 192.168.90.0/24 networkType: OpenShiftSDN publish: External baseDomain: qeclusters.arm.eng.rdu2.redhat.com pullSecret: ** REDACTED ** sshKey: .... ironic agent container log: http://pastebin.test.redhat.com/1055443
Additional info: it is not related to the rootDeviceHints. The inspection is finished successfully from the worker's IPA and the result is pushed back to the ironic-inspector container in the master. However, after the BareMetalHost resource transitions to the provisioning state, the metal3-baremetal-operator looks for the networkData key in the worker-$N-network-config-secret, which is not available, as the networkData are stored in the nmstate key, pushed as a manifest by the installer. {"level":"error","ts":1654276012.1460414,"logger":"controller.baremetalhost","msg":"Reconciler error","reconciler group":"metal3.io","reconciler kind":"BareMetalHost","name":"worker-01","namespace":"openshift-machine-api","error":"action \"provisioning\" failed: failed to provision: could not retrieve network data: Secret worker-01-network-config-secret does not contain key networkData","errorVerbose":"Secret worker-01-network-config-secret does not contain key networkData\ncould not retrieve network data\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).getConfigDrive\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1370\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).Provision\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1490\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:1072\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:482\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:199\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:250\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_arm64.s:1259\nfailed to provision\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:1081\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:482\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:199\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:250\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_arm64.s:1259\naction \"provisioning\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:254\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_arm64.s:1259","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"} Changing the key name in the secret from `nmstate` to `networkData`, after it is generated by the installer, leads the baremetal operator and ironic to push rhcos. The workers are able to join the cluster this way: oc get nodes NAME STATUS ROLES AGE VERSION master-00.adistefa-631318b.qeclusters.arm.eng.rdu2.redhat.com Ready master 27m v1.24.0+899fd9f master-01.adistefa-631318b.qeclusters.arm.eng.rdu2.redhat.com Ready master 26m v1.24.0+899fd9f master-02.adistefa-631318b.qeclusters.arm.eng.rdu2.redhat.com Ready master 26m v1.24.0+899fd9f worker-00.adistefa-631318b.qeclusters.arm.eng.rdu2.redhat.com Ready worker 2m53s v1.24.0+899fd9f worker-01.adistefa-631318b.qeclusters.arm.eng.rdu2.redhat.com Ready worker 93s v1.24.0+899fd9f
> Changing the key name in the secret from `nmstate` to `networkData`, after it is generated by the installer, leads the baremetal operator and ironic to push rhcos. Adding the key networkData with an empty value is enough. `nmstate` is still needed to guarantee a custom network configuration for the hosts. The current workaround I'm applying at installation time is like: cd $PATH_TO_INSTALL_DIR/openshift for f in 99_openshift-cluster-api_host-network-config-secrets-* do yq eval ".data.networkData = \"\"" -i $f done Moving the bug to the baremetal-operator component.
After chatting with Alessandro, I'm proposing to set the Severity to "High" and leaving the Blocker? flag to the assignee team to set.
So the secret is created here, with the key "nmstate" https://github.com/openshift/installer/blob/8b3d14d/pkg/asset/machines/baremetal/hosts.go#L50 The BMO reads it here, with the name "networkData" https://github.com/openshift/baremetal-operator/blob/1adcaab/controllers/metal3.io/host_config_data.go#L85 Fixing the installer to match the BMO shouldn't be difficult, I'd like to first figure out why/how this works in 4.10
Hello Derek, only matching the name (s/nmstate/networkData) seems not sufficient. The nmstate key is still needed to configure a custom network, while the networkData value can also be empty (at least for the use cases I tried). If you change the "nmstate" key to "networkData", any custom network configuration is ignored. This key is never referenced by the baremetal-operator: if I'm not wrong it is handled by the image-customization-controller: https://github.com/openshift/image-customization-controller/blob/main/pkg/imageprovider/rhcos.go#L49
(In reply to Derek Higgins from comment #6) > The BMO reads it here, with the name "networkData" > https://github.com/openshift/baremetal-operator/blob/1adcaab/controllers/ > metal3.io/host_config_data.go#L85 No it doesn't. The machine-image-customization-controller reads it (with the key "nmstate") here: https://github.com/openshift/image-customization-controller/blob/main/pkg/imageprovider/rhcos.go#L49
To me, the issue seems that the upstream metal3 operator: 1. considers the secret and its networkData field optional: host_config_data.go#L74 2. does not consider that the secret (i.e., a secret with the same name) could have other fields to be consumed by another guy 3. getSecretData(.) returns an error that is propagated to the caller functions when "the secret is found and the $key field is not": host_config_data.go#L82 and host_config_data.go#L42 https://github.com/openshift/baremetal-operator/blob/1adcaab/controllers/metal3.io/host_config_data.go For this reason, I set the component to the BMO: adding the empty key in the manifests at install time would also mean asking the users to add this field when expanding the cluster (see the release_pending doc from: https://bugzilla.redhat.com/show_bug.cgi?id=2082671).
Yes, I think that's correct. BMO should stop treating this as an error. We should *not* change the Secret format used by the installer or what we describe in the docs. What's confusing is how this didn't fail in exactly the same way in 4.10, given that the code doesn't appear to have changed.
(In reply to Zane Bitter from comment #10) > Yes, I think that's correct. BMO should stop treating this as an error. We > should *not* change the Secret format used by the installer or what we > describe in the docs. > > What's confusing is how this didn't fail in exactly the same way in 4.10, > given that the code doesn't appear to have changed. I think I see the difference, the bmh for 4.11 contains "preprovisioningNetworkDataName" its not present in 4.10 $ cat ./4.11/release/namespaces/openshift-machine-api/metal3.io/baremetalhosts/ostest-worker-1.yaml | yq .spec.preprovisioningNetworkDataName "ostest-worker-1-network-config-secret" $ cat ./4.10/release/namespaces/openshift-machine-api/metal3.io/baremetalhosts/ostest-worker-1.yaml | yq .spec.preprovisioningNetworkDataName null So In 4.10 we don't get passed this check in NetworkData() https://github.com/openshift/baremetal-operator/blob/1adcaab/controllers/metal3.io/host_config_data.go#L74 if networkData == nil { hcd.log.Info("NetworkData is not set, returning empty data") return "", nil } In 4.11 networkData is set and we instead attempts to get "networkData" from the secret and fail as it doesn't exist return hcd.getSecretData( networkData.Name, namespace, "networkData", )
(In reply to Derek Higgins from comment #11) > (In reply to Zane Bitter from comment #10) > > Yes, I think that's correct. BMO should stop treating this as an error. We > > should *not* change the Secret format used by the installer or what we > > describe in the docs. > > > > What's confusing is how this didn't fail in exactly the same way in 4.10, > > given that the code doesn't appear to have changed. > > I think I see the difference, the bmh for 4.11 contains > "preprovisioningNetworkDataName" its not present in 4.10 > $ cat > ./4.11/release/namespaces/openshift-machine-api/metal3.io/baremetalhosts/ > ostest-worker-1.yaml | yq .spec.preprovisioningNetworkDataName > "ostest-worker-1-network-config-secret" > > $ cat > ./4.10/release/namespaces/openshift-machine-api/metal3.io/baremetalhosts/ > ostest-worker-1.yaml | yq .spec.preprovisioningNetworkDataName > null > > > So In 4.10 we don't get passed this check in NetworkData() > https://github.com/openshift/baremetal-operator/blob/1adcaab/controllers/ > metal3.io/host_config_data.go#L74 > > if networkData == nil { > hcd.log.Info("NetworkData is not set, returning empty data") > return "", nil > } > > In 4.11 networkData is set and we instead attempts to get "networkData" from > the secret and fail as it doesn't exist > > return hcd.getSecretData( > networkData.Name, > namespace, > "networkData", > ) Hi Derek, I'm not 100% we found the actual root cause yet. 4.10 works well, but if I try to generate manifests by using the 4.10 openshift-baremetal-install, the preProvisioningNetworkDataName field is generated too (again, when some network configuration is provided for a given host). However, as you said, this is unavailable later when the cluster installs. The secret is stored, and it is owned by both the preProvisioningImage and the BareMetalHost, but the ref in the BareMetalHost is likely to be deleted between the inspection and provisioning states. Who is deleting it? Why? As this is tested in another infra, I could not be sure yet that the nmstate configuration is actually honored. I want to point out this to understand whether the patch we're going to apply doesn't hide some other issue (in 4.11 or, also, in the previous versions). Here an excerpt of its initial processing in the BMO: {"level":"info","ts":1655120569.8676925,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","credentials":{"credentials":{"name":"worker-01-bmc-secret","namespace":"openshift-machine-api"},"credentialsVersion":"13941"}}, {"level":"info","ts":1655120569.8677754,"logger":"controllers.BareMetalHost.secret_manager","msg":"setting secret owner reference","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","secret":"worker-01-network-config-secret","secretNamespace":"openshift-machine-api","ownerKind":"BareMetalHost","owner":"openshift-machine-api/worker-01","ownerUID":"f0462caa-4e41-48e8-8f12-e30abbf70df9"}, {"level":"info","ts":1655120569.8940384,"logger":"controllers.BareMetalHost","msg":"network data in pre-provisioning image is out of date","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","latestVersion":"15949","currentVersion":"15935"}, {"level":"info","ts":1655120569.923056,"logger":"provisioner.ironic","msg":"current provision state","host":"openshift-machine-api~worker-01","lastError":"","current":"manageable","target":""}, {"level":"info","ts":1655120569.9230728,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","requeue":true,"after":300}, {"level":"info","ts":1655120569.9231105,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-01"}, {"level":"info","ts":1655120569.9553645,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","credentials":{"credentials":{"name":"worker-01-bmc-secret","namespace":"openshift-machine-api"},"credentialsVersion":"13941"}}, {"level":"info","ts":1655120569.9554284,"logger":"controllers.BareMetalHost","msg":"network data in pre-provisioning image is out of date","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","latestVersion":"15949","currentVersion":"15935"}, {"level":"info","ts":1655120569.9849916,"logger":"provisioner.ironic","msg":"current provision state","host":"openshift-machine-api~worker-01","lastError":"","current":"manageable","target":""}, {"level":"info","ts":1655120569.9850068,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","requeue":true,"after":300}, {"level":"info","ts":1655120570.0206833,"logger":"controllers.HostFirmwareSettings","msg":"created new firmwareSchema resource","hostfirmwaresettings":"openshift-machine-api/worker-01"}, {"level":"info","ts":1655120570.0581303,"logger":"controllers.HostFirmwareSettings","msg":"start","hostfirmwaresettings":"openshift-machine-api/worker-01"}, {"level":"info","ts":1655120570.082164,"logger":"controllers.HostFirmwareSettings","msg":"retrieving firmware settings and saving to resource","hostfirmwaresettings":"openshift-machine-api/worker-01","node":"0d0b88ca-619b-453f-9bb9-a62c18ca8df1"}, {"level":"info","ts":1655120570.2091157,"logger":"provisioner.ironic","msg":"retrieved BIOS settings for node","host":"openshift-machine-api~worker-01","node":"0d0b88ca-619b-453f-9bb9-a62c18ca8df1","size":329}, {"level":"info","ts":1655120570.209288,"logger":"controllers.HostFirmwareSettings","msg":"getting firmwareSchema","hostfirmwaresettings":"openshift-machine-api/worker-01"}, {"level":"info","ts":1655120570.210087,"logger":"controllers.HostFirmwareSettings","msg":"found existing firmwareSchema resource","hostfirmwaresettings":"openshift-machine-api/worker-01"}, {"level":"info","ts":1655120570.2285674,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-01"}, {"level":"info","ts":1655120570.2627969,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","credentials":{"credentials":{"name":"worker-01-bmc-secret","namespace":"openshift-machine-api"},"credentialsVersion":"13941"}}, {"level":"info","ts":1655120570.2628784,"logger":"controllers.BareMetalHost","msg":"pending PreprovisioningImage not ready","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting"}, {"level":"info","ts":1655120570.2924314,"logger":"provisioner.ironic","msg":"current provision state","host":"openshift-machine-api~worker-01","lastError":"","current":"manageable","target":""}, {"level":"info","ts":1655120570.2924478,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","requeue":true,"after":300}, {"level":"info","ts":1655120570.292479,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-01"}, {"level":"info","ts":1655120570.3265345,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","credentials":{"credentials":{"name":"worker-01-bmc-secret","namespace":"openshift-machine-api"},"credentialsVersion":"13941"}}, {"level":"info","ts":1655120570.3266103,"logger":"controllers.BareMetalHost","msg":"pending PreprovisioningImage not ready","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting"}, {"level":"info","ts":1655120570.35558,"logger":"provisioner.ironic","msg":"current provision state","host":"openshift-machine-api~worker-01","lastError":"","current":"manageable","target":""}, {"level":"info","ts":1655120570.3555937,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","requeue":true,"after":300}, {"level":"info","ts":1655120570.5948565,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-01"},
adistefa Ya, I can see in 4.10 that the BMH is created with "preprovisioningNetworkDataName" then it is later removed This allows 4.10 to pass but I'm not sure whats removing it zbitter any idea what removes it. regardless should we proceed with the BMO not treating this as a error(as in https://github.com/openshift/baremetal-operator/pull/227)
I have no idea what could be removing it?!
Oh, I bet I know what is removing it. It will be a controller that has vendored an old version of the BMH API. Almost certainly the culprit in this case will be CAPBM. Most likely the vendoring was updated some time in 4.11 and that caused the bug to show up. So we should make this not be an error and backport that change to 4.10 before we update the vendoring in 4.10.
*** Bug 2098430 has been marked as a duplicate of this bug. ***
Verified fix on 4.11.0-0.nightly-2022-06-21-040754
Even if the configuration tested by Adina (static IPs configured for each of the nodes in install-config) is different than the initial one, and given that no arm64 payloads are available yet for reasons that are beyond this bug, I'm moving this to verified. The log of the verification is at https://auto-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/Private_Folders/job/yporagpa/job/networkConfig_Demo/302/console
*** Bug 2098424 has been marked as a duplicate of this bug. ***
(In reply to aleskandro from comment #4) > > Changing the key name in the secret from `nmstate` to `networkData`, after it is generated by the installer, leads the baremetal operator and ironic to push rhcos. > > Adding the key networkData with an empty value is enough. `nmstate` is still > needed to guarantee a custom network configuration for the hosts. > > The current workaround I'm applying at installation time is like: > > cd $PATH_TO_INSTALL_DIR/openshift > for f in 99_openshift-cluster-api_host-network-config-secrets-* > do > yq eval ".data.networkData = \"\"" -i $f > done > > Moving the bug to the baremetal-operator component. Hi how do I apply this workaround, I dont see 99_openshift-cluster-api_host-network-config-secrets-* in my $PATH_TO_INSTALL_DIR/openshift directory all I have are the following manifests below, if it helps my IPI install is not using a provisioning network. 99_baremetal-provisioning-config.yaml 99_openshift-cluster-api_hosts-0.yaml 99_openshift-cluster-api_master-machines-2.yaml 99_kubeadmin-password-secret.yaml 99_openshift-cluster-api_hosts-1.yaml 99_openshift-cluster-api_master-user-data-secret.yaml 99_openshift-cluster-api_host-bmc-secrets-0.yaml 99_openshift-cluster-api_hosts-2.yaml 99_openshift-cluster-api_worker-machineset-0.yaml 99_openshift-cluster-api_host-bmc-secrets-1.yaml 99_openshift-cluster-api_hosts-3.yaml 99_openshift-cluster-api_worker-user-data-secret.yaml 99_openshift-cluster-api_host-bmc-secrets-2.yaml 99_openshift-cluster-api_hosts-4.yaml 99_openshift-machineconfig_99-master-ssh.yaml 99_openshift-cluster-api_host-bmc-secrets-3.yaml 99_openshift-cluster-api_hosts-5.yaml 99_openshift-machineconfig_99-worker-ssh.yaml 99_openshift-cluster-api_host-bmc-secrets-4.yaml 99_openshift-cluster-api_hosts-6.yaml openshift-install-manifests.yaml 99_openshift-cluster-api_host-bmc-secrets-5.yaml 99_openshift-cluster-api_master-machines-0.yaml 99_openshift-cluster-api_host-bmc-secrets-6.yaml 99_openshift-cluster-api_master-machines-1.yaml
Hello, (In reply to denise.ochoa-mendoza from comment #25) > > Hi how do I apply this workaround, I dont see > 99_openshift-cluster-api_host-network-config-secrets-* in my > $PATH_TO_INSTALL_DIR/openshift directory all I have are the following > manifests below, if it helps my IPI install is not using a provisioning > network. > This bug blocked the installation of 4.11 clusters (not yet GA), with a custom network configuration defined in the install-config at the key platform.baremetal.hosts[].networkConfig (or in the related secret to expand already existing clusters). Since this bug is verified, the workaround is no longer needed if you install with the latest nightlies. Note that the installer will not create the network-config secrets if you are installing a 4.11 cluster and not applying a custom network configuration as above. - Which version are you installing? - Do you have a custom network configuration defined in the install-config?
Created attachment 1896134 [details] IPI Installation, no provisioning network, no workers deployed Hi, here is the must-gather.
(In reply to denise.ochoa-mendoza from comment #30) > Created attachment 1896134 [details] > IPI Installation, no provisioning network, no workers deployed > > Hi, > > here is the must-gather. Hello Denise, as anticipated this bug is not related to your case. Looking at your install-config and the status of the cluster from the must-gather you uploaded, the metal3-ironic-api container in the metal3 deployment's pod is not started yet. It's waiting for the assignment of the clusterProvisioningIP to one of the available nodes, and there is no node having 192.168.22.158 as IP. If you feel this is a bug and not a misconfiguration of your network or install-config, open a new bug so that the proper team can support you on this. omg logs metal3-85bc5848d-jq8dr -c metal3-ironic-api 2022-07-08T16:33:39.342637161Z Waiting for 192.168.22.158 to be configured on an interface .... one thing worth noting, however, is that the metal3 pod looks good with all of its containers ready, even if it isn't. This degrades the user experience, makes the debugging of such issues harder, and should be taken into account by the metal3 dev team IMO.
(In reply to aleskandro from comment #31) > one thing worth noting, however, is that the metal3 pod looks good with all > of its containers ready, even if it isn't. This degrades the user > experience, makes the debugging of such issues harder, and should be taken > into account by the metal3 dev team IMO. This appears to be because the configuration sets all of the provisioning network data (MAC addresses, IP, CIDR) but also sets "provisioningNetwork: Disabled". If the provisioning network were enabled then an init container would set the provisioning VIP and not permit the Pod to start up until it was done. I don't recall if there is any specific behaviour tied to providing a provisioning IP when the provisioning network is disabled, or if we should just ignored the rest of the data when we see "provisioningNetwork: Disabled". If not the latter then there probably should be an init container that checks that the IP is available before permitting the pod to start up.
(In reply to Zane Bitter from comment #32) > (In reply to aleskandro from comment #31) > > one thing worth noting, however, is that the metal3 pod looks good with all > > of its containers ready, even if it isn't. This degrades the user > > experience, makes the debugging of such issues harder, and should be taken > > into account by the metal3 dev team IMO. > > This appears to be because the configuration sets all of the provisioning > network data (MAC addresses, IP, CIDR) but also sets "provisioningNetwork: > Disabled". If the provisioning network were enabled then an init container > would set the provisioning VIP and not permit the Pod to start up until it > was done. > What I meant is that, regardless of the specific case, I would set the proper startup/readiness/liveness probes so that a user is aware when some containers in the metal3 pods are not starting/ready. This is true also for other containers there, like the dnsmasq one (see https://bugzilla.redhat.com/show_bug.cgi?id=2081734).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069