Bug 2092650 - [BM IPI with Provisioning Network] Worker nodes are not provisioned: ironic-agent is stuck before writing into disks
Summary: [BM IPI with Provisioning Network] Worker nodes are not provisioned: ironic-a...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.11
Hardware: aarch64
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Derek Higgins
QA Contact: aleskandro
URL:
Whiteboard:
: 2098424 2098430 (view as bug list)
Depends On:
Blocks: 2097695 2098658
TreeView+ depends on / blocked
 
Reported: 2022-06-01 23:10 UTC by aleskandro
Modified: 2022-08-10 11:16 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2097695 (view as bug list)
Environment:
Last Closed: 2022-08-10 11:15:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift baremetal-operator pull 227 0 None Merged Bug 2092650: Stop treating missing network as fatal error 2022-06-21 08:42:41 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:16:05 UTC

Description aleskandro 2022-06-01 23:10:01 UTC
Description of problem:

Bare Metal IPI installation fails: the worker nodes are never provisioned, and the RHCOS image is not deployed.

Version-Release number of selected component (if applicable):
tested on arm payloads >= 4.11.0-0.nightly-arm64-2022-05-31-155531

How reproducible: always (see attached install-config)


Steps to Reproduce:
1. Install a BM IPI cluster on ARM with Managed Provisioning Network

Actual results:

no workers join the cluster nor do they get rhcos into their disks.

oc get nodes
NAME                                                          STATUS   ROLES    AGE     VERSION
master-00.adistefa-61947.qeclusters.arm.eng.rdu2.redhat.com   Ready    master   7h55m   v1.23.3+ad897c4
master-01.adistefa-61947.qeclusters.arm.eng.rdu2.redhat.com   Ready    master   7h56m   v1.23.3+ad897c4
master-02.adistefa-61947.qeclusters.arm.eng.rdu2.redhat.com   Ready    master   7h55m   v1.23.3+ad897c4
openshift-qe-bastion adistefa-61947 # 

[root@worker-00 core]# lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0     7:0    0  47.1G  0 loop /run/ephemeral
loop1     7:1    0 849.5M  0 loop /sysroot
nvme0n1 259:0    0 894.3G  0 disk 
[root@worker-00 core]# fdisk -l /dev/nvme0n1 
Disk /dev/nvme0n1: 894.3 GiB, 960197124096 bytes, 1875385008 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
[root@worker-00 core]# 


Expected results:

Worker nodes are provisioned and ready.

Additional info:

The issue seems related to the rootDeviceHint. 

install-config.yaml

---
apiVersion: v1
controlPlane:
  architecture: arm64
  hyperthreading: Enabled
  name: master
  platform:
    baremetal: {}
  replicas: 3
compute:
- architecture: arm64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 2
metadata:
  name: adistefa-61947
platform:
  baremetal:
    apiVIP: 192.168.90.224
    ingressVIP: 192.168.90.3
    provisioningNetworkCIDR: 172.22.224.0/24
    provisioningBridge: br-istefa-61947
    hosts:
    - name: master-00
      role: master
      bmc:
        address: ipmi://BMC_HOST
        disableCertificateVerification: true
        username: ** REDACTED **
        password: ** REDACTED **
      bootMACAddress: a0:36:9f:30:04:c6
      rootDeviceHints:
        deviceName: "/dev/nvme0n1"
      networkConfig:
        interfaces:
        - name: enP2p2s0f0
          type: ethernet
          state: up
          ipv4:
            enabled: true
            dhcp: true
          ipv6:
            enabled: true
            dhcp: true
        - name: enP2p2s0f1
          type: ethernet
          state: up
          ipv4:
            enabled: false
            dhcp: false
          ipv6:
            enabled: false
            dhcp: false
        - name: enP4p3s0u2u3c2
          type: ethernet
          state: up
          ipv4:
            enabled: false
            dhcp: false
          ipv6:
            enabled: false
            dhcp: false
    ******* x3 ******
    - name: worker-00
      role: worker
      bmc:
        address: ipmi://BMC_HOST
        disableCertificateVerification: true
        username: ** REDACTED **
        password: ** REDACTED **
      bootMACAddress: 00:1b:21:e4:3a:b1
      rootDeviceHints:
        deviceName: "/dev/nvme0n1"
      networkConfig:
        interfaces:
        - name: enP2p2s0f0
          type: ethernet
          state: up
          ipv4:
            enabled: true
            dhcp: true
          ipv6:
            enabled: true
            dhcp: true
        - name: enP2p2s0f1
          type: ethernet
          state: up
          ipv4:
            enabled: false
            dhcp: false
          ipv6:
            enabled: false
            dhcp: false
        - name: enP4p3s0u2u3c2
          type: ethernet
          state: up
          ipv4:
            enabled: false
            dhcp: false
          ipv6:
            enabled: false
              dhcp: false
    ******* x2 ******
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  serviceNetwork:
  - 172.30.0.0/16
  machineNetwork:
  - cidr: 192.168.90.0/24
  networkType: OpenShiftSDN
publish: External
baseDomain: qeclusters.arm.eng.rdu2.redhat.com
pullSecret: ** REDACTED **
sshKey: ....


ironic agent container log: http://pastebin.test.redhat.com/1055443

Comment 2 aleskandro 2022-06-03 18:41:17 UTC
Additional info:

it is not related to the rootDeviceHints. The inspection is finished successfully from the worker's IPA and the result is pushed back to the ironic-inspector container in the master.

However, after the BareMetalHost resource transitions to the provisioning state, the metal3-baremetal-operator looks for the networkData key in the worker-$N-network-config-secret, which is not available, as the networkData are stored in the nmstate key, pushed as a manifest by the installer.

{"level":"error","ts":1654276012.1460414,"logger":"controller.baremetalhost","msg":"Reconciler error","reconciler group":"metal3.io","reconciler kind":"BareMetalHost","name":"worker-01","namespace":"openshift-machine-api","error":"action \"provisioning\" failed: failed to provision: could not retrieve network data: Secret worker-01-network-config-secret does not contain key networkData","errorVerbose":"Secret worker-01-network-config-secret does not contain key networkData\ncould not retrieve network data\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).getConfigDrive\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1370\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).Provision\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1490\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:1072\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:482\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:199\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:250\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_arm64.s:1259\nfailed to provision\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:1081\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:482\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:199\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:250\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_arm64.s:1259\naction \"provisioning\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:254\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_arm64.s:1259","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}



Changing the key name in the secret from `nmstate` to `networkData`, after it is generated by the installer, leads the baremetal operator and ironic to push rhcos. 

The workers are able to join the cluster this way:

oc get nodes
NAME                                                            STATUS   ROLES    AGE     VERSION
master-00.adistefa-631318b.qeclusters.arm.eng.rdu2.redhat.com   Ready    master   27m     v1.24.0+899fd9f
master-01.adistefa-631318b.qeclusters.arm.eng.rdu2.redhat.com   Ready    master   26m     v1.24.0+899fd9f
master-02.adistefa-631318b.qeclusters.arm.eng.rdu2.redhat.com   Ready    master   26m     v1.24.0+899fd9f
worker-00.adistefa-631318b.qeclusters.arm.eng.rdu2.redhat.com   Ready    worker   2m53s   v1.24.0+899fd9f
worker-01.adistefa-631318b.qeclusters.arm.eng.rdu2.redhat.com   Ready    worker   93s     v1.24.0+899fd9f

Comment 4 aleskandro 2022-06-04 11:21:02 UTC
> Changing the key name in the secret from `nmstate` to `networkData`, after it is generated by the installer, leads the baremetal operator and ironic to push rhcos. 

Adding the key networkData with an empty value is enough. `nmstate` is still needed to guarantee a custom network configuration for the hosts.

The current workaround I'm applying at installation time is like:

        cd $PATH_TO_INSTALL_DIR/openshift
        for f in 99_openshift-cluster-api_host-network-config-secrets-*
        do
          yq eval ".data.networkData = \"\"" -i $f
        done

Moving the bug to the baremetal-operator component.

Comment 5 Dan Li 2022-06-06 14:52:50 UTC
After chatting with Alessandro, I'm proposing to set the Severity to "High" and leaving the Blocker? flag to the assignee team to set.

Comment 6 Derek Higgins 2022-06-08 09:46:57 UTC
So the secret is created here, with the key "nmstate"
https://github.com/openshift/installer/blob/8b3d14d/pkg/asset/machines/baremetal/hosts.go#L50

The BMO reads it here, with the name "networkData"
https://github.com/openshift/baremetal-operator/blob/1adcaab/controllers/metal3.io/host_config_data.go#L85

Fixing the installer to match the BMO shouldn't be difficult, I'd like to first figure out why/how this works in 4.10

Comment 7 aleskandro 2022-06-08 12:04:46 UTC
Hello Derek, only matching the name (s/nmstate/networkData) seems not sufficient.

The nmstate key is still needed to configure a custom network, while the networkData value can also be empty (at least for the use cases I tried).

If you change the "nmstate" key to "networkData", any custom network configuration is ignored.

This key is never referenced by the baremetal-operator: if I'm not wrong it is handled by the image-customization-controller: https://github.com/openshift/image-customization-controller/blob/main/pkg/imageprovider/rhcos.go#L49

Comment 8 Zane Bitter 2022-06-08 18:32:04 UTC
(In reply to Derek Higgins from comment #6)
> The BMO reads it here, with the name "networkData"
> https://github.com/openshift/baremetal-operator/blob/1adcaab/controllers/
> metal3.io/host_config_data.go#L85

No it doesn't.

The machine-image-customization-controller reads it (with the key "nmstate") here:
https://github.com/openshift/image-customization-controller/blob/main/pkg/imageprovider/rhcos.go#L49

Comment 9 aleskandro 2022-06-08 19:32:16 UTC
To me, the issue seems that the upstream metal3 operator:

1. considers the secret and its networkData field optional: host_config_data.go#L74
2. does not consider that the secret (i.e., a secret with the same name) could have other fields to be consumed by another guy
3. getSecretData(.) returns an error that is propagated to the caller functions when "the secret is found and the $key field is not": host_config_data.go#L82 and host_config_data.go#L42

https://github.com/openshift/baremetal-operator/blob/1adcaab/controllers/metal3.io/host_config_data.go

For this reason, I set the component to the BMO: adding the empty key in the manifests at install time would also mean asking the users to add this field when expanding the cluster (see the release_pending doc from: https://bugzilla.redhat.com/show_bug.cgi?id=2082671).

Comment 10 Zane Bitter 2022-06-08 19:38:14 UTC
Yes, I think that's correct. BMO should stop treating this as an error. We should *not* change the Secret format used by the installer or what we describe in the docs.

What's confusing is how this didn't fail in exactly the same way in 4.10, given that the code doesn't appear to have changed.

Comment 11 Derek Higgins 2022-06-09 13:14:46 UTC
(In reply to Zane Bitter from comment #10)
> Yes, I think that's correct. BMO should stop treating this as an error. We
> should *not* change the Secret format used by the installer or what we
> describe in the docs.
> 
> What's confusing is how this didn't fail in exactly the same way in 4.10,
> given that the code doesn't appear to have changed.

I think I see the difference, the bmh for 4.11 contains "preprovisioningNetworkDataName" its not present in 4.10
$ cat ./4.11/release/namespaces/openshift-machine-api/metal3.io/baremetalhosts/ostest-worker-1.yaml | yq .spec.preprovisioningNetworkDataName
"ostest-worker-1-network-config-secret"

$ cat ./4.10/release/namespaces/openshift-machine-api/metal3.io/baremetalhosts/ostest-worker-1.yaml | yq .spec.preprovisioningNetworkDataName
null


So In 4.10 we don't get passed this check in NetworkData()
https://github.com/openshift/baremetal-operator/blob/1adcaab/controllers/metal3.io/host_config_data.go#L74

	if networkData == nil {
		hcd.log.Info("NetworkData is not set, returning empty data")
		return "", nil
	}

In 4.11 networkData is set and we instead attempts to get "networkData" from the secret and fail as it doesn't exist

	return hcd.getSecretData(
		networkData.Name,
		namespace,
		"networkData",
	)

Comment 12 aleskandro 2022-06-13 13:35:47 UTC
(In reply to Derek Higgins from comment #11)
> (In reply to Zane Bitter from comment #10)
> > Yes, I think that's correct. BMO should stop treating this as an error. We
> > should *not* change the Secret format used by the installer or what we
> > describe in the docs.
> > 
> > What's confusing is how this didn't fail in exactly the same way in 4.10,
> > given that the code doesn't appear to have changed.
> 
> I think I see the difference, the bmh for 4.11 contains
> "preprovisioningNetworkDataName" its not present in 4.10
> $ cat
> ./4.11/release/namespaces/openshift-machine-api/metal3.io/baremetalhosts/
> ostest-worker-1.yaml | yq .spec.preprovisioningNetworkDataName
> "ostest-worker-1-network-config-secret"
> 
> $ cat
> ./4.10/release/namespaces/openshift-machine-api/metal3.io/baremetalhosts/
> ostest-worker-1.yaml | yq .spec.preprovisioningNetworkDataName
> null
> 
> 
> So In 4.10 we don't get passed this check in NetworkData()
> https://github.com/openshift/baremetal-operator/blob/1adcaab/controllers/
> metal3.io/host_config_data.go#L74
> 
> 	if networkData == nil {
> 		hcd.log.Info("NetworkData is not set, returning empty data")
> 		return "", nil
> 	}
> 
> In 4.11 networkData is set and we instead attempts to get "networkData" from
> the secret and fail as it doesn't exist
> 
> 	return hcd.getSecretData(
> 		networkData.Name,
> 		namespace,
> 		"networkData",
> 	)

Hi Derek, I'm not 100% we found the actual root cause yet. 4.10 works well, but if I try to generate manifests by using the 4.10 openshift-baremetal-install, the preProvisioningNetworkDataName field is generated too (again, when some network configuration is provided for a given host).

However, as you said, this is unavailable later when the cluster installs. The secret is stored, and it is owned by both the preProvisioningImage and the BareMetalHost, but the ref in the BareMetalHost is likely to be deleted between the inspection and provisioning states. Who is deleting it? Why?

As this is tested in another infra, I could not be sure yet that the nmstate configuration is actually honored.

I want to point out this to understand whether the patch we're going to apply doesn't hide some other issue (in 4.11 or, also, in the previous versions).

Here an excerpt of its initial processing in the BMO:

{"level":"info","ts":1655120569.8676925,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","credentials":{"credentials":{"name":"worker-01-bmc-secret","namespace":"openshift-machine-api"},"credentialsVersion":"13941"}},
{"level":"info","ts":1655120569.8677754,"logger":"controllers.BareMetalHost.secret_manager","msg":"setting secret owner reference","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","secret":"worker-01-network-config-secret","secretNamespace":"openshift-machine-api","ownerKind":"BareMetalHost","owner":"openshift-machine-api/worker-01","ownerUID":"f0462caa-4e41-48e8-8f12-e30abbf70df9"},
{"level":"info","ts":1655120569.8940384,"logger":"controllers.BareMetalHost","msg":"network data in pre-provisioning image is out of date","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","latestVersion":"15949","currentVersion":"15935"},
{"level":"info","ts":1655120569.923056,"logger":"provisioner.ironic","msg":"current provision state","host":"openshift-machine-api~worker-01","lastError":"","current":"manageable","target":""},
{"level":"info","ts":1655120569.9230728,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","requeue":true,"after":300},
{"level":"info","ts":1655120569.9231105,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-01"},
{"level":"info","ts":1655120569.9553645,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","credentials":{"credentials":{"name":"worker-01-bmc-secret","namespace":"openshift-machine-api"},"credentialsVersion":"13941"}},
{"level":"info","ts":1655120569.9554284,"logger":"controllers.BareMetalHost","msg":"network data in pre-provisioning image is out of date","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","latestVersion":"15949","currentVersion":"15935"},
{"level":"info","ts":1655120569.9849916,"logger":"provisioner.ironic","msg":"current provision state","host":"openshift-machine-api~worker-01","lastError":"","current":"manageable","target":""},
{"level":"info","ts":1655120569.9850068,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","requeue":true,"after":300},
{"level":"info","ts":1655120570.0206833,"logger":"controllers.HostFirmwareSettings","msg":"created new firmwareSchema resource","hostfirmwaresettings":"openshift-machine-api/worker-01"},
{"level":"info","ts":1655120570.0581303,"logger":"controllers.HostFirmwareSettings","msg":"start","hostfirmwaresettings":"openshift-machine-api/worker-01"},
{"level":"info","ts":1655120570.082164,"logger":"controllers.HostFirmwareSettings","msg":"retrieving firmware settings and saving to resource","hostfirmwaresettings":"openshift-machine-api/worker-01","node":"0d0b88ca-619b-453f-9bb9-a62c18ca8df1"},
{"level":"info","ts":1655120570.2091157,"logger":"provisioner.ironic","msg":"retrieved BIOS settings for node","host":"openshift-machine-api~worker-01","node":"0d0b88ca-619b-453f-9bb9-a62c18ca8df1","size":329},
{"level":"info","ts":1655120570.209288,"logger":"controllers.HostFirmwareSettings","msg":"getting firmwareSchema","hostfirmwaresettings":"openshift-machine-api/worker-01"},
{"level":"info","ts":1655120570.210087,"logger":"controllers.HostFirmwareSettings","msg":"found existing firmwareSchema resource","hostfirmwaresettings":"openshift-machine-api/worker-01"},
{"level":"info","ts":1655120570.2285674,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-01"},
{"level":"info","ts":1655120570.2627969,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","credentials":{"credentials":{"name":"worker-01-bmc-secret","namespace":"openshift-machine-api"},"credentialsVersion":"13941"}},
{"level":"info","ts":1655120570.2628784,"logger":"controllers.BareMetalHost","msg":"pending PreprovisioningImage not ready","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting"},
{"level":"info","ts":1655120570.2924314,"logger":"provisioner.ironic","msg":"current provision state","host":"openshift-machine-api~worker-01","lastError":"","current":"manageable","target":""},
{"level":"info","ts":1655120570.2924478,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","requeue":true,"after":300},
{"level":"info","ts":1655120570.292479,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-01"},
{"level":"info","ts":1655120570.3265345,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","credentials":{"credentials":{"name":"worker-01-bmc-secret","namespace":"openshift-machine-api"},"credentialsVersion":"13941"}},
{"level":"info","ts":1655120570.3266103,"logger":"controllers.BareMetalHost","msg":"pending PreprovisioningImage not ready","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting"},
{"level":"info","ts":1655120570.35558,"logger":"provisioner.ironic","msg":"current provision state","host":"openshift-machine-api~worker-01","lastError":"","current":"manageable","target":""},
{"level":"info","ts":1655120570.3555937,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-01","provisioningState":"inspecting","requeue":true,"after":300},
{"level":"info","ts":1655120570.5948565,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-01"},

Comment 13 Derek Higgins 2022-06-14 16:28:53 UTC
adistefa Ya, I can see in 4.10 that the BMH is created with "preprovisioningNetworkDataName" then it is later removed 
This allows 4.10 to pass but I'm not sure whats removing it zbitter any idea what removes it.

regardless should we proceed with the BMO not treating this as a error(as in https://github.com/openshift/baremetal-operator/pull/227)

Comment 14 Zane Bitter 2022-06-15 14:20:03 UTC
I have no idea what could be removing it?!

Comment 15 Zane Bitter 2022-06-15 18:16:01 UTC
Oh, I bet I know what is removing it. It will be a controller that has vendored an old version of the BMH API. Almost certainly the culprit in this case will be CAPBM.

Most likely the vendoring was updated some time in 4.11 and that caused the bug to show up. So we should make this not be an error and backport that change to 4.10 before we update the vendoring in 4.10.

Comment 17 Riccardo Pittau 2022-06-21 12:29:25 UTC
*** Bug 2098430 has been marked as a duplicate of this bug. ***

Comment 19 Adina Wolff 2022-06-21 14:25:53 UTC
Verified fix on 4.11.0-0.nightly-2022-06-21-040754

Comment 20 aleskandro 2022-06-21 15:41:19 UTC
Even if the configuration tested by Adina (static IPs configured for each of the nodes in install-config) is different than the initial one, and given that no arm64 payloads are available yet for reasons that are beyond this bug, I'm moving this to verified. 

The log of the verification is at https://auto-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/Private_Folders/job/yporagpa/job/networkConfig_Demo/302/console

Comment 21 Yoav Porag 2022-06-22 06:35:54 UTC
*** Bug 2098424 has been marked as a duplicate of this bug. ***

Comment 25 denise.ochoa-mendoza 2022-07-08 00:24:33 UTC
(In reply to aleskandro from comment #4)
> > Changing the key name in the secret from `nmstate` to `networkData`, after it is generated by the installer, leads the baremetal operator and ironic to push rhcos. 
> 
> Adding the key networkData with an empty value is enough. `nmstate` is still
> needed to guarantee a custom network configuration for the hosts.
> 
> The current workaround I'm applying at installation time is like:
> 
>         cd $PATH_TO_INSTALL_DIR/openshift
>         for f in 99_openshift-cluster-api_host-network-config-secrets-*
>         do
>           yq eval ".data.networkData = \"\"" -i $f
>         done
> 
> Moving the bug to the baremetal-operator component.

Hi how do I apply this workaround, I dont see  99_openshift-cluster-api_host-network-config-secrets-* in my $PATH_TO_INSTALL_DIR/openshift directory all I have are the following manifests below, if it helps my IPI install is not using a provisioning network.  

99_baremetal-provisioning-config.yaml             99_openshift-cluster-api_hosts-0.yaml            99_openshift-cluster-api_master-machines-2.yaml
99_kubeadmin-password-secret.yaml                 99_openshift-cluster-api_hosts-1.yaml            99_openshift-cluster-api_master-user-data-secret.yaml
99_openshift-cluster-api_host-bmc-secrets-0.yaml  99_openshift-cluster-api_hosts-2.yaml            99_openshift-cluster-api_worker-machineset-0.yaml
99_openshift-cluster-api_host-bmc-secrets-1.yaml  99_openshift-cluster-api_hosts-3.yaml            99_openshift-cluster-api_worker-user-data-secret.yaml
99_openshift-cluster-api_host-bmc-secrets-2.yaml  99_openshift-cluster-api_hosts-4.yaml            99_openshift-machineconfig_99-master-ssh.yaml
99_openshift-cluster-api_host-bmc-secrets-3.yaml  99_openshift-cluster-api_hosts-5.yaml            99_openshift-machineconfig_99-worker-ssh.yaml
99_openshift-cluster-api_host-bmc-secrets-4.yaml  99_openshift-cluster-api_hosts-6.yaml            openshift-install-manifests.yaml
99_openshift-cluster-api_host-bmc-secrets-5.yaml  99_openshift-cluster-api_master-machines-0.yaml
99_openshift-cluster-api_host-bmc-secrets-6.yaml  99_openshift-cluster-api_master-machines-1.yaml

Comment 26 aleskandro 2022-07-08 11:21:12 UTC
Hello, 

(In reply to denise.ochoa-mendoza from comment #25)
> 
> Hi how do I apply this workaround, I dont see 
> 99_openshift-cluster-api_host-network-config-secrets-* in my
> $PATH_TO_INSTALL_DIR/openshift directory all I have are the following
> manifests below, if it helps my IPI install is not using a provisioning
> network.  
> 

This bug blocked the installation of 4.11 clusters (not yet GA), with a custom network configuration defined in the install-config at the key platform.baremetal.hosts[].networkConfig (or in the related secret to expand already existing clusters).
Since this bug is verified, the workaround is no longer needed if you install with the latest nightlies.

Note that the installer will not create the network-config secrets if you are installing a 4.11 cluster and not applying a custom network configuration as above.

- Which version are you installing?
- Do you have a custom network configuration defined in the install-config?

Comment 30 denise.ochoa-mendoza 2022-07-11 18:22:05 UTC
Created attachment 1896134 [details]
IPI Installation, no provisioning network, no workers deployed

Hi,

here is the must-gather.

Comment 31 aleskandro 2022-07-22 08:49:06 UTC

(In reply to denise.ochoa-mendoza from comment #30)
> Created attachment 1896134 [details]
> IPI Installation, no provisioning network, no workers deployed
> 
> Hi,
> 
> here is the must-gather.


Hello Denise, as anticipated this bug is not related to your case.

Looking at your install-config and the status of the cluster from the must-gather you uploaded, the metal3-ironic-api container in the metal3 deployment's pod is not started yet.

It's waiting for the assignment of the clusterProvisioningIP to one of the available nodes, and there is no node having 192.168.22.158 as IP.

If you feel this is a bug and not a misconfiguration of your network or install-config, open a new bug so that the proper team can support you on this.

omg logs metal3-85bc5848d-jq8dr -c metal3-ironic-api

2022-07-08T16:33:39.342637161Z Waiting for 192.168.22.158 to be configured on an interface
....

one thing worth noting, however, is that the metal3 pod looks good with all of its containers ready, even if it isn't. This degrades the user experience, makes the debugging of such issues harder, and should be taken into account by the metal3 dev team IMO.

Comment 32 Zane Bitter 2022-07-22 15:02:55 UTC
(In reply to aleskandro from comment #31)
> one thing worth noting, however, is that the metal3 pod looks good with all
> of its containers ready, even if it isn't. This degrades the user
> experience, makes the debugging of such issues harder, and should be taken
> into account by the metal3 dev team IMO.

This appears to be because the configuration sets all of the provisioning network data (MAC addresses, IP, CIDR) but also sets "provisioningNetwork: Disabled". If the provisioning network were enabled then an init container would set the provisioning VIP and not permit the Pod to start up until it was done.

I don't recall if there is any specific behaviour tied to providing a provisioning IP when the provisioning network is disabled, or if we should just ignored the rest of the data when we see "provisioningNetwork: Disabled". If not the latter then there probably should be an init container that checks that the IP is available before permitting the pod to start up.

Comment 34 aleskandro 2022-07-22 19:26:49 UTC
(In reply to Zane Bitter from comment #32)
> (In reply to aleskandro from comment #31)
> > one thing worth noting, however, is that the metal3 pod looks good with all
> > of its containers ready, even if it isn't. This degrades the user
> > experience, makes the debugging of such issues harder, and should be taken
> > into account by the metal3 dev team IMO.
> 
> This appears to be because the configuration sets all of the provisioning
> network data (MAC addresses, IP, CIDR) but also sets "provisioningNetwork:
> Disabled". If the provisioning network were enabled then an init container
> would set the provisioning VIP and not permit the Pod to start up until it
> was done.
> 

What I meant is that, regardless of the specific case, I would set the proper startup/readiness/liveness probes so that a user is aware when some containers in the metal3 pods are not starting/ready. This is true also for other containers there, like the dnsmasq one (see https://bugzilla.redhat.com/show_bug.cgi?id=2081734).

Comment 36 errata-xmlrpc 2022-08-10 11:15:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.