Description of problem: baremetal operator boot metal3 inspection image instead of the specified image when `inspect.metal3.io: disabled` is present This only happens when the image information is added after the resource is created, see Steps to reproduce for further detail. When the resource is created with the image information the correct image is being booted Version-Release number of selected component (if applicable): ➜ ~ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.ci-2021-04-13-104134 True False 8h Cluster version is 4.8.0-0.ci-2021-04-13-104134 How reproducible: every time Steps to Reproduce: 1. create baremetalhost resource ``` apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: yoloswag-4 namespace: yoloswag-4 annotations: inspect.metal3.io: disabled labels: installenvs.agent-install.openshift.io: "yoloswag-4" spec: bootMode: "legacy" bmc: address: redfish-virtualmedia+http://10.1.61.12:8000/redfish/v1/Systems/a9941474-fc78-4f1f-893b-89cac52ca581 disableCertificateVerification: true credentialsName: yoloswag-4-bmc-secret bootMACAddress: 52:54:00:76:8f:15 automatedCleaningMode: disabled online: true ``` 2. modify the baremetalhost by adding ``` image: format: live-iso url: http://assisted-service-assisted-installer.apps.vlan613.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/eeb45748-d014-43e7-a6e0-7939eb6da72a/downloads/image ``` (in my usecase this is done by the BaremetalAgentController in assisted installer service but manually do this yield the same result) Actual results: metal3 inspection image is booted Expected results: specified image is booted Additional info:
I tried this morning and I couldn't replicate this issue. Here's what I did: My environment is based on `dev-script` (recreated it yesterday). Here are some details about the environment: ``` [dev@edge-10 assisted-service]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-09-222447 True False 17h Cluster version is 4.8.0-0.nightly-2021-04-09-222447 ``` The BMH I used: ``` apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: ostest-worker-0 namespace: assisted-installer annotations: inspect.metal3.io: disabled labels: installenvs.agent-install.openshift.io: "bmac-test" spec: online: true bootMACAddress: 00:ec:ee:f8:5a:ba automatedCleaningMode: disabled bmc: address: redfish-virtualmedia+http://192.168.111.1:8000/redfish/v1/Systems/ca6345cf-96a3-4ec6-8344-bf0a5f910eec credentialsName: ostest-worker-0-bmc-secret ``` The ClusterDeployment and InstallEnv (soon to be InfraEnv) resources: ``` apiVersion: hive.openshift.io/v1 kind: ClusterImageSet metadata: name: "4.7" spec: releaseImage: quay.io/openshift-release-dev/ocp-release:4.7.2-x86_64 --- apiVersion: hive.openshift.io/v1 kind: ClusterDeployment metadata: name: bmac-test namespace: assisted-installer spec: baseDomain: bmac.hungry-bytes.io clusterName: bmac-test platform: agentBareMetal: apiVIP: "" ingressVIP: "" agentSelector: matchLabels: bla: aaa provisioning: clusterImageSetRef: name: 4.8 installStrategy: agent: sshPublicKey: 'ssh-rsa ' networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 serviceNetwork: - 172.30.0.0/16 provisionRequirements: controlPlaneAgents: 1 workerAgents: 0 agentSelector: matchLabels: bla: aaa pullSecretRef: name: my-pull-secret --- apiVersion: agent-install.openshift.io/v1beta1 kind: InstallEnv metadata: name: bmac-test namespace: assisted-installer spec: clusterRef: name: bmac-test namespace: assisted-installer sshAuthorizedKeys: - 'ssh-rsa ' agentLabelSelector: matchLabels: bla: aaa pullSecretRef: name: my-pull-secret ``` After a bit, here's the agent: ``` [dev@edge-10 assisted-service]$ kg agent NAME CLUSTER APPROVED ca6345cf-96a3-4ec6-8344-bf0a5f910eec bmac-test true [dev@edge-10 assisted-service]$ ``` and the BMH has the HardwareDetails populated: ``` [dev@edge-10 dev-scripts]$ kd bmh ... Spec: Automated Cleaning Mode: disabled Bmc: Address: redfish-virtualmedia+http://192.168.111.1:8000/redfish/v1/Systems/ca6345cf-96a3-4ec6-8344-bf0a5f910eec Credentials Name: ostest-worker-0-bmc-secret Boot MAC Address: 00:ec:ee:f8:5a:ba Image: Format: live-iso URL: http://assisted-service-assisted-installer.apps.ostest.test.metalkube.org/api/assisted-install/v1/clusters/90941774-2f05-4a21-858b-177d3f389b69/downloads/image Online: true Status: Error Count: 0 Error Message: Good Credentials: Credentials: Name: ostest-worker-0-bmc-secret Namespace: assisted-installer Credentials Version: 548059 Hardware: Cpu: Arch: x86_64 Clock Megahertz: 3792 Count: 4 Flags: fpu ... ``` Why are you using bootMode: legacy? The default is UEFI
> This only happens when the image information is added after the resource is created Is the host PXE booting the Ironic image, because `online: true` and no image is specified? If so that's expected, depending on the host configuration - either set online to false, add the target image URL, or configure the host not to PXE boot. Otherwise, please capture a must-gather, so we can see the relevant logs to debug, thanks!
Here I can see the same issue: [root@lab-installer AI]# oc get clusterdeployment sno-assisted -o json | jq '.status.conditions[] | select(.reason | contains("AgentPlatformState"))' { "lastProbeTime": "2021-04-14T12:56:16Z", "lastTransitionTime": "2021-04-14T12:56:16Z", "message": "error", "reason": "AgentPlatformState", "status": "Unknown", "type": "Unreachable" } { "lastProbeTime": "2021-04-14T12:56:16Z", "lastTransitionTime": "2021-04-14T12:56:16Z", "message": "object 41c1309b-ed2e-4aa5-9da0-1f4af0b28457/bootstrap.ign was not found", "reason": "AgentPlatformStateInfo", "status": "Unknown", "type": "Unreachable" } In my case is not directly related with Metal3 because I mounted the ISO on the VM and booted it up.
I've tried to reach the file using the API but it's not there, so I've gone to the LocalStorage to see whhici object contains and looks like there is not any 4.8 file there: ``` drwxrwsr-x. 7 root 1000660000 4.0K Apr 14 12:56 . drwxr-x---. 4 root root 56 Apr 14 09:08 .. drwxr-sr-x. 2 1000660000 1000660000 4.0K Apr 14 11:55 781ab3bf-1c70-479b-b282-038baa1fe494 drwxr-sr-x. 2 1000660000 1000660000 4.0K Apr 14 10:29 8aea4dcc-7df6-4663-988a-fcc5cbd92c3f drwxrws---. 2 1000660000 1000660000 4.0K Apr 14 09:41 cache -rw-------. 1 1000660000 1000660000 94M Apr 14 12:52 discovery-image-41c1309b-ed2e-4aa5-9da0-1f4af0b28457.iso -rw-------. 1 1000660000 1000660000 94M Apr 14 11:55 discovery-image-781ab3bf-1c70-479b-b282-038baa1fe494.iso -rw-------. 1 1000660000 1000660000 94M Apr 14 11:03 discovery-image-8aea4dcc-7df6-4663-988a-fcc5cbd92c3f.iso drwxr-sr-x. 3 1000660000 1000660000 4.0K Apr 14 12:55 installercache drwxrws---. 2 root 1000660000 16K Apr 14 08:56 lost+found -rw-rw----. 1 1000660000 1000660000 13 Apr 14 08:59 minimal_templates_version.json -rw-rw----. 1 1000660000 1000660000 76M Apr 14 08:59 rhcos-46.82.202012051820-0.initrd.img -rw-rw----. 1 1000660000 1000660000 876M Apr 14 08:59 rhcos-46.82.202012051820-0.iso -rw-rw----. 1 1000660000 1000660000 94M Apr 14 08:59 rhcos-46.82.202012051820-0-minimal.iso -rw-rw----. 1 1000660000 1000660000 783M Apr 14 08:59 rhcos-46.82.202012051820-0.rootfs.img -rw-rw----. 1 1000660000 1000660000 8.6M Apr 14 08:59 rhcos-46.82.202012051820-0.vmlinuz -rw-rw----. 1 1000660000 1000660000 83M Apr 14 08:58 rhcos-47.83.202102090044-0.initrd.img -rw-rw----. 1 1000660000 1000660000 910M Apr 14 08:58 rhcos-47.83.202102090044-0.iso -rw-rw----. 1 1000660000 1000660000 101M Apr 14 08:58 rhcos-47.83.202102090044-0-minimal.iso -rw-rw----. 1 1000660000 1000660000 810M Apr 14 08:58 rhcos-47.83.202102090044-0.rootfs.img -rw-rw----. 1 1000660000 1000660000 9.1M Apr 14 08:58 rhcos-47.83.202102090044-0.vmlinuz ``` But the Assisted Service pod containes this env var: OPENSHIFT_VERSIONS={"4.6":{"display_name":"4.6.16","release_image":"quay.io/openshift-release-dev/ocp-release:4.6.16-x86_64","rhcos_image":"https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.6/4.6.8/rhcos-4.6.8-x86_64-live.x86_64.iso","rhcos_version":"46.82.202012051820-0","support_level":"production"},"4.7":{"display_name":"4.7.5","release_image":"quay.io/openshift-release-dev/ocp-release:4.7.5-x86_64","rhcos_image":"https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.7/4.7.0/rhcos-4.7.0-x86_64-live.x86_64.iso","rhcos_version":"47.83.202102090044-0","support_level":"production"},"4.8":{"display_name":"4.8","release_image":"quay.io/openshift-release-dev/ocp-release-nightly:4.8.0-0.nightly-2021-03-16-221720","rhcos_image":"https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.8/48.83.202103122318-0/x86_64/rhcos-48.83.202103122318-0-live.x86_64.iso","rhcos_version":"46.82.202012051820-0","support_level":"production"}} which includes the 4.8 version. any hint?
summary from conversation with Steve Hardy this bug only happens when the BMH is edited to add the image section - creating without image with inspect false does not boot ironic agent - create with image with inspect false does not boot ironic agent it's only when you update to add the image after
Looking at the BMO logs we see the wrong cleaning mode is used: $ grep "setting automated" current.log | grep yoloswag-4 2021-04-14T16:08:58.360724319Z {"level":"info","ts":1618416538.3606825,"logger":"provisioner.ironic","msg":"setting automated cleaning mode to","host":"yoloswag-4~yoloswag-4","ID":"3cac3178-46c5-4b1f-9ba3-be8e89c9140a","mode":"metadata"} I've not yet managed to reproduce, but provided this wasn't modified by updating the CR, something must either be modifying it after creation, or a bug in BMO means we're using the wrong value.
The cause is almost certainly that $SOMETHING is updating the BMH that contains an outdated vendored version of the BareMetalHost type, and is writing it using Update() (i.e. not just patching) with the result that the cleaning mode field is set back to the default.
the AI part of this will be resolved via https://github.com/openshift/assisted-service/pull/1512
I created https://github.com/openshift/cluster-api-provider-baremetal/pull/149 to resolve the CAPBM vendor update, in my environment I think that's what overwrote the spec, but I suspect in the reporters case it was the assisted-service controller.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438