Bug 1949316

Summary: BaremetalHost resource automatedCleaningMode ignored due to outdated vendoring
Product: OpenShift Container Platform Reporter: Hao Liu <haoli>
Component: Bare Metal Hardware ProvisioningAssignee: Steven Hardy <shardy>
Bare Metal Hardware Provisioning sub component: cluster-api-provider QA Contact: Lubov <lshilin>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: ccrum, fpercoco, jparrill, rbartal, rhhi-next-mgmt-qe, shardy, trwest, zbitter
Version: 4.8Keywords: Triaged
Target Milestone: ---Flags: lshilin: needinfo-
Target Release: 4.8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:00:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hao Liu 2021-04-14 01:20:59 UTC
Description of problem:
baremetal operator boot metal3 inspection image instead of the specified image when `inspect.metal3.io: disabled` is present

This only happens when the image information is added after the resource is created, see Steps to reproduce for further detail. 

When the resource is created with the image information the correct image is being booted

Version-Release number of selected component (if applicable):
➜  ~ oc get clusterversion
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.ci-2021-04-13-104134   True        False         8h      Cluster version is 4.8.0-0.ci-2021-04-13-104134


How reproducible:
every time

Steps to Reproduce:
1. create baremetalhost resource 
```
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: yoloswag-4
  namespace: yoloswag-4
  annotations:
    inspect.metal3.io: disabled
  labels:
    installenvs.agent-install.openshift.io: "yoloswag-4"
spec:
  bootMode: "legacy"
  bmc:
    address: redfish-virtualmedia+http://10.1.61.12:8000/redfish/v1/Systems/a9941474-fc78-4f1f-893b-89cac52ca581
    disableCertificateVerification: true
    credentialsName: yoloswag-4-bmc-secret
  bootMACAddress: 52:54:00:76:8f:15
  automatedCleaningMode: disabled
  online: true
```
2. modify the baremetalhost by adding 
```
  image:
    format: live-iso
    url: http://assisted-service-assisted-installer.apps.vlan613.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/eeb45748-d014-43e7-a6e0-7939eb6da72a/downloads/image
```
(in my usecase this is done by the BaremetalAgentController in assisted installer service but manually do this yield the same result)

Actual results:
metal3 inspection image is booted

Expected results:
specified image is booted

Additional info:

Comment 1 Flavio Percoco 2021-04-14 08:27:11 UTC
I tried this morning and I couldn't replicate this issue. Here's what I did:

My environment is based on `dev-script` (recreated it yesterday). Here are some details about the environment:

```
[dev@edge-10 assisted-service]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-09-222447   True        False         17h     Cluster version is 4.8.0-0.nightly-2021-04-09-222447
```

The BMH I used:

```
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: ostest-worker-0
  namespace: assisted-installer
  annotations:
    inspect.metal3.io: disabled
  labels:
    installenvs.agent-install.openshift.io: "bmac-test"
spec:
  online: true
  bootMACAddress: 00:ec:ee:f8:5a:ba
  automatedCleaningMode: disabled
  bmc:
    address: redfish-virtualmedia+http://192.168.111.1:8000/redfish/v1/Systems/ca6345cf-96a3-4ec6-8344-bf0a5f910eec
    credentialsName: ostest-worker-0-bmc-secret
```

The ClusterDeployment and InstallEnv (soon to be InfraEnv) resources:

```
apiVersion: hive.openshift.io/v1
kind: ClusterImageSet
metadata:
  name: "4.7"
spec:
  releaseImage: quay.io/openshift-release-dev/ocp-release:4.7.2-x86_64
---
apiVersion: hive.openshift.io/v1
kind: ClusterDeployment
metadata:
    name: bmac-test
    namespace: assisted-installer
spec:
  baseDomain: bmac.hungry-bytes.io
  clusterName: bmac-test
  platform:
    agentBareMetal:
      apiVIP: ""
      ingressVIP: ""
      agentSelector:
        matchLabels:
          bla: aaa
  provisioning:
    clusterImageSetRef:
      name: 4.8
    installStrategy:
      agent:
        sshPublicKey: 'ssh-rsa '
        networking:
            clusterNetwork:
              - cidr: 10.128.0.0/14
                hostPrefix: 23
            serviceNetwork:
              - 172.30.0.0/16
        provisionRequirements:
            controlPlaneAgents: 1
            workerAgents: 0
        agentSelector:
          matchLabels:
            bla: aaa
  pullSecretRef:
    name: my-pull-secret
---
apiVersion: agent-install.openshift.io/v1beta1
kind: InstallEnv
metadata:
  name: bmac-test
  namespace: assisted-installer
spec:
  clusterRef:
        name: bmac-test
        namespace: assisted-installer
  sshAuthorizedKeys:
        - 'ssh-rsa '
  agentLabelSelector:
        matchLabels:
          bla: aaa
  pullSecretRef:
      name: my-pull-secret
```

After a bit, here's the agent:

```
[dev@edge-10 assisted-service]$ kg agent
NAME                                   CLUSTER     APPROVED
ca6345cf-96a3-4ec6-8344-bf0a5f910eec   bmac-test   true
[dev@edge-10 assisted-service]$
```

and the BMH has the HardwareDetails populated:

```
[dev@edge-10 dev-scripts]$ kd bmh
...
Spec:
  Automated Cleaning Mode:  disabled
  Bmc:
    Address:           redfish-virtualmedia+http://192.168.111.1:8000/redfish/v1/Systems/ca6345cf-96a3-4ec6-8344-bf0a5f910eec
    Credentials Name:  ostest-worker-0-bmc-secret
  Boot MAC Address:    00:ec:ee:f8:5a:ba
  Image:
    Format:  live-iso
    URL:     http://assisted-service-assisted-installer.apps.ostest.test.metalkube.org/api/assisted-install/v1/clusters/90941774-2f05-4a21-858b-177d3f389b69/downloads/image
  Online:    true
Status:
  Error Count:    0
  Error Message:
  Good Credentials:
    Credentials:
      Name:               ostest-worker-0-bmc-secret
      Namespace:          assisted-installer
    Credentials Version:  548059
  Hardware:
    Cpu:
      Arch:             x86_64
      Clock Megahertz:  3792
      Count:            4
      Flags:
        fpu
...
```

Why are you using bootMode: legacy? The default is UEFI

Comment 2 Steven Hardy 2021-04-14 09:21:52 UTC
> This only happens when the image information is added after the resource is created

Is the host PXE booting the Ironic image, because `online: true` and no image is specified?

If so that's expected, depending on the host configuration - either set online to false, add the target image URL, or configure the host not to PXE boot.

Otherwise, please capture a must-gather, so we can see the relevant logs to debug, thanks!

Comment 3 Juan Manuel Parrilla Madrid 2021-04-14 13:30:32 UTC
Here I can see the same issue:

[root@lab-installer AI]# oc get clusterdeployment sno-assisted -o json | jq '.status.conditions[] | select(.reason | contains("AgentPlatformState"))'
{
  "lastProbeTime": "2021-04-14T12:56:16Z",
  "lastTransitionTime": "2021-04-14T12:56:16Z",
  "message": "error",
  "reason": "AgentPlatformState",
  "status": "Unknown",
  "type": "Unreachable"
}
{
  "lastProbeTime": "2021-04-14T12:56:16Z",
  "lastTransitionTime": "2021-04-14T12:56:16Z",
  "message": "object 41c1309b-ed2e-4aa5-9da0-1f4af0b28457/bootstrap.ign was not found",
  "reason": "AgentPlatformStateInfo",
  "status": "Unknown",
  "type": "Unreachable"
}

In my case is not directly related with Metal3 because I mounted the ISO on the VM and booted it up.

Comment 4 Juan Manuel Parrilla Madrid 2021-04-14 13:43:13 UTC
I've tried to reach the file using the API but it's not there, so I've gone to the LocalStorage to see whhici object contains and looks like there is not any 4.8 file there:

```
drwxrwsr-x. 7 root       1000660000 4.0K Apr 14 12:56 .
drwxr-x---. 4 root       root         56 Apr 14 09:08 ..
drwxr-sr-x. 2 1000660000 1000660000 4.0K Apr 14 11:55 781ab3bf-1c70-479b-b282-038baa1fe494
drwxr-sr-x. 2 1000660000 1000660000 4.0K Apr 14 10:29 8aea4dcc-7df6-4663-988a-fcc5cbd92c3f
drwxrws---. 2 1000660000 1000660000 4.0K Apr 14 09:41 cache
-rw-------. 1 1000660000 1000660000  94M Apr 14 12:52 discovery-image-41c1309b-ed2e-4aa5-9da0-1f4af0b28457.iso
-rw-------. 1 1000660000 1000660000  94M Apr 14 11:55 discovery-image-781ab3bf-1c70-479b-b282-038baa1fe494.iso
-rw-------. 1 1000660000 1000660000  94M Apr 14 11:03 discovery-image-8aea4dcc-7df6-4663-988a-fcc5cbd92c3f.iso
drwxr-sr-x. 3 1000660000 1000660000 4.0K Apr 14 12:55 installercache
drwxrws---. 2 root       1000660000  16K Apr 14 08:56 lost+found
-rw-rw----. 1 1000660000 1000660000   13 Apr 14 08:59 minimal_templates_version.json
-rw-rw----. 1 1000660000 1000660000  76M Apr 14 08:59 rhcos-46.82.202012051820-0.initrd.img
-rw-rw----. 1 1000660000 1000660000 876M Apr 14 08:59 rhcos-46.82.202012051820-0.iso
-rw-rw----. 1 1000660000 1000660000  94M Apr 14 08:59 rhcos-46.82.202012051820-0-minimal.iso
-rw-rw----. 1 1000660000 1000660000 783M Apr 14 08:59 rhcos-46.82.202012051820-0.rootfs.img
-rw-rw----. 1 1000660000 1000660000 8.6M Apr 14 08:59 rhcos-46.82.202012051820-0.vmlinuz
-rw-rw----. 1 1000660000 1000660000  83M Apr 14 08:58 rhcos-47.83.202102090044-0.initrd.img
-rw-rw----. 1 1000660000 1000660000 910M Apr 14 08:58 rhcos-47.83.202102090044-0.iso
-rw-rw----. 1 1000660000 1000660000 101M Apr 14 08:58 rhcos-47.83.202102090044-0-minimal.iso
-rw-rw----. 1 1000660000 1000660000 810M Apr 14 08:58 rhcos-47.83.202102090044-0.rootfs.img
-rw-rw----. 1 1000660000 1000660000 9.1M Apr 14 08:58 rhcos-47.83.202102090044-0.vmlinuz
```

But the Assisted Service pod containes this env var:

OPENSHIFT_VERSIONS={"4.6":{"display_name":"4.6.16","release_image":"quay.io/openshift-release-dev/ocp-release:4.6.16-x86_64","rhcos_image":"https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.6/4.6.8/rhcos-4.6.8-x86_64-live.x86_64.iso","rhcos_version":"46.82.202012051820-0","support_level":"production"},"4.7":{"display_name":"4.7.5","release_image":"quay.io/openshift-release-dev/ocp-release:4.7.5-x86_64","rhcos_image":"https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.7/4.7.0/rhcos-4.7.0-x86_64-live.x86_64.iso","rhcos_version":"47.83.202102090044-0","support_level":"production"},"4.8":{"display_name":"4.8","release_image":"quay.io/openshift-release-dev/ocp-release-nightly:4.8.0-0.nightly-2021-03-16-221720","rhcos_image":"https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.8/48.83.202103122318-0/x86_64/rhcos-48.83.202103122318-0-live.x86_64.iso","rhcos_version":"46.82.202012051820-0","support_level":"production"}}

which includes the 4.8 version.

any hint?

Comment 5 Hao Liu 2021-04-14 16:40:09 UTC
summary from conversation with Steve Hardy

this bug only happens when the BMH is edited to add the image section
- creating without image with inspect false does not boot ironic agent
- create with image with inspect false does not boot ironic agent

it's only when you update to add the image after

Comment 7 Steven Hardy 2021-04-15 16:35:22 UTC
Looking at the BMO logs we see the wrong cleaning mode is used:

$ grep "setting automated" current.log  | grep yoloswag-4
2021-04-14T16:08:58.360724319Z {"level":"info","ts":1618416538.3606825,"logger":"provisioner.ironic","msg":"setting automated cleaning mode to","host":"yoloswag-4~yoloswag-4","ID":"3cac3178-46c5-4b1f-9ba3-be8e89c9140a","mode":"metadata"}

I've not yet managed to reproduce, but provided this wasn't modified by updating the CR, something must either be modifying it after creation, or a bug in BMO means we're using the wrong value.

Comment 8 Zane Bitter 2021-04-15 16:48:49 UTC
The cause is almost certainly that $SOMETHING is updating the BMH that contains an outdated vendored version of the BareMetalHost type, and is writing it using Update() (i.e. not just patching) with the result that the cleaning mode field is set back to the default.

Comment 9 Steven Hardy 2021-04-21 15:13:26 UTC
the AI part of this will be resolved via https://github.com/openshift/assisted-service/pull/1512

Comment 10 Steven Hardy 2021-04-21 15:22:26 UTC
I created https://github.com/openshift/cluster-api-provider-baremetal/pull/149 to resolve the CAPBM vendor update, in my environment I think that's what overwrote the spec, but I suspect in the reporters case it was the assisted-service controller.

Comment 15 errata-xmlrpc 2021-07-27 23:00:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438