Bug 2104117 - Spoke BMH stuck “available” after changing a BIOS attribute via the converged workflow
Summary: Spoke BMH stuck “available” after changing a BIOS attribute via the converged...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Advanced Cluster Management for Kubernetes
Classification: Red Hat
Component: Infrastructure Operator
Version: rhacm-2.6
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: rhacm-2.6
Assignee: Dmitry Tantsur
QA Contact: Chad Crum
Derek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-05 14:54 UTC by tali@redhat.com
Modified: 2022-09-06 22:34 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-06 22:33:25 UTC
Target Upstream Version:
Embargoed:
cbynum: rhacm-2.6+
cbynum: rhacm-2.6.z+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift assisted-service pull 4090 0 None Draft Bug 2104117: BMH stuck “available” after changing a BIOS attribute via the converged workflow 2022-07-07 16:13:47 UTC
Github stolostron backlog issues 24093 0 None None None 2022-07-07 17:58:47 UTC
Red Hat Issue Tracker MGMTBUGSM-461 0 None None None 2022-07-07 13:32:27 UTC
Red Hat Product Errata RHSA-2022:6370 0 None None None 2022-09-06 22:34:29 UTC

Description tali@redhat.com 2022-07-05 14:54:30 UTC
Description of problem:

Deploy a spoke cluster with a HostFirmwareSettings CR via the converged workflow. The BMH is stuck in “available” state after successfully changing a BIOS attribute during deployment.

cat HostFirmwareSettings.yaml
apiVersion: metal3.io/v1alpha1
kind: HostFirmwareSettings
metadata:
    name: "cnfde11.ptp.lab.eng.bos.redhat.com"
    namespace: "cnfde11"
spec:
  settings:
    PowerButtonFunction: "4 Seconds Override"

oc get  bmh -n cnfde11
NAME                                 STATE       CONSUMER   ONLINE   ERROR   AGE
cnfde11.ptp.lab.eng.bos.redhat.com   available              true             70m

Ironic log shows applying the configuration change:
2022-07-05 13:47:57.713 1 DEBUG ironic.drivers.modules.redfish.bios [req-e29ff593-3e6f-4dd3-a34f-06568e01d2da - - - - -] Apply BIOS configuration for node 93422fec-4ada-4009-becb-a8799ce3f7c3: [{'name': 'PowerButtonFunction', 'value': '4 Seconds Override'}] apply_configuration /usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/bios.py:230^[[00m
2022-07-05 13:47:57.715 1 DEBUG sushy.connector [req-e29ff593-3e6f-4dd3-a34f-06568e01d2da - - - - -] HTTP request: PATCH https://10.16.231.98/redfish/v1/Systems/1/Bios/SD; headers: {'Content-Type': 'application/json', 'OData-Version': '4.0'}; body: {'Attributes': {'PowerButtonFunction': '4 Seconds Override'}}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:111^[[00m

Version-Release number of selected component (if applicable):
- Latest upstream assisted-service-operator
- OCP 4.11 on hub (4.11.0-fc.3)
- 4.10 spoke


How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP 4.11 hub with upstream assisted-service-operator
2. Try to deploy spoke using manually created CRs including a HostFirmwareSettings CR


Actual results:
BMH stuck "available"


Expected results:
The SuperMicro server is deployed as expected


Additional info:

Comment 1 tali@redhat.com 2022-07-05 14:58:14 UTC
The must-gather is available: https://drive.google.com/file/d/1c_7Eg5-6Vf6YSzPyjJjYSRlmhPgAYrkm/view?usp=sharing

Comment 3 Dmitry Tantsur 2022-07-05 15:58:43 UTC
> operationalStatus: detached

This is suspicious, reconciliation won't happen for detached nodes.

Comment 5 Eran Cohen 2022-07-05 17:26:11 UTC
According to the agent CR this host was successfully installed:

[root@cnfdt08-installer ~]# oc get agent -A
NAMESPACE   NAME                                   CLUSTER   APPROVED   ROLE     STAGE
cnfde11     aba3d84c-44c5-f521-e1b8-f24d29c26080   cnfde11   true       master   Done


Unclear why the BMH is still Available and not  provisioned, detached

From the host motd we can see that the host is no longer running the discovery ISO 
and completed the installation successfully.

Since the Ironic agent is the one starting the assisted-agent and also the one rebooting the node I don't think this is related to the assisted part of the converged flow

Comment 7 tali@redhat.com 2022-07-06 16:34:49 UTC
The BMH was detached while it was still in preparing state.

oc describe bmh -n  cnfde11                 cnfde11.ptp.lab.eng.bos.redhat.com
Name:         cnfde11.ptp.lab.eng.bos.redhat.com
Namespace:    cnfde11
Labels:       infraenvs.agent-install.openshift.io=cnfde11
Annotations:  argocd.argoproj.io/sync-wave: 1
              baremetalhost.metal3.io/detached: assisted-service-controller
              bmac.agent-install.openshift.io/hostname: cnfde11.ptp.lab.eng.bos.redhat.com
              bmac.agent-install.openshift.io/role: master
              ran.openshift.io/ztp-gitops-generated: {}
API Version:  metal3.io/v1alpha1
Kind:         BareMetalHost
Metadata:
  Creation Timestamp:  2022-07-06T15:54:16Z
  Finalizers:
    baremetalhost.metal3.io
  Generation:  2
  Managed Fields:
    API Version:  metal3.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:baremetalhost.metal3.io/detached:
      f:spec:
        f:customDeploy:
          .:
          f:method:
    Manager:      assisted-service
    Operation:    Update
    Time:         2022-07-06T15:54:16Z
    API Version:  metal3.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"baremetalhost.metal3.io":
    Manager:      baremetal-operator
    Operation:    Update
    Time:         2022-07-06T15:54:16Z
    API Version:  metal3.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:argocd.argoproj.io/sync-wave:
          f:bmac.agent-install.openshift.io/hostname:
          f:bmac.agent-install.openshift.io/role:
          f:kubectl.kubernetes.io/last-applied-configuration:
          f:ran.openshift.io/ztp-gitops-generated:
        f:labels:
          .:
          f:infraenvs.agent-install.openshift.io:
      f:spec:
        .:
        f:automatedCleaningMode:
        f:bmc:
          .:
          f:address:
          f:credentialsName:
          f:disableCertificateVerification:
        f:bootMACAddress:
        f:bootMode:
        f:online:
        f:rootDeviceHints:
          .:
          f:deviceName:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2022-07-06T15:54:16Z
    API Version:  metal3.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:errorCount:
        f:errorMessage:
        f:goodCredentials:
          .:
          f:credentials:
            .:
            f:name:
            f:namespace:
          f:credentialsVersion:
        f:hardware:
          .:
          f:cpu:
            .:
            f:arch:
            f:clockMegahertz:
            f:count:
            f:flags:
            f:model:
          f:firmware:
            .:
            f:bios:
              .:
              f:date:
              f:vendor:
              f:version:
          f:hostname:
          f:nics:
          f:ramMebibytes:
          f:storage:
          f:systemVendor:
            .:
            f:manufacturer:
            f:productName:
            f:serialNumber:
        f:hardwareProfile:
        f:lastUpdated:
        f:operationHistory:
          .:
          f:deprovision:
            .:
            f:end:
            f:start:
          f:inspect:
            .:
            f:end:
            f:start:
          f:provision:
            .:
            f:end:
            f:start:
          f:register:
            .:
            f:end:
            f:start:
        f:operationalStatus:
        f:poweredOn:
        f:provisioning:
          .:
          f:ID:
          f:bootMode:
          f:image:
            .:
            f:url:
          f:raid:
            .:
            f:hardwareRAIDVolumes:
            f:softwareRAIDVolumes:
          f:rootDeviceHints:
            .:
            f:deviceName:
          f:state:
        f:triedCredentials:
          .:
          f:credentials:
            .:
            f:name:
            f:namespace:
          f:credentialsVersion:
    Manager:         baremetal-operator
    Operation:       Update
    Subresource:     status
    Time:            2022-07-06T16:07:48Z
  Resource Version:  5800011
  UID:               bc5e2d8d-1ce3-4936-a064-806052192629
Spec:
  Automated Cleaning Mode:  disabled
  Bmc:
    Address:                           redfish-virtualmedia+https://10.16.231.98/redfish/v1/Systems/1
    Credentials Name:                  bmh-secret
    Disable Certificate Verification:  true
  Boot MAC Address:                    3c:ec:ef:1e:d3:5e
  Boot Mode:                           UEFI
  Custom Deploy:
    Method:  start_assisted_install
  Online:    true
  Root Device Hints:
    Device Name:  /dev/sdb
Status:
  Error Count:    0
  Error Message:  
  Good Credentials:
    Credentials:
      Name:               bmh-secret
      Namespace:          cnfde11
    Credentials Version:  5792566
  Hardware:
    Cpu:
      Arch:             x86_64
      Clock Megahertz:  3900
      Count:            48
      Flags:
        3dnowprefetch
        abm
        acpi
        adx
        aes
        aperfmperf
        apic
        arat
        arch_capabilities
        arch_perfmon
        art
        avx
        avx2
        avx512_vnni
        avx512bw
        avx512cd
        avx512dq
        avx512f
        avx512vl
        bmi1
        bmi2
        bts
        cat_l3
        cdp_l3
        clflush
        clflushopt
        clwb
        cmov
        constant_tsc
        cpuid
        cpuid_fault
        cqm
        cqm_llc
        cqm_mbm_local
        cqm_mbm_total
        cqm_occup_llc
        cx16
        cx8
        dca
        de
        ds_cpl
        dtes64
        dtherm
        dts
        epb
        ept
        ept_ad
        erms
        est
        f16c
        flexpriority
        flush_l1d
        fma
        fpu
        fsgsbase
        fxsr
        hle
        ht
        ibpb
        ibrs
        ibrs_enhanced
        ida
        intel_ppin
        intel_pt
        invpcid
        invpcid_single
        lahf_lm
        lm
        mba
        mca
        mce
        md_clear
        mmx
        monitor
        movbe
        mpx
        msr
        mtrr
        nonstop_tsc
        nopl
        nx
        ospke
        pae
        pat
        pbe
        pcid
        pclmulqdq
        pdcm
        pdpe1gb
        pebs
        pge
        pku
        pln
        pni
        popcnt
        pse
        pse36
        pts
        rdrand
        rdseed
        rdt_a
        rdtscp
        rep_good
        sdbg
        sep
        smap
        smep
        smx
        ss
        ssbd
        sse
        sse2
        sse4_1
        sse4_2
        ssse3
        stibp
        syscall
        tm
        tm2
        tpr_shadow
        tsc
        tsc_adjust
        tsc_deadline_timer
        vme
        vmx
        vnmi
        vpid
        x2apic
        xgetbv1
        xsave
        xsavec
        xsaveopt
        xsaves
        xtopology
        xtpr
      Model:  Intel(R) Xeon(R) Gold 6212U CPU @ 2.40GHz
    Firmware:
      Bios:
        Date:     05/18/2021
        Vendor:   American Megatrends Inc.
        Version:  3.5
    Hostname:     api.cnfde11.ptp.lab.eng.bos.redhat.com
    Nics:
      Mac:          ac:1f:6b:e1:1d:d2
      Model:        0x8086 0x158b
      Name:         ens2f0
      Ip:           10.16.231.52
      Mac:          3c:ec:ef:1e:d3:5e
      Model:        0x8086 0x37d2
      Name:         eno1
      Mac:          ac:1f:6b:e1:1d:d3
      Model:        0x8086 0x158b
      Name:         ens2f1
      Mac:          3c:ec:ef:1e:d3:5f
      Model:        0x8086 0x37d2
      Name:         eno2
    Ram Mebibytes:  98304
    Storage:
      Model:       INTEL SSDPELKX010T8
      Name:        /dev/nvme0n1
      Size Bytes:  1000204886016
      Type:        NVME
    System Vendor:
      Manufacturer:   Supermicro
      Product Name:   Super Server (To be filled by O.E.M.)
      Serial Number:  SHUBIWC00001
  Hardware Profile:   unknown
  Last Updated:       2022-07-06T16:07:48Z
  Operation History:
    Deprovision:
      End:    <nil>
      Start:  <nil>
    Inspect:
      End:    2022-07-06T16:07:48Z
      Start:  2022-07-06T15:54:38Z
    Provision:
      End:    <nil>
      Start:  <nil>
    Register:
      End:             2022-07-06T15:54:38Z
      Start:           2022-07-06T15:54:16Z
  Operational Status:  OK
  Powered On:          false
  Provisioning:
    ID:         ffc5213a-8271-4942-982e-7831d543efdd
    Boot Mode:  UEFI
    Image:
      URL:  
    Raid:
      Hardware RAID Volumes:  <nil>
      Software RAID Volumes:
    Root Device Hints:
      Device Name:  /dev/sdb
    State:          preparing
  Tried Credentials:
    Credentials:
      Name:               bmh-secret
      Namespace:          cnfde11
    Credentials Version:  5792566
Events:
  Type    Reason              Age   From                         Message
  ----    ------              ----  ----                         -------
  Normal  Registered          39m   metal3-baremetal-controller  Registered new host
  Normal  BMCAccessValidated  38m   metal3-baremetal-controller  Verified access to BMC
  Normal  InspectionStarted   38m   metal3-baremetal-controller  Hardware inspection started
  Normal  InspectionComplete  25m   metal3-baremetal-controller  Hardware inspection completed
  Normal  ProfileSet          25m   metal3-baremetal-controller  Hardware profile set: unknown

Comment 8 Eran Cohen 2022-07-11 12:52:02 UTC
After fixing the premature detach annotation in BMAC there is still an issue 
@tali :
I looked at the ironic logs from the previous tests, the clean_step timed out as well. The only difference is that BMH is stuck in "provisioning" state (instead of "available" state without AI changes). I think we need something to drive Ironic cleaning complete and BMH transition to “provisioned”.

Comment 9 tali@redhat.com 2022-07-11 13:37:33 UTC
Ironic clean_step times out after the host is rebooted and installation is completed. BMH stuck in "provisioning" state after the premature detachment is fixed. I will raise a new BZ to track the new issue.

Comment 10 tali@redhat.com 2022-07-12 13:49:07 UTC
The BMH is no longer stuck in "“available” state after fixing the premature detach annotation in BMAC.  It is now stuck in ““provisioning” state (tracked by 2106378).

Comment 11 Chad Crum 2022-07-19 17:12:41 UTC
Will move to verified per previous note.

Comment 14 errata-xmlrpc 2022-09-06 22:33:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.6.0 security updates and bug fixes), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6370


Note You need to log in before you can comment on or make changes to this bug.