Bug 2104117
| Summary: | Spoke BMH stuck “available” after changing a BIOS attribute via the converged workflow | ||
|---|---|---|---|
| Product: | Red Hat Advanced Cluster Management for Kubernetes | Reporter: | tali <tali> |
| Component: | Infrastructure Operator | Assignee: | Dmitry Tantsur <dtantsur> |
| Status: | CLOSED ERRATA | QA Contact: | Chad Crum <ccrum> |
| Severity: | high | Docs Contact: | Derek <dcadzow> |
| Priority: | high | ||
| Version: | rhacm-2.6 | CC: | bzvonar, cbynum, ccrum, ercohen, trwest, yfirst |
| Target Milestone: | --- | Flags: | cbynum:
rhacm-2.6+
cbynum: rhacm-2.6.z+ |
| Target Release: | rhacm-2.6 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-09-06 22:33:25 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The must-gather is available: https://drive.google.com/file/d/1c_7Eg5-6Vf6YSzPyjJjYSRlmhPgAYrkm/view?usp=sharing > operationalStatus: detached
This is suspicious, reconciliation won't happen for detached nodes.
According to the agent CR this host was successfully installed: [root@cnfdt08-installer ~]# oc get agent -A NAMESPACE NAME CLUSTER APPROVED ROLE STAGE cnfde11 aba3d84c-44c5-f521-e1b8-f24d29c26080 cnfde11 true master Done Unclear why the BMH is still Available and not provisioned, detached From the host motd we can see that the host is no longer running the discovery ISO and completed the installation successfully. Since the Ironic agent is the one starting the assisted-agent and also the one rebooting the node I don't think this is related to the assisted part of the converged flow The BMH was detached while it was still in preparing state.
oc describe bmh -n cnfde11 cnfde11.ptp.lab.eng.bos.redhat.com
Name: cnfde11.ptp.lab.eng.bos.redhat.com
Namespace: cnfde11
Labels: infraenvs.agent-install.openshift.io=cnfde11
Annotations: argocd.argoproj.io/sync-wave: 1
baremetalhost.metal3.io/detached: assisted-service-controller
bmac.agent-install.openshift.io/hostname: cnfde11.ptp.lab.eng.bos.redhat.com
bmac.agent-install.openshift.io/role: master
ran.openshift.io/ztp-gitops-generated: {}
API Version: metal3.io/v1alpha1
Kind: BareMetalHost
Metadata:
Creation Timestamp: 2022-07-06T15:54:16Z
Finalizers:
baremetalhost.metal3.io
Generation: 2
Managed Fields:
API Version: metal3.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
f:baremetalhost.metal3.io/detached:
f:spec:
f:customDeploy:
.:
f:method:
Manager: assisted-service
Operation: Update
Time: 2022-07-06T15:54:16Z
API Version: metal3.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.:
v:"baremetalhost.metal3.io":
Manager: baremetal-operator
Operation: Update
Time: 2022-07-06T15:54:16Z
API Version: metal3.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:argocd.argoproj.io/sync-wave:
f:bmac.agent-install.openshift.io/hostname:
f:bmac.agent-install.openshift.io/role:
f:kubectl.kubernetes.io/last-applied-configuration:
f:ran.openshift.io/ztp-gitops-generated:
f:labels:
.:
f:infraenvs.agent-install.openshift.io:
f:spec:
.:
f:automatedCleaningMode:
f:bmc:
.:
f:address:
f:credentialsName:
f:disableCertificateVerification:
f:bootMACAddress:
f:bootMode:
f:online:
f:rootDeviceHints:
.:
f:deviceName:
Manager: kubectl-client-side-apply
Operation: Update
Time: 2022-07-06T15:54:16Z
API Version: metal3.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:errorCount:
f:errorMessage:
f:goodCredentials:
.:
f:credentials:
.:
f:name:
f:namespace:
f:credentialsVersion:
f:hardware:
.:
f:cpu:
.:
f:arch:
f:clockMegahertz:
f:count:
f:flags:
f:model:
f:firmware:
.:
f:bios:
.:
f:date:
f:vendor:
f:version:
f:hostname:
f:nics:
f:ramMebibytes:
f:storage:
f:systemVendor:
.:
f:manufacturer:
f:productName:
f:serialNumber:
f:hardwareProfile:
f:lastUpdated:
f:operationHistory:
.:
f:deprovision:
.:
f:end:
f:start:
f:inspect:
.:
f:end:
f:start:
f:provision:
.:
f:end:
f:start:
f:register:
.:
f:end:
f:start:
f:operationalStatus:
f:poweredOn:
f:provisioning:
.:
f:ID:
f:bootMode:
f:image:
.:
f:url:
f:raid:
.:
f:hardwareRAIDVolumes:
f:softwareRAIDVolumes:
f:rootDeviceHints:
.:
f:deviceName:
f:state:
f:triedCredentials:
.:
f:credentials:
.:
f:name:
f:namespace:
f:credentialsVersion:
Manager: baremetal-operator
Operation: Update
Subresource: status
Time: 2022-07-06T16:07:48Z
Resource Version: 5800011
UID: bc5e2d8d-1ce3-4936-a064-806052192629
Spec:
Automated Cleaning Mode: disabled
Bmc:
Address: redfish-virtualmedia+https://10.16.231.98/redfish/v1/Systems/1
Credentials Name: bmh-secret
Disable Certificate Verification: true
Boot MAC Address: 3c:ec:ef:1e:d3:5e
Boot Mode: UEFI
Custom Deploy:
Method: start_assisted_install
Online: true
Root Device Hints:
Device Name: /dev/sdb
Status:
Error Count: 0
Error Message:
Good Credentials:
Credentials:
Name: bmh-secret
Namespace: cnfde11
Credentials Version: 5792566
Hardware:
Cpu:
Arch: x86_64
Clock Megahertz: 3900
Count: 48
Flags:
3dnowprefetch
abm
acpi
adx
aes
aperfmperf
apic
arat
arch_capabilities
arch_perfmon
art
avx
avx2
avx512_vnni
avx512bw
avx512cd
avx512dq
avx512f
avx512vl
bmi1
bmi2
bts
cat_l3
cdp_l3
clflush
clflushopt
clwb
cmov
constant_tsc
cpuid
cpuid_fault
cqm
cqm_llc
cqm_mbm_local
cqm_mbm_total
cqm_occup_llc
cx16
cx8
dca
de
ds_cpl
dtes64
dtherm
dts
epb
ept
ept_ad
erms
est
f16c
flexpriority
flush_l1d
fma
fpu
fsgsbase
fxsr
hle
ht
ibpb
ibrs
ibrs_enhanced
ida
intel_ppin
intel_pt
invpcid
invpcid_single
lahf_lm
lm
mba
mca
mce
md_clear
mmx
monitor
movbe
mpx
msr
mtrr
nonstop_tsc
nopl
nx
ospke
pae
pat
pbe
pcid
pclmulqdq
pdcm
pdpe1gb
pebs
pge
pku
pln
pni
popcnt
pse
pse36
pts
rdrand
rdseed
rdt_a
rdtscp
rep_good
sdbg
sep
smap
smep
smx
ss
ssbd
sse
sse2
sse4_1
sse4_2
ssse3
stibp
syscall
tm
tm2
tpr_shadow
tsc
tsc_adjust
tsc_deadline_timer
vme
vmx
vnmi
vpid
x2apic
xgetbv1
xsave
xsavec
xsaveopt
xsaves
xtopology
xtpr
Model: Intel(R) Xeon(R) Gold 6212U CPU @ 2.40GHz
Firmware:
Bios:
Date: 05/18/2021
Vendor: American Megatrends Inc.
Version: 3.5
Hostname: api.cnfde11.ptp.lab.eng.bos.redhat.com
Nics:
Mac: ac:1f:6b:e1:1d:d2
Model: 0x8086 0x158b
Name: ens2f0
Ip: 10.16.231.52
Mac: 3c:ec:ef:1e:d3:5e
Model: 0x8086 0x37d2
Name: eno1
Mac: ac:1f:6b:e1:1d:d3
Model: 0x8086 0x158b
Name: ens2f1
Mac: 3c:ec:ef:1e:d3:5f
Model: 0x8086 0x37d2
Name: eno2
Ram Mebibytes: 98304
Storage:
Model: INTEL SSDPELKX010T8
Name: /dev/nvme0n1
Size Bytes: 1000204886016
Type: NVME
System Vendor:
Manufacturer: Supermicro
Product Name: Super Server (To be filled by O.E.M.)
Serial Number: SHUBIWC00001
Hardware Profile: unknown
Last Updated: 2022-07-06T16:07:48Z
Operation History:
Deprovision:
End: <nil>
Start: <nil>
Inspect:
End: 2022-07-06T16:07:48Z
Start: 2022-07-06T15:54:38Z
Provision:
End: <nil>
Start: <nil>
Register:
End: 2022-07-06T15:54:38Z
Start: 2022-07-06T15:54:16Z
Operational Status: OK
Powered On: false
Provisioning:
ID: ffc5213a-8271-4942-982e-7831d543efdd
Boot Mode: UEFI
Image:
URL:
Raid:
Hardware RAID Volumes: <nil>
Software RAID Volumes:
Root Device Hints:
Device Name: /dev/sdb
State: preparing
Tried Credentials:
Credentials:
Name: bmh-secret
Namespace: cnfde11
Credentials Version: 5792566
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Registered 39m metal3-baremetal-controller Registered new host
Normal BMCAccessValidated 38m metal3-baremetal-controller Verified access to BMC
Normal InspectionStarted 38m metal3-baremetal-controller Hardware inspection started
Normal InspectionComplete 25m metal3-baremetal-controller Hardware inspection completed
Normal ProfileSet 25m metal3-baremetal-controller Hardware profile set: unknown
After fixing the premature detach annotation in BMAC there is still an issue @tali : I looked at the ironic logs from the previous tests, the clean_step timed out as well. The only difference is that BMH is stuck in "provisioning" state (instead of "available" state without AI changes). I think we need something to drive Ironic cleaning complete and BMH transition to “provisioned”. Ironic clean_step times out after the host is rebooted and installation is completed. BMH stuck in "provisioning" state after the premature detachment is fixed. I will raise a new BZ to track the new issue. The BMH is no longer stuck in "“available” state after fixing the premature detach annotation in BMAC. It is now stuck in ““provisioning” state (tracked by 2106378). Will move to verified per previous note. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.6.0 security updates and bug fixes), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6370 |
Description of problem: Deploy a spoke cluster with a HostFirmwareSettings CR via the converged workflow. The BMH is stuck in “available” state after successfully changing a BIOS attribute during deployment. cat HostFirmwareSettings.yaml apiVersion: metal3.io/v1alpha1 kind: HostFirmwareSettings metadata: name: "cnfde11.ptp.lab.eng.bos.redhat.com" namespace: "cnfde11" spec: settings: PowerButtonFunction: "4 Seconds Override" oc get bmh -n cnfde11 NAME STATE CONSUMER ONLINE ERROR AGE cnfde11.ptp.lab.eng.bos.redhat.com available true 70m Ironic log shows applying the configuration change: 2022-07-05 13:47:57.713 1 DEBUG ironic.drivers.modules.redfish.bios [req-e29ff593-3e6f-4dd3-a34f-06568e01d2da - - - - -] Apply BIOS configuration for node 93422fec-4ada-4009-becb-a8799ce3f7c3: [{'name': 'PowerButtonFunction', 'value': '4 Seconds Override'}] apply_configuration /usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/bios.py:230^[[00m 2022-07-05 13:47:57.715 1 DEBUG sushy.connector [req-e29ff593-3e6f-4dd3-a34f-06568e01d2da - - - - -] HTTP request: PATCH https://10.16.231.98/redfish/v1/Systems/1/Bios/SD; headers: {'Content-Type': 'application/json', 'OData-Version': '4.0'}; body: {'Attributes': {'PowerButtonFunction': '4 Seconds Override'}}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:111^[[00m Version-Release number of selected component (if applicable): - Latest upstream assisted-service-operator - OCP 4.11 on hub (4.11.0-fc.3) - 4.10 spoke How reproducible: 100% Steps to Reproduce: 1. Deploy OCP 4.11 hub with upstream assisted-service-operator 2. Try to deploy spoke using manually created CRs including a HostFirmwareSettings CR Actual results: BMH stuck "available" Expected results: The SuperMicro server is deployed as expected Additional info: