Bug 2082601

Summary: Ironic-conductor resetting a machine that was removed from hub cluster
Product: OpenShift Container Platform Reporter: Alex Krzos <akrzos>
Component: Bare Metal Hardware ProvisioningAssignee: Tomas Sedovic <tsedovic>
Bare Metal Hardware Provisioning sub component: ironic QA Contact: Amit Ugol <augol>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: derekh, imiller, rpittau
Version: 4.10   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-02 08:02:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Krzos 2022-05-06 14:16:39 UTC
Description of problem:
While provisioning many OCP SNO clusters using Zero Touch Provisioning (ZTP) with ACM, I had attempted to have failed clusters re-provisioned by removing them from the hub cluster.  While they were removed, I had powered down the machines however it seems Ironic conductor or a metal3 component continued to re-power on the machines despite the fact that all references to the machine has been removed. 

Version-Release number of selected component (if applicable):
Hub cluster 4.10.8
SNO clusters 4.9.26
ACM - 2.5.0-DOWNSTREAM-2022-05-04-04-34-55

How reproducible:
Unclear, I had attempted to re-provision failed clusters in only one test so far.

Steps to Reproduce:
1.
2.
3.

Actual results:
After the SNO definition was removed from ZTP, gitops resynced to the hub cluster removing the namespace, bmh, infraenv, agentclusterinstall, nmstateconfig, ad the finalizer in order to allow the namespace to finish terminating. After the namespace termianted, I noticed that the VM's that where referenced by the bmh object were turned on, I powered them down, but moments later they were powered on again.  After watching logs for sushy-emulator we witnessed something from the hub cluster was resetting the power on those machines:


[root@f35-h17-000-r640 ~]# virsh destroy sno00045
Domain sno00045 destroyed

[root@f35-h17-000-r640 ~]# journalctl -f | grep post -i
May 05 20:03:58 f35-h17-000-r640.rdu2.scalelab.redhat.com sushy-emulator[279756]: fc00:1000::5 - - [05/May/2022 20:03:58] "POST /redfish/v1/Systems/fd1f3a27-b58d-582d-b234-a3f786af806e/Actions/ComputerSystem.Reset HTTP/1.1" 204 -

In the above snippet, you can see the vm was powered off (destroy operation) and a subsequent reset was issued to the redfish api to power this vm back on.


Expected results:
Once an SNO definition is removed via ZTP for Ironic to not interact with the bare metal host for that SNO machine at all.


Additional info:

Forcing the metal3 pod with ironic-conductor running to be recreated resolved the issue.

oc delete po -n openshift-machine-api -l baremetal.openshift.io/cluster-baremetal-operator=metal3-state –all

Logs that seemed to show the sno/vm being powered back on:

2022-05-05 20:03:57.587 1 DEBUG sushy.resources.base [req-c9ea00cb-8829-480d-8c59-c8adb60bb61b - - - - -] Received representation of System /redfish/v1/Systems/fd1f3a27-b58d-582d-b234-a3f786af806e: {'_actions': {'reset': {'allowed_values': ['On', 'ForceOff', 'GracefulShutdown', 'GracefulRestart', 'ForceRestart', 'Nmi', 'ForceOn'], 'operation_apply_time_support': None, 'target_uri': '/redfish/v1/Systems/fd1f3a27-b58d-582d-b234-a3f786af806e/Actions/ComputerSystem.Reset'}}, '_oem_vendors': None, 'asset_tag': None, 'bios_version': None, 'boot': {'allowed_values': ['Pxe', 'Cd', 'Hdd'], 'enabled': <BootSourceOverrideEnabled.CONTINUOUS: 'Continuous'>, 'mode': <BootSourceOverrideMode.UEFI: 'UEFI'>, 'target': <BootSource.HDD: 'Hdd'>}, 'description': None, 'hostname': None, 'identity': 'fd1f3a27-b58d-582d-b234-a3f786af806e', 'indicator_led': <IndicatorLED.LIT: 'Lit'>, 'links': {'oem_vendors': None}, 'maintenance_window': None, 'manufacturer': 'Sushy Emulator', 'memory_summary': {'health': None, 'size_gib': 18}, 'name': 'sno00045', 'part_number': None, 'power_state': <PowerState.OFF: 'Off'>, 'serial_number': None, 'sku': None, 'status': {'health': <Health.OK: 'OK'>, 'health_rollup': None, 'state': <State.ENABLED: 'Enabled'>}, 'system_type': None, 'uuid': 'fd1f3a27-b58d-582d-b234-a3f786af806e'} refresh /usr/lib/python3.6/site-packages/sushy/resources/base.py:656
2022-05-05 20:03:57.699 1 DEBUG sushy.resources.base [req-c9ea00cb-8829-480d-8c59-c8adb60bb61b - - - - -] Received representation of System /redfish/v1/Systems/fd1f3a27-b58d-582d-b234-a3f786af806e: {'_actions': {'reset': {'allowed_values': ['On', 'ForceOff', 'GracefulShutdown', 'GracefulRestart', 'ForceRestart', 'Nmi', 'ForceOn'], 'operation_apply_time_support': None, 'target_uri': '/redfish/v1/Systems/fd1f3a27-b58d-582d-b234-a3f786af806e/Actions/ComputerSystem.Reset'}}, '_oem_vendors': None, 'asset_tag': None, 'bios_version': None, 'boot': {'allowed_values': ['Pxe', 'Cd', 'Hdd'], 'enabled': <BootSourceOverrideEnabled.CONTINUOUS: 'Continuous'>, 'mode': <BootSourceOverrideMode.UEFI: 'UEFI'>, 'target': <BootSource.HDD: 'Hdd'>}, 'description': None, 'hostname': None, 'identity': 'fd1f3a27-b58d-582d-b234-a3f786af806e', 'indicator_led': <IndicatorLED.LIT: 'Lit'>, 'links': {'oem_vendors': None}, 'maintenance_window': None, 'manufacturer': 'Sushy Emulator', 'memory_summary': {'health': None, 'size_gib': 18}, 'name': 'sno00045', 'part_number': None, 'power_state': <PowerState.OFF: 'Off'>, 'serial_number': None, 'sku': None, 'status': {'health': <Health.OK: 'OK'>, 'health_rollup': None, 'state': <State.ENABLED: 'Enabled'>}, 'system_type': None, 'uuid': 'fd1f3a27-b58d-582d-b234-a3f786af806e'} refresh /usr/lib/python3.6/site-packages/sushy/resources/base.py:656
2022-05-05 20:03:57.812 1 DEBUG sushy.resources.base [req-c9ea00cb-8829-480d-8c59-c8adb60bb61b - - - - -] Received representation of System /redfish/v1/Systems/fd1f3a27-b58d-582d-b234-a3f786af806e: {'_actions': {'reset': {'allowed_values': ['On', 'ForceOff', 'GracefulShutdown', 'GracefulRestart', 'ForceRestart', 'Nmi', 'ForceOn'], 'operation_apply_time_support': None, 'target_uri': '/redfish/v1/Systems/fd1f3a27-b58d-582d-b234-a3f786af806e/Actions/ComputerSystem.Reset'}}, '_oem_vendors': None, 'asset_tag': None, 'bios_version': None, 'boot': {'allowed_values': ['Pxe', 'Cd', 'Hdd'], 'enabled': <BootSourceOverrideEnabled.CONTINUOUS: 'Continuous'>, 'mode': <BootSourceOverrideMode.UEFI: 'UEFI'>, 'target': <BootSource.HDD: 'Hdd'>}, 'description': None, 'hostname': None, 'identity': 'fd1f3a27-b58d-582d-b234-a3f786af806e', 'indicator_led': <IndicatorLED.LIT: 'Lit'>, 'links': {'oem_vendors': None}, 'maintenance_window': None, 'manufacturer': 'Sushy Emulator', 'memory_summary': {'health': None, 'size_gib': 18}, 'name': 'sno00045', 'part_number': None, 'power_state': <PowerState.OFF: 'Off'>, 'serial_number': None, 'sku': None, 'status': {'health': <Health.OK: 'OK'>, 'health_rollup': None, 'state': <State.ENABLED: 'Enabled'>}, 'system_type': None, 'uuid': 'fd1f3a27-b58d-582d-b234-a3f786af806e'} refresh /usr/lib/python3.6/site-packages/sushy/resources/base.py:656
2022-05-05 20:03:57.935 1 DEBUG sushy.resources.base [req-c9ea00cb-8829-480d-8c59-c8adb60bb61b - - - - -] Received representation of System /redfish/v1/Systems/fd1f3a27-b58d-582d-b234-a3f786af806e: {'_actions': {'reset': {'allowed_values': ['On', 'ForceOff', 'GracefulShutdown', 'GracefulRestart', 'ForceRestart', 'Nmi', 'ForceOn'], 'operation_apply_time_support': None, 'target_uri': '/redfish/v1/Systems/fd1f3a27-b58d-582d-b234-a3f786af806e/Actions/ComputerSystem.Reset'}}, '_oem_vendors': None, 'asset_tag': None, 'bios_version': None, 'boot': {'allowed_values': ['Pxe', 'Cd', 'Hdd'], 'enabled': <BootSourceOverrideEnabled.CONTINUOUS: 'Continuous'>, 'mode': <BootSourceOverrideMode.UEFI: 'UEFI'>, 'target': <BootSource.HDD: 'Hdd'>}, 'description': None, 'hostname': None, 'identity': 'fd1f3a27-b58d-582d-b234-a3f786af806e', 'indicator_led': <IndicatorLED.LIT: 'Lit'>, 'links': {'oem_vendors': None}, 'maintenance_window': None, 'manufacturer': 'Sushy Emulator', 'memory_summary': {'health': None, 'size_gib': 18}, 'name': 'sno00045', 'part_number': None, 'power_state': <PowerState.OFF: 'Off'>, 'serial_number': None, 'sku': None, 'status': {'health': <Health.OK: 'OK'>, 'health_rollup': None, 'state': <State.ENABLED: 'Enabled'>}, 'system_type': None, 'uuid': 'fd1f3a27-b58d-582d-b234-a3f786af806e'} refresh /usr/lib/python3.6/site-packages/sushy/resources/base.py:656
2022-05-05 20:03:58.060 1 DEBUG sushy.resources.base [req-c9ea00cb-8829-480d-8c59-c8adb60bb61b - - - - -] Received representation of System /redfish/v1/Systems/fd1f3a27-b58d-582d-b234-a3f786af806e: {'_actions': {'reset': {'allowed_values': ['On', 'ForceOff', 'GracefulShutdown', 'GracefulRestart', 'ForceRestart', 'Nmi', 'ForceOn'], 'operation_apply_time_support': None, 'target_uri': '/redfish/v1/Systems/fd1f3a27-b58d-582d-b234-a3f786af806e/Actions/ComputerSystem.Reset'}}, '_oem_vendors': None, 'asset_tag': None, 'bios_version': None, 'boot': {'allowed_values': ['Pxe', 'Cd', 'Hdd'], 'enabled': <BootSourceOverrideEnabled.CONTINUOUS: 'Continuous'>, 'mode': <BootSourceOverrideMode.UEFI: 'UEFI'>, 'target': <BootSource.HDD: 'Hdd'>}, 'description': None, 'hostname': None, 'identity': 'fd1f3a27-b58d-582d-b234-a3f786af806e', 'indicator_led': <IndicatorLED.LIT: 'Lit'>, 'links': {'oem_vendors': None}, 'maintenance_window': None, 'manufacturer': 'Sushy Emulator', 'memory_summary': {'health': None, 'size_gib': 18}, 'name': 'sno00045', 'part_number': None, 'power_state': <PowerState.OFF: 'Off'>, 'serial_number': None, 'sku': None, 'status': {'health': <Health.OK: 'OK'>, 'health_rollup': None, 'state': <State.ENABLED: 'Enabled'>}, 'system_type': None, 'uuid': 'fd1f3a27-b58d-582d-b234-a3f786af806e'} refresh /usr/lib/python3.6/site-packages/sushy/resources/base.py:656
2022-05-05 20:04:00.041 1 DEBUG sushy.resources.base [-] Received representation of System /redfish/v1/Systems/fd1f3a27-b58d-582d-b234-a3f786af806e: {'_actions': {'reset': {'allowed_values': ['On', 'ForceOff', 'GracefulShutdown', 'GracefulRestart', 'ForceRestart', 'Nmi', 'ForceOn'], 'operation_apply_time_support': None, 'target_uri': '/redfish/v1/Systems/fd1f3a27-b58d-582d-b234-a3f786af806e/Actions/ComputerSystem.Reset'}}, '_oem_vendors': None, 'asset_tag': None, 'bios_version': None, 'boot': {'allowed_values': ['Pxe', 'Cd', 'Hdd'], 'enabled': <BootSourceOverrideEnabled.CONTINUOUS: 'Continuous'>, 'mode': <BootSourceOverrideMode.UEFI: 'UEFI'>, 'target': <BootSource.HDD: 'Hdd'>}, 'description': None, 'hostname': None, 'identity': 'fd1f3a27-b58d-582d-b234-a3f786af806e', 'indicator_led': <IndicatorLED.LIT: 'Lit'>, 'links': {'oem_vendors': None}, 'maintenance_window': None, 'manufacturer': 'Sushy Emulator', 'memory_summary': {'health': None, 'size_gib': 18}, 'name': 'sno00045', 'part_number': None, 'power_state': <PowerState.ON: 'On'>, 'serial_number': None, 'sku': None, 'status': {'health': <Health.OK: 'OK'>, 'health_rollup': None, 'state': <State.ENABLED: 'Enabled'>}, 'system_type': None, 'uuid': 'fd1f3a27-b58d-582d-b234-a3f786af806e'} refresh /usr/lib/python3.6/site-packages/sushy/resources/base.py:656

Comment 1 Derek Higgins 2022-05-10 16:26:02 UTC
Hi Alex, could you provid a must gather with the logs, We'd be mainly interested in seeing the entire ironic logs and baremetal-operator

Comment 2 Riccardo Pittau 2022-06-02 08:02:49 UTC
closing this as we don't have enough info to move forward with the troubleshooting or reproduce the issue
please re-open the BZ or open a new one if you encounter the same issue again