Description of problem: Replacing a master node on a baremetal IPI deployment gets stuck when deleting the machine of the unhealthy member. https://access.redhat.com/documentation/en-us/openshift_container_platform/4.5/html-single/backup_and_restore/index#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member Version-Release number of selected component (if applicable): 4.6.0-rc.4 How reproducible: 100% Steps to Reproduce: 1. Deploy a baremetal IPI environment with 3 x masters + 2 x worker nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 2d8h v1.19.0+d59ce34 master-0-1 Ready master 2d8h v1.19.0+d59ce34 master-0-2 Ready master 2d8h v1.19.0+d59ce34 worker-0-0 Ready worker 2d7h v1.19.0+d59ce34 worker-0-1 Ready worker 2d7h v1.19.0+d59ce34 NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-rc.4 True False 2d7h Cluster version is 4.6.0-rc.4 2. Simulate one of the masters nodes outage by powering off master-0-0 and getting it into NotReady state oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 NotReady master 2d8h v1.19.0+d59ce34 master-0-1 Ready master 2d8h v1.19.0+d59ce34 master-0-2 Ready master 2d8h v1.19.0+d59ce34 worker-0-0 Ready worker 2d7h v1.19.0+d59ce34 worker-0-1 Ready worker 2d7h v1.19.0+d59ce34 3. Add a new BMH to replace master-0-0: openshift-master-0-3-bmc-secret.yaml: apiVersion: v1 data: password: cGFzc3dvcmQ= username: YWRtaW4= kind: Secret metadata: name: openshift-master-0-3-bmc-secret namespace: openshift-machine-api openshift-master-0-3-bmh.yaml: apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: openshift-master-0-3 namespace: openshift-machine-api spec: online: true bmc: address: redfish://192.168.123.1:8000/redfish/v1/Systems/347b1dc7-ca79-4ca7-a67a-9b613e3d2acd credentialsName: openshift-master-0-3-bmc-secret disableCertificateVerification: true bootMACAddress: 52:54:00:24:0b:ce oc -n openshift-machine-api get bmh openshift-master-0-3 NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-master-0-3 OK ready redfish://192.168.123.1:8000/redfish/v1/Systems/347b1dc7-ca79-4ca7-a67a-9b613e3d2acd unknown true 4. Follow the procedure for Replacing an unhealthy etcd member whose machine is not running or whose node is not ready from https://access.redhat.com/documentation/en-us/openshift_container_platform/4.5/html-single/backup_and_restore/index#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member 5. Check which member is unhealthy oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="EtcdMembersAvailable")]}{.message}{"\n"}' 2 of 3 members are available, master-0-0 is unhealthy 6. Determine if the machine is not running: oc get machines -A -ojsonpath='{range .items[*]}{@.status.nodeRef.name}{"\t"}{@.status.providerStatus.instanceState}{"\n"}' | grep -v running master-0-0 master-0-1 master-0-2 worker-0-0 worker-0-1 7. Determine if the node is not ready. oc get nodes -o jsonpath='{range .items[*]}{"\n"}{.metadata.name}{"\t"}{range .spec.taints[*]}{.key}{" "}' | grep unreachable master-0-0 node-role.kubernetes.io/master node.kubernetes.io/unreachable node.kubernetes.io/unreachable 8. Connect to a running etcd container oc -n openshift-etcd rsh etcd-master-0-1 9. View the member list: sh-4.4# etcdctl member list -w table +------------------+---------+------------+------------------------------+------------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+------------+------------------------------+------------------------------+------------+ | 1d82590304f84714 | started | master-0-0 | https://192.168.123.127:2380 | https://192.168.123.127:2379 | false | | 4d97af960967d221 | started | master-0-2 | https://192.168.123.115:2380 | https://192.168.123.115:2379 | false | | b9f1bfa54f56d9ba | started | master-0-1 | https://192.168.123.129:2380 | https://192.168.123.129:2379 | false | +------------------+---------+------------+------------------------------+------------------------------+------------+ 10. Remove the unhealthy etcd member: sh-4.4# etcdctl member remove 1d82590304f84714 Member 1d82590304f84714 removed from cluster 4acece37e8b7a4db 11. View the member list again: sh-4.4# etcdctl member list -w table +------------------+---------+------------+------------------------------+------------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+------------+------------------------------+------------------------------+------------+ | 4d97af960967d221 | started | master-0-2 | https://192.168.123.115:2380 | https://192.168.123.115:2379 | false | | b9f1bfa54f56d9ba | started | master-0-1 | https://192.168.123.129:2380 | https://192.168.123.129:2379 | false | +------------------+---------+------------+------------------------------+------------------------------+------------+ 12. Remove old secrets: oc delete secret -n openshift-etcd etcd-peer-master-0-0 oc delete secret -n openshift-etcd etcd-serving-master-0-0 oc delete secret -n openshift-etcd etcd-serving-metrics-master-0-0 13. Obtain the machine for the unhealthy member oc get machines -n openshift-machine-api -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE ocp-edge-cluster-0-gbw9p-master-0 Running 2d8h master-0-0 baremetalhost:///openshift-machine-api/openshift-master-0-0 ocp-edge-cluster-0-gbw9p-master-1 Running 2d8h master-0-1 baremetalhost:///openshift-machine-api/openshift-master-0-1 ocp-edge-cluster-0-gbw9p-master-2 Running 2d8h master-0-2 baremetalhost:///openshift-machine-api/openshift-master-0-2 ocp-edge-cluster-0-gbw9p-worker-0-6x9r5 Running 2d8h worker-0-0 baremetalhost:///openshift-machine-api/openshift-worker-0-0 ocp-edge-cluster-0-gbw9p-worker-0-s8tjg Running 2d8h worker-0-1 baremetalhost:///openshift-machine-api/openshift-worker-0-1 14. Delete the machine of the unhealthy member: oc delete machine -n openshift-machine-api ocp-edge-cluster-0-gbw9p-master-0 Actual results: Prompt is getting stuck at machine.machine.openshift.io "ocp-edge-cluster-0-gbw9p-master-0" deleted Expected results: Machine gets deleted Additional info: machine seems to be in phase deleting: oc get machines/ocp-edge-cluster-0-gbw9p-master-0 -n openshift-machine-api -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE ocp-edge-cluster-0-gbw9p-master-0 Deleting 2d8h master-0-0 baremetalhost:///openshift-machine-api/openshift-master-0-0 Looking at the machine-controller logs I can see that drain failed for machine "ocp-edge-cluster-0-gbw9p-master-0": I1021 18:32:51.812051 1 controller.go:543] evicting pod openshift-etcd/etcd-quorum-guard-644f5747b8-hkzvr I1021 18:32:51.812189 1 controller.go:543] evicting pod openshift-etcd/installer-4-master-0-0 E1021 18:32:51.819149 1 controller.go:543] error when evicting pod "etcd-quorum-guard-644f5747b8-hkzvr" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I1021 18:32:56.819943 1 controller.go:543] evicting pod openshift-etcd/etcd-quorum-guard-644f5747b8-hkzvr E1021 18:32:56.835685 1 controller.go:543] error when evicting pod "etcd-quorum-guard-644f5747b8-hkzvr" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I1021 18:33:01.836776 1 controller.go:543] evicting pod openshift-etcd/etcd-quorum-guard-644f5747b8-hkzvr E1021 18:33:01.843563 1 controller.go:543] error when evicting pod "etcd-quorum-guard-644f5747b8-hkzvr" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I1021 18:33:06.844114 1 controller.go:543] evicting pod openshift-etcd/etcd-quorum-guard-644f5747b8-hkzvr E1021 18:33:06.851959 1 controller.go:543] error when evicting pod "etcd-quorum-guard-644f5747b8-hkzvr" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I1021 18:33:11.852553 1 controller.go:543] evicting pod openshift-etcd/etcd-quorum-guard-644f5747b8-hkzvr W1021 18:33:11.852747 1 controller.go:380] drain failed for machine "ocp-edge-cluster-0-gbw9p-master-0": [error when waiting for pod "installer-4-master-0-0" terminating: global timeout reached: 20s, error when evicting pod "etcd-quorum-guard-644f5747b8-hkzvr": global timeout reached: 20s] E1021 18:33:11.852836 1 controller.go:218] ocp-edge-cluster-0-gbw9p-master-0: failed to drain node for machine: requeue in: 20s I1021 18:33:11.852869 1 controller.go:406] Actuator returned requeue-after error: requeue in: 20s I1021 18:33:11.853089 1 controller.go:170] ocp-edge-cluster-0-gbw9p-master-0: reconciling Machine I1021 18:33:11.853156 1 controller.go:210] ocp-edge-cluster-0-gbw9p-master-0: reconciling machine triggers delete Attaching must-gather
Even after annotating the machine with: oc -n openshift-machine-api annotate machine ocp-edge-cluster-0-gbw9p-master-0 machine.openshift.io/exclude-node-draining= deleting the machine still fails with: 2020/10/21 18:57:03 Deleting machine ocp-edge-cluster-0-gbw9p-master-0 . 2020/10/21 18:57:03 deleting machine ocp-edge-cluster-0-gbw9p-master-0 using host openshift-master-0-0 2020/10/21 18:57:03 waiting for host openshift-master-0-0 to be deprovisioned E1021 18:57:03.945743 1 controller.go:231] ocp-edge-cluster-0-gbw9p-master-0: failed to delete machine: requeue in: 30s I1021 18:57:03.945759 1 controller.go:406] Actuator returned requeue-after error: requeue in: 30s
In the baremetal-operator log we can see: {"level":"info","ts":1603306563.9395769,"logger":"baremetalhost","msg":"Reconciling BareMetalHost","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-master-0-0"} {"level":"info","ts":1603306563.9396687,"logger":"baremetalhost_ironic","msg":"verifying ironic provisioner dependencies","host":"openshift-master-0-0"} {"level":"info","ts":1603306564.0244021,"logger":"baremetalhost_ironic","msg":"found existing node by ID","host":"openshift-master-0-0"} {"level":"error","ts":1603306564.0245044,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"metal3-baremetalhost-controller","request":"openshift-machine-api/openshift-master-0-0","error":"action \"externally provisioned\" failed: Invalid state for adopt: enroll","errorVerbose":"Invalid state for adopt: enroll\naction \"externally provisioned\" failed\ngithub.com/metal3-io/baremetal-operator/pkg/controller/baremetalhost.(*ReconcileBareMetalHost).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/controller/baremetalhost/baremetalhost_controller.go:307\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1374","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
After manually deleting the openshift-master-0-0 BMH the machine was finally deleted: oc -n openshift-machine-api delete bmh openshift-master-0-0
Removing TestBlocker since I was able to move forward after annotating the machine with exclude-node-draining and removing the BMH.
The logs show all of the Hosts being registered into an empty Ironic right at the beginning. I'm inferring that the reason for this is that the metal3 pod was running on master-0 before it was powered down, and was rescheduled on another node afterwards. The root cause of the problem is that Ironic is not able to connect to the BMC of the master-0 machine right from the beginning: 020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager [req-35cd7821-0765-466e-b767-5ce56b8bc603 ironic-user - - - -] Failed to get power state for node 1c121c00-74dc-4b72-b8c5-5 6a14bd06bae. Error: HTTP GET https://192.168.123.1:8000/redfish/v1/Systems/7691867f-577d-4134-be10-87054201595f returned code 500. Base.1.0.GeneralError: Domain not found: no domain with matching uuid '7691867f- 577d-4134-be10-87054201595f' (master-0-0): sushy.exceptions.ServerSideError: HTTP GET https://192.168.123.1:8000/redfish/v1/Systems/7691867f-577d-4134-be10-87054201595f returned code 500. Base.1.0.GeneralError: Domain not found: no domain with matching uuid '7691867f-577d-4134-be10-87054201595f' (master-0-0) 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager Traceback (most recent call last): 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/ironic/conductor/manager.py", line 1240, in _do_node_verify 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager power_state = task.driver.power.get_power_state(task) 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/power.py", line 112, in get_power_state 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager system = redfish_utils.get_system(task.node) 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/utils.py", line 289, in get_system 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager return _get_system() 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/retrying.py", line 68, in wrapped_f 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager return Retrying(*dargs, **dkw).call(f, *args, **kw) 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/retrying.py", line 223, in call 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager return attempt.get(self._wrap_exception) 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/retrying.py", line 261, in get 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager six.reraise(self.value[0], self.value[1], self.value[2]) 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager raise value 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/retrying.py", line 217, in call 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager attempt = Attempt(fn(*args, **kwargs), attempt_number, False) 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/utils.py", line 266, in _get_system 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager return conn.get_system(system_id) 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/sushy/main.py", line 251, in get_system 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager registries=self.lazy_registries) 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/sushy/resources/system/system.py", line 151, in __init__ 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager connector, identity, redfish_version, registries) 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/sushy/resources/base.py", line 437, in __init__ 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager self.refresh() 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/sushy/resources/base.py", line 472, in refresh 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager self._json = self._reader.get_json() 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/sushy/resources/base.py", line 339, in get_json 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager data = self._conn.get(path=self._path) 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/sushy/connector.py", line 178, in get 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager **extra_session_req_kwargs) 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/sushy/connector.py", line 118, in _op 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager exceptions.raise_for_response(method, url, response) 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager File "/usr/lib/python3.6/site-packages/sushy/exceptions.py", line 161, in raise_for_response 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager raise ServerSideError(method, url, response) 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager sushy.exceptions.ServerSideError: HTTP GET https://192.168.123.1:8000/redfish/v1/Systems/7691867f-577d-4134-be10-8705420159 5f returned code 500. Base.1.0.GeneralError: Domain not found: no domain with matching uuid '7691867f-577d-4134-be10-87054201595f' (master-0-0) 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager 2020-10-21T18:13:58.484195339Z 2020-10-21 18:13:58.483 1 DEBUG ironic.common.states [req-35cd7821-0765-466e-b767-5ce56b8bc603 ironic-user - - - -] Exiting old state 'verifying' in response to event 'fail' on_exi t /usr/lib/python3.6/site-packages/ironic/common/states.py:295 2020-10-21T18:13:58.484498112Z 2020-10-21 18:13:58.484 1 DEBUG ironic.common.states [req-35cd7821-0765-466e-b767-5ce56b8bc603 ironic-user - - - -] Entering new state 'enroll' in response to event 'fail' on_enter /usr/lib/python3.6/site-packages/ironic/common/states.py:301 How did you power off the Host? I can't help but imagine that this is related. In any event, the baremetal-operator does not handle this well - it decides that registration is complete (since the Node exists) even though it has failed, and proceeds to trying to Adopt the node - a prerequisite for which is that registration has succeeded. It never gets past this and continues trying and failing to adopt forever. Once we get past Node draining, the Machine will not allow itself to be deleted until it has powered the Host down. In this case Ironic cannot read the power state, and in fact the BMO is not able to even run the code to manage it anyway, so it will never complete.
(In reply to Zane Bitter from comment #6) > The logs show all of the Hosts being registered into an empty Ironic right > at the beginning. I'm inferring that the reason for this is that the metal3 > pod was running on master-0 before it was powered down, and was rescheduled > on another node afterwards. > The root cause of the problem is that Ironic is not able to connect to the > BMC of the master-0 machine right from the beginning: > > 020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager [req-35cd7821-0765-466e-b767-5ce56b8bc603 > ironic-user - - - -] Failed to get power state for node > 1c121c00-74dc-4b72-b8c5-5 > 6a14bd06bae. Error: HTTP GET > https://192.168.123.1:8000/redfish/v1/Systems/7691867f-577d-4134-be10- > 87054201595f returned code 500. Base.1.0.GeneralError: Domain not found: no > domain with matching uuid '7691867f- > 577d-4134-be10-87054201595f' (master-0-0): sushy.exceptions.ServerSideError: > HTTP GET > https://192.168.123.1:8000/redfish/v1/Systems/7691867f-577d-4134-be10- > 87054201595f returned code 500. Base.1.0.GeneralError: > Domain not found: no domain with matching uuid > '7691867f-577d-4134-be10-87054201595f' (master-0-0) > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager Traceback (most recent call last): > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/ironic/conductor/manager.py", line 1240, > in _do_node_verify > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager power_state = > task.driver.power.get_power_state(task) > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/power.py", > line 112, in get_power_state > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager system = redfish_utils.get_system(task.node) > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/utils.py", > line 289, in get_system > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager return _get_system() > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/retrying.py", line 68, in wrapped_f > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager return Retrying(*dargs, **dkw).call(f, *args, > **kw) > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/retrying.py", line 223, in call > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager return attempt.get(self._wrap_exception) > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/retrying.py", line 261, in get > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager six.reraise(self.value[0], self.value[1], > self.value[2]) > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File "/usr/lib/python3.6/site-packages/six.py", > line 675, in reraise > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager raise value > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/retrying.py", line 217, in call > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager attempt = Attempt(fn(*args, **kwargs), > attempt_number, False) > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/utils.py", > line 266, in _get_system > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager return conn.get_system(system_id) > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/sushy/main.py", line 251, in get_system > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager registries=self.lazy_registries) > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/sushy/resources/system/system.py", line > 151, in __init__ > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager connector, identity, redfish_version, > registries) > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/sushy/resources/base.py", line 437, in > __init__ > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager self.refresh() > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/sushy/resources/base.py", line 472, in > refresh > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager self._json = self._reader.get_json() > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/sushy/resources/base.py", line 339, in > get_json > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager data = self._conn.get(path=self._path) > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/sushy/connector.py", line 178, in get > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager **extra_session_req_kwargs) > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/sushy/connector.py", line 118, in _op > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager exceptions.raise_for_response(method, url, > response) > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager File > "/usr/lib/python3.6/site-packages/sushy/exceptions.py", line 161, in > raise_for_response > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager raise ServerSideError(method, url, response) > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager sushy.exceptions.ServerSideError: HTTP GET > https://192.168.123.1:8000/redfish/v1/Systems/7691867f-577d-4134-be10- > 8705420159 > 5f returned code 500. Base.1.0.GeneralError: Domain not found: no domain > with matching uuid '7691867f-577d-4134-be10-87054201595f' (master-0-0) > 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR > ironic.conductor.manager > 2020-10-21T18:13:58.484195339Z 2020-10-21 18:13:58.483 1 DEBUG > ironic.common.states [req-35cd7821-0765-466e-b767-5ce56b8bc603 ironic-user - > - - -] Exiting old state 'verifying' in response to event 'fail' on_exi > t /usr/lib/python3.6/site-packages/ironic/common/states.py:295 > 2020-10-21T18:13:58.484498112Z 2020-10-21 18:13:58.484 1 DEBUG > ironic.common.states [req-35cd7821-0765-466e-b767-5ce56b8bc603 ironic-user - > - - -] Entering new state 'enroll' in response to event 'fail' on_enter > /usr/lib/python3.6/site-packages/ironic/common/states.py:301 > > > How did you power off the Host? I can't help but imagine that this is > related. > I used virsh destroy and virsh undefine on the hypervisor which explains the 'Domain not found: no domain with matching uuid ' message as there isn't any VM with that UUID existing anymore.
(In reply to Marius Cornea from comment #7) > I used virsh destroy and virsh undefine on the hypervisor which explains the > 'Domain not found: no domain with matching uuid ' message as there isn't any > VM with that UUID existing anymore. Yep, that explains everything then :) I guess you used that method so that Ironic couldn't turn it back on again. But it's going to systematically cause this problem, since we'll never be able to read the power state. Current behaviour of the Machine is that it will not allow itself to be deleted before it has verified that the Host power is off. So if a physical server goes away then the only way to get rid of the Machine is to delete the Host corresponding to that physical server. I'd argue that this part is correct behaviour.
The fixable part is fixed upstream and incorporated downstream by https://github.com/openshift/baremetal-operator/pull/113
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633