1890256 – Replacing a master node on a baremetal IPI deployment gets stuck when deleting the machine of the unhealthy member

Bug 1890256 - Replacing a master node on a baremetal IPI deployment gets stuck when deleting the machine of the unhealthy member

Summary: Replacing a master node on a baremetal IPI deployment gets stuck when deletin...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Zane Bitter
QA Contact:	Lubov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-21 18:37 UTC by Marius Cornea
Modified:	2023-12-05 12:26 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:27:22 UTC
Target Upstream Version:
Embargoed:
Flags:	lshilin: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	metal3-io baremetal-operator issues 698	None	closed	Registration of provisioned Host is never retried	2021-02-02 09:48:54 UTC
Github	metal3-io baremetal-operator pull 716	None	closed	Handle re-registration in all states	2021-02-02 09:48:54 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:27:46 UTC

Description Marius Cornea 2020-10-21 18:37:56 UTC

Description of problem:

Replacing a master node on a baremetal IPI deployment gets stuck when deleting the machine of the unhealthy member.

https://access.redhat.com/documentation/en-us/openshift_container_platform/4.5/html-single/backup_and_restore/index#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member

Version-Release number of selected component (if applicable):
4.6.0-rc.4

How reproducible:
100%

Steps to Reproduce:

1. Deploy a baremetal IPI environment with 3 x masters + 2 x worker nodes

NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   2d8h   v1.19.0+d59ce34
master-0-1   Ready    master   2d8h   v1.19.0+d59ce34
master-0-2   Ready    master   2d8h   v1.19.0+d59ce34
worker-0-0   Ready    worker   2d7h   v1.19.0+d59ce34
worker-0-1   Ready    worker   2d7h   v1.19.0+d59ce34

NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-rc.4   True        False         2d7h    Cluster version is 4.6.0-rc.4

2. Simulate one of the masters nodes outage by powering off master-0-0 and getting it into NotReady state

oc get nodes
NAME         STATUS     ROLES    AGE    VERSION
master-0-0   NotReady   master   2d8h   v1.19.0+d59ce34
master-0-1   Ready      master   2d8h   v1.19.0+d59ce34
master-0-2   Ready      master   2d8h   v1.19.0+d59ce34
worker-0-0   Ready      worker   2d7h   v1.19.0+d59ce34
worker-0-1   Ready      worker   2d7h   v1.19.0+d59ce34


3. Add a new BMH to replace master-0-0:

openshift-master-0-3-bmc-secret.yaml:

apiVersion: v1
data:
  password: cGFzc3dvcmQ=
  username: YWRtaW4=
kind: Secret
metadata:
  name: openshift-master-0-3-bmc-secret
  namespace: openshift-machine-api


openshift-master-0-3-bmh.yaml:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: openshift-master-0-3
  namespace: openshift-machine-api
spec:
  online: true
  bmc:
    address: redfish://192.168.123.1:8000/redfish/v1/Systems/347b1dc7-ca79-4ca7-a67a-9b613e3d2acd
    credentialsName: openshift-master-0-3-bmc-secret
    disableCertificateVerification: true
  bootMACAddress: 52:54:00:24:0b:ce

oc -n openshift-machine-api get bmh openshift-master-0-3
NAME                   STATUS   PROVISIONING STATUS   CONSUMER   BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0-3   OK       ready                            redfish://192.168.123.1:8000/redfish/v1/Systems/347b1dc7-ca79-4ca7-a67a-9b613e3d2acd   unknown            true     


4. Follow the procedure for Replacing an unhealthy etcd member whose machine is not running or whose node is not ready from https://access.redhat.com/documentation/en-us/openshift_container_platform/4.5/html-single/backup_and_restore/index#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member

5. Check which member is unhealthy

oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="EtcdMembersAvailable")]}{.message}{"\n"}'

2 of 3 members are available, master-0-0 is unhealthy

6. Determine if the machine is not running:

oc get machines -A -ojsonpath='{range .items[*]}{@.status.nodeRef.name}{"\t"}{@.status.providerStatus.instanceState}{"\n"}' | grep -v running

master-0-0	
master-0-1	
master-0-2	
worker-0-0	
worker-0-1	

7. Determine if the node is not ready.

oc get nodes -o jsonpath='{range .items[*]}{"\n"}{.metadata.name}{"\t"}{range .spec.taints[*]}{.key}{" "}' | grep unreachable

master-0-0	node-role.kubernetes.io/master node.kubernetes.io/unreachable node.kubernetes.io/unreachable  

8. Connect to a running etcd container
oc -n openshift-etcd rsh etcd-master-0-1

9. View the member list:

sh-4.4#  etcdctl member list -w table
+------------------+---------+------------+------------------------------+------------------------------+------------+
|        ID        | STATUS  |    NAME    |          PEER ADDRS          |         CLIENT ADDRS         | IS LEARNER |
+------------------+---------+------------+------------------------------+------------------------------+------------+
| 1d82590304f84714 | started | master-0-0 | https://192.168.123.127:2380 | https://192.168.123.127:2379 |      false |
| 4d97af960967d221 | started | master-0-2 | https://192.168.123.115:2380 | https://192.168.123.115:2379 |      false |
| b9f1bfa54f56d9ba | started | master-0-1 | https://192.168.123.129:2380 | https://192.168.123.129:2379 |      false |
+------------------+---------+------------+------------------------------+------------------------------+------------+

10. Remove the unhealthy etcd member:

sh-4.4# etcdctl member remove 1d82590304f84714
Member 1d82590304f84714 removed from cluster 4acece37e8b7a4db

11. View the member list again:

sh-4.4# etcdctl member list -w table
+------------------+---------+------------+------------------------------+------------------------------+------------+
|        ID        | STATUS  |    NAME    |          PEER ADDRS          |         CLIENT ADDRS         | IS LEARNER |
+------------------+---------+------------+------------------------------+------------------------------+------------+
| 4d97af960967d221 | started | master-0-2 | https://192.168.123.115:2380 | https://192.168.123.115:2379 |      false |
| b9f1bfa54f56d9ba | started | master-0-1 | https://192.168.123.129:2380 | https://192.168.123.129:2379 |      false |
+------------------+---------+------------+------------------------------+------------------------------+------------+

12. Remove old secrets:

oc delete secret -n openshift-etcd etcd-peer-master-0-0
oc delete secret -n openshift-etcd etcd-serving-master-0-0
oc delete secret -n openshift-etcd etcd-serving-metrics-master-0-0

13. Obtain the machine for the unhealthy member

oc get machines -n openshift-machine-api -o wide
NAME                                      PHASE     TYPE   REGION   ZONE   AGE    NODE         PROVIDERID                                                    STATE
ocp-edge-cluster-0-gbw9p-master-0         Running                          2d8h   master-0-0   baremetalhost:///openshift-machine-api/openshift-master-0-0   
ocp-edge-cluster-0-gbw9p-master-1         Running                          2d8h   master-0-1   baremetalhost:///openshift-machine-api/openshift-master-0-1   
ocp-edge-cluster-0-gbw9p-master-2         Running                          2d8h   master-0-2   baremetalhost:///openshift-machine-api/openshift-master-0-2   
ocp-edge-cluster-0-gbw9p-worker-0-6x9r5   Running                          2d8h   worker-0-0   baremetalhost:///openshift-machine-api/openshift-worker-0-0   
ocp-edge-cluster-0-gbw9p-worker-0-s8tjg   Running                          2d8h   worker-0-1   baremetalhost:///openshift-machine-api/openshift-worker-0-1   

14. Delete the machine of the unhealthy member:

oc delete machine -n openshift-machine-api ocp-edge-cluster-0-gbw9p-master-0

Actual results:

Prompt is getting stuck at machine.machine.openshift.io "ocp-edge-cluster-0-gbw9p-master-0" deleted


Expected results:
Machine gets deleted

Additional info:

machine seems to be in phase deleting:

oc get machines/ocp-edge-cluster-0-gbw9p-master-0 -n openshift-machine-api -o wide
NAME                                PHASE      TYPE   REGION   ZONE   AGE    NODE         PROVIDERID                                                    STATE
ocp-edge-cluster-0-gbw9p-master-0   Deleting                          2d8h   master-0-0   baremetalhost:///openshift-machine-api/openshift-master-0-0   


Looking at the machine-controller logs I can see that drain failed for machine "ocp-edge-cluster-0-gbw9p-master-0":

I1021 18:32:51.812051       1 controller.go:543] evicting pod openshift-etcd/etcd-quorum-guard-644f5747b8-hkzvr
I1021 18:32:51.812189       1 controller.go:543] evicting pod openshift-etcd/installer-4-master-0-0
E1021 18:32:51.819149       1 controller.go:543] error when evicting pod "etcd-quorum-guard-644f5747b8-hkzvr" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I1021 18:32:56.819943       1 controller.go:543] evicting pod openshift-etcd/etcd-quorum-guard-644f5747b8-hkzvr
E1021 18:32:56.835685       1 controller.go:543] error when evicting pod "etcd-quorum-guard-644f5747b8-hkzvr" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I1021 18:33:01.836776       1 controller.go:543] evicting pod openshift-etcd/etcd-quorum-guard-644f5747b8-hkzvr
E1021 18:33:01.843563       1 controller.go:543] error when evicting pod "etcd-quorum-guard-644f5747b8-hkzvr" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I1021 18:33:06.844114       1 controller.go:543] evicting pod openshift-etcd/etcd-quorum-guard-644f5747b8-hkzvr
E1021 18:33:06.851959       1 controller.go:543] error when evicting pod "etcd-quorum-guard-644f5747b8-hkzvr" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I1021 18:33:11.852553       1 controller.go:543] evicting pod openshift-etcd/etcd-quorum-guard-644f5747b8-hkzvr
W1021 18:33:11.852747       1 controller.go:380] drain failed for machine "ocp-edge-cluster-0-gbw9p-master-0": [error when waiting for pod "installer-4-master-0-0" terminating: global timeout reached: 20s, error when evicting pod "etcd-quorum-guard-644f5747b8-hkzvr": global timeout reached: 20s]
E1021 18:33:11.852836       1 controller.go:218] ocp-edge-cluster-0-gbw9p-master-0: failed to drain node for machine: requeue in: 20s
I1021 18:33:11.852869       1 controller.go:406] Actuator returned requeue-after error: requeue in: 20s
I1021 18:33:11.853089       1 controller.go:170] ocp-edge-cluster-0-gbw9p-master-0: reconciling Machine
I1021 18:33:11.853156       1 controller.go:210] ocp-edge-cluster-0-gbw9p-master-0: reconciling machine triggers delete


Attaching must-gather

Comment 2 Marius Cornea 2020-10-21 18:58:12 UTC

Even after annotating the machine with:

oc -n openshift-machine-api annotate machine ocp-edge-cluster-0-gbw9p-master-0 machine.openshift.io/exclude-node-draining=

deleting the machine still fails with:

2020/10/21 18:57:03 Deleting machine ocp-edge-cluster-0-gbw9p-master-0 .
2020/10/21 18:57:03 deleting machine ocp-edge-cluster-0-gbw9p-master-0 using host openshift-master-0-0
2020/10/21 18:57:03 waiting for host openshift-master-0-0 to be deprovisioned
E1021 18:57:03.945743       1 controller.go:231] ocp-edge-cluster-0-gbw9p-master-0: failed to delete machine: requeue in: 30s
I1021 18:57:03.945759       1 controller.go:406] Actuator returned requeue-after error: requeue in: 30s

Comment 3 Marius Cornea 2020-10-21 19:01:55 UTC

In the baremetal-operator log we can see:

{"level":"info","ts":1603306563.9395769,"logger":"baremetalhost","msg":"Reconciling BareMetalHost","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-master-0-0"}
{"level":"info","ts":1603306563.9396687,"logger":"baremetalhost_ironic","msg":"verifying ironic provisioner dependencies","host":"openshift-master-0-0"}
{"level":"info","ts":1603306564.0244021,"logger":"baremetalhost_ironic","msg":"found existing node by ID","host":"openshift-master-0-0"}
{"level":"error","ts":1603306564.0245044,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"metal3-baremetalhost-controller","request":"openshift-machine-api/openshift-master-0-0","error":"action \"externally provisioned\" failed: Invalid state for adopt: enroll","errorVerbose":"Invalid state for adopt: enroll\naction \"externally provisioned\" failed\ngithub.com/metal3-io/baremetal-operator/pkg/controller/baremetalhost.(*ReconcileBareMetalHost).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/controller/baremetalhost/baremetalhost_controller.go:307\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1374","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

Comment 4 Marius Cornea 2020-10-21 19:07:53 UTC

After manually deleting the openshift-master-0-0 BMH the machine was finally deleted:

oc -n openshift-machine-api delete bmh openshift-master-0-0

Comment 5 Marius Cornea 2020-10-22 07:05:50 UTC

Removing TestBlocker since I was able to move forward after annotating the machine with exclude-node-draining and removing the BMH.

Comment 6 Zane Bitter 2020-10-27 19:38:25 UTC

The logs show all of the Hosts being registered into an empty Ironic right at the beginning. I'm inferring that the reason for this is that the metal3 pod was running on master-0 before it was powered down, and was rescheduled on another node afterwards.
The root cause of the problem is that Ironic is not able to connect to the BMC of the master-0 machine right from the beginning:

020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager [req-35cd7821-0765-466e-b767-5ce56b8bc603 ironic-user - - - -] Failed to get power state for node 1c121c00-74dc-4b72-b8c5-5
6a14bd06bae. Error: HTTP GET https://192.168.123.1:8000/redfish/v1/Systems/7691867f-577d-4134-be10-87054201595f returned code 500. Base.1.0.GeneralError: Domain not found: no domain with matching uuid '7691867f-
577d-4134-be10-87054201595f' (master-0-0): sushy.exceptions.ServerSideError: HTTP GET https://192.168.123.1:8000/redfish/v1/Systems/7691867f-577d-4134-be10-87054201595f returned code 500. Base.1.0.GeneralError: 
Domain not found: no domain with matching uuid '7691867f-577d-4134-be10-87054201595f' (master-0-0)
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager Traceback (most recent call last):
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/ironic/conductor/manager.py", line 1240, in _do_node_verify
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     power_state = task.driver.power.get_power_state(task)
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/power.py", line 112, in get_power_state
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     system = redfish_utils.get_system(task.node)
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/utils.py", line 289, in get_system
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     return _get_system()
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/retrying.py", line 68, in wrapped_f
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     return Retrying(*dargs, **dkw).call(f, *args, **kw)
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/retrying.py", line 223, in call
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     return attempt.get(self._wrap_exception)
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/retrying.py", line 261, in get
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     six.reraise(self.value[0], self.value[1], self.value[2])
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     raise value
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/retrying.py", line 217, in call
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/utils.py", line 266, in _get_system
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     return conn.get_system(system_id)
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/sushy/main.py", line 251, in get_system
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     registries=self.lazy_registries)
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/sushy/resources/system/system.py", line 151, in __init__
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     connector, identity, redfish_version, registries)
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/sushy/resources/base.py", line 437, in __init__
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     self.refresh()
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/sushy/resources/base.py", line 472, in refresh
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     self._json = self._reader.get_json()
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/sushy/resources/base.py", line 339, in get_json
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     data = self._conn.get(path=self._path)
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/sushy/connector.py", line 178, in get
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     **extra_session_req_kwargs)
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/sushy/connector.py", line 118, in _op
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     exceptions.raise_for_response(method, url, response)
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/sushy/exceptions.py", line 161, in raise_for_response
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager     raise ServerSideError(method, url, response)
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager sushy.exceptions.ServerSideError: HTTP GET https://192.168.123.1:8000/redfish/v1/Systems/7691867f-577d-4134-be10-8705420159
5f returned code 500. Base.1.0.GeneralError: Domain not found: no domain with matching uuid '7691867f-577d-4134-be10-87054201595f' (master-0-0)
2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR ironic.conductor.manager 
2020-10-21T18:13:58.484195339Z 2020-10-21 18:13:58.483 1 DEBUG ironic.common.states [req-35cd7821-0765-466e-b767-5ce56b8bc603 ironic-user - - - -] Exiting old state 'verifying' in response to event 'fail' on_exi
t /usr/lib/python3.6/site-packages/ironic/common/states.py:295
2020-10-21T18:13:58.484498112Z 2020-10-21 18:13:58.484 1 DEBUG ironic.common.states [req-35cd7821-0765-466e-b767-5ce56b8bc603 ironic-user - - - -] Entering new state 'enroll' in response to event 'fail' on_enter
 /usr/lib/python3.6/site-packages/ironic/common/states.py:301


How did you power off the Host? I can't help but imagine that this is related.

In any event, the baremetal-operator does not handle this well - it decides that registration is complete (since the Node exists) even though it has failed, and proceeds to trying to Adopt the node - a prerequisite for which is that registration has succeeded. It never gets past this and continues trying and failing to adopt forever.

Once we get past Node draining, the Machine will not allow itself to be deleted until it has powered the Host down. In this case Ironic cannot read the power state, and in fact the BMO is not able to even run the code to manage it anyway, so it will never complete.

Comment 7 Marius Cornea 2020-10-27 19:55:58 UTC

(In reply to Zane Bitter from comment #6)
> The logs show all of the Hosts being registered into an empty Ironic right
> at the beginning. I'm inferring that the reason for this is that the metal3
> pod was running on master-0 before it was powered down, and was rescheduled
> on another node afterwards.
> The root cause of the problem is that Ironic is not able to connect to the
> BMC of the master-0 machine right from the beginning:
> 
> 020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager [req-35cd7821-0765-466e-b767-5ce56b8bc603
> ironic-user - - - -] Failed to get power state for node
> 1c121c00-74dc-4b72-b8c5-5
> 6a14bd06bae. Error: HTTP GET
> https://192.168.123.1:8000/redfish/v1/Systems/7691867f-577d-4134-be10-
> 87054201595f returned code 500. Base.1.0.GeneralError: Domain not found: no
> domain with matching uuid '7691867f-
> 577d-4134-be10-87054201595f' (master-0-0): sushy.exceptions.ServerSideError:
> HTTP GET
> https://192.168.123.1:8000/redfish/v1/Systems/7691867f-577d-4134-be10-
> 87054201595f returned code 500. Base.1.0.GeneralError: 
> Domain not found: no domain with matching uuid
> '7691867f-577d-4134-be10-87054201595f' (master-0-0)
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager Traceback (most recent call last):
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/ironic/conductor/manager.py", line 1240,
> in _do_node_verify
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     power_state =
> task.driver.power.get_power_state(task)
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/power.py",
> line 112, in get_power_state
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     system = redfish_utils.get_system(task.node)
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/utils.py",
> line 289, in get_system
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     return _get_system()
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/retrying.py", line 68, in wrapped_f
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     return Retrying(*dargs, **dkw).call(f, *args,
> **kw)
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/retrying.py", line 223, in call
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     return attempt.get(self._wrap_exception)
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/retrying.py", line 261, in get
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     six.reraise(self.value[0], self.value[1],
> self.value[2])
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/six.py",
> line 675, in reraise
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     raise value
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/retrying.py", line 217, in call
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     attempt = Attempt(fn(*args, **kwargs),
> attempt_number, False)
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/utils.py",
> line 266, in _get_system
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     return conn.get_system(system_id)
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/sushy/main.py", line 251, in get_system
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     registries=self.lazy_registries)
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/sushy/resources/system/system.py", line
> 151, in __init__
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     connector, identity, redfish_version,
> registries)
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/sushy/resources/base.py", line 437, in
> __init__
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     self.refresh()
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/sushy/resources/base.py", line 472, in
> refresh
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     self._json = self._reader.get_json()
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/sushy/resources/base.py", line 339, in
> get_json
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     data = self._conn.get(path=self._path)
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/sushy/connector.py", line 178, in get
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     **extra_session_req_kwargs)
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/sushy/connector.py", line 118, in _op
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     exceptions.raise_for_response(method, url,
> response)
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager   File
> "/usr/lib/python3.6/site-packages/sushy/exceptions.py", line 161, in
> raise_for_response
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager     raise ServerSideError(method, url, response)
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager sushy.exceptions.ServerSideError: HTTP GET
> https://192.168.123.1:8000/redfish/v1/Systems/7691867f-577d-4134-be10-
> 8705420159
> 5f returned code 500. Base.1.0.GeneralError: Domain not found: no domain
> with matching uuid '7691867f-577d-4134-be10-87054201595f' (master-0-0)
> 2020-10-21T18:13:58.483705606Z 2020-10-21 18:13:58.471 1 ERROR
> ironic.conductor.manager 
> 2020-10-21T18:13:58.484195339Z 2020-10-21 18:13:58.483 1 DEBUG
> ironic.common.states [req-35cd7821-0765-466e-b767-5ce56b8bc603 ironic-user -
> - - -] Exiting old state 'verifying' in response to event 'fail' on_exi
> t /usr/lib/python3.6/site-packages/ironic/common/states.py:295
> 2020-10-21T18:13:58.484498112Z 2020-10-21 18:13:58.484 1 DEBUG
> ironic.common.states [req-35cd7821-0765-466e-b767-5ce56b8bc603 ironic-user -
> - - -] Entering new state 'enroll' in response to event 'fail' on_enter
>  /usr/lib/python3.6/site-packages/ironic/common/states.py:301
> 
> 
> How did you power off the Host? I can't help but imagine that this is
> related.
> 

I used virsh destroy and virsh undefine on the hypervisor which explains the 'Domain not found: no domain with matching uuid ' message as there isn't any VM with that UUID existing anymore.

Comment 8 Zane Bitter 2020-10-27 20:08:43 UTC

(In reply to Marius Cornea from comment #7)
> I used virsh destroy and virsh undefine on the hypervisor which explains the
> 'Domain not found: no domain with matching uuid ' message as there isn't any
> VM with that UUID existing anymore.

Yep, that explains everything then :)

I guess you used that method so that Ironic couldn't turn it back on again. But it's going to systematically cause this problem, since we'll never be able to read the power state.

Current behaviour of the Machine is that it will not allow itself to be deleted before it has verified that the Host power is off. So if a physical server goes away then the only way to get rid of the Machine is to delete the Host corresponding to that physical server. I'd argue that this part is correct behaviour.

Comment 9 Zane Bitter 2021-01-28 15:48:34 UTC

The fixable part is fixed upstream and incorporated downstream by https://github.com/openshift/baremetal-operator/pull/113

Comment 15 errata-xmlrpc 2021-02-24 15:27:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.