1846486 – [OCP 4.5][Machine Health Check] Remediation Strategy: 'external-baremetal' Node Deleted and never returns when remediation is repeated on same node 4 times

Bug 1846486 - [OCP 4.5][Machine Health Check] Remediation Strategy: 'external-baremetal' Node Deleted and never returns when remediation is repeated on same node 4 times

Summary: [OCP 4.5][Machine Health Check] Remediation Strategy: 'external-baremetal' No...

Keywords:
Status:	CLOSED DUPLICATE of bug 1857224
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node Maintenance Operator
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Nir
QA Contact:	gharden
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-11 17:05 UTC by gharden
Modified:	2021-02-06 07:06 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-15 13:48:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
automation_script_log_attempt_1_success_log (51.32 KB, text/plain) 2020-06-11 17:05 UTC, gharden	no flags	Details
View All

Description gharden 2020-06-11 17:05:07 UTC

Created attachment 1696876 [details]
automation_script_log_attempt_1_success_log

Description of problem:

Remediation strategy annotation: 'machine.openshift.io/remediation-strategy': 'external-baremetal' 

Unhealthy Condition: 'status': 'Unknown'

If unhealthy condition is created on the same node 4 times, on the 4th run the node gets deleted and never returns.


Version-Release number of selected component (if applicable):\
oc must gather: https://drive.google.com/open?id=1mBBlbvsQPXesb7kr6CZIPJy9LH19f2-g


How reproducible:
100%


Steps to Reproduce:
1. cat > mHc_Ready.yaml << EOF
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: ready-example
 namespace: openshift-machine-api
 annotations:
    machine.openshift.io/remediation-strategy: external-baremetal
spec:
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: worker
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 60s
EOF

oc create -f mHc_Ready.yaml

2. virsh suspend worker-0-0

3. oc get nodes

4. Repeat steps 1 - 3, four times, on fourth run the node will go from NotReady, NodeNotFound (It doesn't show up in oc get nodes) and will never return under oc get nodes


Actual results:
Please see logs and oc must gather for more details.
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
worker-0-1   Ready    worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# virsh suspend worker-0-0
[root@sealusa8 ~]# oc get nodes
NAME         STATUS      ROLES    AGE    VERSION
master-0-0   Ready       master   143m   v1.18.3+a637491
master-0-1   Ready       master   137m   v1.18.3+a637491
master-0-2   Ready       master   143m   v1.18.3+a637491
worker-0-0   NotReady    worker   102m   v1.18.3+a637491
worker-0-1   Ready       worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491


Expected results:
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
worker-0-1   Ready    worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# virsh suspend worker-0-0
[root@sealusa8 ~]# oc get nodes
NAME         STATUS      ROLES    AGE    VERSION
master-0-0   Ready       master   143m   v1.18.3+a637491
master-0-1   Ready       master   137m   v1.18.3+a637491
master-0-2   Ready       master   143m   v1.18.3+a637491
worker-0-0   NotReady    worker   102m   v1.18.3+a637491
worker-0-1   Ready       worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS      ROLES    AGE    VERSION
master-0-0   Ready       master   143m   v1.18.3+a637491
master-0-1   Ready       master   137m   v1.18.3+a637491
master-0-2   Ready       master   143m   v1.18.3+a637491
worker-0-0   NotReady    worker   102m   v1.18.3+a637491
worker-0-1   Ready       worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
worker-0-1   Ready    worker   101m   v1.18.3+a637491

Additional info:
Must Gather: https://drive.google.com/open?id=1mBBlbvsQPXesb7kr6CZIPJy9LH19f2-g

Script Log Run 1: https://drive.google.com/open?id=15obrS0ox5FPxMQn0Uwu7jpixAQ9NZVTH

Script Log Run 2: https://drive.google.com/open?id=1fCPNzNkJxcLmNdvXjjkU8u6_dXEJwIBc

Script Log Run 3: https://drive.google.com/open?id=1pBeBt169YEypI8eXxVZkwiRMK-x9CQbI

Script Log Run 4: https://drive.google.com/open?id=1Xxd6woFJTA3iu6pKxBUCIclIUPnq3-1j

Comment 2 Nir 2020-06-29 07:58:50 UTC

Here are my findings:
BMO successfully powered cycle the host, the machine came up, but the node couldn't register itself.
I sshed into the node and saw that nodeip-configuration.service has a problem:

Jun 25 14:08:07 worker-2.ostest.test.metalkube.org podman[1424]: Error: error creating container storage: layer not known
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: nodeip-configuration.service: Main process exited, code=exited, status=125/n/a
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: nodeip-configuration.service: Failed with result 'exit-code'.
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: Failed to start Writes IP address configuration so that kubelet and crio services select a valid node IP.

Comment 3 Nir 2020-06-29 07:59:21 UTC

podman version 1.6.4

Comment 4 Nir 2020-07-15 13:48:50 UTC


*** This bug has been marked as a duplicate of bug 1857224 ***

Note You need to log in before you can comment on or make changes to this bug.