Bug 1846486

Summary:

[OCP 4.5][Machine Health Check] Remediation Strategy: 'external-baremetal' Node Deleted and never returns when remediation is repeated on same node 4 times

Product:

OpenShift Container Platform

Reporter:

gharden

Component:

Node Maintenance Operator

Assignee:

Nir <nyehia>

Status:

CLOSED DUPLICATE

QA Contact:

gharden

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.5

CC:

abeekhof, aos-bugs, gharden, mlammon, msluiter, nyehia

Target Milestone:

---

Keywords:

Triaged

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-07-15 13:48:50 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
automation_script_log_attempt_1_success_log	none

Description gharden 2020-06-11 17:05:07 UTC

Created attachment 1696876 [details]
automation_script_log_attempt_1_success_log

Description of problem:

Remediation strategy annotation: 'machine.openshift.io/remediation-strategy': 'external-baremetal' 

Unhealthy Condition: 'status': 'Unknown'

If unhealthy condition is created on the same node 4 times, on the 4th run the node gets deleted and never returns.


Version-Release number of selected component (if applicable):\
oc must gather: https://drive.google.com/open?id=1mBBlbvsQPXesb7kr6CZIPJy9LH19f2-g


How reproducible:
100%


Steps to Reproduce:
1. cat > mHc_Ready.yaml << EOF
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: ready-example
 namespace: openshift-machine-api
 annotations:
    machine.openshift.io/remediation-strategy: external-baremetal
spec:
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: worker
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 60s
EOF

oc create -f mHc_Ready.yaml

2. virsh suspend worker-0-0

3. oc get nodes

4. Repeat steps 1 - 3, four times, on fourth run the node will go from NotReady, NodeNotFound (It doesn't show up in oc get nodes) and will never return under oc get nodes


Actual results:
Please see logs and oc must gather for more details.
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
worker-0-1   Ready    worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# virsh suspend worker-0-0
[root@sealusa8 ~]# oc get nodes
NAME         STATUS      ROLES    AGE    VERSION
master-0-0   Ready       master   143m   v1.18.3+a637491
master-0-1   Ready       master   137m   v1.18.3+a637491
master-0-2   Ready       master   143m   v1.18.3+a637491
worker-0-0   NotReady    worker   102m   v1.18.3+a637491
worker-0-1   Ready       worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491


Expected results:
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
worker-0-1   Ready    worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# virsh suspend worker-0-0
[root@sealusa8 ~]# oc get nodes
NAME         STATUS      ROLES    AGE    VERSION
master-0-0   Ready       master   143m   v1.18.3+a637491
master-0-1   Ready       master   137m   v1.18.3+a637491
master-0-2   Ready       master   143m   v1.18.3+a637491
worker-0-0   NotReady    worker   102m   v1.18.3+a637491
worker-0-1   Ready       worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS      ROLES    AGE    VERSION
master-0-0   Ready       master   143m   v1.18.3+a637491
master-0-1   Ready       master   137m   v1.18.3+a637491
master-0-2   Ready       master   143m   v1.18.3+a637491
worker-0-0   NotReady    worker   102m   v1.18.3+a637491
worker-0-1   Ready       worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
worker-0-1   Ready    worker   101m   v1.18.3+a637491

Additional info:
Must Gather: https://drive.google.com/open?id=1mBBlbvsQPXesb7kr6CZIPJy9LH19f2-g

Script Log Run 1: https://drive.google.com/open?id=15obrS0ox5FPxMQn0Uwu7jpixAQ9NZVTH

Script Log Run 2: https://drive.google.com/open?id=1fCPNzNkJxcLmNdvXjjkU8u6_dXEJwIBc

Script Log Run 3: https://drive.google.com/open?id=1pBeBt169YEypI8eXxVZkwiRMK-x9CQbI

Script Log Run 4: https://drive.google.com/open?id=1Xxd6woFJTA3iu6pKxBUCIclIUPnq3-1j

Comment 2 Nir 2020-06-29 07:58:50 UTC

Here are my findings:
BMO successfully powered cycle the host, the machine came up, but the node couldn't register itself.
I sshed into the node and saw that nodeip-configuration.service has a problem:

Jun 25 14:08:07 worker-2.ostest.test.metalkube.org podman[1424]: Error: error creating container storage: layer not known
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: nodeip-configuration.service: Main process exited, code=exited, status=125/n/a
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: nodeip-configuration.service: Failed with result 'exit-code'.
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: Failed to start Writes IP address configuration so that kubelet and crio services select a valid node IP.

Comment 3 Nir 2020-06-29 07:59:21 UTC

podman version 1.6.4

Comment 4 Nir 2020-07-15 13:48:50 UTC


*** This bug has been marked as a duplicate of bug 1857224 ***