Bug 1846486

Summary: [OCP 4.5][Machine Health Check] Remediation Strategy: 'external-baremetal' Node Deleted and never returns when remediation is repeated on same node 4 times
Product: OpenShift Container Platform Reporter: gharden
Component: Node Maintenance OperatorAssignee: Nir <nyehia>
Status: CLOSED DUPLICATE QA Contact: gharden
Severity: medium Docs Contact:
Priority: medium    
Version: 4.5CC: abeekhof, aos-bugs, gharden, mlammon, msluiter, nyehia
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-15 13:48:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
automation_script_log_attempt_1_success_log none

Description gharden 2020-06-11 17:05:07 UTC
Created attachment 1696876 [details]
automation_script_log_attempt_1_success_log

Description of problem:

Remediation strategy annotation: 'machine.openshift.io/remediation-strategy': 'external-baremetal' 

Unhealthy Condition: 'status': 'Unknown'

If unhealthy condition is created on the same node 4 times, on the 4th run the node gets deleted and never returns.


Version-Release number of selected component (if applicable):\
oc must gather: https://drive.google.com/open?id=1mBBlbvsQPXesb7kr6CZIPJy9LH19f2-g


How reproducible:
100%


Steps to Reproduce:
1. cat > mHc_Ready.yaml << EOF
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: ready-example
 namespace: openshift-machine-api
 annotations:
    machine.openshift.io/remediation-strategy: external-baremetal
spec:
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: worker
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 60s
EOF

oc create -f mHc_Ready.yaml

2. virsh suspend worker-0-0

3. oc get nodes

4. Repeat steps 1 - 3, four times, on fourth run the node will go from NotReady, NodeNotFound (It doesn't show up in oc get nodes) and will never return under oc get nodes


Actual results:
Please see logs and oc must gather for more details.
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
worker-0-1   Ready    worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# virsh suspend worker-0-0
[root@sealusa8 ~]# oc get nodes
NAME         STATUS      ROLES    AGE    VERSION
master-0-0   Ready       master   143m   v1.18.3+a637491
master-0-1   Ready       master   137m   v1.18.3+a637491
master-0-2   Ready       master   143m   v1.18.3+a637491
worker-0-0   NotReady    worker   102m   v1.18.3+a637491
worker-0-1   Ready       worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491


Expected results:
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
worker-0-1   Ready    worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# virsh suspend worker-0-0
[root@sealusa8 ~]# oc get nodes
NAME         STATUS      ROLES    AGE    VERSION
master-0-0   Ready       master   143m   v1.18.3+a637491
master-0-1   Ready       master   137m   v1.18.3+a637491
master-0-2   Ready       master   143m   v1.18.3+a637491
worker-0-0   NotReady    worker   102m   v1.18.3+a637491
worker-0-1   Ready       worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS      ROLES    AGE    VERSION
master-0-0   Ready       master   143m   v1.18.3+a637491
master-0-1   Ready       master   137m   v1.18.3+a637491
master-0-2   Ready       master   143m   v1.18.3+a637491
worker-0-0   NotReady    worker   102m   v1.18.3+a637491
worker-0-1   Ready       worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
worker-0-1   Ready    worker   101m   v1.18.3+a637491

Additional info:
Must Gather: https://drive.google.com/open?id=1mBBlbvsQPXesb7kr6CZIPJy9LH19f2-g

Script Log Run 1: https://drive.google.com/open?id=15obrS0ox5FPxMQn0Uwu7jpixAQ9NZVTH

Script Log Run 2: https://drive.google.com/open?id=1fCPNzNkJxcLmNdvXjjkU8u6_dXEJwIBc

Script Log Run 3: https://drive.google.com/open?id=1pBeBt169YEypI8eXxVZkwiRMK-x9CQbI

Script Log Run 4: https://drive.google.com/open?id=1Xxd6woFJTA3iu6pKxBUCIclIUPnq3-1j

Comment 2 Nir 2020-06-29 07:58:50 UTC
Here are my findings:
BMO successfully powered cycle the host, the machine came up, but the node couldn't register itself.
I sshed into the node and saw that nodeip-configuration.service has a problem:

Jun 25 14:08:07 worker-2.ostest.test.metalkube.org podman[1424]: Error: error creating container storage: layer not known
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: nodeip-configuration.service: Main process exited, code=exited, status=125/n/a
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: nodeip-configuration.service: Failed with result 'exit-code'.
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: Failed to start Writes IP address configuration so that kubelet and crio services select a valid node IP.

Comment 3 Nir 2020-06-29 07:59:21 UTC
podman version 1.6.4

Comment 4 Nir 2020-07-15 13:48:50 UTC

*** This bug has been marked as a duplicate of bug 1857224 ***