Bug 1846486 - [OCP 4.5][Machine Health Check] Remediation Strategy: 'external-baremetal' Node Deleted and never returns when remediation is repeated on same node 4 times
Summary: [OCP 4.5][Machine Health Check] Remediation Strategy: 'external-baremetal' No...
Keywords:
Status: CLOSED DUPLICATE of bug 1857224
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node Maintenance Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Nir
QA Contact: gharden
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-11 17:05 UTC by gharden
Modified: 2021-02-06 07:06 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-15 13:48:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
automation_script_log_attempt_1_success_log (51.32 KB, text/plain)
2020-06-11 17:05 UTC, gharden
no flags Details

Description gharden 2020-06-11 17:05:07 UTC
Created attachment 1696876 [details]
automation_script_log_attempt_1_success_log

Description of problem:

Remediation strategy annotation: 'machine.openshift.io/remediation-strategy': 'external-baremetal' 

Unhealthy Condition: 'status': 'Unknown'

If unhealthy condition is created on the same node 4 times, on the 4th run the node gets deleted and never returns.


Version-Release number of selected component (if applicable):\
oc must gather: https://drive.google.com/open?id=1mBBlbvsQPXesb7kr6CZIPJy9LH19f2-g


How reproducible:
100%


Steps to Reproduce:
1. cat > mHc_Ready.yaml << EOF
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: ready-example
 namespace: openshift-machine-api
 annotations:
    machine.openshift.io/remediation-strategy: external-baremetal
spec:
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: worker
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 60s
EOF

oc create -f mHc_Ready.yaml

2. virsh suspend worker-0-0

3. oc get nodes

4. Repeat steps 1 - 3, four times, on fourth run the node will go from NotReady, NodeNotFound (It doesn't show up in oc get nodes) and will never return under oc get nodes


Actual results:
Please see logs and oc must gather for more details.
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
worker-0-1   Ready    worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# virsh suspend worker-0-0
[root@sealusa8 ~]# oc get nodes
NAME         STATUS      ROLES    AGE    VERSION
master-0-0   Ready       master   143m   v1.18.3+a637491
master-0-1   Ready       master   137m   v1.18.3+a637491
master-0-2   Ready       master   143m   v1.18.3+a637491
worker-0-0   NotReady    worker   102m   v1.18.3+a637491
worker-0-1   Ready       worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491


Expected results:
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
worker-0-1   Ready    worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# virsh suspend worker-0-0
[root@sealusa8 ~]# oc get nodes
NAME         STATUS      ROLES    AGE    VERSION
master-0-0   Ready       master   143m   v1.18.3+a637491
master-0-1   Ready       master   137m   v1.18.3+a637491
master-0-2   Ready       master   143m   v1.18.3+a637491
worker-0-0   NotReady    worker   102m   v1.18.3+a637491
worker-0-1   Ready       worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS      ROLES    AGE    VERSION
master-0-0   Ready       master   143m   v1.18.3+a637491
master-0-1   Ready       master   137m   v1.18.3+a637491
master-0-2   Ready       master   143m   v1.18.3+a637491
worker-0-0   NotReady    worker   102m   v1.18.3+a637491
worker-0-1   Ready       worker   101m   v1.18.3+a637491
[root@sealusa8 ~]# oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   143m   v1.18.3+a637491
master-0-1   Ready    master   137m   v1.18.3+a637491
master-0-2   Ready    master   143m   v1.18.3+a637491
worker-0-0   Ready    worker   102m   v1.18.3+a637491
worker-0-1   Ready    worker   101m   v1.18.3+a637491

Additional info:
Must Gather: https://drive.google.com/open?id=1mBBlbvsQPXesb7kr6CZIPJy9LH19f2-g

Script Log Run 1: https://drive.google.com/open?id=15obrS0ox5FPxMQn0Uwu7jpixAQ9NZVTH

Script Log Run 2: https://drive.google.com/open?id=1fCPNzNkJxcLmNdvXjjkU8u6_dXEJwIBc

Script Log Run 3: https://drive.google.com/open?id=1pBeBt169YEypI8eXxVZkwiRMK-x9CQbI

Script Log Run 4: https://drive.google.com/open?id=1Xxd6woFJTA3iu6pKxBUCIclIUPnq3-1j

Comment 2 Nir 2020-06-29 07:58:50 UTC
Here are my findings:
BMO successfully powered cycle the host, the machine came up, but the node couldn't register itself.
I sshed into the node and saw that nodeip-configuration.service has a problem:

Jun 25 14:08:07 worker-2.ostest.test.metalkube.org podman[1424]: Error: error creating container storage: layer not known
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: nodeip-configuration.service: Main process exited, code=exited, status=125/n/a
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: nodeip-configuration.service: Failed with result 'exit-code'.
Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: Failed to start Writes IP address configuration so that kubelet and crio services select a valid node IP.

Comment 3 Nir 2020-06-29 07:59:21 UTC
podman version 1.6.4

Comment 4 Nir 2020-07-15 13:48:50 UTC

*** This bug has been marked as a duplicate of bug 1857224 ***


Note You need to log in before you can comment on or make changes to this bug.