Created attachment 1696876 [details] automation_script_log_attempt_1_success_log Description of problem: Remediation strategy annotation: 'machine.openshift.io/remediation-strategy': 'external-baremetal' Unhealthy Condition: 'status': 'Unknown' If unhealthy condition is created on the same node 4 times, on the 4th run the node gets deleted and never returns. Version-Release number of selected component (if applicable):\ oc must gather: https://drive.google.com/open?id=1mBBlbvsQPXesb7kr6CZIPJy9LH19f2-g How reproducible: 100% Steps to Reproduce: 1. cat > mHc_Ready.yaml << EOF apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: ready-example namespace: openshift-machine-api annotations: machine.openshift.io/remediation-strategy: external-baremetal spec: selector: matchLabels: machine.openshift.io/cluster-api-machine-role: worker unhealthyConditions: - type: Ready status: Unknown timeout: 60s EOF oc create -f mHc_Ready.yaml 2. virsh suspend worker-0-0 3. oc get nodes 4. Repeat steps 1 - 3, four times, on fourth run the node will go from NotReady, NodeNotFound (It doesn't show up in oc get nodes) and will never return under oc get nodes Actual results: Please see logs and oc must gather for more details. [root@sealusa8 ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 143m v1.18.3+a637491 master-0-1 Ready master 137m v1.18.3+a637491 master-0-2 Ready master 143m v1.18.3+a637491 worker-0-0 Ready worker 102m v1.18.3+a637491 worker-0-1 Ready worker 101m v1.18.3+a637491 [root@sealusa8 ~]# virsh suspend worker-0-0 [root@sealusa8 ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 143m v1.18.3+a637491 master-0-1 Ready master 137m v1.18.3+a637491 master-0-2 Ready master 143m v1.18.3+a637491 worker-0-0 NotReady worker 102m v1.18.3+a637491 worker-0-1 Ready worker 101m v1.18.3+a637491 [root@sealusa8 ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 143m v1.18.3+a637491 master-0-1 Ready master 137m v1.18.3+a637491 master-0-2 Ready master 143m v1.18.3+a637491 worker-0-0 Ready worker 102m v1.18.3+a637491 Expected results: [root@sealusa8 ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 143m v1.18.3+a637491 master-0-1 Ready master 137m v1.18.3+a637491 master-0-2 Ready master 143m v1.18.3+a637491 worker-0-0 Ready worker 102m v1.18.3+a637491 worker-0-1 Ready worker 101m v1.18.3+a637491 [root@sealusa8 ~]# virsh suspend worker-0-0 [root@sealusa8 ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 143m v1.18.3+a637491 master-0-1 Ready master 137m v1.18.3+a637491 master-0-2 Ready master 143m v1.18.3+a637491 worker-0-0 NotReady worker 102m v1.18.3+a637491 worker-0-1 Ready worker 101m v1.18.3+a637491 [root@sealusa8 ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 143m v1.18.3+a637491 master-0-1 Ready master 137m v1.18.3+a637491 master-0-2 Ready master 143m v1.18.3+a637491 worker-0-0 Ready worker 102m v1.18.3+a637491 [root@sealusa8 ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 143m v1.18.3+a637491 master-0-1 Ready master 137m v1.18.3+a637491 master-0-2 Ready master 143m v1.18.3+a637491 worker-0-0 NotReady worker 102m v1.18.3+a637491 worker-0-1 Ready worker 101m v1.18.3+a637491 [root@sealusa8 ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 143m v1.18.3+a637491 master-0-1 Ready master 137m v1.18.3+a637491 master-0-2 Ready master 143m v1.18.3+a637491 worker-0-0 Ready worker 102m v1.18.3+a637491 worker-0-1 Ready worker 101m v1.18.3+a637491 Additional info: Must Gather: https://drive.google.com/open?id=1mBBlbvsQPXesb7kr6CZIPJy9LH19f2-g Script Log Run 1: https://drive.google.com/open?id=15obrS0ox5FPxMQn0Uwu7jpixAQ9NZVTH Script Log Run 2: https://drive.google.com/open?id=1fCPNzNkJxcLmNdvXjjkU8u6_dXEJwIBc Script Log Run 3: https://drive.google.com/open?id=1pBeBt169YEypI8eXxVZkwiRMK-x9CQbI Script Log Run 4: https://drive.google.com/open?id=1Xxd6woFJTA3iu6pKxBUCIclIUPnq3-1j
Here are my findings: BMO successfully powered cycle the host, the machine came up, but the node couldn't register itself. I sshed into the node and saw that nodeip-configuration.service has a problem: Jun 25 14:08:07 worker-2.ostest.test.metalkube.org podman[1424]: Error: error creating container storage: layer not known Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: nodeip-configuration.service: Main process exited, code=exited, status=125/n/a Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: nodeip-configuration.service: Failed with result 'exit-code'. Jun 25 14:08:07 worker-2.ostest.test.metalkube.org systemd[1]: Failed to start Writes IP address configuration so that kubelet and crio services select a valid node IP.
podman version 1.6.4
*** This bug has been marked as a duplicate of bug 1857224 ***