Created attachment 1668422 [details] machinehealthcheck Description of problem: Remediation doesn't delete old node from cluster Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1.Create VM with runningStrategy=Always 2.Create MachineHealthCheck 3.Stop kubelet.service on the node where VM from step 1 is hosted Actual results: Node is marked as NotReady,SchedulingDisabled, new node is installed and added to cluster, old one not deleted and removed from cluster Expected results: Old node removed from cluster, VM restarted on new available node Additional info: Logs: oc logs --tail=40 machine-api-controllers-6d98ffcd76-tb884 -c machine-healthcheck-controller I0308 11:26:20.146660 1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: meet unhealthy criteria, triggers remediation I0308 11:26:20.146688 1 machinehealthcheck_controller.go:428] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: start remediation logic I0308 11:26:20.146699 1 machinehealthcheck_controller.go:452] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: deleting I0308 11:26:20.162675 1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/example: no more targets meet unhealthy criteria I0308 11:26:37.202413 1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/example I0308 11:26:37.203044 1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/example: finding targets I0308 11:26:37.203593 1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: health checking I0308 11:26:37.203624 1 machinehealthcheck_controller.go:601] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: unhealthy: condition Ready in state Unknown longer than 300s I0308 11:26:37.203641 1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-l9l8h/vasiliy-vk-q4p9z-worker-l9l8h: health checking I0308 11:26:37.203651 1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-d2wgj/vasiliy-vk-q4p9z-worker-d2wgj: health checking I0308 11:26:37.203661 1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-4qnpx/vasiliy-vk-q4p9z-worker-4qnpx: health checking I0308 11:26:37.213090 1 machinehealthcheck_controller.go:195] Reconciling openshift-machine-api/example: monitoring MHC: total targets: 4, maxUnhealthy: 2, unhealthy: 1. Remediations are allowed I0308 11:26:37.215246 1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: meet unhealthy criteria, triggers remediation I0308 11:26:37.215624 1 machinehealthcheck_controller.go:428] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: start remediation logic I0308 11:26:37.215669 1 machinehealthcheck_controller.go:452] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: deleting I0308 11:26:37.231896 1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/example: no more targets meet unhealthy criteria I0308 11:26:43.448124 1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/example I0308 11:26:43.448191 1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/example: finding targets I0308 11:26:43.448667 1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-4qnpx/vasiliy-vk-q4p9z-worker-4qnpx: health checking I0308 11:26:43.448687 1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: health checking I0308 11:26:43.448698 1 machinehealthcheck_controller.go:601] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: unhealthy: condition Ready in state Unknown longer than 300s I0308 11:26:43.448712 1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-l9l8h/vasiliy-vk-q4p9z-worker-l9l8h: health checking I0308 11:26:43.448721 1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-d2wgj/vasiliy-vk-q4p9z-worker-d2wgj: health checking I0308 11:26:43.459843 1 machinehealthcheck_controller.go:195] Reconciling openshift-machine-api/example: monitoring MHC: total targets: 4, maxUnhealthy: 2, unhealthy: 1. Remediations are allowed I0308 11:26:43.459870 1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: meet unhealthy criteria, triggers remediation I0308 11:26:43.459905 1 machinehealthcheck_controller.go:428] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: start remediation logic I0308 11:26:43.459913 1 machinehealthcheck_controller.go:452] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: deleting I0308 11:26:43.470653 1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/example: no more targets meet unhealthy criteria I0308 11:26:43.470721 1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/example I0308 11:26:43.470748 1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/example: finding targets I0308 11:26:43.471064 1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: health checking I0308 11:26:43.471082 1 machinehealthcheck_controller.go:601] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: unhealthy: condition Ready in state Unknown longer than 300s I0308 11:26:43.471092 1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-l9l8h/vasiliy-vk-q4p9z-worker-l9l8h: health checking I0308 11:26:43.471098 1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-d2wgj/vasiliy-vk-q4p9z-worker-d2wgj: health checking I0308 11:26:43.471114 1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-4qnpx/vasiliy-vk-q4p9z-worker-4qnpx: health checking I0308 11:26:43.476666 1 machinehealthcheck_controller.go:195] Reconciling openshift-machine-api/example: monitoring MHC: total targets: 4, maxUnhealthy: 2, unhealthy: 1. Remediations are allowed I0308 11:26:43.476703 1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: meet unhealthy criteria, triggers remediation I0308 11:26:43.476710 1 machinehealthcheck_controller.go:428] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: start remediation logic I0308 11:26:43.476720 1 machinehealthcheck_controller.go:452] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: deleting I0308 11:26:43.485265 1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/example: no more targets meet unhealthy criteria
Created attachment 1668423 [details] vm
Hey Alberto, I don't think this bug belongs to me. This is a problem with the default remediation strategy, nothing to do with reboots. -- Andrew
The unsuccessful deletion is possibly caused by node drain, looks similar to 1810400. The machine-controller log can show whether drain had completed.
Can you please check machine controller logs? If the kubelet was stopped so the node is unreachable and there's stateful pods. Draining will only proceed after 5 mins as that's the value for skip-wait-for-delete-timeout. Unless the baremetal has no latest machine controller. See e.g aws https://github.com/openshift/machine-api-operator/pull/464 https://github.com/openshift/cluster-api-provider-aws/pull/296
When I looked at this environment, it had been trying to drain for several days (both the node and machine had the machinev1.MachineFinalizer. Perhaps that daily didnt have skipWaitForDeleteTimeoutSeconds yet?
If the OCP version is newer than 4.4.0-0.nightly-2020-02-18-042756, it should be included. As I understand, it also needs the cloud-api-provider to revendor the new node drain feature(as comment 6). If the OCP deployment is using libvirt then it probably does not have the new feature. @vsibirsk, the must-gather does not contain any log under machine-api-namespace and openshift-cluster-version. Can you please provide the followings? What the cloud provider was for the OCP deployment. oc get clusterverion -o yaml oc logs machine-api-controllers-xxx -c machine-controller -n openshift-machine-api(with the error machine) If you'd like to skip the node drain, use annotation `machine.openshift.io/exclude-node-draining: ""` on the machine.
(In reply to Jianwei Hou from comment #8) > If the OCP version is newer than 4.4.0-0.nightly-2020-02-18-042756, it > should be included. According to my notes, their testing was with version: 4.4.0-0.nightly-2020-03-02-011520 > As I understand, it also needs the cloud-api-provider to > revendor the new node drain feature(as comment 6). If the OCP deployment is > using libvirt then it probably does not have the new feature. > > @vsibirsk, the must-gather does not contain any log under > machine-api-namespace and openshift-cluster-version. Can you please provide > the followings? > > What the cloud provider was for the OCP deployment. > > oc get clusterverion -o yaml [cloud-user@ocp-psi-executor ~]$ oc get clusterversion -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2020-03-03T08:58:26Z" generation: 1 name: version resourceVersion: "8543723" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: afb87553-c74d-44f4-bfb4-db642d78ec2d spec: channel: stable-4.4 clusterID: 57e0ac37-3b02-4760-8581-50fef02ce0b4 upstream: https://api.openshift.com/api/upgrades_info/v1/graph status: availableUpdates: null conditions: - lastTransitionTime: "2020-03-03T09:27:48Z" message: Done applying 4.4.0-0.nightly-2020-03-02-011520 status: "True" type: Available - lastTransitionTime: "2020-03-12T01:21:03Z" status: "False" type: Failing - lastTransitionTime: "2020-03-03T09:27:48Z" message: Cluster version is 4.4.0-0.nightly-2020-03-02-011520 status: "False" type: Progressing - lastTransitionTime: "2020-03-03T08:58:40Z" message: 'Unable to retrieve available updates: currently installed version 4.4.0-0.nightly-2020-03-02-011520 not found in the "stable-4.4" channel' reason: VersionNotFound status: "False" type: RetrievedUpdates desired: force: false image: registry.svc.ci.openshift.org/ocp/release@sha256:8a04012778c628211d705bb632276c16534c864cf4084c3c0d9d5b42087dade7 version: 4.4.0-0.nightly-2020-03-02-011520 history: - completionTime: "2020-03-03T09:27:48Z" image: registry.svc.ci.openshift.org/ocp/release@sha256:8a04012778c628211d705bb632276c16534c864cf4084c3c0d9d5b42087dade7 startedTime: "2020-03-03T08:58:40Z" state: Completed verified: false version: 4.4.0-0.nightly-2020-03-02-011520 observedGeneration: 1 versionHash: b4c1puXoHrY= kind: List metadata: resourceVersion: "" selfLink: "" > > oc logs machine-api-controllers-xxx -c machine-controller -n > openshift-machine-api(with the error machine) attached > > If you'd like to skip the node drain, use annotation > `machine.openshift.io/exclude-node-draining: ""` on the machine. I don't believe this is relevant here, as it is the MHC that requested the failed machine be deleted - not the admin.
Created attachment 1669570 [details] machine-controller logs
Created attachment 1669571 [details] machine-healthcheck-controller logs
This is affecting CNV and I see it is still on NEW. What is the plan?
Hi, can you clarify in which environment are you seeing "Node is not deleted from cluster during remediation"? baremetal/aws/azure/gcp/openstack?
(In reply to Alberto from comment #13) > Hi, can you clarify in which environment are you seeing "Node is not deleted > from cluster during remediation"? baremetal/aws/azure/gcp/openstack? +1 I think we need more information about the environment e.g the platform and how the cluster was deployed to triage/assign this. Adding needinfo to the reporter to hopefully get this information.
@Alberto, @Steven The test done with openstack (PSI).
During a machine deletion the machine controller will drain and block permanently on pod disruption budget. If the node is unreachable drain will succeed to evict the pods and will wait for pods to be deleted 5 min after they are given a deletion timestamp. Then it will move forward deleting the node. If this behaviour is not happening is likely openstack provider is not using the latest version of the machine controller.
This is possibly a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1810400 which is currently being worked on.
+1 Lets revisit when the bz linked in the comment above is complete. This very likely has the same root cause.
https://bugzilla.redhat.com/show_bug.cgi?id=1810400 is a verified solution for 4.5. I will be back porting it to 4.4 shortly. This should allow you to properly delete nodes. I am going to close this bug as a duplicate. *** This bug has been marked as a duplicate of bug 1810400 ***