Bug 1811400

Summary: Node is not deleted from cluster during remediation
Product: OpenShift Container Platform Reporter: vsibirsk
Component: Cloud ComputeAssignee: egarcia
Cloud Compute sub component: OpenStack Provider QA Contact: David Sanz <dsanzmor>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: abeekhof, adduarte, agarcial, augol, danken, egarcia, fdeutsch, ipinto, m.andre, mfedosin, ncredi, oblaut, pprinett, shardy, stbenjam, zhsun
Version: 4.4Keywords: AutomationBlocker
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-07 15:49:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
machinehealthcheck
none
vm
none
machine-controller logs
none
machine-healthcheck-controller logs none

Description vsibirsk 2020-03-08 11:32:47 UTC
Created attachment 1668422 [details]
machinehealthcheck

Description of problem:
Remediation doesn't delete old node from cluster

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.Create VM with runningStrategy=Always
2.Create MachineHealthCheck
3.Stop kubelet.service on the node where VM from step 1 is hosted

Actual results:
Node is marked as NotReady,SchedulingDisabled, new node is installed and added to cluster, old one not deleted and removed from cluster

Expected results:
Old node removed from cluster, VM restarted on new available node

Additional info:
Logs:
oc logs --tail=40 machine-api-controllers-6d98ffcd76-tb884 -c machine-healthcheck-controller
I0308 11:26:20.146660       1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: meet unhealthy criteria, triggers remediation
I0308 11:26:20.146688       1 machinehealthcheck_controller.go:428]  openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: start remediation logic
I0308 11:26:20.146699       1 machinehealthcheck_controller.go:452] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: deleting
I0308 11:26:20.162675       1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/example: no more targets meet unhealthy criteria
I0308 11:26:37.202413       1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/example
I0308 11:26:37.203044       1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/example: finding targets
I0308 11:26:37.203593       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: health checking
I0308 11:26:37.203624       1 machinehealthcheck_controller.go:601] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: unhealthy: condition Ready in state Unknown longer than 300s
I0308 11:26:37.203641       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-l9l8h/vasiliy-vk-q4p9z-worker-l9l8h: health checking
I0308 11:26:37.203651       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-d2wgj/vasiliy-vk-q4p9z-worker-d2wgj: health checking
I0308 11:26:37.203661       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-4qnpx/vasiliy-vk-q4p9z-worker-4qnpx: health checking
I0308 11:26:37.213090       1 machinehealthcheck_controller.go:195] Reconciling openshift-machine-api/example: monitoring MHC: total targets: 4,  maxUnhealthy: 2, unhealthy: 1. Remediations are allowed
I0308 11:26:37.215246       1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: meet unhealthy criteria, triggers remediation
I0308 11:26:37.215624       1 machinehealthcheck_controller.go:428]  openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: start remediation logic
I0308 11:26:37.215669       1 machinehealthcheck_controller.go:452] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: deleting
I0308 11:26:37.231896       1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/example: no more targets meet unhealthy criteria
I0308 11:26:43.448124       1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/example
I0308 11:26:43.448191       1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/example: finding targets
I0308 11:26:43.448667       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-4qnpx/vasiliy-vk-q4p9z-worker-4qnpx: health checking
I0308 11:26:43.448687       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: health checking
I0308 11:26:43.448698       1 machinehealthcheck_controller.go:601] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: unhealthy: condition Ready in state Unknown longer than 300s
I0308 11:26:43.448712       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-l9l8h/vasiliy-vk-q4p9z-worker-l9l8h: health checking
I0308 11:26:43.448721       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-d2wgj/vasiliy-vk-q4p9z-worker-d2wgj: health checking
I0308 11:26:43.459843       1 machinehealthcheck_controller.go:195] Reconciling openshift-machine-api/example: monitoring MHC: total targets: 4,  maxUnhealthy: 2, unhealthy: 1. Remediations are allowed
I0308 11:26:43.459870       1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: meet unhealthy criteria, triggers remediation
I0308 11:26:43.459905       1 machinehealthcheck_controller.go:428]  openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: start remediation logic
I0308 11:26:43.459913       1 machinehealthcheck_controller.go:452] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: deleting
I0308 11:26:43.470653       1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/example: no more targets meet unhealthy criteria
I0308 11:26:43.470721       1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/example
I0308 11:26:43.470748       1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/example: finding targets
I0308 11:26:43.471064       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: health checking
I0308 11:26:43.471082       1 machinehealthcheck_controller.go:601] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: unhealthy: condition Ready in state Unknown longer than 300s
I0308 11:26:43.471092       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-l9l8h/vasiliy-vk-q4p9z-worker-l9l8h: health checking
I0308 11:26:43.471098       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-d2wgj/vasiliy-vk-q4p9z-worker-d2wgj: health checking
I0308 11:26:43.471114       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-4qnpx/vasiliy-vk-q4p9z-worker-4qnpx: health checking
I0308 11:26:43.476666       1 machinehealthcheck_controller.go:195] Reconciling openshift-machine-api/example: monitoring MHC: total targets: 4,  maxUnhealthy: 2, unhealthy: 1. Remediations are allowed
I0308 11:26:43.476703       1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: meet unhealthy criteria, triggers remediation
I0308 11:26:43.476710       1 machinehealthcheck_controller.go:428]  openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: start remediation logic
I0308 11:26:43.476720       1 machinehealthcheck_controller.go:452] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: deleting
I0308 11:26:43.485265       1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/example: no more targets meet unhealthy criteria

Comment 1 vsibirsk 2020-03-08 11:33:34 UTC
Created attachment 1668423 [details]
vm

Comment 4 Andrew Beekhof 2020-03-10 03:21:01 UTC
Hey Alberto,

I don't think this bug belongs to me.
This is a problem with the default remediation strategy, nothing to do with reboots.

-- Andrew

Comment 5 Jianwei Hou 2020-03-10 06:11:53 UTC
The unsuccessful deletion is possibly caused by node drain, looks similar to 1810400. The machine-controller log can show whether drain had completed.

Comment 6 Alberto 2020-03-10 08:18:48 UTC
Can you please check machine controller logs?
If the kubelet was stopped so the node is unreachable and there's stateful pods. Draining will only proceed after 5 mins as that's the value for skip-wait-for-delete-timeout. Unless the baremetal has no latest machine controller.
See e.g aws
https://github.com/openshift/machine-api-operator/pull/464
https://github.com/openshift/cluster-api-provider-aws/pull/296

Comment 7 Andrew Beekhof 2020-03-11 05:32:23 UTC
When I looked at this environment, it had been trying to drain for several days (both the node and machine had the machinev1.MachineFinalizer.
Perhaps that daily didnt have skipWaitForDeleteTimeoutSeconds yet?

Comment 8 Jianwei Hou 2020-03-11 08:21:40 UTC
If the OCP version is newer than 4.4.0-0.nightly-2020-02-18-042756, it should be included. As I understand, it also needs the cloud-api-provider to revendor the new node drain feature(as comment 6). If the OCP deployment is using libvirt then it probably does not have the new feature.

@vsibirsk, the must-gather does not contain any log under machine-api-namespace and openshift-cluster-version. Can you please provide the followings?

What the cloud provider was for the OCP deployment.

oc get clusterverion -o yaml

oc logs machine-api-controllers-xxx -c machine-controller -n openshift-machine-api(with the error machine)

If you'd like to skip the node drain, use annotation `machine.openshift.io/exclude-node-draining: ""` on the machine.

Comment 9 Andrew Beekhof 2020-03-12 07:45:20 UTC
(In reply to Jianwei Hou from comment #8)
> If the OCP version is newer than 4.4.0-0.nightly-2020-02-18-042756, it
> should be included. 

According to my notes, their testing was with version: 4.4.0-0.nightly-2020-03-02-011520

> As I understand, it also needs the cloud-api-provider to
> revendor the new node drain feature(as comment 6). If the OCP deployment is
> using libvirt then it probably does not have the new feature.
> 
> @vsibirsk, the must-gather does not contain any log under
> machine-api-namespace and openshift-cluster-version. Can you please provide
> the followings?
> 
> What the cloud provider was for the OCP deployment.
> 
> oc get clusterverion -o yaml


[cloud-user@ocp-psi-executor ~]$ oc get clusterversion -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2020-03-03T08:58:26Z"
    generation: 1
    name: version
    resourceVersion: "8543723"
    selfLink: /apis/config.openshift.io/v1/clusterversions/version
    uid: afb87553-c74d-44f4-bfb4-db642d78ec2d
  spec:
    channel: stable-4.4
    clusterID: 57e0ac37-3b02-4760-8581-50fef02ce0b4
    upstream: https://api.openshift.com/api/upgrades_info/v1/graph
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2020-03-03T09:27:48Z"
      message: Done applying 4.4.0-0.nightly-2020-03-02-011520
      status: "True"
      type: Available
    - lastTransitionTime: "2020-03-12T01:21:03Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2020-03-03T09:27:48Z"
      message: Cluster version is 4.4.0-0.nightly-2020-03-02-011520
      status: "False"
      type: Progressing
    - lastTransitionTime: "2020-03-03T08:58:40Z"
      message: 'Unable to retrieve available updates: currently installed version
        4.4.0-0.nightly-2020-03-02-011520 not found in the "stable-4.4" channel'
      reason: VersionNotFound
      status: "False"
      type: RetrievedUpdates
    desired:
      force: false
      image: registry.svc.ci.openshift.org/ocp/release@sha256:8a04012778c628211d705bb632276c16534c864cf4084c3c0d9d5b42087dade7
      version: 4.4.0-0.nightly-2020-03-02-011520
    history:
    - completionTime: "2020-03-03T09:27:48Z"
      image: registry.svc.ci.openshift.org/ocp/release@sha256:8a04012778c628211d705bb632276c16534c864cf4084c3c0d9d5b42087dade7
      startedTime: "2020-03-03T08:58:40Z"
      state: Completed
      verified: false
      version: 4.4.0-0.nightly-2020-03-02-011520
    observedGeneration: 1
    versionHash: b4c1puXoHrY=
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


> 
> oc logs machine-api-controllers-xxx -c machine-controller -n
> openshift-machine-api(with the error machine)

attached

> 
> If you'd like to skip the node drain, use annotation
> `machine.openshift.io/exclude-node-draining: ""` on the machine.

I don't believe this is relevant here, as it is the MHC that requested the failed machine be deleted - not the admin.

Comment 10 Andrew Beekhof 2020-03-12 07:46:33 UTC
Created attachment 1669570 [details]
machine-controller logs

Comment 11 Andrew Beekhof 2020-03-12 07:47:07 UTC
Created attachment 1669571 [details]
machine-healthcheck-controller logs

Comment 12 Nelly Credi 2020-04-26 13:50:33 UTC
This is affecting CNV and I see it is still on NEW. 
What is the plan?

Comment 13 Alberto 2020-04-27 14:58:49 UTC
Hi, can you clarify in which environment are you seeing "Node is not deleted from cluster during remediation"? baremetal/aws/azure/gcp/openstack?

Comment 14 Steven Hardy 2020-04-27 15:46:26 UTC
(In reply to Alberto from comment #13)
> Hi, can you clarify in which environment are you seeing "Node is not deleted
> from cluster during remediation"? baremetal/aws/azure/gcp/openstack?

+1 I think we need more information about the environment e.g the platform and how the cluster was deployed to triage/assign this.

Adding needinfo to the reporter to hopefully get this information.

Comment 15 Israel Pinto 2020-04-27 17:04:32 UTC
@Alberto, @Steven
The test done with openstack (PSI).

Comment 16 Alberto 2020-04-28 07:20:36 UTC
During a machine deletion the machine controller will drain and block permanently on pod disruption budget.
If the node is unreachable drain will succeed to evict the pods and will wait for pods to be deleted 5 min after they are given a deletion timestamp. Then it will move forward deleting the node.

If this behaviour is not happening is likely openstack provider is not using the latest version of the machine controller.

Comment 18 Martin André 2020-04-30 10:37:39 UTC
This is possibly a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1810400 which is currently being worked on.

Comment 19 egarcia 2020-04-30 14:04:49 UTC
+1 Lets revisit when the bz linked in the comment above is complete. This very likely has the same root cause.

Comment 22 egarcia 2020-05-07 15:49:26 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1810400 is a verified solution for 4.5. I will be back porting it to 4.4 shortly. This should allow you to properly delete nodes. I am going to close this bug as a duplicate.

*** This bug has been marked as a duplicate of bug 1810400 ***