Bug 1811400 - Node is not deleted from cluster during remediation
Summary: Node is not deleted from cluster during remediation
Keywords:
Status: CLOSED DUPLICATE of bug 1810400
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: egarcia
QA Contact: David Sanz
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-08 11:32 UTC by vsibirsk
Modified: 2020-05-20 19:54 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-07 15:49:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
machinehealthcheck (599 bytes, text/plain)
2020-03-08 11:32 UTC, vsibirsk
no flags Details
vm (868 bytes, text/plain)
2020-03-08 11:33 UTC, vsibirsk
no flags Details
machine-controller logs (166.20 KB, application/x-bzip)
2020-03-12 07:46 UTC, Andrew Beekhof
no flags Details
machine-healthcheck-controller logs (611.84 KB, application/x-bzip)
2020-03-12 07:47 UTC, Andrew Beekhof
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-openstack pull 93 0 None closed [release-4.4]Bug 1811400: Migrate to MAO 2020-09-10 13:13:02 UTC

Description vsibirsk 2020-03-08 11:32:47 UTC
Created attachment 1668422 [details]
machinehealthcheck

Description of problem:
Remediation doesn't delete old node from cluster

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.Create VM with runningStrategy=Always
2.Create MachineHealthCheck
3.Stop kubelet.service on the node where VM from step 1 is hosted

Actual results:
Node is marked as NotReady,SchedulingDisabled, new node is installed and added to cluster, old one not deleted and removed from cluster

Expected results:
Old node removed from cluster, VM restarted on new available node

Additional info:
Logs:
oc logs --tail=40 machine-api-controllers-6d98ffcd76-tb884 -c machine-healthcheck-controller
I0308 11:26:20.146660       1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: meet unhealthy criteria, triggers remediation
I0308 11:26:20.146688       1 machinehealthcheck_controller.go:428]  openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: start remediation logic
I0308 11:26:20.146699       1 machinehealthcheck_controller.go:452] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: deleting
I0308 11:26:20.162675       1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/example: no more targets meet unhealthy criteria
I0308 11:26:37.202413       1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/example
I0308 11:26:37.203044       1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/example: finding targets
I0308 11:26:37.203593       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: health checking
I0308 11:26:37.203624       1 machinehealthcheck_controller.go:601] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: unhealthy: condition Ready in state Unknown longer than 300s
I0308 11:26:37.203641       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-l9l8h/vasiliy-vk-q4p9z-worker-l9l8h: health checking
I0308 11:26:37.203651       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-d2wgj/vasiliy-vk-q4p9z-worker-d2wgj: health checking
I0308 11:26:37.203661       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-4qnpx/vasiliy-vk-q4p9z-worker-4qnpx: health checking
I0308 11:26:37.213090       1 machinehealthcheck_controller.go:195] Reconciling openshift-machine-api/example: monitoring MHC: total targets: 4,  maxUnhealthy: 2, unhealthy: 1. Remediations are allowed
I0308 11:26:37.215246       1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: meet unhealthy criteria, triggers remediation
I0308 11:26:37.215624       1 machinehealthcheck_controller.go:428]  openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: start remediation logic
I0308 11:26:37.215669       1 machinehealthcheck_controller.go:452] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: deleting
I0308 11:26:37.231896       1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/example: no more targets meet unhealthy criteria
I0308 11:26:43.448124       1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/example
I0308 11:26:43.448191       1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/example: finding targets
I0308 11:26:43.448667       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-4qnpx/vasiliy-vk-q4p9z-worker-4qnpx: health checking
I0308 11:26:43.448687       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: health checking
I0308 11:26:43.448698       1 machinehealthcheck_controller.go:601] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: unhealthy: condition Ready in state Unknown longer than 300s
I0308 11:26:43.448712       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-l9l8h/vasiliy-vk-q4p9z-worker-l9l8h: health checking
I0308 11:26:43.448721       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-d2wgj/vasiliy-vk-q4p9z-worker-d2wgj: health checking
I0308 11:26:43.459843       1 machinehealthcheck_controller.go:195] Reconciling openshift-machine-api/example: monitoring MHC: total targets: 4,  maxUnhealthy: 2, unhealthy: 1. Remediations are allowed
I0308 11:26:43.459870       1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: meet unhealthy criteria, triggers remediation
I0308 11:26:43.459905       1 machinehealthcheck_controller.go:428]  openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: start remediation logic
I0308 11:26:43.459913       1 machinehealthcheck_controller.go:452] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: deleting
I0308 11:26:43.470653       1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/example: no more targets meet unhealthy criteria
I0308 11:26:43.470721       1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/example
I0308 11:26:43.470748       1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/example: finding targets
I0308 11:26:43.471064       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: health checking
I0308 11:26:43.471082       1 machinehealthcheck_controller.go:601] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: unhealthy: condition Ready in state Unknown longer than 300s
I0308 11:26:43.471092       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-l9l8h/vasiliy-vk-q4p9z-worker-l9l8h: health checking
I0308 11:26:43.471098       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-d2wgj/vasiliy-vk-q4p9z-worker-d2wgj: health checking
I0308 11:26:43.471114       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-4qnpx/vasiliy-vk-q4p9z-worker-4qnpx: health checking
I0308 11:26:43.476666       1 machinehealthcheck_controller.go:195] Reconciling openshift-machine-api/example: monitoring MHC: total targets: 4,  maxUnhealthy: 2, unhealthy: 1. Remediations are allowed
I0308 11:26:43.476703       1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: meet unhealthy criteria, triggers remediation
I0308 11:26:43.476710       1 machinehealthcheck_controller.go:428]  openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: start remediation logic
I0308 11:26:43.476720       1 machinehealthcheck_controller.go:452] openshift-machine-api/example/vasiliy-vk-q4p9z-worker-cfsw7/vasiliy-vk-q4p9z-worker-cfsw7: deleting
I0308 11:26:43.485265       1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/example: no more targets meet unhealthy criteria

Comment 1 vsibirsk 2020-03-08 11:33:34 UTC
Created attachment 1668423 [details]
vm

Comment 4 Andrew Beekhof 2020-03-10 03:21:01 UTC
Hey Alberto,

I don't think this bug belongs to me.
This is a problem with the default remediation strategy, nothing to do with reboots.

-- Andrew

Comment 5 Jianwei Hou 2020-03-10 06:11:53 UTC
The unsuccessful deletion is possibly caused by node drain, looks similar to 1810400. The machine-controller log can show whether drain had completed.

Comment 6 Alberto 2020-03-10 08:18:48 UTC
Can you please check machine controller logs?
If the kubelet was stopped so the node is unreachable and there's stateful pods. Draining will only proceed after 5 mins as that's the value for skip-wait-for-delete-timeout. Unless the baremetal has no latest machine controller.
See e.g aws
https://github.com/openshift/machine-api-operator/pull/464
https://github.com/openshift/cluster-api-provider-aws/pull/296

Comment 7 Andrew Beekhof 2020-03-11 05:32:23 UTC
When I looked at this environment, it had been trying to drain for several days (both the node and machine had the machinev1.MachineFinalizer.
Perhaps that daily didnt have skipWaitForDeleteTimeoutSeconds yet?

Comment 8 Jianwei Hou 2020-03-11 08:21:40 UTC
If the OCP version is newer than 4.4.0-0.nightly-2020-02-18-042756, it should be included. As I understand, it also needs the cloud-api-provider to revendor the new node drain feature(as comment 6). If the OCP deployment is using libvirt then it probably does not have the new feature.

@vsibirsk, the must-gather does not contain any log under machine-api-namespace and openshift-cluster-version. Can you please provide the followings?

What the cloud provider was for the OCP deployment.

oc get clusterverion -o yaml

oc logs machine-api-controllers-xxx -c machine-controller -n openshift-machine-api(with the error machine)

If you'd like to skip the node drain, use annotation `machine.openshift.io/exclude-node-draining: ""` on the machine.

Comment 9 Andrew Beekhof 2020-03-12 07:45:20 UTC
(In reply to Jianwei Hou from comment #8)
> If the OCP version is newer than 4.4.0-0.nightly-2020-02-18-042756, it
> should be included. 

According to my notes, their testing was with version: 4.4.0-0.nightly-2020-03-02-011520

> As I understand, it also needs the cloud-api-provider to
> revendor the new node drain feature(as comment 6). If the OCP deployment is
> using libvirt then it probably does not have the new feature.
> 
> @vsibirsk, the must-gather does not contain any log under
> machine-api-namespace and openshift-cluster-version. Can you please provide
> the followings?
> 
> What the cloud provider was for the OCP deployment.
> 
> oc get clusterverion -o yaml


[cloud-user@ocp-psi-executor ~]$ oc get clusterversion -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2020-03-03T08:58:26Z"
    generation: 1
    name: version
    resourceVersion: "8543723"
    selfLink: /apis/config.openshift.io/v1/clusterversions/version
    uid: afb87553-c74d-44f4-bfb4-db642d78ec2d
  spec:
    channel: stable-4.4
    clusterID: 57e0ac37-3b02-4760-8581-50fef02ce0b4
    upstream: https://api.openshift.com/api/upgrades_info/v1/graph
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2020-03-03T09:27:48Z"
      message: Done applying 4.4.0-0.nightly-2020-03-02-011520
      status: "True"
      type: Available
    - lastTransitionTime: "2020-03-12T01:21:03Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2020-03-03T09:27:48Z"
      message: Cluster version is 4.4.0-0.nightly-2020-03-02-011520
      status: "False"
      type: Progressing
    - lastTransitionTime: "2020-03-03T08:58:40Z"
      message: 'Unable to retrieve available updates: currently installed version
        4.4.0-0.nightly-2020-03-02-011520 not found in the "stable-4.4" channel'
      reason: VersionNotFound
      status: "False"
      type: RetrievedUpdates
    desired:
      force: false
      image: registry.svc.ci.openshift.org/ocp/release@sha256:8a04012778c628211d705bb632276c16534c864cf4084c3c0d9d5b42087dade7
      version: 4.4.0-0.nightly-2020-03-02-011520
    history:
    - completionTime: "2020-03-03T09:27:48Z"
      image: registry.svc.ci.openshift.org/ocp/release@sha256:8a04012778c628211d705bb632276c16534c864cf4084c3c0d9d5b42087dade7
      startedTime: "2020-03-03T08:58:40Z"
      state: Completed
      verified: false
      version: 4.4.0-0.nightly-2020-03-02-011520
    observedGeneration: 1
    versionHash: b4c1puXoHrY=
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


> 
> oc logs machine-api-controllers-xxx -c machine-controller -n
> openshift-machine-api(with the error machine)

attached

> 
> If you'd like to skip the node drain, use annotation
> `machine.openshift.io/exclude-node-draining: ""` on the machine.

I don't believe this is relevant here, as it is the MHC that requested the failed machine be deleted - not the admin.

Comment 10 Andrew Beekhof 2020-03-12 07:46:33 UTC
Created attachment 1669570 [details]
machine-controller logs

Comment 11 Andrew Beekhof 2020-03-12 07:47:07 UTC
Created attachment 1669571 [details]
machine-healthcheck-controller logs

Comment 12 Nelly Credi 2020-04-26 13:50:33 UTC
This is affecting CNV and I see it is still on NEW. 
What is the plan?

Comment 13 Alberto 2020-04-27 14:58:49 UTC
Hi, can you clarify in which environment are you seeing "Node is not deleted from cluster during remediation"? baremetal/aws/azure/gcp/openstack?

Comment 14 Steven Hardy 2020-04-27 15:46:26 UTC
(In reply to Alberto from comment #13)
> Hi, can you clarify in which environment are you seeing "Node is not deleted
> from cluster during remediation"? baremetal/aws/azure/gcp/openstack?

+1 I think we need more information about the environment e.g the platform and how the cluster was deployed to triage/assign this.

Adding needinfo to the reporter to hopefully get this information.

Comment 15 Israel Pinto 2020-04-27 17:04:32 UTC
@Alberto, @Steven
The test done with openstack (PSI).

Comment 16 Alberto 2020-04-28 07:20:36 UTC
During a machine deletion the machine controller will drain and block permanently on pod disruption budget.
If the node is unreachable drain will succeed to evict the pods and will wait for pods to be deleted 5 min after they are given a deletion timestamp. Then it will move forward deleting the node.

If this behaviour is not happening is likely openstack provider is not using the latest version of the machine controller.

Comment 18 Martin André 2020-04-30 10:37:39 UTC
This is possibly a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1810400 which is currently being worked on.

Comment 19 egarcia 2020-04-30 14:04:49 UTC
+1 Lets revisit when the bz linked in the comment above is complete. This very likely has the same root cause.

Comment 22 egarcia 2020-05-07 15:49:26 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1810400 is a verified solution for 4.5. I will be back porting it to 4.4 shortly. This should allow you to properly delete nodes. I am going to close this bug as a duplicate.

*** This bug has been marked as a duplicate of bug 1810400 ***


Note You need to log in before you can comment on or make changes to this bug.