Bug 1905709 - Machine API deletion does not properly handle stopped instances on AWS or GCP
Summary: Machine API deletion does not properly handle stopped instances on AWS or GCP
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Michael Gugino
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-08 21:33 UTC by Clayton Coleman
Modified: 2021-07-27 22:35 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: The machine-api can now successfully drain and delete nodes that are unreachable. Reason: PodDisruptionBudgets would disallow draining pods on an unreachable node due to missing upstream eviction API features. Result: Machine replacement operations via machine-api are simplified.
Clone Of:
Environment:
Last Closed: 2021-07-27 22:35:01 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 780 0 None closed Bug 1905709: Drain: ignore gracePeriod on unreachable kublets 2021-02-08 13:50:39 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:35:27 UTC

Description Clayton Coleman 2020-12-08 21:33:25 UTC
A powered off GCP machine (reported as terminated by GCP) should be deleted close to instantly by the machine api when the machine object is deleted.  Instead, it takes 20+ minutes on 4.7.  This causes the disruption quorum test to timeout on GCP (may be a problem on other clouds as well):

https://bugzilla.redhat.com/show_bug.cgi?id=1886160#c14

Scenario:

1. SSH to a node in GCP cluster and run "sudo -i systemctl poweroff --force --force"
2. Delete machine via oc
3. Wait

Expected:

Deletion of a terminated instance (according to GCP) should be instant - no drain is necessary as long as the machine is shut down

Actual:

Machine sticks around for 20m or more

Comment 1 Clayton Coleman 2020-12-08 22:08:54 UTC
Actually, after testing I'm still waiting an hour later.

So if terminated machines are not getting deleted from machines ever, that would be bad

Comment 2 Clayton Coleman 2020-12-08 22:22:07 UTC
https://github.com/openshift/origin/pull/25750 on GCP triggers the failure clearly (indicates the node is never removed and the terminated machine stays deleting).  It also tightens the windows for the test - it is unacceptable for deletion to stretch out that long.

Comment 3 Joel Speed 2020-12-09 13:13:28 UTC
Machines shouldn't be sticking around for that long. 

I'm struggling to find the failure on GCP, but can see that in the AWS Disruptive test it has failed [1] 

Sadly that test run doesn't have the pod logs captured within the artifacts so I can't see what the machine controller was doing during the period of the test.

The only thing I can gather is from the logs below that we are expecting the Machine to be gone within 7 minutes.

Dec  9 01:41:31.775: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
Dec  9 01:41:31.818: INFO: Condition Ready of node ip-10-0-205-106.us-east-2.compute.internal is false, but Node is tainted by NodeController with [{node.kubernetes.io/unreachable  NoSchedule 2020-12-09 01:36:03 +0000 UTC} {node.cloudprovider.kubernetes.io/shutdown  NoSchedule <nil>} {node.kubernetes.io/unreachable  NoExecute 2020-12-09 01:36:09 +0000 UTC} {node.kubernetes.io/unschedulable  NoSchedule 2020-12-09 01:41:03 +0000 UTC}]. Failure

Currently in the machine controller [2], we have logic so that if the Node has gone (ie instance has been removed from the cloud provider) then we should skip drain, but not logic that determines if the VM is actually running or not.

In the case that the node is unreachable, such as when it is shut down, we still attempt draining but [3] have logic that will force the terminating pods to be removed after a fixed timeout of 5 minutes.

Therefore, the maximum amount of time the machine should stick around in this situation is 5 minutes after the machine controller initially starts to reconcile the deletion.

I believe this mechanism was meant to be a safe way of handling the case as described above and other partitions that may occur without specific logic being required for determining if the VM is stopped or not.

The question now is why this hasn't worked in this case. I can't determine anything from the logs of the AWS disruptive test and cannot see the failure on GCP, so will need to spin up a cluster to try and reproduce the issue. I believe this issue should be agnostic of the provider we are running on since the logic does not get involved with the cloud provider at all during deletion.

> So if terminated machines are not getting deleted from machines ever, that would be bad

Termination can mean different states across different providers. Eg there is a semantic difference between terminated and stopped on AWS and we handle those differently in our code. Could you clarify which you mean here? Is the instance deleted on the cloud provider side or just shutdown and able to be restarted? I suspect the latter but would prefer to clarify.

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25750/pull-ci-openshift-origin-master-e2e-aws-disruptive/1336471962565742592
[2] https://github.com/openshift/machine-api-operator/blob/063975e22e50288864918a2419fc298b151b330f/pkg/controller/machine/controller.go#L212-L222
[3] https://github.com/openshift/machine-api-operator/blob/063975e22e50288864918a2419fc298b151b330f/pkg/controller/machine/controller.go#L367-L371

Comment 4 Clayton Coleman 2020-12-09 14:27:34 UTC
If I stop a machine in GCP via kernel shutdown, GCP should update the cloud API within 10-15s to mark the instance as stopped, and I would expect the machine controller to detect that within 1m.  A deleting terminated machine in the API can be deleted from GCP immediately.

> Currently in the machine controller [2], we have logic so that if the Node has gone (ie instance has been removed from the cloud provider) then we should skip drain, but not logic that determines if the VM is actually running or not.

This would be a bug (the running status should be used).  All our supported control planes should correctly delete stopped or terminated instances rapidly (as fast as the controller detects it). It's good we have the unknown state and the logic there is correct - if we don't know what its happening, we shouldn't do anything.  But the stopped and terminating states on all three major clouds should be supported correctly and tested.

I believe we had a previous bug on this but I can't find it.

So:

1. If a cloud provider (that we know about) has an instance in the stopped state
2. And the machine object is deleted
3. The controller should delete the instance of that stopped machine
4. If deletion of the instance succeeds the machine object can now be deleted.

Comment 5 Michael Gugino 2020-12-09 14:56:01 UTC
If you force stop an instance (eg, kubelet goes unreachable) and it doesn't have blocking PDBs, the total time to machine deletion is the time it takes for taints to be applied and delete all the pods + 5 minutes.  So, in most cases, in a real world scenario, by the time you realize the node is dead and you delete it via the machine-api, it's going to be near instant removal.

If the instance is hanging around longer than that, it's because you have blocking PDBs.

Don't delete things outside of the machine-api and you won't have any problems.

Comment 6 Michael Gugino 2020-12-09 15:06:06 UTC
See this bug for an example of a scenario where delete two control plane hosts was previously possible: https://bugzilla.redhat.com/show_bug.cgi?id=1840358

We are protecting against this type of scenario by always enforcing drain.

With this change (I don't think it's downstream yet) we can successfully run drain against disrupted control plane hosts: https://github.com/kubernetes/kubernetes/pull/94381

Comment 7 Michael Gugino 2020-12-09 17:13:02 UTC
Some combination of behaviors is causing the deletion timestamp of certain pods to be set into the future, which I think is a bug.  This seems to directly correlated with a pod's gracePeriod.

https://github.com/kubernetes/apiserver/blob/master/pkg/registry/rest/delete.go#L118-L120

That is the line of code that updates the deletion timestamp, otherwise it is set to time.Now() on the first deletion attempt.

This may be a side effect of using '-1' as the GracePeriod in the drain library, which is supposed to honor the pod's specifications: https://github.com/openshift/cluster-api-provider-gcp/blob/master/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go#L351

This combination of things might have been exposed by this change, which allows evictions to succeed (instead of failing PDBs as it would have previously): https://github.com/kubernetes/kubernetes/pull/91342/files#diff-6105105bc4d40e3d45b9cc38165516be0205f0329929421fd7df65df7abaeb2cR250

Need to determine if setting the DeletionTimestamp to some point in the future is ever desireable, I think that's a bug.

Comment 8 Clayton Coleman 2020-12-09 17:54:07 UTC
Looks like AWS also fails to delete immediately in "stop then delete machine" (taken 8m after delete) - the instance should have been deleted (it's stopped) and then the machine should have completed deletion immediately after the machine showed up in terminated state

apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
    machine.openshift.io/instance-state: stopped
  creationTimestamp: "2020-12-09T14:35:33Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2020-12-09T17:45:55Z"
  finalizers:
  - machine.machine.openshift.io
  generateName: ci-ln-m0tdlc2-d5d6b-x2mwd-worker-us-east-2a-
  generation: 3
  labels:
    machine.openshift.io/cluster-api-cluster: ci-ln-m0tdlc2-d5d6b-x2mwd
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/cluster-api-machineset: ci-ln-m0tdlc2-d5d6b-x2mwd-worker-us-east-2a
    machine.openshift.io/instance-type: m4.xlarge
    machine.openshift.io/region: us-east-2
    machine.openshift.io/zone: us-east-2a
  managedFields:
  - apiVersion: machine.openshift.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName: {}
        f:labels:
          .: {}
          f:machine.openshift.io/cluster-api-cluster: {}
          f:machine.openshift.io/cluster-api-machine-role: {}
          f:machine.openshift.io/cluster-api-machine-type: {}
          f:machine.openshift.io/cluster-api-machineset: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"c43ef5f8-fb66-4aee-9656-a36ff908d06a"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        .: {}
        f:metadata: {}
        f:providerSpec:
          .: {}
          f:value:
            .: {}
            f:ami: {}
            f:apiVersion: {}
            f:blockDevices: {}
            f:credentialsSecret: {}
            f:deviceIndex: {}
            f:iamInstanceProfile: {}
            f:instanceType: {}
            f:kind: {}
            f:metadata: {}
            f:placement: {}
            f:securityGroups: {}
            f:subnet: {}
            f:tags: {}
            f:userDataSecret: {}
    manager: machineset-controller
    operation: Update
    time: "2020-12-09T14:35:33Z"
  - apiVersion: machine.openshift.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:nodeRef:
          .: {}
          f:kind: {}
          f:name: {}
          f:uid: {}
    manager: nodelink-controller
    operation: Update
    time: "2020-12-09T17:38:31Z"
  - apiVersion: machine.openshift.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:machine.openshift.io/instance-state: {}
        f:finalizers:
          .: {}
          v:"machine.machine.openshift.io": {}
        f:labels:
          f:machine.openshift.io/instance-type: {}
          f:machine.openshift.io/region: {}
          f:machine.openshift.io/zone: {}
      f:spec:
        f:providerID: {}
      f:status:
        .: {}
        f:addresses: {}
        f:lastUpdated: {}
        f:phase: {}
        f:providerStatus:
          .: {}
          f:conditions: {}
          f:instanceId: {}
          f:instanceState: {}
    manager: machine-controller-manager
    operation: Update
    time: "2020-12-09T17:45:55Z"
  name: ci-ln-m0tdlc2-d5d6b-x2mwd-worker-us-east-2a-hsn72
  namespace: openshift-machine-api
  ownerReferences:
  - apiVersion: machine.openshift.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachineSet
    name: ci-ln-m0tdlc2-d5d6b-x2mwd-worker-us-east-2a
    uid: c43ef5f8-fb66-4aee-9656-a36ff908d06a
  resourceVersion: "82916"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/ci-ln-m0tdlc2-d5d6b-x2mwd-worker-us-east-2a-hsn72
  uid: 102a2b47-73b4-49f2-9a3c-ad067a44fc76
spec:
  metadata: {}
  providerID: aws:///us-east-2a/i-078ddd964bfb21da9
  providerSpec:
    value:
      ami:
        id: ami-0c0b5ea0a4ebe87d5
      apiVersion: awsproviderconfig.openshift.io/v1beta1
      blockDevices:
      - ebs:
          encrypted: true
          iops: 0
          kmsKey:
            arn: ""
          volumeSize: 120
          volumeType: gp2
      credentialsSecret:
        name: aws-cloud-credentials
      deviceIndex: 0
      iamInstanceProfile:
        id: ci-ln-m0tdlc2-d5d6b-x2mwd-worker-profile
      instanceType: m4.xlarge
      kind: AWSMachineProviderConfig
      metadata:
        creationTimestamp: null
      placement:
        availabilityZone: us-east-2a
        region: us-east-2
      securityGroups:
      - filters:
        - name: tag:Name
          values:
          - ci-ln-m0tdlc2-d5d6b-x2mwd-worker-sg
      subnet:
        filters:
        - name: tag:Name
          values:
          - ci-ln-m0tdlc2-d5d6b-x2mwd-private-us-east-2a
      tags:
      - name: kubernetes.io/cluster/ci-ln-m0tdlc2-d5d6b-x2mwd
        value: owned
      - name: expirationDate
        value: 2020-12-09T18:23+00:00
      userDataSecret:
        name: worker-user-data
status:
  addresses:
  - address: 10.0.169.233
    type: InternalIP
  - address: ip-10-0-169-233.us-east-2.compute.internal
    type: InternalDNS
  - address: ip-10-0-169-233.us-east-2.compute.internal
    type: Hostname
  lastUpdated: "2020-12-09T17:45:55Z"
  nodeRef:
    kind: Node
    name: ip-10-0-169-233.us-east-2.compute.internal
    uid: 5ede4ccb-2bfe-476f-8beb-074f71c3a026
  phase: Deleting
  providerStatus:
    conditions:
    - lastProbeTime: "2020-12-09T14:36:12Z"
      lastTransitionTime: "2020-12-09T14:36:12Z"
      message: Machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreation
    instanceId: i-078ddd964bfb21da9
    instanceState: stopped

Comment 9 Clayton Coleman 2020-12-09 17:56:08 UTC
In the case where the cloud API says the machine is stopped, the machine api upon receiving a machine delete should bypass drain and immediately attempt to delete the instance.  If delete succeeds, then machine api should immediately delete the node and then the machine.  This is a shortcut we allow and there are no safety concerns remaining if delete succeeds (if delete for a cloud says the instance is deleted, the machine is never coming back and therefore no wait or drain is appropriate).

Comment 10 Clayton Coleman 2020-12-09 17:57:19 UTC
> Some combination of behaviors is causing the deletion timestamp of certain pods to be set into the future, which I think is a bug.  This seems to directly correlated with a pod's gracePeriod.

oc explain pod.metadata.deletionTimestamp
KIND:     Pod
VERSION:  v1

FIELD:    deletionTimestamp <string>

DESCRIPTION:
     DeletionTimestamp is RFC 3339 date and time at which this resource will be
     deleted. This field is set by the server when a graceful deletion is
     requested by the user, and is not directly settable by a client. The
     resource is expected to be deleted (no longer visible from resource lists,
     and not reachable by name) after the time in this field, once the
     finalizers list is empty. As long as the finalizers list contains items,
     deletion is blocked. Once the deletionTimestamp is set, this value may not
     be unset or be set further into the future, although it may be shortened or
     the resource may be deleted prior to this time. For example, a user may
     request that a pod is deleted in 30 seconds. The Kubelet will react by
     sending a graceful termination signal to the containers in the pod. After
     that 30 seconds, the Kubelet will send a hard termination signal (SIGKILL)
     to the container and after cleanup, remove the pod from the API. In the
     presence of network partitions, this object may still exist after this
     timestamp, until an administrator or automated process can determine the
     resource is fully terminated. If not set, graceful deletion of the object
     has not been requested.

     Populated by the system when a graceful deletion is requested. Read-only.
     More info:
     https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata

Comment 11 Clayton Coleman 2020-12-09 17:59:01 UTC
> If you force stop an instance (eg, kubelet goes unreachable) and it doesn't have blocking PDBs, the total time to machine deletion is the time it takes for taints to be applied and delete all the pods + 5 minutes.  So, in most cases, in a real world scenario, by the time you realize the node is dead and you delete it via the machine-api, it's going to be near instant removal.

The bug is that it's taking 5 + infinite minutes on GCP and AWS. It should take 5 minutes + about 15-60s, because a stopped machine cannot be drained, it should be deleted.

Comment 12 Clayton Coleman 2020-12-09 18:00:56 UTC
When you fix this please add an e2e test so we don't regress in the future.  Stopped machines should not block deletion and are a key part of the machine health check and recovery process for masters.

Comment 13 Michael Gugino 2020-12-09 19:27:56 UTC
Well, whoever decided to make a 'timestamp' field not be a timestamp of when an action occurred, chose a really terrible name for that field.  I wrongly assumed the timestamp recorded the time when the pod was marked deleted.  It seems that gracePeriodSeconds should have been added to timestamp at the kubelet to determine the SIGKILL deadline, but probably can't change that now.

The problems discussed in this bug regarding the inability to drain are not specific to instances that are stopped or destroyed, this would impact any unreachable kubelet.

Looks like we can potentially set 'gracePeriodTimeout' to 1s when we call drain in the case of unreachable nodes and achieve the desired 5m behavior (with the caveat that if the node were to come back, a bunch of stuff would be immediate SIGKILL'ed, which depending on timing, would happen to most pods today anyway).

When the kubelet goes unreachable, after some time, taints are applied.

Anything that doesn't tolerate taints gets a deletionTimestamp with it's nominal gracePeriod.
Anything that tolerates does not.

We set the ignore after 5m option, as well as 1s grace period:

This resets deletionTimestamp to when the pod was originally marked deleted + 1 second (T0)
At T0+5m those pods will be ignored and drain will complete.

For pods that were not marked deleted (have tolerations), eviction proceeds as normal, except
now we're specifying 1s gracePeriod.  This pod will.  In 5 minutes after we attempt the first
drain (rather than when taints were applied), this pod will be ignored and drain will succeed
(provided PDBs are not violated, PDBs will block eviction in this case and grace Period will
have no effect as the pod won't be marked deleted).


Proposed fix:  Set gracePeriodTimeout to 1s when kubelet unreachable in machine-api drain invocation.  This may result in a small increase to draining times for anything that tolerates unreachable/noexecute taint, at most adding 5 minutes from initial machine delete command.  In almost all cases, the results of this change will be imperceptible and not result in increases in drain time.

As far as skipping drain if the machine is stopped (definitely not doing this) or terminated, I prefer not to do this.  If we always ensure we drain, we can't make any tragic mistakes as we'll always respect PDBs.  5M is not a long time to wait for an instance/node to be removed.

Comment 14 Clayton Coleman 2020-12-09 20:05:51 UTC
> The problems discussed in this bug regarding the inability to drain are not specific to instances that are stopped or destroyed, this would impact any unreachable kubelet

A stopped Kubelet and an unreachable Kubelet are completely different things, and we need different behavior.

A stopped Kubelet *IS NOT* an unreachable Kubelet - it is a kubelet where we know that the reason for unreachable is that the cloud has stopped it.  Therefore we can rely on an authoritative API (the cloud API) to delete the machine, which guarantees the workloads are stopped.  If the cloud API doesn't explicitly tell us it is stopped, we cannot assume anything about the intsance.

> As far as skipping drain if the machine is stopped (definitely not doing this) or terminated, I prefer not to do this.  If we always ensure we drain, we can't make any tragic mistakes as we'll always respect PDBs.  5M is not a long time to wait for an instance/node to be removed.

Drain does nothing, because if a machine is stopped, the pod is already unready (stopped) and doesn't count towards PDB.  The correct behavior upon receiving a node delete for a stopped node is to delete the instance, then delete the node object (which triggers release of all Kube objects), and then delete the machine.  That is the designed behavior of Kubernetes - a stopped machine MUST be able to clear immediately, because the stop info is fully determinate (we know the workloads aren't running).

The argument that we need to drain presumes that a stopped machine has running workloads, which is not possible.  You cannot drain it, it is already "out".

Comment 15 Michael Gugino 2020-12-09 20:54:28 UTC
>  A stopped Kubelet and an unreachable Kubelet are completely different things, and we need different behavior.

From the perspective of the NLM, a stopped instance is an unreachable kubelet.  We don't need different behavior because we handle the situation identically for each case.

> Drain does nothing, because if a machine is stopped, the pod is already unready (stopped) and doesn't count towards PDB.

This is incorrect.  The pod will count as a disrupted pod against the PDB.  This will prevent other (additional) pods from being disrupted.  For the stopped node, in 1.20 (not merged downstream yet), if the unreachable node being drained has a pod covered by that PDB, and that PDB as enough *other* healthy pods, drain will succeed.  Otherwise, it will fail.  This is by design to prevent the scenario here: https://bugzilla.redhat.com/show_bug.cgi?id=1840358

This situation was actually hit in real life right here: https://bugzilla.redhat.com/show_bug.cgi?id=1840358#c2

This might not be important if 1 host is stopped, but it will probably be important if 2+ hosts are stopped.  As long as we always drain, nobody can point the finger at us for doing the wrong thing and deleting their data when we shouldn't.

The net result is, even if drain doesn't actually result in any pods actually being stopped, what it does do is serve as an emergency backstop to ensure we don't delete data for any application that is covered by a PDB unless there enough healthy replicas of that application.

Comment 16 Clayton Coleman 2020-12-09 22:27:33 UTC
> From the perspective of the NLM, a stopped instance is an unreachable kubelet. We don't need different behavior because we handle the situation identically for each case.

I'm not sure whether I care about NLM, I'm just describing the designed behaviour of OpenShift (which NLM is responsible for implementing what is within its role, not doing arbitrary things).

> The net result is, even if drain doesn't actually result in any pods actually being stopped, what it does do is serve as an emergency backstop to ensure we don't delete data for any application that is covered by a PDB unless there enough healthy replicas of that application.

Ok, I buy the argument in that scenario that you have to call PDB drain. However, since the machine is stopped, we know the processes are not running, and delete is pending, the drain should happen almost instanteously (in the case in question no PDB would be violated).  Are you saying you know what the bug is such that the drain doesn't execute quickly?  On a stopped instance you have no guarantee the node is coming back, so you can't just wait arbitrarily long to drain on that machine - you have to drain from another machine.  Why is drain "hung" on this case?

Comment 17 Clayton Coleman 2020-12-09 22:32:53 UTC
Actually, thinking about this again I'm not sure the PDB argument flies.  A stopped instance is requested to be deleted. Waiting "X amount of time" before giving up draining and just deleting is no more correct than just continuing - either we decide as a system that the machine delete is an expression of intent (like a stop + node object delete would be in Kube), or we wait forever until the PDBs clear.  Giving up after unit time will not be an option.

If we choose to wait forever, we would have to a) signal that via the machine API (indicating the user needs to take action) and b) alert on it, because the stopped instance will not magically clear.  The current behavior is definitely insufficient.  The system may not wait forever without clear feedback to a user (probably after the first round of evictions fail to complete).

Comment 18 Michael Gugino 2020-12-10 01:13:14 UTC
(In reply to Clayton Coleman from comment #16)

> Ok, I buy the argument in that scenario that you have to call PDB drain.
> However, since the machine is stopped, we know the processes are not
> running, and delete is pending, the drain should happen almost instanteously
> (in the case in question no PDB would be violated).  Are you saying you know
> what the bug is such that the drain doesn't execute quickly?  On a stopped
> instance you have no guarantee the node is coming back, so you can't just
> wait arbitrarily long to drain on that machine - you have to drain from
> another machine.  Why is drain "hung" on this case?

Drain is hung in this particular bug's case because we didn't properly account for how DeletionTimestamp actually works.  We hit this bug under the following conditions: 1) Kubelet goes unreachable for one reason or another 2) Machine is deleted via machine-api 3) Pod being drained has a high gracePeriod.  In our case, I think it was the router pod, and it had 3600 second timeout.

We aren't waiting arbitrarily long, if the kubelet is unreachable, we (today) wait until deletionTimestamp + 5M for each pod.  We don't set gracePeriodTimeout on drain today, we honor each pod's setting.  After that period, we ignore the pod.  We do this because if a pod has a deletion timestamp, it means the pod was already deleted or evicted, so PDBs are already disrupted, and we don't need to stick around and wait any longer.  We originally chose 5M because some people didn't want it to be instant, possibly for node flapping, but ultimately there's on strong need to do this.


(In reply to Clayton Coleman from comment #17)
> Actually, thinking about this again I'm not sure the PDB argument flies.  A
> stopped instance is requested to be deleted. Waiting "X amount of time"
> before giving up draining and just deleting is no more correct than just
> continuing - either we decide as a system that the machine delete is an
> expression of intent (like a stop + node object delete would be in Kube), or
> we wait forever until the PDBs clear.  Giving up after unit time will not be
> an option.

We don't just give up, PDBs will indeed block forever and that's an intentional design.  We quit waiting for pods with deletion timestamps to disappear from the node after 5 minutes if the kubelet is unreachable (because they'll never disappear because the kubelet doesn't ever confirm they're gone).  In the case of an unreachable kubelet, any pod without a 'noexecute' toleration will be deleted by the scheduler automatically.  Unfortunately, this is a 'delete' and not an 'evict' so PDBs are not respect.  This is why we specified etcd-quorum-guard to tolerate noexecute indefinitely, this prevents the scheduler from deleting it and allowing us to actually evict it.

> If we choose to wait forever, we would have to a) signal that via the
> machine API (indicating the user needs to take action) and b) alert on it,
> because the stopped instance will not magically clear.  The current behavior
> is definitely insufficient.  The system may not wait forever without clear
> feedback to a user (probably after the first round of evictions fail to
> complete).

We already have some alerts for nodes that are not in the 'running' phase (eg, in the deleting phase), but it takes 60M for those alerts to fire.  We consider worker node alerts secondary conditions, if something critical needs to alarm sooner that component should have it's own monitoring/alarms (eg, if you only have 3 nodes, 1 goes down, router should complain, etc).

We should enable custom monitoring and alerts for our components.  For instance, our CI users want certain jobs to run with PDBs having MaxDisruptions=0.  This is to ensure the jobs run to completion (I think they do this for builds).  So, it's totally expected that eviction will be blocked for 30M+ and there's no reason to flap alerts.  In any case, we only wait 20s per reconcile for a given drain attempt to complete.  There's a portion called 'wait for delete' in the drain library that just polls the API waiting for the pods to be removed from the node in question.  After 20s, we give up so we're not blocked for minutes at a time and we can reconcile whatever is next.  If someone sets a very high timeout and a component that doesn't bother to stop itself and just traps the signal (or whatever) is going to block drain indefinitely.

Comment 19 Clayton Coleman 2020-12-10 16:43:10 UTC
> We aren't waiting arbitrarily long, if the kubelet is unreachable, we (today) wait until deletionTimestamp + 5M for each pod.  We don't set gracePeriodTimeout on drain today, we honor each pod's setting.  After that period, we ignore the pod.  We do this because if a pod has a deletion timestamp, it means the pod was already deleted or evicted, so PDBs are already disrupted, and we don't need to stick around and wait any longer.  We originally chose 5M because some people didn't want it to be instant, possibly for node flapping, but ultimately there's on strong need to do this.

I want to make sure we're on the same page (maybe?) - for a stopped node grace period does not apply (it is 0).  If you know that the node is stopped (which you do from the cloud provider) the Kubelet's role in stopping the pod does not apply (the Kubelet cannot satisfy your request because it is also stopped).  Grace period is defined for pods in terms of "the time for a running process to acknowledge cleanup".

I agree PDB constraints should be satisfied, so the expection I would have based on our discussion is:

1. A stopped and unreachable machine instance that is drained behaves differently from an unreachable machine instance
2. A stopped and unreachable machine drain assumes grace period 0 no matter what the pod specifies, and can only be blocked by a PDB
3. If blocked by a PDB, wait a medium period (~ a few minutes?) then go back to 1
4. Once a drain has successfully completed (ignoring everything except PDB restrictions) and all pods are marked deleted, a stopped machine should be deleted via the cloud provider API
5. Once the machine is acknowledged as deleted from cloud provider API, the node object should be deleted, signalling to Kube that the pods are released (and the pods will be deleted by pod GC)
6. The machine object is deleted

The only waits in this period (because grace period doesn't apply) would be on a retry of PDB clearing - but if PDB clears, then we should move through 4-6 as fast as possible.

Comment 20 Joel Speed 2020-12-11 11:57:29 UTC
> 1. A stopped and unreachable machine instance that is drained behaves differently from an unreachable machine instance
> 2. A stopped and unreachable machine drain assumes grace period 0 no matter what the pod specifies, and can only be blocked by a PDB
> 3. If blocked by a PDB, wait a medium period (~ a few minutes?) then go back to 1
> 4. Once a drain has successfully completed (ignoring everything except PDB restrictions) and all pods are marked deleted, a stopped machine should be deleted via the cloud provider API
> 5. Once the machine is acknowledged as deleted from cloud provider API, the node object should be deleted, signalling to Kube that the pods are released (and the pods will be deleted by pod GC)
> 6. The machine object is deleted

This matches my understanding of the issue. Based off of this and the other parts of this thread, I think there are a couple of things we need to do to resolve the issues here (I see this as two bugs):

1. We need to fix the drain implementation so that the `skip-wait-for-delete-seconds` option takes into account the grace period of the terminating pods. At present this is having absolutely no effect because it doesn't take the graceperiod into account.
  a. The current logic finds the time elapsed since the deletion timestamp (time.Now() - deletionTimestamp), it is my understanding that the deletion timestamp is the point at which the GC removes the object, therefore, this should always be negative. (It is set by Delete request time + grace period).
  b. It compares the (negative) elapsed time to 5 minutes, if the elapsed time is greater than 5 minutes, it ignores the pod in the drain (this is currently always false, so pods are never ignored)
  c. It should calculate (time.Now() - (deletionTimeStamp - deletionGracePeriodSeconds)) to find the time elapsed since the initial delete request was made 

2. We should not wait for termination grace periods when we know that a VM is shut down
  a. Currently we rely on the same logic as it being unreachable, and expect the broken `skip-wait-for-delete-seconds` option to cover us
  b. If we know that a Node is stopped, we can force the grace period to 0 seconds as this will not affect anything (everything is already known to be stopped).
  c. With a grace period zero, but continuing to use the drain library, any pod not blocked by a PDB will be deleted immediately. Any pod blocked by a PDB will be retried indefinitely as is done now (this prevents us regressing on the etcd bug Mike mentioned before).

These changes should improve the UX and keep us safe from a data loss perspective.

My only concern is that the second fix will be quite a major effort to implement. We don't currently have an easy `Stopped()` method on the actuator interface and as such will need to determine a suitable way to either add this new method or bring provider specific logic into the Machine Controller. Given that we are already past feature freeze, and the size/impact/effort required to resolve this second bug, I'd like to suggest we target this to 4.8 and evaluate backporting to 4.7 once we have seen how much code we have to touch. TL;DR; I don't think we should make the second part of this a blocker for 4.7.

If we can fix the drain library for 4.7, the advantage of the second fix becomes just that we don't wait 5 minutes on a stopped node, so I think we should prioritise this and make sure that we wait maximum 5 mins for any unreachable node. @mgugino@redhat.com Do you think it would be possible to tackle the first part of this and get a fix asap?

Comment 21 Clayton Coleman 2020-12-11 14:12:53 UTC
Note that in the scenario I'm describing the "effective grace period" (because the machine is stopped) is 0.  The real grace period will continue to be non-zero.  However, if at the time of the drain you know that the machine is stopped you can override the grace period.  It wasn't clear to me whether your statement about 2) implied that you can't start the former (or whether that's a different part of the logic)?

But the more important part:

You should never do math with deletionTimetamp ever.  NO component of Kubernetes is allowed to compare time.Now() to any recorded timestamp.  Instead, you are required to have a local clock and "start the clock" when you sync the object.  If you look at the namespace controller you'll see this behavior, and we reference it in many of the original design documents.  The reason is that that you can't rely on different parts of the system having synchronized clocks (we mandate it, but we design the system to tolerate skew and to not fail in an unsafe manner).  So you cannot have ANY logic in drain that compares time.Now to a value stored in an api object.  Instead, when you see an object that has a deletionTimestamp set, you must "start the clock" (`begin := time.Now()`) and then leverage the grace period seconds on the pod (the one in metadata) relative to that.  So your logic 1a-c cannot be used.

Comment 22 Joel Speed 2020-12-11 14:49:29 UTC
>  However, if at the time of the drain you know that the machine is stopped you can override the grace period. 

Yeah I think we are on the same page with this one. If we know the machine is stopped, we can override the effective grace period to 0 (delete straight away). I don't think any part of (2) is blocking us from starting (1).

The idea around (1) is to fix a bug present in the Kubectl library we are using [1].
The idea around (2) would be to add new code to the machine controller around [2] that somehow (TBD) determines that the VM is stopped, and sets the drain.GracePeriodSeconds = 0.

> You should never do math with deletionTimetamp ever.  NO component of Kubernetes is allowed to compare time.Now() to any recorded timestamp. ...  So your logic 1a-c cannot be used.

We are using a library from kubectl which is doing the math on the timestamp [1]. My suggestion was that we fix this logic rather than add anything new.
It sounds like we should be raising an issue upstream with the reasoning you've included to try and rip out this logic and come up with a better solution for the feature that doesn't rely on this math.

I understand what you're saying here and it makes a lot of sense. I'm gonna need to spend a bit of time to work out if there's a way around this within our controllers. We currently fail the drain after 20 seconds and requeue to prevent the controller blocking on long drains, knowing when each pod was successfully evicted and waiting for that grace period could be quite tricky if we continue to use the drain library as is today, we don't get output from when pods are evicted. 

If we had our own drain implementation that sent the evictions, we could track this better, not sure where we would store the information though if we are still doing our 20 second requeue.

[1]: https://github.com/kubernetes/kubectl/blob/3881818950ff64d26a0f786e599637952ff6654d/pkg/drain/filters.go#L248-L250
[2]: https://github.com/openshift/machine-api-operator/blob/b57d98768d45805fc5b0136bf768631b97eecbe0/pkg/controller/machine/controller.go#L346-L371

Comment 23 Michael Gugino 2020-12-11 15:04:53 UTC
There are some things getting confused in this thread.

1) we only look at the deletionTimestamp of pods if the kubelet is unreachable.  We're not fundamentally concerned with race conditions or accurate time keeping, we're going to assume the pods running on the control plane host (eg, machine-controller) are on the same time as the api-server.

2) I'm proposing we adjust the gracePeriodTimeout to 1s for pods un unreachable kubelets.  Waiting doesn't really make sense, it exposed the bug we have currently.  The only impact this change will have is in the off chance that a kubelet comes back online after we initial the process, pods will receive a SIGKILL rather than a SIGTERM, but this will only affect pods that are not covered by PDBs which cannot be satisfied.  If the kubelet comes back, all those pods are going to be terminated, most of them would have been SIGKILL'ed in today's implementation as it will likely be 15M+ before any action is actually taken against machine-api to remediate the broken node.

3) I'm not inclined to treat 'stopped' instances any differently than 'unreachable' instances.  What I think is being forgotten here is that in the real world, the kubelet doesn't go unreachable and then immediately we drain and terminate the instance.  There will be some time before MHC (if being used) kicks in and decides to terminate the instance, or it will probably take some time for the user to notice the kubelet is unreachable and trigger the machine-api themselves.

The ignoreDeletionTimetamp logic is built into the drain library (kubectl) now, it's not just for controllers, it's for users to invoke as they wish.  The option works exactly as intended, minus we didn't understand how deletionTimestamp was implemented in the API, but it's still useful logic to have for users.  I'm not planning on changing this option upstream as it shipped a couple releases ago and is useful by itself.

A user bash script using this logic might look as follows:

1) Invoke drain as normal
2) If (1) fails, invoke drain with --ignore-deletion-timestamp=5m

There's not really a place to record these 'start clock now' for CLI runs without some sort of temporary file and large changes to the drain library.

Comment 24 Michael Gugino 2020-12-11 15:22:31 UTC
If we absolutely insist that we can't compare the deletionTimeStamp from pods (which I disagree with) and *must have* a timer local to the controller, we can achieve this with the custom drain filters logic: https://github.com/kubernetes/kubectl/commit/a28f944c902117183385ca1b95c6b21e01b19c28#diff-27e1a7da1fc0826f3b709c7a67c8d3d5dbc4cecf58267d612f27cf968d619ef7

This allows you to pass an arbitrary function to the filters (the things that exclude pods during drain) to record information about pods and then after whatever timing has passed according to our controller's logic, to exclude those pods.  This would be trivial, but also unnecessary, as setting ignoreDeletionTimetamp to a low value and having a low gracePeriodTimeout effectively makes this a binary operation, unless you want to start the 5M clock from the time we (the machine-api) first attempts to drain a pod, which is going to add more time to most drain operations rather than subtract time from most drain operations.

Comment 25 Joel Speed 2020-12-17 17:17:14 UTC
Mike is going to come up with a fix based on his grace period idea in the next week or so

Comment 27 sunzhaohua 2021-01-20 11:11:11 UTC
Verified
clusterversion: 4.7.0-0.nightly-2021-01-19-060737

1. SSH to a node in GCP cluster and run "sudo -i systemctl poweroff --force --force", and the node without Pods containing the pdb
$ oc debug node/zhsungcp119-99rvl-worker-c-k8g8b.c.openshift-qe.internal
Creating debug namespace/openshift-debug-node-c4zwp ...
Starting pod/zhsungcp119-99rvl-worker-c-k8g8bcopenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.32.6
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# sudo -i systemctl poweroff --force --force
Powering off.

Removing debug pod ...
Removing debug namespace/openshift-debug-node-c4zwp ...
2. Delete machine via oc
$ oc delete machine zhsungcp119-99rvl-worker-c-k8g8b
machine.machine.openshift.io "zhsungcp119-99rvl-worker-c-k8g8b" deleted
3. Machine is deleted quickly
I0120 11:03:42.095375       1 controller.go:168] zhsungcp119-99rvl-worker-c-k8g8b: reconciling Machine
I0120 11:03:42.095392       1 controller.go:208] zhsungcp119-99rvl-worker-c-k8g8b: reconciling machine triggers delete
I0120 11:03:42.099401       1 controller.go:337] Could not find node from noderef, it may have already been deleted: zhsungcp119-99rvl-worker-c-k8g8b.c.openshift-qe.internal
I0120 11:03:42.099428       1 actuator.go:125] zhsungcp119-99rvl-worker-c-k8g8b: Deleting machine
I0120 11:03:42.270621       1 reconciler.go:342] zhsungcp119-99rvl-worker-c-k8g8b: Machine does not exist
I0120 11:03:42.270658       1 reconciler.go:359] zhsungcp119-99rvl-worker-c-k8g8b: Machine not found during delete, skipping
I0120 11:03:42.270693       1 actuator.go:84] zhsungcp119-99rvl-worker-c-k8g8b: Checking if machine exists
I0120 11:03:42.270943       1 recorder.go:98] controller-runtime/manager/events "msg"="Normal"  "message"="Deleted machine zhsungcp119-99rvl-worker-c-k8g8b" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"zhsungcp119-99rvl-worker-c-k8g8b","uid":"97e369d9-d097-43ce-9a1b-c2bf399bd44b","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"483956"} "reason"="Delete"
I0120 11:03:42.512038       1 reconciler.go:342] zhsungcp119-99rvl-worker-c-k8g8b: Machine does not exist
I0120 11:03:42.512086       1 controller.go:246] zhsungcp119-99rvl-worker-c-k8g8b: deleting node "zhsungcp119-99rvl-worker-c-k8g8b.c.openshift-qe.internal" for machine
I0120 11:03:42.512122       1 controller.go:392] Node "zhsungcp119-99rvl-worker-c-k8g8b.c.openshift-qe.internal" not found
I0120 11:03:42.544077       1 controller.go:260] zhsungcp119-99rvl-worker-c-k8g8b: machine deletion successful

Comment 28 Clayton Coleman 2021-02-01 03:15:27 UTC
Doesn't look correct to me in GCP in master:

Deleted, still waiting about 10m later.  Did this regress?

E0201 03:05:00.692500       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:05:00.713642       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:05:00.713683       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:05:00.720491       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:05:21.177003       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "image-registry-5b48fbdc88-x6qxb" terminating: global timeout reached: 20s, error when waiting for pod "ssh-bastion-5fcf8d7d9b-4r7hj" terminating: global timeout reached: 20s, error when waiting for pod "migrator-689747c59-krk4r" terminating: global timeout reached: 20s, error when waiting for pod "certified-operators-lqwjk" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "network-check-source-78cb79769b-5wfv2" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "kube-state-metrics-7bf99fb87-z2rz5" terminating: global timeout reached: 20s, error when waiting for pod "openshift-state-metrics-5cd6797d47-kgwbs" terminating: global timeout reached: 20s, error when waiting for pod "redhat-marketplace-6t282" terminating: global timeout reached: 20s, error when waiting for pod "service-test-gn7lf" terminating: global timeout reached: 20s]
E0201 03:05:21.177078       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:05:22.908829       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:05:22.908853       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:05:22.919456       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:05:43.374721       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "network-check-source-78cb79769b-5wfv2" terminating: global timeout reached: 20s, error when waiting for pod "migrator-689747c59-krk4r" terminating: global timeout reached: 20s, error when waiting for pod "kube-state-metrics-7bf99fb87-z2rz5" terminating: global timeout reached: 20s, error when waiting for pod "ssh-bastion-5fcf8d7d9b-4r7hj" terminating: global timeout reached: 20s, error when waiting for pod "image-registry-5b48fbdc88-x6qxb" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "redhat-marketplace-6t282" terminating: global timeout reached: 20s, error when waiting for pod "openshift-state-metrics-5cd6797d47-kgwbs" terminating: global timeout reached: 20s, error when waiting for pod "service-test-gn7lf" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "certified-operators-lqwjk" terminating: global timeout reached: 20s]
E0201 03:05:43.374795       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:05:43.987244       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:05:43.987267       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:05:44.000131       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:06:04.451689       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "service-test-gn7lf" terminating: global timeout reached: 20s, error when waiting for pod "network-check-source-78cb79769b-5wfv2" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "redhat-marketplace-6t282" terminating: global timeout reached: 20s, error when waiting for pod "certified-operators-lqwjk" terminating: global timeout reached: 20s, error when waiting for pod "migrator-689747c59-krk4r" terminating: global timeout reached: 20s, error when waiting for pod "ssh-bastion-5fcf8d7d9b-4r7hj" terminating: global timeout reached: 20s, error when waiting for pod "image-registry-5b48fbdc88-x6qxb" terminating: global timeout reached: 20s, error when waiting for pod "openshift-state-metrics-5cd6797d47-kgwbs" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "kube-state-metrics-7bf99fb87-z2rz5" terminating: global timeout reached: 20s]
E0201 03:06:04.451756       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:06:04.937062       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:06:04.937083       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:06:04.942816       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:06:25.399391       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "certified-operators-lqwjk" terminating: global timeout reached: 20s, error when waiting for pod "ssh-bastion-5fcf8d7d9b-4r7hj" terminating: global timeout reached: 20s, error when waiting for pod "redhat-marketplace-6t282" terminating: global timeout reached: 20s, error when waiting for pod "service-test-gn7lf" terminating: global timeout reached: 20s, error when waiting for pod "image-registry-5b48fbdc88-x6qxb" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "kube-state-metrics-7bf99fb87-z2rz5" terminating: global timeout reached: 20s, error when waiting for pod "migrator-689747c59-krk4r" terminating: global timeout reached: 20s, error when waiting for pod "openshift-state-metrics-5cd6797d47-kgwbs" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "network-check-source-78cb79769b-5wfv2" terminating: global timeout reached: 20s]
E0201 03:06:25.399521       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:06:26.001342       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:06:26.002618       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:06:26.011126       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:06:46.466009       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "redhat-marketplace-6t282" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "certified-operators-lqwjk" terminating: global timeout reached: 20s, error when waiting for pod "ssh-bastion-5fcf8d7d9b-4r7hj" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "openshift-state-metrics-5cd6797d47-kgwbs" terminating: global timeout reached: 20s, error when waiting for pod "image-registry-5b48fbdc88-x6qxb" terminating: global timeout reached: 20s, error when waiting for pod "migrator-689747c59-krk4r" terminating: global timeout reached: 20s, error when waiting for pod "kube-state-metrics-7bf99fb87-z2rz5" terminating: global timeout reached: 20s, error when waiting for pod "service-test-gn7lf" terminating: global timeout reached: 20s, error when waiting for pod "network-check-source-78cb79769b-5wfv2" terminating: global timeout reached: 20s]
E0201 03:06:46.466190       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:06:46.466390       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:06:46.466532       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:06:46.472571       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:07:06.930664       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "network-check-source-78cb79769b-5wfv2" terminating: global timeout reached: 20s, error when waiting for pod "ssh-bastion-5fcf8d7d9b-4r7hj" terminating: global timeout reached: 20s, error when waiting for pod "service-test-gn7lf" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "certified-operators-lqwjk" terminating: global timeout reached: 20s, error when waiting for pod "kube-state-metrics-7bf99fb87-z2rz5" terminating: global timeout reached: 20s, error when waiting for pod "migrator-689747c59-krk4r" terminating: global timeout reached: 20s, error when waiting for pod "redhat-marketplace-6t282" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "openshift-state-metrics-5cd6797d47-kgwbs" terminating: global timeout reached: 20s, error when waiting for pod "image-registry-5b48fbdc88-x6qxb" terminating: global timeout reached: 20s]
E0201 03:07:06.930744       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:07:07.522583       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:07:07.522601       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:07:07.530173       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:07:27.984581       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "redhat-marketplace-6t282" terminating: global timeout reached: 20s, error when waiting for pod "ssh-bastion-5fcf8d7d9b-4r7hj" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "service-test-gn7lf" terminating: global timeout reached: 20s, error when waiting for pod "migrator-689747c59-krk4r" terminating: global timeout reached: 20s, error when waiting for pod "kube-state-metrics-7bf99fb87-z2rz5" terminating: global timeout reached: 20s, error when waiting for pod "network-check-source-78cb79769b-5wfv2" terminating: global timeout reached: 20s, error when waiting for pod "openshift-state-metrics-5cd6797d47-kgwbs" terminating: global timeout reached: 20s, error when waiting for pod "image-registry-5b48fbdc88-x6qxb" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "certified-operators-lqwjk" terminating: global timeout reached: 20s]
E0201 03:07:27.984657       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:07:27.984760       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:07:27.984783       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:07:27.992003       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:07:48.528550       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "kube-state-metrics-7bf99fb87-z2rz5" terminating: global timeout reached: 20s, error when waiting for pod "certified-operators-lqwjk" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "ssh-bastion-5fcf8d7d9b-4r7hj" terminating: global timeout reached: 20s, error when waiting for pod "openshift-state-metrics-5cd6797d47-kgwbs" terminating: global timeout reached: 20s, error when waiting for pod "service-test-gn7lf" terminating: global timeout reached: 20s, error when waiting for pod "migrator-689747c59-krk4r" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "redhat-marketplace-6t282" terminating: global timeout reached: 20s, error when waiting for pod "network-check-source-78cb79769b-5wfv2" terminating: global timeout reached: 20s, error when waiting for pod "image-registry-5b48fbdc88-x6qxb" terminating: global timeout reached: 20s]
E0201 03:07:48.528615       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:07:49.091890       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:07:49.091919       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:07:49.101731       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:08:09.555703       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "certified-operators-lqwjk" terminating: global timeout reached: 20s, error when waiting for pod "migrator-689747c59-krk4r" terminating: global timeout reached: 20s, error when waiting for pod "ssh-bastion-5fcf8d7d9b-4r7hj" terminating: global timeout reached: 20s, error when waiting for pod "kube-state-metrics-7bf99fb87-z2rz5" terminating: global timeout reached: 20s, error when waiting for pod "service-test-gn7lf" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "network-check-source-78cb79769b-5wfv2" terminating: global timeout reached: 20s, error when waiting for pod "openshift-state-metrics-5cd6797d47-kgwbs" terminating: global timeout reached: 20s, error when waiting for pod "redhat-marketplace-6t282" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "image-registry-5b48fbdc88-x6qxb" terminating: global timeout reached: 20s]
E0201 03:08:09.555771       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:08:10.123325       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:08:10.123381       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:08:10.135554       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:08:30.588999       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "service-test-gn7lf" terminating: global timeout reached: 20s, error when waiting for pod "certified-operators-lqwjk" terminating: global timeout reached: 20s, error when waiting for pod "redhat-marketplace-6t282" terminating: global timeout reached: 20s, error when waiting for pod "migrator-689747c59-krk4r" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "ssh-bastion-5fcf8d7d9b-4r7hj" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "image-registry-5b48fbdc88-x6qxb" terminating: global timeout reached: 20s, error when waiting for pod "openshift-state-metrics-5cd6797d47-kgwbs" terminating: global timeout reached: 20s, error when waiting for pod "network-check-source-78cb79769b-5wfv2" terminating: global timeout reached: 20s, error when waiting for pod "kube-state-metrics-7bf99fb87-z2rz5" terminating: global timeout reached: 20s]
E0201 03:08:30.589065       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:08:31.075535       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:08:31.075573       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:08:31.081590       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:08:51.539957       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "service-test-gn7lf" terminating: global timeout reached: 20s, error when waiting for pod "ssh-bastion-5fcf8d7d9b-4r7hj" terminating: global timeout reached: 20s, error when waiting for pod "migrator-689747c59-krk4r" terminating: global timeout reached: 20s, error when waiting for pod "image-registry-5b48fbdc88-x6qxb" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "redhat-marketplace-6t282" terminating: global timeout reached: 20s, error when waiting for pod "certified-operators-lqwjk" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "openshift-state-metrics-5cd6797d47-kgwbs" terminating: global timeout reached: 20s, error when waiting for pod "kube-state-metrics-7bf99fb87-z2rz5" terminating: global timeout reached: 20s, error when waiting for pod "network-check-source-78cb79769b-5wfv2" terminating: global timeout reached: 20s]
E0201 03:08:51.540042       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:08:51.540131       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:08:51.540148       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:08:51.545960       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:09:12.003158       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "image-registry-5b48fbdc88-x6qxb" terminating: global timeout reached: 20s, error when waiting for pod "network-check-source-78cb79769b-5wfv2" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "kube-state-metrics-7bf99fb87-z2rz5" terminating: global timeout reached: 20s, error when waiting for pod "certified-operators-lqwjk" terminating: global timeout reached: 20s, error when waiting for pod "service-test-gn7lf" terminating: global timeout reached: 20s, error when waiting for pod "ssh-bastion-5fcf8d7d9b-4r7hj" terminating: global timeout reached: 20s, error when waiting for pod "openshift-state-metrics-5cd6797d47-kgwbs" terminating: global timeout reached: 20s, error when waiting for pod "redhat-marketplace-6t282" terminating: global timeout reached: 20s, error when waiting for pod "migrator-689747c59-krk4r" terminating: global timeout reached: 20s]
E0201 03:09:12.003220       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:09:12.517409       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:09:12.517456       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:09:12.524917       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:09:33.123148       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "ssh-bastion-5fcf8d7d9b-4r7hj" terminating: global timeout reached: 20s, error when waiting for pod "service-test-gn7lf" terminating: global timeout reached: 20s, error when waiting for pod "network-check-source-78cb79769b-5wfv2" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "image-registry-5b48fbdc88-x6qxb" terminating: global timeout reached: 20s, error when waiting for pod "certified-operators-lqwjk" terminating: global timeout reached: 20s, error when waiting for pod "openshift-state-metrics-5cd6797d47-kgwbs" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "kube-state-metrics-7bf99fb87-z2rz5" terminating: global timeout reached: 20s, error when waiting for pod "migrator-689747c59-krk4r" terminating: global timeout reached: 20s, error when waiting for pod "redhat-marketplace-6t282" terminating: global timeout reached: 20s]
E0201 03:09:33.123218       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:09:33.123325       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:09:33.123364       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:09:33.129572       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:09:53.586457       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "network-check-source-78cb79769b-5wfv2" terminating: global timeout reached: 20s, error when waiting for pod "openshift-state-metrics-5cd6797d47-kgwbs" terminating: global timeout reached: 20s, error when waiting for pod "kube-state-metrics-7bf99fb87-z2rz5" terminating: global timeout reached: 20s, error when waiting for pod "migrator-689747c59-krk4r" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "redhat-marketplace-6t282" terminating: global timeout reached: 20s, error when waiting for pod "ssh-bastion-5fcf8d7d9b-4r7hj" terminating: global timeout reached: 20s, error when waiting for pod "certified-operators-lqwjk" terminating: global timeout reached: 20s, error when waiting for pod "image-registry-5b48fbdc88-x6qxb" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s]
E0201 03:09:53.586549       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:09:53.586643       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:09:53.586657       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:09:53.593167       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:10:13.899552       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s]
E0201 03:10:13.899616       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:10:13.899709       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:10:13.899725       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:10:13.907700       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:10:34.076417       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s]
E0201 03:10:34.076467       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:10:34.076556       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:10:34.076572       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:10:34.087098       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:10:54.218370       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s, error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s]
E0201 03:10:54.218434       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:10:54.218535       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:10:54.218565       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:10:54.224981       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:11:14.353606       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s]
E0201 03:11:14.353655       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:11:14.353764       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:11:14.353785       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:11:14.360516       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:11:34.486215       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": [error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s, error when waiting for pod "alertmanager-main-2" terminating: global timeout reached: 20s]
E0201 03:11:34.486276       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:11:34.486378       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:11:34.486394       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:11:34.492688       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:11:54.612867       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s
E0201 03:11:54.612915       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:11:54.613011       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:11:54.613035       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:11:54.619815       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:12:14.784188       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s
E0201 03:12:14.784238       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:12:14.784325       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:12:14.784341       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:12:14.791054       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:12:34.910972       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s
E0201 03:12:34.911012       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:12:34.911109       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:12:34.911130       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:12:34.917820       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it
W0201 03:12:55.040572       1 controller.go:378] drain failed for machine "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": error when waiting for pod "router-default-5cbbbdfd4f-bczl6" terminating: global timeout reached: 20s
E0201 03:12:55.040616       1 controller.go:216] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: failed to drain node for machine: requeue in: 20s
I0201 03:12:55.040721       1 controller.go:168] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling Machine
I0201 03:12:55.040742       1 controller.go:208] ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7: reconciling machine triggers delete
I0201 03:12:55.047522       1 controller.go:365] "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7": Node "ci-ln-yvpqy9b-f76d1-4cqs5-worker-d-ggpq7" is unreachable, draining will wait 300 seconds after pod is signalled for deletion and skip after it

Comment 29 Clayton Coleman 2021-02-01 03:17:10 UTC
4.7.0-0.ci-2021-01-31-192525 was the version I tried this on.

Comment 30 Michael Gugino 2021-02-01 13:57:13 UTC
(In reply to Clayton Coleman from comment #29)
> 4.7.0-0.ci-2021-01-31-192525 was the version I tried this on.

This change hasn't made it to all the providers yet, we have to vendor this to the providers.  I think a bot moved it to QA.

Comment 31 Joel Speed 2021-02-08 13:58:25 UTC
The vendor update should be done this week at which point we can try to get this re-verified.
We will want to backport this fix to 4.7/4.6 once we are sure it is fixed, moving the target to 4.8 for this now.

Comment 32 Joel Speed 2021-03-26 11:17:17 UTC
Vendor update has been done now in GCP,AWS,Azure,vSphere at the least

QE can I ask that you check the vendor update and re-verify this?

Comment 33 sunzhaohua 2021-03-29 08:35:09 UTC
Verified on aws and gcp
clusterversion: 4.8.0-0.nightly-2021-03-26-054333

1. Stop a master machine from cloud provider
2. Delete machine via oc
$ oc delete machine zhsungcp-zk5jv-master-2
machine.machine.openshift.io "zhsungcp-zk5jv-master-2" deleted

3. Machine is deleted quickly
on aws
I0329 08:28:08.874687       1 controller.go:476] zhsunaws-hw24q-master-2: going into phase "Deleting"
I0329 08:28:08.887265       1 controller.go:218] zhsunaws-hw24q-master-2: reconciling machine triggers delete
I0329 08:28:08.894857       1 controller.go:412] "zhsunaws-hw24q-master-2": Node "ip-10-0-194-194.us-east-2.compute.internal" is unreachable, draining will ignore gracePeriod. PDBs are still honored.

....
I0329 08:29:03.867758       1 reconciler.go:255] zhsunaws-hw24q-master-2: Instance does not exist
I0329 08:29:03.867774       1 controller.go:256] zhsunaws-hw24q-master-2: deleting node "ip-10-0-194-194.us-east-2.compute.internal" for machine
I0329 08:29:03.868057       1 reflector.go:219] Starting reflector *v1.Node (9m44.363141236s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225
I0329 08:29:03.868076       1 reflector.go:255] Listing and watching *v1.Node from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225
I0329 08:29:03.968765       1 controller.go:442] Node "ip-10-0-194-194.us-east-2.compute.internal" not found
I0329 08:29:03.991072       1 controller.go:270] zhsunaws-hw24q-master-2: machine deletion successful

on gcp
I0329 06:19:12.067755       1 controller.go:174] zhsungcp-zk5jv-master-2: reconciling Machine
I0329 06:19:12.080143       1 controller.go:218] zhsungcp-zk5jv-master-2: reconciling machine triggers delete
I0329 06:19:12.085978       1 controller.go:412] "zhsungcp-zk5jv-master-2": Node "zhsungcp-zk5jv-master-2.c.openshift-qe.internal" is unreachable, draining will ignore gracePeriod. PDBs are still honored.
E0329 06:19:12.192491       1 controller.go:637] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/gcp-pd-csi-driver-node-jl8h4, openshift-cluster-node-tuning-operator/tuned-nbxgz, openshift-controller-manager/controller-manager-6t6wg, openshift-dns/dns-default-8f9lv, openshift-image-registry/node-ca-zsr8w, openshift-machine-config-operator/machine-config-daemon-hbxt4, openshift-machine-config-operator/machine-config-server-5lsw4, openshift-monitoring/node-exporter-7llw4, openshift-multus/multus-admission-controller-qztk2, openshift-multus/multus-hb7f8, openshift-multus/network-metrics-daemon-swrlq, openshift-network-diagnostics/network-check-target-g2jzt, openshift-sdn/ovs-7qtm5, openshift-sdn/sdn-controller-bthqq, openshift-sdn/sdn-mbnzv; deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: openshift-kube-apiserver/revision-pruner-7-zhsungcp-zk5jv-master-2.c.openshift-qe.internal
I0329 06:19:12.197520       1 controller.go:637] evicting pod openshift-kube-apiserver/revision-pruner-7-zhsungcp-zk5jv-master-2.c.openshift-qe.internal
I0329 06:19:15.218286       1 controller.go:637] pod "revision-pruner-7-zhsungcp-zk5jv-master-2.c.openshift-qe.internal" has DeletionTimestamp older than 1 seconds, skipping
I0329 06:19:15.218342       1 controller.go:432] drain successful for machine "zhsungcp-zk5jv-master-2"
I0329 06:19:15.218365       1 actuator.go:125] zhsungcp-zk5jv-master-2: Deleting machine
I0329 06:19:15.218539       1 recorder.go:98] controller-runtime/manager/events "msg"="Normal"  "message"="Node \"zhsungcp-zk5jv-master-2.c.openshift-qe.internal\" drained" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"zhsungcp-zk5jv-master-2","uid":"89041523-1e2c-4023-8f37-fe18a823c9a7","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"77208"} "reason"="Deleted"
I0329 06:19:16.009175       1 reconciler.go:370] zhsungcp-zk5jv-master-2: machine status is exists, requeuing...
E0329 06:19:16.009221       1 actuator.go:53] zhsungcp-zk5jv-master-2 error: zhsungcp-zk5jv-master-2: reconciler failed to Delete machine: requeue in: 20s
E0329 06:19:16.009269       1 controller.go:239] zhsungcp-zk5jv-master-2: failed to delete machine: zhsungcp-zk5jv-master-2: reconciler failed to Delete machine: requeue in: 20s
I0329 06:19:16.009281       1 controller.go:454] Actuator returned requeue-after error: requeue in: 20s
I0329 06:19:16.009368       1 recorder.go:98] controller-runtime/manager/events "msg"="Warning"  "message"="zhsungcp-zk5jv-master-2: reconciler failed to Delete machine: requeue in: 20s" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"zhsungcp-zk5jv-master-2","uid":"89041523-1e2c-4023-8f37-fe18a823c9a7","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"77208"} "reason"="FailedDelete"
I0329 06:19:36.009979       1 controller.go:174] zhsungcp-zk5jv-master-2: reconciling Machine
I0329 06:19:36.021798       1 controller.go:218] zhsungcp-zk5jv-master-2: reconciling machine triggers delete
I0329 06:19:36.029654       1 controller.go:384] Could not find node from noderef, it may have already been deleted: zhsungcp-zk5jv-master-2.c.openshift-qe.internal
I0329 06:19:36.029737       1 actuator.go:125] zhsungcp-zk5jv-master-2: Deleting machine
I0329 06:19:36.371875       1 reconciler.go:342] zhsungcp-zk5jv-master-2: Machine does not exist
I0329 06:19:36.371905       1 reconciler.go:359] zhsungcp-zk5jv-master-2: Machine not found during delete, skipping
I0329 06:19:36.371947       1 actuator.go:84] zhsungcp-zk5jv-master-2: Checking if machine exists
I0329 06:19:36.372221       1 recorder.go:98] controller-runtime/manager/events "msg"="Normal"  "message"="Deleted machine zhsungcp-zk5jv-master-2" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"zhsungcp-zk5jv-master-2","uid":"89041523-1e2c-4023-8f37-fe18a823c9a7","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"77208"} "reason"="Delete"
I0329 06:19:36.665156       1 reconciler.go:342] zhsungcp-zk5jv-master-2: Machine does not exist
I0329 06:19:36.665324       1 controller.go:256] zhsungcp-zk5jv-master-2: deleting node "zhsungcp-zk5jv-master-2.c.openshift-qe.internal" for machine
I0329 06:19:36.665823       1 reflector.go:219] Starting reflector *v1.Node (10h47m38.342094445s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225
I0329 06:19:36.666000       1 reflector.go:255] Listing and watching *v1.Node from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225
I0329 06:19:36.766773       1 controller.go:442] Node "zhsungcp-zk5jv-master-2.c.openshift-qe.internal" not found
I0329 06:19:36.794733       1 controller.go:270] zhsungcp-zk5jv-master-2: machine deletion successful

Comment 36 errata-xmlrpc 2021-07-27 22:35:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.