Bug 1916575

Summary: [gcp] The spot instance node are not marked with the Terminating condition
Product: OpenShift Container Platform Reporter: sunzhaohua <zhsun>
Component: Cloud ComputeAssignee: Alexander Demicev <ademicev>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED DEFERRED Docs Contact:
Severity: medium    
Priority: medium CC: aarapov, ademicev, mgugino, zzhao
Version: 4.7   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-19 14:17:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
zhsungcp114-z4t84-worker-f-h627p is stopped none

Description sunzhaohua 2021-01-15 06:28:29 UTC
Created attachment 1747682 [details]
zhsungcp114-z4t84-worker-f-h627p is stopped

Description of problem:
Create a preemptible instance on gcp, after the preemptible instances run for 24 hours, the machine was not deleted. Check the node yaml file, the node was not marked with the `Terminating` condition.

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2021-01-13-124141

How reproducible:
tried 2 times, same result

Steps to Reproduce:
1. Create a preemptible instance on gcp with "preemptible: true"
2. Let the preemptible instances run for 24 hours
3. Check machine, node status

Actual results:
Node is NotReady, but is not marked with the Terminating condition. Check from gcp console the vm instance is stopped. 

$ oc get node
NAME                                                       STATUS     ROLES    AGE   VERSION
zhsungcp114-z4t84-master-0.c.openshift-qe.internal         Ready      master   26h   v1.20.0+31b56ef
zhsungcp114-z4t84-master-1.c.openshift-qe.internal         Ready      master   26h   v1.20.0+31b56ef
zhsungcp114-z4t84-master-2.c.openshift-qe.internal         Ready      master   26h   v1.20.0+31b56ef
zhsungcp114-z4t84-worker-a-762rp.c.openshift-qe.internal   Ready      worker   26h   v1.20.0+31b56ef
zhsungcp114-z4t84-worker-b-5htgn.c.openshift-qe.internal   Ready      worker   26h   v1.20.0+31b56ef
zhsungcp114-z4t84-worker-f-h627p.c.openshift-qe.internal   NotReady   worker   25h   v1.20.0+31b56ef

$ oc describe node zhsungcp114-z4t84-worker-f-h627p.c.openshift-qe.internal
Conditions:
  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----                 ------    -----------------                 ------------------                ------              -------
  NetworkUnavailable   False     Mon, 01 Jan 0001 00:00:00 +0000   Thu, 14 Jan 2021 12:41:15 +0800   RouteCreated        openshift-sdn cleared kubelet-set NoRouteCreated
  MemoryPressure       Unknown   Fri, 15 Jan 2021 12:39:28 +0800   Fri, 15 Jan 2021 12:43:10 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure         Unknown   Fri, 15 Jan 2021 12:39:28 +0800   Fri, 15 Jan 2021 12:43:10 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure          Unknown   Fri, 15 Jan 2021 12:39:28 +0800   Fri, 15 Jan 2021 12:43:10 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready                Unknown   Fri, 15 Jan 2021 12:39:28 +0800   Fri, 15 Jan 2021 12:43:10 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  
$ oc get machine -o wide
NAME                               PHASE     TYPE            REGION        ZONE            AGE   NODE                                                       PROVIDERID                                                          STATE
zhsungcp114-z4t84-master-0         Running   n1-standard-4   us-central1   us-central1-a   26h   zhsungcp114-z4t84-master-0.c.openshift-qe.internal         gce://openshift-qe/us-central1-a/zhsungcp114-z4t84-master-0         RUNNING
zhsungcp114-z4t84-master-1         Running   n1-standard-4   us-central1   us-central1-b   26h   zhsungcp114-z4t84-master-1.c.openshift-qe.internal         gce://openshift-qe/us-central1-b/zhsungcp114-z4t84-master-1         RUNNING
zhsungcp114-z4t84-master-2         Running   n1-standard-4   us-central1   us-central1-c   26h   zhsungcp114-z4t84-master-2.c.openshift-qe.internal         gce://openshift-qe/us-central1-c/zhsungcp114-z4t84-master-2         RUNNING
zhsungcp114-z4t84-worker-a-762rp   Running   n1-standard-4   us-central1   us-central1-a   26h   zhsungcp114-z4t84-worker-a-762rp.c.openshift-qe.internal   gce://openshift-qe/us-central1-a/zhsungcp114-z4t84-worker-a-762rp   RUNNING
zhsungcp114-z4t84-worker-b-5htgn   Running   n1-standard-4   us-central1   us-central1-b   26h   zhsungcp114-z4t84-worker-b-5htgn.c.openshift-qe.internal   gce://openshift-qe/us-central1-b/zhsungcp114-z4t84-worker-b-5htgn   RUNNING
zhsungcp114-z4t84-worker-f-h627p   Running   n1-standard-4   us-central1   us-central1-f   25h   zhsungcp114-z4t84-worker-f-h627p.c.openshift-qe.internal   gce://openshift-qe/us-central1-f/zhsungcp114-z4t84-worker-f-h627p   RUNNING
  nodeRef:
    kind: Node
    name: zhsungcp114-z4t84-worker-f-h627p.c.openshift-qe.internal
    uid: 46815e08-71bb-45bb-82f0-e987200f31c7
  phase: Running
  providerStatus:
    conditions:
    - lastProbeTime: "2021-01-14T04:38:23Z"
      lastTransitionTime: "2021-01-14T04:38:23Z"
      message: machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreated
    instanceId: zhsungcp114-z4t84-worker-f-h627p
    instanceState: RUNNING
    metadata: {}

$ oc get ds
NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                  AGE
machine-api-termination-handler   0         0         0       0            0           machine.openshift.io/interruptible-instance=   26h

$ oc get mhc
NAME                              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
machine-api-termination-handler   100%           1                  1


Expected results:
The machine is deleted after the preemptible instances run for 24 hours.

Additional info:
Run this case another time, the result is the same with above, the only difference is machine state is TERMINATED.
$ oc get machine -o wide
NAME                               PHASE     TYPE            REGION        ZONE            AGE   NODE                                                       PROVIDERID                                                          STATE
zhsungcp113-zgxg5-master-0         Running   n1-standard-4   us-central1   us-central1-a   25h   zhsungcp113-zgxg5-master-0.c.openshift-qe.internal         gce://openshift-qe/us-central1-a/zhsungcp113-zgxg5-master-0         RUNNING
zhsungcp113-zgxg5-master-1         Running   n1-standard-4   us-central1   us-central1-b   25h   zhsungcp113-zgxg5-master-1.c.openshift-qe.internal         gce://openshift-qe/us-central1-b/zhsungcp113-zgxg5-master-1         RUNNING
zhsungcp113-zgxg5-master-2         Running   n1-standard-4   us-central1   us-central1-c   25h   zhsungcp113-zgxg5-master-2.c.openshift-qe.internal         gce://openshift-qe/us-central1-c/zhsungcp113-zgxg5-master-2         RUNNING
zhsungcp113-zgxg5-worker-a-df8bc   Running   n1-standard-4   us-central1   us-central1-a   25h   zhsungcp113-zgxg5-worker-a-df8bc.c.openshift-qe.internal   gce://openshift-qe/us-central1-a/zhsungcp113-zgxg5-worker-a-df8bc   RUNNING
zhsungcp113-zgxg5-worker-b-dc7lb   Running   n1-standard-4   us-central1   us-central1-b   25h   zhsungcp113-zgxg5-worker-b-dc7lb.c.openshift-qe.internal   gce://openshift-qe/us-central1-b/zhsungcp113-zgxg5-worker-b-dc7lb   RUNNING
zhsungcp113-zgxg5-worker-c-928pf   Running   n1-standard-4   us-central1   us-central1-c   24h   zhsungcp113-zgxg5-worker-c-928pf.c.openshift-qe.internal   gce://openshift-qe/us-central1-c/zhsungcp113-zgxg5-worker-c-928pf   TERMINATED

  nodeRef:
    kind: Node
    name: zhsungcp113-zgxg5-worker-c-928pf.c.openshift-qe.internal
    uid: d55afed3-352b-4006-9549-496e0ef89d67
  phase: Running
  providerStatus:
    conditions:
    - lastProbeTime: "2021-01-13T02:51:11Z"
      lastTransitionTime: "2021-01-13T02:51:11Z"
      message: machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreated
    instanceId: zhsungcp113-zgxg5-worker-c-928pf
    instanceState: TERMINATED
    metadata: {}

Comment 2 Joel Speed 2021-01-19 12:15:03 UTC
I've done some manual testing this morning of the termination handler this morning and it appears to be working when the instance is preempted based on the documentation GCP provides, https://cloud.google.com/compute/docs/instances/create-start-preemptible-instance#detecting_if_an_instance_was_preempted

I'm not sure if the 24 hour limit counts as a preemption, I was of the understanding that it would. Perhaps we need to have some way to remove the machines after 24 hours on GCP that isn't related to these events

Comment 3 Joel Speed 2021-02-24 13:52:06 UTC
Could I ask QE to try this on a recent 4.8 nightly? We don't currently have access to an environment where we would be able to run a 24 hour long test like this (I will ask about this).

I'd like to verify that this is a real/reproducible issue and wasn't a random occurrence.

Comment 5 sunzhaohua 2021-03-01 05:59:42 UTC
(In reply to Joel Speed from comment #3)
> Could I ask QE to try this on a recent 4.8 nightly? We don't currently have
> access to an environment where we would be able to run a 24 hour long test
> like this (I will ask about this).
> 
> I'd like to verify that this is a real/reproducible issue and wasn't a
> random occurrence.

sure, I will try this on 4.8 nightly

Comment 6 sunzhaohua 2021-03-07 07:14:46 UTC
I could reproduce this on 4.8, and the time the spot instances were stopped are not same, sometimes a few hours, sometimes more than ten hours. I tried 3 times, each time some or all spot instances will be stopped within 24 hours. After they run for 24 hours, all spot instance will be stopped, but node are not marked with "Terminating", and the machines couldn't be deleted.

clusterversion: 4.8.0-0.nightly-2021-03-04-014703

$ oc get node
NAME                                                      STATUS     ROLES    AGE   VERSION
zhsun34gcp-vhphm-master-0.c.openshift-qe.internal         Ready      master   2d    v1.20.0+2ce2be0
zhsun34gcp-vhphm-master-1.c.openshift-qe.internal         Ready      master   47h   v1.20.0+2ce2be0
zhsun34gcp-vhphm-master-2.c.openshift-qe.internal         Ready      master   47h   v1.20.0+2ce2be0
zhsun34gcp-vhphm-worker-a-shfcq.c.openshift-qe.internal   Ready      worker   47h   v1.20.0+2ce2be0
zhsun34gcp-vhphm-worker-b-prkbq.c.openshift-qe.internal   Ready      worker   47h   v1.20.0+2ce2be0
zhsun34gcp-vhphm-worker-c-2d2tf.c.openshift-qe.internal   NotReady   worker   32h   v1.20.0+2ce2be0
zhsun34gcp-vhphm-worker-f-lf84m.c.openshift-qe.internal   NotReady   worker   32h   v1.20.0+2ce2be0

must gather: http://file.rdu.redhat.com/~zhsun/must-gather.local.5742201258686345266.tar.gz

Comment 7 Joel Speed 2021-03-16 10:54:56 UTC
I don't think the must gather uploaded successfully. I tried to download it but it's a zero byte archive according to the system, can you double check/re-upload please?

Comment 8 sunzhaohua 2021-03-17 02:43:51 UTC
sorry, uploaded failed because disk quota exceeded, I have re-uploaded, please download again.

Comment 9 Joel Speed 2021-03-17 11:36:06 UTC
I've had a look through the must gather and can see that the termination handler did indeed fail to mark the machine.
However, we have no logs (as the machine had gone), so nothing much to go on to debug this.
Will have to see if we can come up with some way to reproduce this and gather logs at the same time.

Comment 10 Joel Speed 2021-03-26 10:39:06 UTC
I managed to spend some time to look into this one and reproduce.

When a preemptible instance gets to 24 hours, GCP shuts it down, it doesn't terminate it.
This means that you can, if you wanted to, restart the VM.

Importantly, the preemptible event does not get sent to our termination handler, which means it doesn't mark the node for termination.

I think the only way we are reliably going to be able to handle this is to have a system uptime check, and, if the machine has been up for 23:59, mark it as terminating, to get it to be replaced.

Comment 11 Michael Gugino 2021-03-26 12:28:42 UTC
One suggestion that has come up before is a machine-recycler that deletes machines after X time.  We could build some opt-in logic for MHC to automatically delete nodes after a certain time period.  This would also be useful in the general case for non-spot instances.

Comment 12 Joel Speed 2021-03-26 12:41:13 UTC
Yeah that seems like a reasonable approach, means we don't have to run the thing on the node which is a nice benefit.

We would only want to deploy this on GCP though, so would need to build that into MAO so it deploys the MHC only on GCP.

Do you have any links for prior discussion around machine recyclers?

Comment 13 Michael Gugino 2021-04-08 14:13:25 UTC
Too long ago, I don't have any links to past discussions.

This can be opt-in behavior which is strongly suggested for GCP users via documentation.  We should have some amount of documentation for the specifics of these types of instances across the clouds we support.  Some users might not care about this functionality at all since an instance getting interrupted is not a big deal to them, though if you want a replacement machine, this would be the best way IMO.

Comment 14 Joel Speed 2021-05-19 14:17:12 UTC
We need to come up with a proper solution for this, this isn't a bug, but rather a feature gap.

We will track this now in https://issues.redhat.com/browse/OCPCLOUD-1177 and it will be prioritised with our other work.