Created attachment 1747682 [details] zhsungcp114-z4t84-worker-f-h627p is stopped Description of problem: Create a preemptible instance on gcp, after the preemptible instances run for 24 hours, the machine was not deleted. Check the node yaml file, the node was not marked with the `Terminating` condition. Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2021-01-13-124141 How reproducible: tried 2 times, same result Steps to Reproduce: 1. Create a preemptible instance on gcp with "preemptible: true" 2. Let the preemptible instances run for 24 hours 3. Check machine, node status Actual results: Node is NotReady, but is not marked with the Terminating condition. Check from gcp console the vm instance is stopped. $ oc get node NAME STATUS ROLES AGE VERSION zhsungcp114-z4t84-master-0.c.openshift-qe.internal Ready master 26h v1.20.0+31b56ef zhsungcp114-z4t84-master-1.c.openshift-qe.internal Ready master 26h v1.20.0+31b56ef zhsungcp114-z4t84-master-2.c.openshift-qe.internal Ready master 26h v1.20.0+31b56ef zhsungcp114-z4t84-worker-a-762rp.c.openshift-qe.internal Ready worker 26h v1.20.0+31b56ef zhsungcp114-z4t84-worker-b-5htgn.c.openshift-qe.internal Ready worker 26h v1.20.0+31b56ef zhsungcp114-z4t84-worker-f-h627p.c.openshift-qe.internal NotReady worker 25h v1.20.0+31b56ef $ oc describe node zhsungcp114-z4t84-worker-f-h627p.c.openshift-qe.internal Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- NetworkUnavailable False Mon, 01 Jan 0001 00:00:00 +0000 Thu, 14 Jan 2021 12:41:15 +0800 RouteCreated openshift-sdn cleared kubelet-set NoRouteCreated MemoryPressure Unknown Fri, 15 Jan 2021 12:39:28 +0800 Fri, 15 Jan 2021 12:43:10 +0800 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Fri, 15 Jan 2021 12:39:28 +0800 Fri, 15 Jan 2021 12:43:10 +0800 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Fri, 15 Jan 2021 12:39:28 +0800 Fri, 15 Jan 2021 12:43:10 +0800 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Fri, 15 Jan 2021 12:39:28 +0800 Fri, 15 Jan 2021 12:43:10 +0800 NodeStatusUnknown Kubelet stopped posting node status. $ oc get machine -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE zhsungcp114-z4t84-master-0 Running n1-standard-4 us-central1 us-central1-a 26h zhsungcp114-z4t84-master-0.c.openshift-qe.internal gce://openshift-qe/us-central1-a/zhsungcp114-z4t84-master-0 RUNNING zhsungcp114-z4t84-master-1 Running n1-standard-4 us-central1 us-central1-b 26h zhsungcp114-z4t84-master-1.c.openshift-qe.internal gce://openshift-qe/us-central1-b/zhsungcp114-z4t84-master-1 RUNNING zhsungcp114-z4t84-master-2 Running n1-standard-4 us-central1 us-central1-c 26h zhsungcp114-z4t84-master-2.c.openshift-qe.internal gce://openshift-qe/us-central1-c/zhsungcp114-z4t84-master-2 RUNNING zhsungcp114-z4t84-worker-a-762rp Running n1-standard-4 us-central1 us-central1-a 26h zhsungcp114-z4t84-worker-a-762rp.c.openshift-qe.internal gce://openshift-qe/us-central1-a/zhsungcp114-z4t84-worker-a-762rp RUNNING zhsungcp114-z4t84-worker-b-5htgn Running n1-standard-4 us-central1 us-central1-b 26h zhsungcp114-z4t84-worker-b-5htgn.c.openshift-qe.internal gce://openshift-qe/us-central1-b/zhsungcp114-z4t84-worker-b-5htgn RUNNING zhsungcp114-z4t84-worker-f-h627p Running n1-standard-4 us-central1 us-central1-f 25h zhsungcp114-z4t84-worker-f-h627p.c.openshift-qe.internal gce://openshift-qe/us-central1-f/zhsungcp114-z4t84-worker-f-h627p RUNNING nodeRef: kind: Node name: zhsungcp114-z4t84-worker-f-h627p.c.openshift-qe.internal uid: 46815e08-71bb-45bb-82f0-e987200f31c7 phase: Running providerStatus: conditions: - lastProbeTime: "2021-01-14T04:38:23Z" lastTransitionTime: "2021-01-14T04:38:23Z" message: machine successfully created reason: MachineCreationSucceeded status: "True" type: MachineCreated instanceId: zhsungcp114-z4t84-worker-f-h627p instanceState: RUNNING metadata: {} $ oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE machine-api-termination-handler 0 0 0 0 0 machine.openshift.io/interruptible-instance= 26h $ oc get mhc NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY machine-api-termination-handler 100% 1 1 Expected results: The machine is deleted after the preemptible instances run for 24 hours. Additional info: Run this case another time, the result is the same with above, the only difference is machine state is TERMINATED. $ oc get machine -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE zhsungcp113-zgxg5-master-0 Running n1-standard-4 us-central1 us-central1-a 25h zhsungcp113-zgxg5-master-0.c.openshift-qe.internal gce://openshift-qe/us-central1-a/zhsungcp113-zgxg5-master-0 RUNNING zhsungcp113-zgxg5-master-1 Running n1-standard-4 us-central1 us-central1-b 25h zhsungcp113-zgxg5-master-1.c.openshift-qe.internal gce://openshift-qe/us-central1-b/zhsungcp113-zgxg5-master-1 RUNNING zhsungcp113-zgxg5-master-2 Running n1-standard-4 us-central1 us-central1-c 25h zhsungcp113-zgxg5-master-2.c.openshift-qe.internal gce://openshift-qe/us-central1-c/zhsungcp113-zgxg5-master-2 RUNNING zhsungcp113-zgxg5-worker-a-df8bc Running n1-standard-4 us-central1 us-central1-a 25h zhsungcp113-zgxg5-worker-a-df8bc.c.openshift-qe.internal gce://openshift-qe/us-central1-a/zhsungcp113-zgxg5-worker-a-df8bc RUNNING zhsungcp113-zgxg5-worker-b-dc7lb Running n1-standard-4 us-central1 us-central1-b 25h zhsungcp113-zgxg5-worker-b-dc7lb.c.openshift-qe.internal gce://openshift-qe/us-central1-b/zhsungcp113-zgxg5-worker-b-dc7lb RUNNING zhsungcp113-zgxg5-worker-c-928pf Running n1-standard-4 us-central1 us-central1-c 24h zhsungcp113-zgxg5-worker-c-928pf.c.openshift-qe.internal gce://openshift-qe/us-central1-c/zhsungcp113-zgxg5-worker-c-928pf TERMINATED nodeRef: kind: Node name: zhsungcp113-zgxg5-worker-c-928pf.c.openshift-qe.internal uid: d55afed3-352b-4006-9549-496e0ef89d67 phase: Running providerStatus: conditions: - lastProbeTime: "2021-01-13T02:51:11Z" lastTransitionTime: "2021-01-13T02:51:11Z" message: machine successfully created reason: MachineCreationSucceeded status: "True" type: MachineCreated instanceId: zhsungcp113-zgxg5-worker-c-928pf instanceState: TERMINATED metadata: {}
I've done some manual testing this morning of the termination handler this morning and it appears to be working when the instance is preempted based on the documentation GCP provides, https://cloud.google.com/compute/docs/instances/create-start-preemptible-instance#detecting_if_an_instance_was_preempted I'm not sure if the 24 hour limit counts as a preemption, I was of the understanding that it would. Perhaps we need to have some way to remove the machines after 24 hours on GCP that isn't related to these events
Could I ask QE to try this on a recent 4.8 nightly? We don't currently have access to an environment where we would be able to run a 24 hour long test like this (I will ask about this). I'd like to verify that this is a real/reproducible issue and wasn't a random occurrence.
(In reply to Joel Speed from comment #3) > Could I ask QE to try this on a recent 4.8 nightly? We don't currently have > access to an environment where we would be able to run a 24 hour long test > like this (I will ask about this). > > I'd like to verify that this is a real/reproducible issue and wasn't a > random occurrence. sure, I will try this on 4.8 nightly
I could reproduce this on 4.8, and the time the spot instances were stopped are not same, sometimes a few hours, sometimes more than ten hours. I tried 3 times, each time some or all spot instances will be stopped within 24 hours. After they run for 24 hours, all spot instance will be stopped, but node are not marked with "Terminating", and the machines couldn't be deleted. clusterversion: 4.8.0-0.nightly-2021-03-04-014703 $ oc get node NAME STATUS ROLES AGE VERSION zhsun34gcp-vhphm-master-0.c.openshift-qe.internal Ready master 2d v1.20.0+2ce2be0 zhsun34gcp-vhphm-master-1.c.openshift-qe.internal Ready master 47h v1.20.0+2ce2be0 zhsun34gcp-vhphm-master-2.c.openshift-qe.internal Ready master 47h v1.20.0+2ce2be0 zhsun34gcp-vhphm-worker-a-shfcq.c.openshift-qe.internal Ready worker 47h v1.20.0+2ce2be0 zhsun34gcp-vhphm-worker-b-prkbq.c.openshift-qe.internal Ready worker 47h v1.20.0+2ce2be0 zhsun34gcp-vhphm-worker-c-2d2tf.c.openshift-qe.internal NotReady worker 32h v1.20.0+2ce2be0 zhsun34gcp-vhphm-worker-f-lf84m.c.openshift-qe.internal NotReady worker 32h v1.20.0+2ce2be0 must gather: http://file.rdu.redhat.com/~zhsun/must-gather.local.5742201258686345266.tar.gz
I don't think the must gather uploaded successfully. I tried to download it but it's a zero byte archive according to the system, can you double check/re-upload please?
sorry, uploaded failed because disk quota exceeded, I have re-uploaded, please download again.
I've had a look through the must gather and can see that the termination handler did indeed fail to mark the machine. However, we have no logs (as the machine had gone), so nothing much to go on to debug this. Will have to see if we can come up with some way to reproduce this and gather logs at the same time.
I managed to spend some time to look into this one and reproduce. When a preemptible instance gets to 24 hours, GCP shuts it down, it doesn't terminate it. This means that you can, if you wanted to, restart the VM. Importantly, the preemptible event does not get sent to our termination handler, which means it doesn't mark the node for termination. I think the only way we are reliably going to be able to handle this is to have a system uptime check, and, if the machine has been up for 23:59, mark it as terminating, to get it to be replaced.
One suggestion that has come up before is a machine-recycler that deletes machines after X time. We could build some opt-in logic for MHC to automatically delete nodes after a certain time period. This would also be useful in the general case for non-spot instances.
Yeah that seems like a reasonable approach, means we don't have to run the thing on the node which is a nice benefit. We would only want to deploy this on GCP though, so would need to build that into MAO so it deploys the MHC only on GCP. Do you have any links for prior discussion around machine recyclers?
Too long ago, I don't have any links to past discussions. This can be opt-in behavior which is strongly suggested for GCP users via documentation. We should have some amount of documentation for the specifics of these types of instances across the clouds we support. Some users might not care about this functionality at all since an instance getting interrupted is not a big deal to them, though if you want a replacement machine, this would be the best way IMO.
We need to come up with a proper solution for this, this isn't a bug, but rather a feature gap. We will track this now in https://issues.redhat.com/browse/OCPCLOUD-1177 and it will be prioritised with our other work.