Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1908350

Summary:	[Azure gov ]machine-api-termination-handler CrashLoopBackOff
Product:	OpenShift Container Platform	Reporter:	Milind Yadav <miyadav>
Component:	Cloud Compute	Assignee:	Joel Speed <jspeed>
Cloud Compute sub component:	Other Providers	QA Contact:	sunzhaohua <zhsun>
Status:	CLOSED WORKSFORME	Docs Contact:
Severity:	medium
Priority:	high	CC:	jspeed, mgugino, zhsun
Version:	4.7
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1907286	Environment:
Last Closed:	2021-01-05 14:52:58 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Milind Yadav 2020-12-16 13:44:37 UTC

Description of problem:
Cluster autoscaler not able to scale down spot instances on Azure Gov cloud 

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2020-12-14-165231


Steps to Reproduce:
1. Create a machineset by copying and changing name of existing machineset 
.
.
 create spot instances refer below for to modify 
    spec:
      metadata: {}
      providerSpec:
        value:
          spotVMOptions:
           maxPrice: 0.225.
.
Expected and actual - machineset created successfully 

2.Create cas use below yaml 
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  scaleDown:
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
~                          
Expected and actual - clusterautoscaler created successfully

3.Create machineautoscaler referring to the spot machineset refer below :
.
.
spec:
  maxReplicas: 6
  minReplicas: 1
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: whu47az4-pwzmb-worker-usgovvirginia-spot
status:
  lastTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: whu47az4-pwzmb-worker-usgovvirginia-spot

Actual and expected - machineautoscaler created successfully

4.Create workload to scale spot machineset 
.
.
Create workload refer 
apiVersion: batch/v1
kind: Job
metadata:
  generateName: work-queue-
spec:
  template:
    spec:
      containers:
      - name: work
        image: quay.io/openshifttest/busybox@sha256:xyz
        command: ["sleep",  "300"]
        resources:
          requests:
            memory: 500Mi
            cpu: 500m
      restartPolicy: Never
  backoffLimit: 4
  completions: 50
  parallelism: 50.
.

Actual and expected : workload created successfully 

5.spot instances scaled up successfully

6.delete workload and wait for sometime  (10 mins)
Actual : machines could not scale down , as spot instance termination handler is crashing 
Expected : spot instances should be released
 

Additional info:
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  46m                default-scheduler  Successfully assigned openshift-machine-api/machine-api-termination-handler-4s4ls to whu47az4-pwzmb-worker-usgovvirginia-spot-w4hl5
  Normal   Pulling    46m                kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:xyzf445734cf6d83a478bd"
  Normal   Pulled     46m                kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ee0846a75604e7602cfed2bf2e918ed3ccef2733e0a29f445734cf6d83a478bd" in 7.141122758s
  Normal   Pulled     25m (x2 over 25m)  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:xyzccef2733e0a29f445734cf6d83a478bd" already present on machine
  Normal   Created    24m (x3 over 46m)  kubelet            Created container termination-handler
  Normal   Started    24m (x3 over 46m)  kubelet            Started container termination-handler
  Warning  BackOff    24m (x3 over 25m)  kubelet            Back-off restarting failed container

Comment 1 Michael Gugino 2020-12-16 13:56:28 UTC

The termination handler shouldn't affect scale down.  That is only for cases where the cloud provider evicts the spot instance and notifies the machine-api that it needs to be deleted.

We'll need mustgather as always to investigate.

Comment 3 Joel Speed 2020-12-18 10:35:35 UTC

There's unfortunately nothing in the must-gather that shows any insight as to why that pod is crash looping. No record of the pod in the must-gather for some reason. Can we try to gather some logs from a pod that's crashing like this?

Comment 4 Milind Yadav 2020-12-18 13:53:02 UTC

Thanks Joel , here is some more info , I tried again but couldnt reproduce , you can mark it as not a bug and review more if needed ..

the status of cluster when the workload was not put on it 

[miyadav@miyadav azure]$ oc get machines
NAME                                             PHASE         TYPE              REGION          ZONE   AGE
whu47az4-pwzmb-master-0                          Running       Standard_D8s_v3   usgovvirginia          2d10h
whu47az4-pwzmb-master-1                          Running       Standard_D8s_v3   usgovvirginia          2d10h
whu47az4-pwzmb-master-2                          Running       Standard_D8s_v3   usgovvirginia          2d10h
whu47az4-pwzmb-worker-usgovvirginia-cbqhb        Running       Standard_D2s_v3   usgovvirginia          28h
whu47az4-pwzmb-worker-usgovvirginia-gmpvc        Running       Standard_D2s_v3   usgovvirginia          29h
whu47az4-pwzmb-worker-usgovvirginia-jf97p        Running       Standard_D2s_v3   usgovvirginia          28h
whu47az4-pwzmb-worker-usgovvirginia-spot-v49bx   Provisioned   Standard_D2s_v3   usgovvirginia          3m13s

[miyadav@miyadav azure]$ oc get machines
NAME                                             PHASE     TYPE              REGION          ZONE   AGE
whu47az4-pwzmb-master-0                          Running   Standard_D8s_v3   usgovvirginia          2d10h
whu47az4-pwzmb-master-1                          Running   Standard_D8s_v3   usgovvirginia          2d10h
whu47az4-pwzmb-master-2                          Running   Standard_D8s_v3   usgovvirginia          2d10h
whu47az4-pwzmb-worker-usgovvirginia-cbqhb        Running   Standard_D2s_v3   usgovvirginia          28h
whu47az4-pwzmb-worker-usgovvirginia-gmpvc        Running   Standard_D2s_v3   usgovvirginia          29h
whu47az4-pwzmb-worker-usgovvirginia-jf97p        Running   Standard_D2s_v3   usgovvirginia          28h
whu47az4-pwzmb-worker-usgovvirginia-spot-v49bx   Running   Standard_D2s_v3   usgovvirginia          4m53s

[miyadav@miyadav azure]$ oc get mhc
NAME                              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
machine-api-termination-handler   100%           0                  0
[miyadav@miyadav azure]$ oc get pods
NAME                                          READY   STATUS    RESTARTS   AGE
cluster-autoscaler-default-7b4c46c4c6-9mvln   1/1     Running   0          10m
cluster-autoscaler-operator-cfdfb5877-l76r6   2/2     Running   0          2d9h
cluster-baremetal-operator-dc464c6f8-xkm86    1/1     Running   0          2d9h
machine-api-controllers-54948c7459-dnr54      7/7     Running   0          2d9h
machine-api-operator-7b4548454-4bnfd          2/2     Running   0          2d9h
[miyadav@miyadav azure]$ 


/////Increased Workload////
[miyadav@miyadav azure]$ oc get machineset
NAME                                       DESIRED   CURRENT   READY   AVAILABLE   AGE
whu47az4-pwzmb-worker-usgovvirginia        3         3         2       2           2d10h
whu47az4-pwzmb-worker-usgovvirginia-spot   12        12        1       1           13m
[miyadav@miyadav azure]$ oc get jobs
NAME               COMPLETIONS   DURATION   AGE
work-queue-25r2g   3/50          5m25s      5m25s
[miyadav@miyadav azure]$ oc get machines
NAME                                             PHASE         TYPE              REGION          ZONE   AGE
whu47az4-pwzmb-master-0                          Running       Standard_D8s_v3   usgovvirginia          2d10h
whu47az4-pwzmb-master-1                          Running       Standard_D8s_v3   usgovvirginia          2d10h
whu47az4-pwzmb-master-2                          Running       Standard_D8s_v3   usgovvirginia          2d10h
whu47az4-pwzmb-worker-usgovvirginia-cbqhb        Running       Standard_D2s_v3   usgovvirginia          28h
whu47az4-pwzmb-worker-usgovvirginia-gmpvc        Running       Standard_D2s_v3   usgovvirginia          30h
whu47az4-pwzmb-worker-usgovvirginia-jf97p        Running       Standard_D2s_v3   usgovvirginia          28h
whu47az4-pwzmb-worker-usgovvirginia-spot-57xf8   Provisioned   Standard_D2s_v3   usgovvirginia          104s
whu47az4-pwzmb-worker-usgovvirginia-spot-7g8hm   Provisioned   Standard_D2s_v3   usgovvirginia          104s
whu47az4-pwzmb-worker-usgovvirginia-spot-7p5kz   Provisioned   Standard_D2s_v3   usgovvirginia          104s
whu47az4-pwzmb-worker-usgovvirginia-spot-d42qh   Provisioned   Standard_D2s_v3   usgovvirginia          104s
whu47az4-pwzmb-worker-usgovvirginia-spot-fmkzw   Provisioned   Standard_D2s_v3   usgovvirginia          104s
whu47az4-pwzmb-worker-usgovvirginia-spot-m4t82   Provisioned   Standard_D2s_v3   usgovvirginia          104s
whu47az4-pwzmb-worker-usgovvirginia-spot-npd2q   Provisioned   Standard_D2s_v3   usgovvirginia          104s
whu47az4-pwzmb-worker-usgovvirginia-spot-rscqq   Provisioned   Standard_D2s_v3   usgovvirginia          104s
whu47az4-pwzmb-worker-usgovvirginia-spot-s89qn   Provisioned   Standard_D2s_v3   usgovvirginia          104s
whu47az4-pwzmb-worker-usgovvirginia-spot-t5dq4   Provisioned   Standard_D2s_v3   usgovvirginia          104s
whu47az4-pwzmb-worker-usgovvirginia-spot-v49bx   Running       Standard_D2s_v3   usgovvirginia          13m
whu47az4-pwzmb-worker-usgovvirginia-spot-zkblg   Provisioned   Standard_D2s_v3   usgovvirginia          104s

[miyadav@miyadav azure]$ oc get machineautoscaler
NAME   REF KIND     REF NAME                                   MIN   MAX   AGE
mas1   MachineSet   whu47az4-pwzmb-worker-usgovvirginia-spot   1     12    3m20s

[miyadav@miyadav azure]$ oc get pods | grep termination
machine-api-termination-handler-2f24w         1/1     Running     0          4m41s
machine-api-termination-handler-74d58         1/1     Running     0          4m30s
machine-api-termination-handler-gnx9b         1/1     Running     0          4m57s
machine-api-termination-handler-hqtnk         1/1     Running     0          5m9s
machine-api-termination-handler-j2gxq         1/1     Running     0          4m17s
machine-api-termination-handler-jnznh         1/1     Running     0          4m53s
machine-api-termination-handler-k9f2k         1/1     Running     0          4m26s
machine-api-termination-handler-mphfs         1/1     Running     0          17m
machine-api-termination-handler-nr66h         1/1     Running     0          4m18s
machine-api-termination-handler-q7sdb         1/1     Running     0          5m6s
machine-api-termination-handler-trkdr         1/1     Running     0          4m27s

After the jobs completed 
logs from clusterautoscaler 
.
.
I1218 13:31:20.319542       1 delete.go:193] Releasing taint {Key:DeletionCandidateOfClusterAutoscaler Value:1608298185 Effect:PreferNoSchedule TimeAdded:<nil>} on node whu47az4-pwzmb-worker-usgovvirginia-spot-v49bx
I1218 13:31:20.341474       1 delete.go:220] Successfully released DeletionCandidateTaint on node whu47az4-pwzmb-worker-usgovvirginia-spot-v49bx
.
.

Seems all spot instances brought up as a result of scaling , were deleted successfully 

[miyadav@miyadav azure]$ oc get pods | grep termination
machine-api-termination-handler-2f24w         1/1     Running     0          21m
machine-api-termination-handler-74d58         1/1     Running     0          21m
machine-api-termination-handler-gnx9b         1/1     Running     0          22m
machine-api-termination-handler-hqtnk         1/1     Running     0          22m
machine-api-termination-handler-jnznh         1/1     Running     0          21m
machine-api-termination-handler-k9f2k         1/1     Running     0          21m
machine-api-termination-handler-mphfs         1/1     Running     0          34m
machine-api-termination-handler-nr66h         1/1     Running     0          21m
machine-api-termination-handler-q7sdb         1/1     Running     0          22m
[miyadav@miyadav azure]$ oc logs -f machine-api-termination-handler-2f24w
Error from server: Get "https://10.0.1.16:10250/containerLogs/openshift-machine-api/machine-api-termination-handler-2f24w/termination-handler?follow=true": dial tcp 10.0.1.16:10250: i/o timeout
[miyadav@miyadav azure]$ oc logs -f machine-api-termination-handler-2f24w
Error from server (NotFound): pods "whu47az4-pwzmb-worker-usgovvirginia-spot-7g8hm" not found
[miyadav@miyadav azure]$ oc logs -f machine-api-termination-handler-2f24w
Error from server (NotFound): pods "machine-api-termination-handler-2f24w" not found
[miyadav@miyadav azure]$ oc get pods | grep termination
machine-api-termination-handler-74d58         1/1     Running     0          22m
machine-api-termination-handler-gnx9b         1/1     Running     0          23m
machine-api-termination-handler-k9f2k         1/1     Running     0          22m
machine-api-termination-handler-mphfs         1/1     Running     0          35m
machine-api-termination-handler-nr66h         1/1     Running     0          22m

[miyadav@miyadav azure]$ oc get machines
NAME                                             PHASE      TYPE              REGION          ZONE   AGE
whu47az4-pwzmb-master-0                          Running    Standard_D8s_v3   usgovvirginia          2d10h
whu47az4-pwzmb-master-1                          Running    Standard_D8s_v3   usgovvirginia          2d10h
whu47az4-pwzmb-master-2                          Running    Standard_D8s_v3   usgovvirginia          2d10h
whu47az4-pwzmb-worker-usgovvirginia-cbqhb        Running    Standard_D2s_v3   usgovvirginia          29h
whu47az4-pwzmb-worker-usgovvirginia-gmpvc        Running    Standard_D2s_v3   usgovvirginia          30h
whu47az4-pwzmb-worker-usgovvirginia-jf97p        Running    Standard_D2s_v3   usgovvirginia          29h
whu47az4-pwzmb-worker-usgovvirginia-spot-7g8hm   Deleting   Standard_D2s_v3   usgovvirginia          30m
whu47az4-pwzmb-worker-usgovvirginia-spot-fmkzw   Deleting   Standard_D2s_v3   usgovvirginia          30m
whu47az4-pwzmb-worker-usgovvirginia-spot-m4t82   Deleting   Standard_D2s_v3   usgovvirginia          30m
whu47az4-pwzmb-worker-usgovvirginia-spot-npd2q   Deleting   Standard_D2s_v3   usgovvirginia          30m
whu47az4-pwzmb-worker-usgovvirginia-spot-t5dq4   Deleting   Standard_D2s_v3   usgovvirginia          30m
whu47az4-pwzmb-worker-usgovvirginia-spot-v49bx   Running    Standard_D2s_v3   usgovvirginia          42m
whu47az4-pwzmb-worker-usgovvirginia-spot-zkblg   Deleting   Standard_D2s_v3   usgovvirginia          30m

////State after all workload was removed///
[miyadav@miyadav azure]$ oc get machines
NAME                                             PHASE     TYPE              REGION          ZONE   AGE
whu47az4-pwzmb-master-0                          Running   Standard_D8s_v3   usgovvirginia          2d10h
whu47az4-pwzmb-master-1                          Running   Standard_D8s_v3   usgovvirginia          2d10h
whu47az4-pwzmb-master-2                          Running   Standard_D8s_v3   usgovvirginia          2d10h
whu47az4-pwzmb-worker-usgovvirginia-cbqhb        Running   Standard_D2s_v3   usgovvirginia          29h
whu47az4-pwzmb-worker-usgovvirginia-gmpvc        Running   Standard_D2s_v3   usgovvirginia          30h
whu47az4-pwzmb-worker-usgovvirginia-jf97p        Running   Standard_D2s_v3   usgovvirginia          29h
whu47az4-pwzmb-worker-usgovvirginia-spot-v49bx   Running   Standard_D2s_v3   usgovvirginia          44m


Additional Info :
Couldnt reproduce it , tried multiple times today , not sure how it happened yesterday ...

Only other detail I had about the pod that was crashing yesterday is in the events of yesterday;s comment

Comment 5 Joel Speed 2020-12-18 14:05:47 UTC

We did have another BZ which had the pods crash looping, could be that this was covered when that was fixed.

Out of interest, we had previously been unable to create spot instances on govcloud, but you seem to not be having any issues, are they definitely coming up as spot instances if you check the console?

Comment 6 Joel Speed 2021-01-05 13:48:40 UTC

Hey @miyadav, I think this issue is no longer needed based on the results you've seen above, do you have a moment to answer my below question though before we do?

> Out of interest, we had previously been unable to create spot instances on govcloud, but you seem to not be having any issues, are they definitely coming up as spot instances if you check the console?

Comment 7 Milind Yadav 2021-01-05 13:58:59 UTC

@Joel , I dont remember checking the console ,but the yaml should bring up spot instances only;  all node references were proper and workload was getting executed on those .

Earlier when we were not able to create spot instances it used to fail . 

I will keep eye next time while bringing up the govcloud to make sure spot instances  present on the Azure console are being created by us .

Comment 8 Joel Speed 2021-01-05 14:52:58 UTC

Ok thanks, if you find an issue with it please open a new BZ, going to close this one for now.