Bug 1908350
| Summary: | [Azure gov ]machine-api-termination-handler CrashLoopBackOff | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Milind Yadav <miyadav> |
| Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> |
| Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
| Status: | CLOSED WORKSFORME | Docs Contact: | |
| Severity: | medium | ||
| Priority: | high | CC: | jspeed, mgugino, zhsun |
| Version: | 4.7 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1907286 | Environment: | |
| Last Closed: | 2021-01-05 14:52:58 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The termination handler shouldn't affect scale down. That is only for cases where the cloud provider evicts the spot instance and notifies the machine-api that it needs to be deleted. We'll need mustgather as always to investigate. There's unfortunately nothing in the must-gather that shows any insight as to why that pod is crash looping. No record of the pod in the must-gather for some reason. Can we try to gather some logs from a pod that's crashing like this? Thanks Joel , here is some more info , I tried again but couldnt reproduce , you can mark it as not a bug and review more if needed ..
the status of cluster when the workload was not put on it
[miyadav@miyadav azure]$ oc get machines
NAME PHASE TYPE REGION ZONE AGE
whu47az4-pwzmb-master-0 Running Standard_D8s_v3 usgovvirginia 2d10h
whu47az4-pwzmb-master-1 Running Standard_D8s_v3 usgovvirginia 2d10h
whu47az4-pwzmb-master-2 Running Standard_D8s_v3 usgovvirginia 2d10h
whu47az4-pwzmb-worker-usgovvirginia-cbqhb Running Standard_D2s_v3 usgovvirginia 28h
whu47az4-pwzmb-worker-usgovvirginia-gmpvc Running Standard_D2s_v3 usgovvirginia 29h
whu47az4-pwzmb-worker-usgovvirginia-jf97p Running Standard_D2s_v3 usgovvirginia 28h
whu47az4-pwzmb-worker-usgovvirginia-spot-v49bx Provisioned Standard_D2s_v3 usgovvirginia 3m13s
[miyadav@miyadav azure]$ oc get machines
NAME PHASE TYPE REGION ZONE AGE
whu47az4-pwzmb-master-0 Running Standard_D8s_v3 usgovvirginia 2d10h
whu47az4-pwzmb-master-1 Running Standard_D8s_v3 usgovvirginia 2d10h
whu47az4-pwzmb-master-2 Running Standard_D8s_v3 usgovvirginia 2d10h
whu47az4-pwzmb-worker-usgovvirginia-cbqhb Running Standard_D2s_v3 usgovvirginia 28h
whu47az4-pwzmb-worker-usgovvirginia-gmpvc Running Standard_D2s_v3 usgovvirginia 29h
whu47az4-pwzmb-worker-usgovvirginia-jf97p Running Standard_D2s_v3 usgovvirginia 28h
whu47az4-pwzmb-worker-usgovvirginia-spot-v49bx Running Standard_D2s_v3 usgovvirginia 4m53s
[miyadav@miyadav azure]$ oc get mhc
NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY
machine-api-termination-handler 100% 0 0
[miyadav@miyadav azure]$ oc get pods
NAME READY STATUS RESTARTS AGE
cluster-autoscaler-default-7b4c46c4c6-9mvln 1/1 Running 0 10m
cluster-autoscaler-operator-cfdfb5877-l76r6 2/2 Running 0 2d9h
cluster-baremetal-operator-dc464c6f8-xkm86 1/1 Running 0 2d9h
machine-api-controllers-54948c7459-dnr54 7/7 Running 0 2d9h
machine-api-operator-7b4548454-4bnfd 2/2 Running 0 2d9h
[miyadav@miyadav azure]$
/////Increased Workload////
[miyadav@miyadav azure]$ oc get machineset
NAME DESIRED CURRENT READY AVAILABLE AGE
whu47az4-pwzmb-worker-usgovvirginia 3 3 2 2 2d10h
whu47az4-pwzmb-worker-usgovvirginia-spot 12 12 1 1 13m
[miyadav@miyadav azure]$ oc get jobs
NAME COMPLETIONS DURATION AGE
work-queue-25r2g 3/50 5m25s 5m25s
[miyadav@miyadav azure]$ oc get machines
NAME PHASE TYPE REGION ZONE AGE
whu47az4-pwzmb-master-0 Running Standard_D8s_v3 usgovvirginia 2d10h
whu47az4-pwzmb-master-1 Running Standard_D8s_v3 usgovvirginia 2d10h
whu47az4-pwzmb-master-2 Running Standard_D8s_v3 usgovvirginia 2d10h
whu47az4-pwzmb-worker-usgovvirginia-cbqhb Running Standard_D2s_v3 usgovvirginia 28h
whu47az4-pwzmb-worker-usgovvirginia-gmpvc Running Standard_D2s_v3 usgovvirginia 30h
whu47az4-pwzmb-worker-usgovvirginia-jf97p Running Standard_D2s_v3 usgovvirginia 28h
whu47az4-pwzmb-worker-usgovvirginia-spot-57xf8 Provisioned Standard_D2s_v3 usgovvirginia 104s
whu47az4-pwzmb-worker-usgovvirginia-spot-7g8hm Provisioned Standard_D2s_v3 usgovvirginia 104s
whu47az4-pwzmb-worker-usgovvirginia-spot-7p5kz Provisioned Standard_D2s_v3 usgovvirginia 104s
whu47az4-pwzmb-worker-usgovvirginia-spot-d42qh Provisioned Standard_D2s_v3 usgovvirginia 104s
whu47az4-pwzmb-worker-usgovvirginia-spot-fmkzw Provisioned Standard_D2s_v3 usgovvirginia 104s
whu47az4-pwzmb-worker-usgovvirginia-spot-m4t82 Provisioned Standard_D2s_v3 usgovvirginia 104s
whu47az4-pwzmb-worker-usgovvirginia-spot-npd2q Provisioned Standard_D2s_v3 usgovvirginia 104s
whu47az4-pwzmb-worker-usgovvirginia-spot-rscqq Provisioned Standard_D2s_v3 usgovvirginia 104s
whu47az4-pwzmb-worker-usgovvirginia-spot-s89qn Provisioned Standard_D2s_v3 usgovvirginia 104s
whu47az4-pwzmb-worker-usgovvirginia-spot-t5dq4 Provisioned Standard_D2s_v3 usgovvirginia 104s
whu47az4-pwzmb-worker-usgovvirginia-spot-v49bx Running Standard_D2s_v3 usgovvirginia 13m
whu47az4-pwzmb-worker-usgovvirginia-spot-zkblg Provisioned Standard_D2s_v3 usgovvirginia 104s
[miyadav@miyadav azure]$ oc get machineautoscaler
NAME REF KIND REF NAME MIN MAX AGE
mas1 MachineSet whu47az4-pwzmb-worker-usgovvirginia-spot 1 12 3m20s
[miyadav@miyadav azure]$ oc get pods | grep termination
machine-api-termination-handler-2f24w 1/1 Running 0 4m41s
machine-api-termination-handler-74d58 1/1 Running 0 4m30s
machine-api-termination-handler-gnx9b 1/1 Running 0 4m57s
machine-api-termination-handler-hqtnk 1/1 Running 0 5m9s
machine-api-termination-handler-j2gxq 1/1 Running 0 4m17s
machine-api-termination-handler-jnznh 1/1 Running 0 4m53s
machine-api-termination-handler-k9f2k 1/1 Running 0 4m26s
machine-api-termination-handler-mphfs 1/1 Running 0 17m
machine-api-termination-handler-nr66h 1/1 Running 0 4m18s
machine-api-termination-handler-q7sdb 1/1 Running 0 5m6s
machine-api-termination-handler-trkdr 1/1 Running 0 4m27s
After the jobs completed
logs from clusterautoscaler
.
.
I1218 13:31:20.319542 1 delete.go:193] Releasing taint {Key:DeletionCandidateOfClusterAutoscaler Value:1608298185 Effect:PreferNoSchedule TimeAdded:<nil>} on node whu47az4-pwzmb-worker-usgovvirginia-spot-v49bx
I1218 13:31:20.341474 1 delete.go:220] Successfully released DeletionCandidateTaint on node whu47az4-pwzmb-worker-usgovvirginia-spot-v49bx
.
.
Seems all spot instances brought up as a result of scaling , were deleted successfully
[miyadav@miyadav azure]$ oc get pods | grep termination
machine-api-termination-handler-2f24w 1/1 Running 0 21m
machine-api-termination-handler-74d58 1/1 Running 0 21m
machine-api-termination-handler-gnx9b 1/1 Running 0 22m
machine-api-termination-handler-hqtnk 1/1 Running 0 22m
machine-api-termination-handler-jnznh 1/1 Running 0 21m
machine-api-termination-handler-k9f2k 1/1 Running 0 21m
machine-api-termination-handler-mphfs 1/1 Running 0 34m
machine-api-termination-handler-nr66h 1/1 Running 0 21m
machine-api-termination-handler-q7sdb 1/1 Running 0 22m
[miyadav@miyadav azure]$ oc logs -f machine-api-termination-handler-2f24w
Error from server: Get "https://10.0.1.16:10250/containerLogs/openshift-machine-api/machine-api-termination-handler-2f24w/termination-handler?follow=true": dial tcp 10.0.1.16:10250: i/o timeout
[miyadav@miyadav azure]$ oc logs -f machine-api-termination-handler-2f24w
Error from server (NotFound): pods "whu47az4-pwzmb-worker-usgovvirginia-spot-7g8hm" not found
[miyadav@miyadav azure]$ oc logs -f machine-api-termination-handler-2f24w
Error from server (NotFound): pods "machine-api-termination-handler-2f24w" not found
[miyadav@miyadav azure]$ oc get pods | grep termination
machine-api-termination-handler-74d58 1/1 Running 0 22m
machine-api-termination-handler-gnx9b 1/1 Running 0 23m
machine-api-termination-handler-k9f2k 1/1 Running 0 22m
machine-api-termination-handler-mphfs 1/1 Running 0 35m
machine-api-termination-handler-nr66h 1/1 Running 0 22m
[miyadav@miyadav azure]$ oc get machines
NAME PHASE TYPE REGION ZONE AGE
whu47az4-pwzmb-master-0 Running Standard_D8s_v3 usgovvirginia 2d10h
whu47az4-pwzmb-master-1 Running Standard_D8s_v3 usgovvirginia 2d10h
whu47az4-pwzmb-master-2 Running Standard_D8s_v3 usgovvirginia 2d10h
whu47az4-pwzmb-worker-usgovvirginia-cbqhb Running Standard_D2s_v3 usgovvirginia 29h
whu47az4-pwzmb-worker-usgovvirginia-gmpvc Running Standard_D2s_v3 usgovvirginia 30h
whu47az4-pwzmb-worker-usgovvirginia-jf97p Running Standard_D2s_v3 usgovvirginia 29h
whu47az4-pwzmb-worker-usgovvirginia-spot-7g8hm Deleting Standard_D2s_v3 usgovvirginia 30m
whu47az4-pwzmb-worker-usgovvirginia-spot-fmkzw Deleting Standard_D2s_v3 usgovvirginia 30m
whu47az4-pwzmb-worker-usgovvirginia-spot-m4t82 Deleting Standard_D2s_v3 usgovvirginia 30m
whu47az4-pwzmb-worker-usgovvirginia-spot-npd2q Deleting Standard_D2s_v3 usgovvirginia 30m
whu47az4-pwzmb-worker-usgovvirginia-spot-t5dq4 Deleting Standard_D2s_v3 usgovvirginia 30m
whu47az4-pwzmb-worker-usgovvirginia-spot-v49bx Running Standard_D2s_v3 usgovvirginia 42m
whu47az4-pwzmb-worker-usgovvirginia-spot-zkblg Deleting Standard_D2s_v3 usgovvirginia 30m
////State after all workload was removed///
[miyadav@miyadav azure]$ oc get machines
NAME PHASE TYPE REGION ZONE AGE
whu47az4-pwzmb-master-0 Running Standard_D8s_v3 usgovvirginia 2d10h
whu47az4-pwzmb-master-1 Running Standard_D8s_v3 usgovvirginia 2d10h
whu47az4-pwzmb-master-2 Running Standard_D8s_v3 usgovvirginia 2d10h
whu47az4-pwzmb-worker-usgovvirginia-cbqhb Running Standard_D2s_v3 usgovvirginia 29h
whu47az4-pwzmb-worker-usgovvirginia-gmpvc Running Standard_D2s_v3 usgovvirginia 30h
whu47az4-pwzmb-worker-usgovvirginia-jf97p Running Standard_D2s_v3 usgovvirginia 29h
whu47az4-pwzmb-worker-usgovvirginia-spot-v49bx Running Standard_D2s_v3 usgovvirginia 44m
Additional Info :
Couldnt reproduce it , tried multiple times today , not sure how it happened yesterday ...
Only other detail I had about the pod that was crashing yesterday is in the events of yesterday;s comment
We did have another BZ which had the pods crash looping, could be that this was covered when that was fixed. Out of interest, we had previously been unable to create spot instances on govcloud, but you seem to not be having any issues, are they definitely coming up as spot instances if you check the console? Hey @miyadav, I think this issue is no longer needed based on the results you've seen above, do you have a moment to answer my below question though before we do?
> Out of interest, we had previously been unable to create spot instances on govcloud, but you seem to not be having any issues, are they definitely coming up as spot instances if you check the console?
@Joel , I dont remember checking the console ,but the yaml should bring up spot instances only; all node references were proper and workload was getting executed on those . Earlier when we were not able to create spot instances it used to fail . I will keep eye next time while bringing up the govcloud to make sure spot instances present on the Azure console are being created by us . Ok thanks, if you find an issue with it please open a new BZ, going to close this one for now. |
Description of problem: Cluster autoscaler not able to scale down spot instances on Azure Gov cloud Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2020-12-14-165231 Steps to Reproduce: 1. Create a machineset by copying and changing name of existing machineset . . create spot instances refer below for to modify spec: metadata: {} providerSpec: value: spotVMOptions: maxPrice: 0.225. . Expected and actual - machineset created successfully 2.Create cas use below yaml apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s ~ Expected and actual - clusterautoscaler created successfully 3.Create machineautoscaler referring to the spot machineset refer below : . . spec: maxReplicas: 6 minReplicas: 1 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: whu47az4-pwzmb-worker-usgovvirginia-spot status: lastTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: whu47az4-pwzmb-worker-usgovvirginia-spot Actual and expected - machineautoscaler created successfully 4.Create workload to scale spot machineset . . Create workload refer apiVersion: batch/v1 kind: Job metadata: generateName: work-queue- spec: template: spec: containers: - name: work image: quay.io/openshifttest/busybox@sha256:xyz command: ["sleep", "300"] resources: requests: memory: 500Mi cpu: 500m restartPolicy: Never backoffLimit: 4 completions: 50 parallelism: 50. . Actual and expected : workload created successfully 5.spot instances scaled up successfully 6.delete workload and wait for sometime (10 mins) Actual : machines could not scale down , as spot instance termination handler is crashing Expected : spot instances should be released Additional info: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 46m default-scheduler Successfully assigned openshift-machine-api/machine-api-termination-handler-4s4ls to whu47az4-pwzmb-worker-usgovvirginia-spot-w4hl5 Normal Pulling 46m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:xyzf445734cf6d83a478bd" Normal Pulled 46m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ee0846a75604e7602cfed2bf2e918ed3ccef2733e0a29f445734cf6d83a478bd" in 7.141122758s Normal Pulled 25m (x2 over 25m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:xyzccef2733e0a29f445734cf6d83a478bd" already present on machine Normal Created 24m (x3 over 46m) kubelet Created container termination-handler Normal Started 24m (x3 over 46m) kubelet Started container termination-handler Warning BackOff 24m (x3 over 25m) kubelet Back-off restarting failed container