Description of problem: Termination Pod in CrashLoopBackOff Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-07-13-051854 How reproducible: Always Steps to Reproduce: 1. Create a new machineset with "preemptible: true" 2. Check daemondet and pod 3. Actual results: Termination Pod in CrashLoopBackOff $ oc get po NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-687b6fbc59-rlzkv 2/2 Running 0 16h machine-api-controllers-6cd4ffdcf6-7btc5 7/7 Running 0 16h machine-api-operator-7d4c85fd56-m9h4w 2/2 Running 0 16h machine-api-termination-handler-fd8zr 0/1 CrashLoopBackOff 4 2m17s $ oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE machine-api-termination-handler 1 1 0 1 0 machine.openshift.io/interruptible-instance= 16h $ oc describe po machine-api-termination-handler-fd8zr Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned openshift-machine-api/machine-api-termination-handler-fd8zr to zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal Normal Pulling 13m kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6339413d2db5ade826ae908c23f64f373255262f596d83c10d1e4a9264d42420" Normal Pulled 13m kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6339413d2db5ade826ae908c23f64f373255262f596d83c10d1e4a9264d42420" Normal Created 11m (x5 over 13m) kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal Created container termination-handler Normal Started 11m (x5 over 13m) kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal Started container termination-handler Normal Pulled 11m (x4 over 13m) kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6339413d2db5ade826ae908c23f64f373255262f596d83c10d1e4a9264d42420" already present on machine Warning BackOff 3m12s (x45 over 13m) kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal Back-off restarting failed container $ oc logs -f machine-api-termination-handler-fd8zr I0714 02:24:55.358035 1 request.go:557] Throttling request took 84.573682ms, request: GET:https://172.30.0.1:443/apis/storage.k8s.io/v1?timeout=32s I0714 02:24:55.408147 1 request.go:557] Throttling request took 134.681181ms, request: GET:https://172.30.0.1:443/apis/storage.k8s.io/v1beta1?timeout=32s .... I0714 02:24:57.408197 1 request.go:557] Throttling request took 2.134326471s, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1?timeout=32s I0714 02:24:57.458176 1 request.go:557] Throttling request took 2.184308594s, request: GET:https://172.30.0.1:443/apis/snapshot.storage.k8s.io/v1beta1?timeout=32s E0714 02:24:57.462637 1 main.go:70] "msg"="Error starting termination handler" "error"="error fetching machine for node (\"zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal\"): error listing machines: no kind is registered for the type v1beta1.MachineList in scheme \"k8s.io/client-go/kubernetes/scheme/register.go:69\"" Expected results: Termination Pod works well. Additional info:
this should have been fixed by https://github.com/openshift/cluster-api-provider-gcp/pull/100/commits/f023886d466c3493c0557c5109f3a2ace3c9c639. DOes this build contains that commit?
Sunzhaohua can you please also share the clusterOperator status for machine-api?
$ oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6339413d2db5ade826ae908c23f64f373255262f596d83c10d1e4a9264d42420 io.openshift.build.commit.id=245b58110669b3ae2859fe55b884c7d9d7d74379 So this build contains that commit. I will test this with the latest nightly build. $ oc get co machine-api -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" creationTimestamp: "2020-07-13T09:12:16Z" generation: 1 managedFields: - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:exclude.release.openshift.io/internal-openshift-hosted: {} f:spec: {} f:status: .: {} f:extension: {} manager: cluster-version-operator operation: Update time: "2020-07-13T09:12:16Z" - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:status: f:conditions: {} f:relatedObjects: {} f:versions: {} manager: machine-api-operator operation: Update time: "2020-07-14T10:02:33Z" name: machine-api resourceVersion: "899798" selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-api uid: d95ba359-c091-42c5-b614-cc734f9995cc spec: {} status: conditions: - lastTransitionTime: "2020-07-13T09:32:01Z" reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2020-07-13T09:28:29Z" reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2020-07-13T09:32:01Z" message: 'Cluster Machine API Operator is available at operator: 4.6.0-0.nightly-2020-07-13-051854' reason: AsExpected status: "True" type: Available - lastTransitionTime: "2020-07-13T09:28:29Z" status: "True" type: Upgradeable extension: null relatedObjects: - group: "" name: openshift-machine-api resource: namespaces - group: machine.openshift.io name: "" namespace: openshift-machine-api resource: machines - group: machine.openshift.io name: "" namespace: openshift-machine-api resource: machinesets - group: machine.openshift.io name: "" namespace: openshift-machine-api resource: machinehealthchecks - group: rbac.authorization.k8s.io name: "" namespace: openshift-machine-api resource: roles - group: rbac.authorization.k8s.io name: machine-api-operator resource: clusterroles - group: rbac.authorization.k8s.io name: machine-api-controllers resource: clusterroles - group: metal3.io name: "" namespace: openshift-machine-api resource: baremetalhosts versions: - name: operator version: 4.6.0-0.nightly-2020-07-13-051854
@sunzhaohua please hold of re-verifying this for now, there is a new PR that is needed to fix this issue as the original fix did not resolve the issue
Verified clusterversion: 4.6.0-0.nightly-2020-07-19-093912 $ oc get po NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-5f7f65668f-4lmz4 2/2 Running 0 56m machine-api-controllers-8d59c9c6f-q29kw 7/7 Running 0 55m machine-api-operator-f799f5dc5-lfnjh 2/2 Running 0 56m machine-api-termination-handler-nmg8z 1/1 Running 0 12m $ oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE machine-api-termination-handler 1 1 1 1 1 machine.openshift.io/interruptible-instance= 51m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196