1856597 – [gcp] Termination Pod in CrashLoopBackOff

Bug 1856597 - [gcp] Termination Pod in CrashLoopBackOff

Summary: [gcp] Termination Pod in CrashLoopBackOff

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Alexander Demicev
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-14 03:23 UTC by sunzhaohua
Modified:	2020-10-27 16:14 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:14:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-gcp pull 103	0	None	closed	BUG 1856597: Pass scheme to client creation so that it uses scheme with Machine API	2020-08-05 10:10:12 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:14:24 UTC

Description sunzhaohua 2020-07-14 03:23:54 UTC

Description of problem:
Termination Pod in CrashLoopBackOff

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-07-13-051854

How reproducible:
Always

Steps to Reproduce:
1. Create a new machineset with "preemptible: true"
2. Check daemondet and pod
3. 

Actual results:
Termination Pod in CrashLoopBackOff

$ oc get po
NAME                                           READY   STATUS             RESTARTS   AGE
cluster-autoscaler-operator-687b6fbc59-rlzkv   2/2     Running            0          16h
machine-api-controllers-6cd4ffdcf6-7btc5       7/7     Running            0          16h
machine-api-operator-7d4c85fd56-m9h4w          2/2     Running            0          16h
machine-api-termination-handler-fd8zr          0/1     CrashLoopBackOff   4          2m17s

$ oc get ds
NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                  AGE
machine-api-termination-handler   1         1         0       1            0           machine.openshift.io/interruptible-instance=   16h

$ oc describe po machine-api-termination-handler-fd8zr

Events:
  Type     Reason     Age                   From                                                               Message
  ----     ------     ----                  ----                                                               -------
  Normal   Scheduled  <unknown>             default-scheduler                                                  Successfully assigned openshift-machine-api/machine-api-termination-handler-fd8zr to zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal
  Normal   Pulling    13m                   kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal  Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6339413d2db5ade826ae908c23f64f373255262f596d83c10d1e4a9264d42420"
  Normal   Pulled     13m                   kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal  Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6339413d2db5ade826ae908c23f64f373255262f596d83c10d1e4a9264d42420"
  Normal   Created    11m (x5 over 13m)     kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal  Created container termination-handler
  Normal   Started    11m (x5 over 13m)     kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal  Started container termination-handler
  Normal   Pulled     11m (x4 over 13m)     kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6339413d2db5ade826ae908c23f64f373255262f596d83c10d1e4a9264d42420" already present on machine
  Warning  BackOff    3m12s (x45 over 13m)  kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal  Back-off restarting failed container

$ oc logs -f machine-api-termination-handler-fd8zr
I0714 02:24:55.358035       1 request.go:557] Throttling request took 84.573682ms, request: GET:https://172.30.0.1:443/apis/storage.k8s.io/v1?timeout=32s
I0714 02:24:55.408147       1 request.go:557] Throttling request took 134.681181ms, request: GET:https://172.30.0.1:443/apis/storage.k8s.io/v1beta1?timeout=32s
....
I0714 02:24:57.408197       1 request.go:557] Throttling request took 2.134326471s, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1?timeout=32s
I0714 02:24:57.458176       1 request.go:557] Throttling request took 2.184308594s, request: GET:https://172.30.0.1:443/apis/snapshot.storage.k8s.io/v1beta1?timeout=32s
E0714 02:24:57.462637       1 main.go:70]  "msg"="Error starting termination handler" "error"="error fetching machine for node (\"zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal\"): error listing machines: no kind is registered for the type v1beta1.MachineList in scheme \"k8s.io/client-go/kubernetes/scheme/register.go:69\""

Expected results:
Termination Pod works well.


Additional info:

Comment 1 Alberto 2020-07-14 08:16:26 UTC

this should have been fixed by https://github.com/openshift/cluster-api-provider-gcp/pull/100/commits/f023886d466c3493c0557c5109f3a2ace3c9c639. DOes this build contains that commit?

Comment 2 Alberto 2020-07-14 08:17:05 UTC

Sunzhaohua can you please also share the clusterOperator status for machine-api?

Comment 5 sunzhaohua 2020-07-14 10:13:31 UTC

$ oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6339413d2db5ade826ae908c23f64f373255262f596d83c10d1e4a9264d42420
 io.openshift.build.commit.id=245b58110669b3ae2859fe55b884c7d9d7d74379

So this build contains that commit. I will test this with the latest nightly build.

$ oc get co machine-api -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
  creationTimestamp: "2020-07-13T09:12:16Z"
  generation: 1
  managedFields:
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:exclude.release.openshift.io/internal-openshift-hosted: {}
      f:spec: {}
      f:status:
        .: {}
        f:extension: {}
    manager: cluster-version-operator
    operation: Update
    time: "2020-07-13T09:12:16Z"
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions: {}
        f:relatedObjects: {}
        f:versions: {}
    manager: machine-api-operator
    operation: Update
    time: "2020-07-14T10:02:33Z"
  name: machine-api
  resourceVersion: "899798"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-api
  uid: d95ba359-c091-42c5-b614-cc734f9995cc
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-07-13T09:32:01Z"
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-07-13T09:28:29Z"
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-07-13T09:32:01Z"
    message: 'Cluster Machine API Operator is available at operator: 4.6.0-0.nightly-2020-07-13-051854'
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2020-07-13T09:28:29Z"
    status: "True"
    type: Upgradeable
  extension: null
  relatedObjects:
  - group: ""
    name: openshift-machine-api
    resource: namespaces
  - group: machine.openshift.io
    name: ""
    namespace: openshift-machine-api
    resource: machines
  - group: machine.openshift.io
    name: ""
    namespace: openshift-machine-api
    resource: machinesets
  - group: machine.openshift.io
    name: ""
    namespace: openshift-machine-api
    resource: machinehealthchecks
  - group: rbac.authorization.k8s.io
    name: ""
    namespace: openshift-machine-api
    resource: roles
  - group: rbac.authorization.k8s.io
    name: machine-api-operator
    resource: clusterroles
  - group: rbac.authorization.k8s.io
    name: machine-api-controllers
    resource: clusterroles
  - group: metal3.io
    name: ""
    namespace: openshift-machine-api
    resource: baremetalhosts
  versions:
  - name: operator
    version: 4.6.0-0.nightly-2020-07-13-051854

Comment 6 Joel Speed 2020-07-15 12:38:50 UTC

@sunzhaohua please hold of re-verifying this for now, there is a new PR that is needed to fix this issue as the original fix did not resolve the issue

Comment 8 sunzhaohua 2020-07-20 03:20:20 UTC

Verified
clusterversion: 4.6.0-0.nightly-2020-07-19-093912

$ oc get po
NAME                                           READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-5f7f65668f-4lmz4   2/2     Running   0          56m
machine-api-controllers-8d59c9c6f-q29kw        7/7     Running   0          55m
machine-api-operator-f799f5dc5-lfnjh           2/2     Running   0          56m
machine-api-termination-handler-nmg8z          1/1     Running   0          12m

$ oc get ds
NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                  AGE
machine-api-termination-handler   1         1         1       1            1           machine.openshift.io/interruptible-instance=   51m

Comment 10 errata-xmlrpc 2020-10-27 16:14:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.