Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1856597

Summary: [gcp] Termination Pod in CrashLoopBackOff
Product: OpenShift Container Platform Reporter: sunzhaohua <zhsun>
Component: Cloud ComputeAssignee: Alexander Demicev <ademicev>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: agarcial
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:14:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description sunzhaohua 2020-07-14 03:23:54 UTC
Description of problem:
Termination Pod in CrashLoopBackOff

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-07-13-051854

How reproducible:
Always

Steps to Reproduce:
1. Create a new machineset with "preemptible: true"
2. Check daemondet and pod
3. 

Actual results:
Termination Pod in CrashLoopBackOff

$ oc get po
NAME                                           READY   STATUS             RESTARTS   AGE
cluster-autoscaler-operator-687b6fbc59-rlzkv   2/2     Running            0          16h
machine-api-controllers-6cd4ffdcf6-7btc5       7/7     Running            0          16h
machine-api-operator-7d4c85fd56-m9h4w          2/2     Running            0          16h
machine-api-termination-handler-fd8zr          0/1     CrashLoopBackOff   4          2m17s

$ oc get ds
NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                  AGE
machine-api-termination-handler   1         1         0       1            0           machine.openshift.io/interruptible-instance=   16h

$ oc describe po machine-api-termination-handler-fd8zr

Events:
  Type     Reason     Age                   From                                                               Message
  ----     ------     ----                  ----                                                               -------
  Normal   Scheduled  <unknown>             default-scheduler                                                  Successfully assigned openshift-machine-api/machine-api-termination-handler-fd8zr to zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal
  Normal   Pulling    13m                   kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal  Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6339413d2db5ade826ae908c23f64f373255262f596d83c10d1e4a9264d42420"
  Normal   Pulled     13m                   kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal  Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6339413d2db5ade826ae908c23f64f373255262f596d83c10d1e4a9264d42420"
  Normal   Created    11m (x5 over 13m)     kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal  Created container termination-handler
  Normal   Started    11m (x5 over 13m)     kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal  Started container termination-handler
  Normal   Pulled     11m (x4 over 13m)     kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6339413d2db5ade826ae908c23f64f373255262f596d83c10d1e4a9264d42420" already present on machine
  Warning  BackOff    3m12s (x45 over 13m)  kubelet, zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal  Back-off restarting failed container

$ oc logs -f machine-api-termination-handler-fd8zr
I0714 02:24:55.358035       1 request.go:557] Throttling request took 84.573682ms, request: GET:https://172.30.0.1:443/apis/storage.k8s.io/v1?timeout=32s
I0714 02:24:55.408147       1 request.go:557] Throttling request took 134.681181ms, request: GET:https://172.30.0.1:443/apis/storage.k8s.io/v1beta1?timeout=32s
....
I0714 02:24:57.408197       1 request.go:557] Throttling request took 2.134326471s, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1?timeout=32s
I0714 02:24:57.458176       1 request.go:557] Throttling request took 2.184308594s, request: GET:https://172.30.0.1:443/apis/snapshot.storage.k8s.io/v1beta1?timeout=32s
E0714 02:24:57.462637       1 main.go:70]  "msg"="Error starting termination handler" "error"="error fetching machine for node (\"zhsun713gcp-l87m9-worker-f-p4cv5.c.openshift-qe.internal\"): error listing machines: no kind is registered for the type v1beta1.MachineList in scheme \"k8s.io/client-go/kubernetes/scheme/register.go:69\""

Expected results:
Termination Pod works well.


Additional info:

Comment 1 Alberto 2020-07-14 08:16:26 UTC
this should have been fixed by https://github.com/openshift/cluster-api-provider-gcp/pull/100/commits/f023886d466c3493c0557c5109f3a2ace3c9c639. DOes this build contains that commit?

Comment 2 Alberto 2020-07-14 08:17:05 UTC
Sunzhaohua can you please also share the clusterOperator status for machine-api?

Comment 5 sunzhaohua 2020-07-14 10:13:31 UTC
$ oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6339413d2db5ade826ae908c23f64f373255262f596d83c10d1e4a9264d42420
 io.openshift.build.commit.id=245b58110669b3ae2859fe55b884c7d9d7d74379

So this build contains that commit. I will test this with the latest nightly build.

$ oc get co machine-api -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
  creationTimestamp: "2020-07-13T09:12:16Z"
  generation: 1
  managedFields:
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:exclude.release.openshift.io/internal-openshift-hosted: {}
      f:spec: {}
      f:status:
        .: {}
        f:extension: {}
    manager: cluster-version-operator
    operation: Update
    time: "2020-07-13T09:12:16Z"
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions: {}
        f:relatedObjects: {}
        f:versions: {}
    manager: machine-api-operator
    operation: Update
    time: "2020-07-14T10:02:33Z"
  name: machine-api
  resourceVersion: "899798"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-api
  uid: d95ba359-c091-42c5-b614-cc734f9995cc
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-07-13T09:32:01Z"
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-07-13T09:28:29Z"
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-07-13T09:32:01Z"
    message: 'Cluster Machine API Operator is available at operator: 4.6.0-0.nightly-2020-07-13-051854'
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2020-07-13T09:28:29Z"
    status: "True"
    type: Upgradeable
  extension: null
  relatedObjects:
  - group: ""
    name: openshift-machine-api
    resource: namespaces
  - group: machine.openshift.io
    name: ""
    namespace: openshift-machine-api
    resource: machines
  - group: machine.openshift.io
    name: ""
    namespace: openshift-machine-api
    resource: machinesets
  - group: machine.openshift.io
    name: ""
    namespace: openshift-machine-api
    resource: machinehealthchecks
  - group: rbac.authorization.k8s.io
    name: ""
    namespace: openshift-machine-api
    resource: roles
  - group: rbac.authorization.k8s.io
    name: machine-api-operator
    resource: clusterroles
  - group: rbac.authorization.k8s.io
    name: machine-api-controllers
    resource: clusterroles
  - group: metal3.io
    name: ""
    namespace: openshift-machine-api
    resource: baremetalhosts
  versions:
  - name: operator
    version: 4.6.0-0.nightly-2020-07-13-051854

Comment 6 Joel Speed 2020-07-15 12:38:50 UTC
@sunzhaohua please hold of re-verifying this for now, there is a new PR that is needed to fix this issue as the original fix did not resolve the issue

Comment 8 sunzhaohua 2020-07-20 03:20:20 UTC
Verified
clusterversion: 4.6.0-0.nightly-2020-07-19-093912

$ oc get po
NAME                                           READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-5f7f65668f-4lmz4   2/2     Running   0          56m
machine-api-controllers-8d59c9c6f-q29kw        7/7     Running   0          55m
machine-api-operator-f799f5dc5-lfnjh           2/2     Running   0          56m
machine-api-termination-handler-nmg8z          1/1     Running   0          12m

$ oc get ds
NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                  AGE
machine-api-termination-handler   1         1         1       1            1           machine.openshift.io/interruptible-instance=   51m

Comment 10 errata-xmlrpc 2020-10-27 16:14:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196