Bug 1929777 - oVirt CSI driver operator is constantly restarting
Summary: oVirt CSI driver operator is constantly restarting
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.7.z
Assignee: Benny Zlotnik
QA Contact: michal
URL:
Whiteboard:
Depends On: 1929733
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-17 16:05 UTC by Gal Zaidman
Modified: 2021-03-16 08:43 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: Informers added to the operator infrastructure in 4.7 were not started. Consequence: The oVirt CSI operator doesn't inform and as a result shuts down after a couple of minutes, leading to constant restarts
Clone Of:
Environment:
Last Closed: 2021-03-16 08:42:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovirt-csi-driver-operator pull 49 0 None open [release-4.7] Bug 1929777: Run config informers when starting the operator 2021-02-22 16:18:38 UTC
Red Hat Product Errata RHBA-2021:0749 0 None None None 2021-03-16 08:43:10 UTC

Description Gal Zaidman 2021-02-17 16:05:35 UTC
Description of problem:

The oVirt CSI driver operator is constantly restarting since it's inception

Containers:
  ovirt-csi-driver-operator:
    Container ID:  cri-o://4aded21619cec53cd9c6c06ffd1988909059d66acc23720a0895431ce775968d
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:86a675ddbace0069c6d860629724f1dcebccc639fc032093afa04ec7e13b1940
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:86a675ddbace0069c6d860629724f1dcebccc639fc032093afa04ec7e13b1940
    Port:          <none>
    Host Port:     <none>
    Args:
      start
      --node=$(KUBE_NODE_NAME)
      -v=2
    State:          Running
      Started:      Wed, 17 Feb 2021 13:12:15 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 17 Feb 2021 13:01:08 +0200
      Finished:     Wed, 17 Feb 2021 13:12:14 +0200
    Ready:          True
    Restart Count:  945


This happens because configInformers in the operator code were not started[1], as a result the ConfigObserver failed to sync the cache

Version-Release number of selected component (if applicable):

How reproducible:
100%

Steps to Reproduce:
1. Should happen on any cluster >4.7 with ovirt csi driver operator
2.
3.

Actual results:
ovirt CSI driver operator pod keeps restarting

Expected results:
The operator should not restart unless there is a real issue


[1] https://github.com/openshift/ovirt-csi-driver-operator/blob/master/pkg/operator/starter.go#L128

Comment 1 michal 2021-02-17 19:02:16 UTC
steps to reproduce: 
1) oc project openshift-cluster-csi-drivers
2) oc status
In project openshift-cluster-csi-drivers on server https://api.primary.ocp.rhev.lab.eng.brq.redhat.com:6443

deployment/ovirt-csi-driver-controller deploys quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0feb29efe901393bf80594af53ec8bbef34bbc6303c71cdfb7c779bacc461531,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:def80d6439c31c03f4d5e5bfa4f209bddfd3b7423d38d90f483a1ad1a10c0e01,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b3a0f319143cdd04122e50490ffa60e93024e18ace3c105041c432f2daf961fa,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f14455d69f404747e4458528744fd8ab9c2f5243004b3f7bff0323e73072b681
  deployment #1 running for 13 days - 1 pod

deployment/ovirt-csi-driver-operator deploys quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:86a675ddbace0069c6d860629724f1dcebccc639fc032093afa04ec7e13b1940
  deployment #1 running for 13 days - 1 pod (warning: 974 restarts)
3) we got warning
4) [root@ocp-qe-1 primary]# oc get pods
NAME                                           READY   STATUS    RESTARTS   AGE
ovirt-csi-driver-controller-7db477884c-tflht   4/4     Running   0          7d14h
ovirt-csi-driver-node-8qnxw                    3/3     Running   0          13d
ovirt-csi-driver-node-h5xvc                    3/3     Running   0          13d
ovirt-csi-driver-node-jtf7s                    3/3     Running   1          13d
ovirt-csi-driver-node-lnxmx                    3/3     Running   0          13d
ovirt-csi-driver-node-sg2td                    3/3     Running   0          13d
ovirt-csi-driver-node-wvnbm                    3/3     Running   0          13d
ovirt-csi-driver-operator-89d7bb77b-rn2m5      1/1     Running   975        7d14h
5) oc logs pod/ovirt-csi-driver-operator-89d7bb77b-rn2m5 -n openshift-cluster-csi-drivers
6) oc describe pod/ovirt-csi-driver-operator-89d7bb77b-rn2m5 -n openshift-cluster-csi-drivers

Comment 2 Peter Lauterbach 2021-02-22 15:27:53 UTC
This is not a blocker for OCP 4.7.0, but need to be fixed in the first available OCP 4.7.z stream.

Comment 3 Peter Lauterbach 2021-02-22 19:22:56 UTC
Please suggest some reasonable language on this issue for the "known problems" section of the OCP 4.7 release notes.

Comment 6 michal 2021-03-08 19:59:36 UTC
verify on : ocp - 4.7.0-0.nightly-2021-03-06-183610
rhv- 4.4.4.7

steps: 
1) oc project openshift-cluster-csi-drivers
2) oc status - look if there are any warning -> no warnings
3) oc get pods  

results: there is no pod that restart a lot of time

Comment 8 errata-xmlrpc 2021-03-16 08:42:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.2 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0749


Note You need to log in before you can comment on or make changes to this bug.