Bug 1929777

Summary: oVirt CSI driver operator is constantly restarting
Product: OpenShift Container Platform Reporter: Gal Zaidman <gzaidman>
Component: StorageAssignee: Benny Zlotnik <bzlotnik>
Storage sub component: oVirt CSI Driver QA Contact: michal <mgold>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aos-bugs, bzlotnik, lleistne, mburman, mgold, pelauter
Version: 4.7   
Target Milestone: ---   
Target Release: 4.7.z   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Known Issue
Doc Text:
Cause: Informers added to the operator infrastructure in 4.7 were not started. Consequence: The oVirt CSI operator doesn't inform and as a result shuts down after a couple of minutes, leading to constant restarts
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-16 08:42:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1929733    
Bug Blocks:    

Description Gal Zaidman 2021-02-17 16:05:35 UTC
Description of problem:

The oVirt CSI driver operator is constantly restarting since it's inception

    Container ID:  cri-o://4aded21619cec53cd9c6c06ffd1988909059d66acc23720a0895431ce775968d
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:86a675ddbace0069c6d860629724f1dcebccc639fc032093afa04ec7e13b1940
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:86a675ddbace0069c6d860629724f1dcebccc639fc032093afa04ec7e13b1940
    Port:          <none>
    Host Port:     <none>
    State:          Running
      Started:      Wed, 17 Feb 2021 13:12:15 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 17 Feb 2021 13:01:08 +0200
      Finished:     Wed, 17 Feb 2021 13:12:14 +0200
    Ready:          True
    Restart Count:  945

This happens because configInformers in the operator code were not started[1], as a result the ConfigObserver failed to sync the cache

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Should happen on any cluster >4.7 with ovirt csi driver operator

Actual results:
ovirt CSI driver operator pod keeps restarting

Expected results:
The operator should not restart unless there is a real issue

[1] https://github.com/openshift/ovirt-csi-driver-operator/blob/master/pkg/operator/starter.go#L128

Comment 1 michal 2021-02-17 19:02:16 UTC
steps to reproduce: 
1) oc project openshift-cluster-csi-drivers
2) oc status
In project openshift-cluster-csi-drivers on server https://api.primary.ocp.rhev.lab.eng.brq.redhat.com:6443

deployment/ovirt-csi-driver-controller deploys quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0feb29efe901393bf80594af53ec8bbef34bbc6303c71cdfb7c779bacc461531,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:def80d6439c31c03f4d5e5bfa4f209bddfd3b7423d38d90f483a1ad1a10c0e01,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b3a0f319143cdd04122e50490ffa60e93024e18ace3c105041c432f2daf961fa,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f14455d69f404747e4458528744fd8ab9c2f5243004b3f7bff0323e73072b681
  deployment #1 running for 13 days - 1 pod

deployment/ovirt-csi-driver-operator deploys quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:86a675ddbace0069c6d860629724f1dcebccc639fc032093afa04ec7e13b1940
  deployment #1 running for 13 days - 1 pod (warning: 974 restarts)
3) we got warning
4) [root@ocp-qe-1 primary]# oc get pods
NAME                                           READY   STATUS    RESTARTS   AGE
ovirt-csi-driver-controller-7db477884c-tflht   4/4     Running   0          7d14h
ovirt-csi-driver-node-8qnxw                    3/3     Running   0          13d
ovirt-csi-driver-node-h5xvc                    3/3     Running   0          13d
ovirt-csi-driver-node-jtf7s                    3/3     Running   1          13d
ovirt-csi-driver-node-lnxmx                    3/3     Running   0          13d
ovirt-csi-driver-node-sg2td                    3/3     Running   0          13d
ovirt-csi-driver-node-wvnbm                    3/3     Running   0          13d
ovirt-csi-driver-operator-89d7bb77b-rn2m5      1/1     Running   975        7d14h
5) oc logs pod/ovirt-csi-driver-operator-89d7bb77b-rn2m5 -n openshift-cluster-csi-drivers
6) oc describe pod/ovirt-csi-driver-operator-89d7bb77b-rn2m5 -n openshift-cluster-csi-drivers

Comment 2 Peter Lauterbach 2021-02-22 15:27:53 UTC
This is not a blocker for OCP 4.7.0, but need to be fixed in the first available OCP 4.7.z stream.

Comment 3 Peter Lauterbach 2021-02-22 19:22:56 UTC
Please suggest some reasonable language on this issue for the "known problems" section of the OCP 4.7 release notes.

Comment 6 michal 2021-03-08 19:59:36 UTC
verify on : ocp - 4.7.0-0.nightly-2021-03-06-183610

1) oc project openshift-cluster-csi-drivers
2) oc status - look if there are any warning -> no warnings
3) oc get pods  

results: there is no pod that restart a lot of time

Comment 8 errata-xmlrpc 2021-03-16 08:42:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.2 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.