Bug 1929733

Summary: oVirt CSI driver operator is constantly restarting
Product: OpenShift Container Platform Reporter: Benny Zlotnik <bzlotnik>
Component: StorageAssignee: Benny Zlotnik <bzlotnik>
Storage sub component: oVirt CSI Driver QA Contact: michal <mgold>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: aos-bugs, bbennett, lleistne, mburman, mgold, pelauter
Version: 4.7   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:45:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1929777    

Description Benny Zlotnik 2021-02-17 14:19:05 UTC
Description of problem:

The oVirt CSI driver operator is constantly restarting since it's inception

Containers:
  ovirt-csi-driver-operator:
    Container ID:  cri-o://4aded21619cec53cd9c6c06ffd1988909059d66acc23720a0895431ce775968d
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:86a675ddbace0069c6d860629724f1dcebccc639fc032093afa04ec7e13b1940
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:86a675ddbace0069c6d860629724f1dcebccc639fc032093afa04ec7e13b1940
    Port:          <none>
    Host Port:     <none>
    Args:
      start
      --node=$(KUBE_NODE_NAME)
      -v=2
    State:          Running
      Started:      Wed, 17 Feb 2021 13:12:15 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 17 Feb 2021 13:01:08 +0200
      Finished:     Wed, 17 Feb 2021 13:12:14 +0200
    Ready:          True
    Restart Count:  945


This happens because configInformers in the operator code were not started[1], as a result the ConfigObserver to sync the cache

The error in the operator log:
706171       1 shared_informer.go:266] stop requested
707427       1 base_controller.go:95] unable to sync caches for ConfigObserver

Version-Release number of selected component (if applicable):

How reproducible:
100%

Steps to Reproduce:
1. Should happen on any cluster >4.7 with ovirt csi driver operator
2.
3.

Actual results:
ovirt CSI driver operator pod keeps restarting

Expected results:
The operator should not restart unless there is a real issue


[1] https://github.com/openshift/ovirt-csi-driver-operator/blob/master/pkg/operator/starter.go#L128

Comment 2 michal 2021-02-17 16:57:53 UTC
steps to reproduce: 
1) oc project openshift-cluster-csi-drivers
2) oc status
In project openshift-cluster-csi-drivers on server https://api.primary.ocp.rhev.lab.eng.brq.redhat.com:6443

deployment/ovirt-csi-driver-controller deploys quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0feb29efe901393bf80594af53ec8bbef34bbc6303c71cdfb7c779bacc461531,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:def80d6439c31c03f4d5e5bfa4f209bddfd3b7423d38d90f483a1ad1a10c0e01,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b3a0f319143cdd04122e50490ffa60e93024e18ace3c105041c432f2daf961fa,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f14455d69f404747e4458528744fd8ab9c2f5243004b3f7bff0323e73072b681
  deployment #1 running for 13 days - 1 pod

deployment/ovirt-csi-driver-operator deploys quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:86a675ddbace0069c6d860629724f1dcebccc639fc032093afa04ec7e13b1940
  deployment #1 running for 13 days - 1 pod (warning: 974 restarts)
3) we got warning
4) [root@ocp-qe-1 primary]# oc get pods
NAME                                           READY   STATUS    RESTARTS   AGE
ovirt-csi-driver-controller-7db477884c-tflht   4/4     Running   0          7d14h
ovirt-csi-driver-node-8qnxw                    3/3     Running   0          13d
ovirt-csi-driver-node-h5xvc                    3/3     Running   0          13d
ovirt-csi-driver-node-jtf7s                    3/3     Running   1          13d
ovirt-csi-driver-node-lnxmx                    3/3     Running   0          13d
ovirt-csi-driver-node-sg2td                    3/3     Running   0          13d
ovirt-csi-driver-node-wvnbm                    3/3     Running   0          13d
ovirt-csi-driver-operator-89d7bb77b-rn2m5      1/1     Running   975        7d14h
5) oc logs pod/ovirt-csi-driver-operator-89d7bb77b-rn2m5 -n openshift-cluster-csi-drivers
6) oc describe pod/ovirt-csi-driver-operator-89d7bb77b-rn2m5 -n openshift-cluster-csi-drivers

Comment 3 Peter Lauterbach 2021-02-22 15:24:46 UTC
This is not a blocker for OCP 4.7.0, but need to be fixed in the first available OCP 4.7.z stream.

Comment 4 michal 2021-02-22 15:57:12 UTC
ocp: 4.8.0-0.nightly-2021-02-22-111248 
ovirt: 4.4.2.6-1.el8

steps to reproduce: 
1) install 4.8 cluster
2) oc project openshift-cluster-csi-drivers
3) oc status - > I don't see any warning
4) oc get pods
NAME                                           READY   STATUS    RESTARTS   AGE
ovirt-csi-driver-controller-5bcbbd4c47-7kvld   4/4     Running   0          127m
ovirt-csi-driver-node-4r2mg                    3/3     Running   0          127m
ovirt-csi-driver-node-6tx6f                    3/3     Running   0          127m
ovirt-csi-driver-node-7prgg                    3/3     Running   1          112m
ovirt-csi-driver-node-bf54l                    3/3     Running   0          113m
ovirt-csi-driver-node-r5jxd                    3/3     Running   0          127m
ovirt-csi-driver-operator-8487469d4f-j72ft     1/1     Running   1          127m
there is no pod that do a lot of restart

Comment 7 errata-xmlrpc 2021-07-27 22:45:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438