1929733 – oVirt CSI driver operator is constantly restarting

Bug 1929733 - oVirt CSI driver operator is constantly restarting

Summary: oVirt CSI driver operator is constantly restarting

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Benny Zlotnik
QA Contact:	michal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1929777
TreeView+	depends on / blocked

Reported:	2021-02-17 14:19 UTC by Benny Zlotnik
Modified:	2021-07-27 22:47 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:45:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:47:27 UTC

Description Benny Zlotnik 2021-02-17 14:19:05 UTC

Description of problem:

The oVirt CSI driver operator is constantly restarting since it's inception

Containers:
  ovirt-csi-driver-operator:
    Container ID:  cri-o://4aded21619cec53cd9c6c06ffd1988909059d66acc23720a0895431ce775968d
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:86a675ddbace0069c6d860629724f1dcebccc639fc032093afa04ec7e13b1940
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:86a675ddbace0069c6d860629724f1dcebccc639fc032093afa04ec7e13b1940
    Port:          <none>
    Host Port:     <none>
    Args:
      start
      --node=$(KUBE_NODE_NAME)
      -v=2
    State:          Running
      Started:      Wed, 17 Feb 2021 13:12:15 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 17 Feb 2021 13:01:08 +0200
      Finished:     Wed, 17 Feb 2021 13:12:14 +0200
    Ready:          True
    Restart Count:  945


This happens because configInformers in the operator code were not started[1], as a result the ConfigObserver to sync the cache

The error in the operator log:
706171       1 shared_informer.go:266] stop requested
707427       1 base_controller.go:95] unable to sync caches for ConfigObserver

Version-Release number of selected component (if applicable):

How reproducible:
100%

Steps to Reproduce:
1. Should happen on any cluster >4.7 with ovirt csi driver operator
2.
3.

Actual results:
ovirt CSI driver operator pod keeps restarting

Expected results:
The operator should not restart unless there is a real issue


[1] https://github.com/openshift/ovirt-csi-driver-operator/blob/master/pkg/operator/starter.go#L128

Comment 2 michal 2021-02-17 16:57:53 UTC

steps to reproduce: 
1) oc project openshift-cluster-csi-drivers
2) oc status
In project openshift-cluster-csi-drivers on server https://api.primary.ocp.rhev.lab.eng.brq.redhat.com:6443

deployment/ovirt-csi-driver-controller deploys quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0feb29efe901393bf80594af53ec8bbef34bbc6303c71cdfb7c779bacc461531,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:def80d6439c31c03f4d5e5bfa4f209bddfd3b7423d38d90f483a1ad1a10c0e01,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b3a0f319143cdd04122e50490ffa60e93024e18ace3c105041c432f2daf961fa,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f14455d69f404747e4458528744fd8ab9c2f5243004b3f7bff0323e73072b681
  deployment #1 running for 13 days - 1 pod

deployment/ovirt-csi-driver-operator deploys quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:86a675ddbace0069c6d860629724f1dcebccc639fc032093afa04ec7e13b1940
  deployment #1 running for 13 days - 1 pod (warning: 974 restarts)
3) we got warning
4) [root@ocp-qe-1 primary]# oc get pods
NAME                                           READY   STATUS    RESTARTS   AGE
ovirt-csi-driver-controller-7db477884c-tflht   4/4     Running   0          7d14h
ovirt-csi-driver-node-8qnxw                    3/3     Running   0          13d
ovirt-csi-driver-node-h5xvc                    3/3     Running   0          13d
ovirt-csi-driver-node-jtf7s                    3/3     Running   1          13d
ovirt-csi-driver-node-lnxmx                    3/3     Running   0          13d
ovirt-csi-driver-node-sg2td                    3/3     Running   0          13d
ovirt-csi-driver-node-wvnbm                    3/3     Running   0          13d
ovirt-csi-driver-operator-89d7bb77b-rn2m5      1/1     Running   975        7d14h
5) oc logs pod/ovirt-csi-driver-operator-89d7bb77b-rn2m5 -n openshift-cluster-csi-drivers
6) oc describe pod/ovirt-csi-driver-operator-89d7bb77b-rn2m5 -n openshift-cluster-csi-drivers

Comment 3 Peter Lauterbach 2021-02-22 15:24:46 UTC

This is not a blocker for OCP 4.7.0, but need to be fixed in the first available OCP 4.7.z stream.

Comment 4 michal 2021-02-22 15:57:12 UTC

ocp: 4.8.0-0.nightly-2021-02-22-111248 
ovirt: 4.4.2.6-1.el8

steps to reproduce: 
1) install 4.8 cluster
2) oc project openshift-cluster-csi-drivers
3) oc status - > I don't see any warning
4) oc get pods
NAME                                           READY   STATUS    RESTARTS   AGE
ovirt-csi-driver-controller-5bcbbd4c47-7kvld   4/4     Running   0          127m
ovirt-csi-driver-node-4r2mg                    3/3     Running   0          127m
ovirt-csi-driver-node-6tx6f                    3/3     Running   0          127m
ovirt-csi-driver-node-7prgg                    3/3     Running   1          112m
ovirt-csi-driver-node-bf54l                    3/3     Running   0          113m
ovirt-csi-driver-node-r5jxd                    3/3     Running   0          127m
ovirt-csi-driver-operator-8487469d4f-j72ft     1/1     Running   1          127m
there is no pod that do a lot of restart

Comment 7 errata-xmlrpc 2021-07-27 22:45:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.