1929777 – oVirt CSI driver operator is constantly restarting

Bug 1929777 - oVirt CSI driver operator is constantly restarting

Summary: oVirt CSI driver operator is constantly restarting

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Benny Zlotnik
QA Contact:	michal
Docs Contact:
URL:
Whiteboard:
Depends On:	1929733
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-17 16:05 UTC by Gal Zaidman
Modified:	2021-03-16 08:43 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	Cause: Informers added to the operator infrastructure in 4.7 were not started. Consequence: The oVirt CSI operator doesn't inform and as a result shuts down after a couple of minutes, leading to constant restarts
Clone Of:
Environment:
Last Closed:	2021-03-16 08:42:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ovirt-csi-driver-operator pull 49	0	None	open	[release-4.7] Bug 1929777: Run config informers when starting the operator	2021-02-22 16:18:38 UTC
Red Hat Product Errata	RHBA-2021:0749	0	None	None	None	2021-03-16 08:43:10 UTC

Description Gal Zaidman 2021-02-17 16:05:35 UTC

Description of problem:

The oVirt CSI driver operator is constantly restarting since it's inception

Containers:
  ovirt-csi-driver-operator:
    Container ID:  cri-o://4aded21619cec53cd9c6c06ffd1988909059d66acc23720a0895431ce775968d
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:86a675ddbace0069c6d860629724f1dcebccc639fc032093afa04ec7e13b1940
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:86a675ddbace0069c6d860629724f1dcebccc639fc032093afa04ec7e13b1940
    Port:          <none>
    Host Port:     <none>
    Args:
      start
      --node=$(KUBE_NODE_NAME)
      -v=2
    State:          Running
      Started:      Wed, 17 Feb 2021 13:12:15 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 17 Feb 2021 13:01:08 +0200
      Finished:     Wed, 17 Feb 2021 13:12:14 +0200
    Ready:          True
    Restart Count:  945


This happens because configInformers in the operator code were not started[1], as a result the ConfigObserver failed to sync the cache

Version-Release number of selected component (if applicable):

How reproducible:
100%

Steps to Reproduce:
1. Should happen on any cluster >4.7 with ovirt csi driver operator
2.
3.

Actual results:
ovirt CSI driver operator pod keeps restarting

Expected results:
The operator should not restart unless there is a real issue


[1] https://github.com/openshift/ovirt-csi-driver-operator/blob/master/pkg/operator/starter.go#L128

Comment 1 michal 2021-02-17 19:02:16 UTC

steps to reproduce: 
1) oc project openshift-cluster-csi-drivers
2) oc status
In project openshift-cluster-csi-drivers on server https://api.primary.ocp.rhev.lab.eng.brq.redhat.com:6443

deployment/ovirt-csi-driver-controller deploys quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0feb29efe901393bf80594af53ec8bbef34bbc6303c71cdfb7c779bacc461531,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:def80d6439c31c03f4d5e5bfa4f209bddfd3b7423d38d90f483a1ad1a10c0e01,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b3a0f319143cdd04122e50490ffa60e93024e18ace3c105041c432f2daf961fa,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f14455d69f404747e4458528744fd8ab9c2f5243004b3f7bff0323e73072b681
  deployment #1 running for 13 days - 1 pod

deployment/ovirt-csi-driver-operator deploys quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:86a675ddbace0069c6d860629724f1dcebccc639fc032093afa04ec7e13b1940
  deployment #1 running for 13 days - 1 pod (warning: 974 restarts)
3) we got warning
4) [root@ocp-qe-1 primary]# oc get pods
NAME                                           READY   STATUS    RESTARTS   AGE
ovirt-csi-driver-controller-7db477884c-tflht   4/4     Running   0          7d14h
ovirt-csi-driver-node-8qnxw                    3/3     Running   0          13d
ovirt-csi-driver-node-h5xvc                    3/3     Running   0          13d
ovirt-csi-driver-node-jtf7s                    3/3     Running   1          13d
ovirt-csi-driver-node-lnxmx                    3/3     Running   0          13d
ovirt-csi-driver-node-sg2td                    3/3     Running   0          13d
ovirt-csi-driver-node-wvnbm                    3/3     Running   0          13d
ovirt-csi-driver-operator-89d7bb77b-rn2m5      1/1     Running   975        7d14h
5) oc logs pod/ovirt-csi-driver-operator-89d7bb77b-rn2m5 -n openshift-cluster-csi-drivers
6) oc describe pod/ovirt-csi-driver-operator-89d7bb77b-rn2m5 -n openshift-cluster-csi-drivers

Comment 2 Peter Lauterbach 2021-02-22 15:27:53 UTC

This is not a blocker for OCP 4.7.0, but need to be fixed in the first available OCP 4.7.z stream.

Comment 3 Peter Lauterbach 2021-02-22 19:22:56 UTC

Please suggest some reasonable language on this issue for the "known problems" section of the OCP 4.7 release notes.

Comment 6 michal 2021-03-08 19:59:36 UTC

verify on : ocp - 4.7.0-0.nightly-2021-03-06-183610
rhv- 4.4.4.7

steps: 
1) oc project openshift-cluster-csi-drivers
2) oc status - look if there are any warning -> no warnings
3) oc get pods  

results: there is no pod that restart a lot of time

Comment 8 errata-xmlrpc 2021-03-16 08:42:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.2 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0749

Note You need to log in before you can comment on or make changes to this bug.