Bug 2010946

Summary:	concurrent CRD from ovirt-csi-driver-operator gets reconciled by CVO after deployment, changing CR as well.
Product:	OpenShift Container Platform	Reporter:	Andreas Bleischwitz <ableisch>
Component:	Storage	Assignee:	aos-storage-staff <aos-storage-staff>
Storage sub component:	oVirt CSI Driver	QA Contact:	Andreas Bleischwitz <a.bleischwitz>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	medium	CC:	a.bleischwitz, aos-bugs, jpasztor, mburman
Version:	4.8
Target Milestone:	---
Target Release:	4.10.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-10 16:17:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2024491

Description Andreas Bleischwitz 2021-10-05 16:44:59 UTC

Description of problem:

When deploying the CSI driver manually, probably also during IPI installation, a modified CRD of the *oVirt-CSI-driver-operator* gets deployed:

~~~
spec:
  validation:
    openAPIV3Schema:
      properties:
        spec:
           properties:
+            driverConfig:
+              description: CSIDriverConfig is the CSI driver specific configuration
+              properties:
+                driverName:
+                  description: DriverName holds the name of the CSI driver
+                  enum:
+                  - ebs.csi.aws.com
+                  - manila.csi.openstack.org
+                  - csi.ovirt.org
+                  type: string
+              required:
+              - driverName
+              type: object
~~~

and other minor adjustments. This definition will get reconciled by the Cluster-Version-Operator during reconcile run:

~~~
I1005 15:58:34.592743       1 apiext.go:66] Updating CRD clustercsidrivers.operator.openshift.io due to diff:   &v1.CustomResourceDefinition{
~~~

Additionally the ClusterResource defined with:
~~~
spec:
  driverConfig:
    driverName: csi.ovirt.org
~~~
will get reverted to no longer contain `driverConfig` at all.

A subsequent test of a deployment without the CRD and a CR without `driverConfig` showed that the ovirt-csi-driver does not seem to make use of this property.

Version-Release number of selected component (if applicable):

~~~
% oc version
Client Version: 4.8.13
Server Version: 4.8.11
Kubernetes Version: v1.21.1+9807387
~~~


How reproducible:

Steps to Reproduce:
1. Install RHV-IPI cluster with RHV-CSI enabled
2. Validate output of `oc get clustercsidriver/csi.ovirt.org -o go-template='{{.spec}} {{"\n"}}'` and verify if `driverConfig` is present.
3. Check output of `oc get crd/clustercsidrivers.operator.openshift.io -o yaml | grep "configDriver"` and verify if `driverConfig` is present.

Actual results:
Neither CRD, nor CR does contain 'driverConfig'.

Expected results:
Neither CRD, nor `driverConfig` in CR exists in initial deployment.


Additional info:
When deploying the CSI-driver operator by using ArgoCD, a validation of the YAML files is enforced (unlike a deployment by using oc-command). This renders the deployment to fail in case the CVO reconciled the CRD and therefore invalidates the enriched CR.
As the `configDriver` property does not seem to be used, we should remove the CRD entirely and the additional property from the CR.

Comment 3 Gal Zaidman 2021-10-07 13:16:55 UTC

Are you sure that this is happening on regular installation?

I'll verify it next week when I will install a new cluster, but just from looking this is how I explain the bug.

Generally for all the operators other than ovirt there is no manifest dir, they deleted it because it is just not used by the driver and it is hard to make sure they are updated with whatever is being deployed by the storage operator - we kept it mainly for the day2 stuff.

Looking at our manifest directory it looks like we have the CRD[1] and the CR[2].
The CRD actually comes from OCP API[3] so we probably just need to remove that file from our manifests dir.
The CR comes from the storage operator, so I guess it was reconciled after you installed the driver or something, but that section was removed[4].

I will send a PR to update the manifests dir.

[1] https://github.com/openshift/ovirt-csi-driver-operator/blob/master/manifests/00_crd.yaml#L43-L54
[2] https://github.com/openshift/ovirt-csi-driver-operator/blob/master/manifests/08_cr.yaml#L7-L8
[3] https://docs.openshift.com/container-platform/4.6/rest_api/operator_apis/clustercsidriver-operator-openshift-io-v1.html
[4] https://github.com/openshift/cluster-storage-operator/commit/0463e2adc1a6c8cbbc578387baef341edc537097

Comment 4 Gal Zaidman 2021-10-07 13:36:16 UTC

Created a PR please take a look and if you see it solves your issue please comment:
/unhold
/lgtm

Comment 5 Gal Zaidman 2021-10-07 13:36:29 UTC

Created a PR please take a look and if you see it solves your issue please comment:
/unhold
/lgtm

Comment 6 Andreas Bleischwitz 2021-10-11 10:02:14 UTC

Hi Gal,

Not sure about a fresh RHV-IPI cluster, as I only have resources to run one OCP-cluster at a time.

But I applied the changes from the pull-request into my deployment and the driver is way less chatty about missing permissions and other stuff.
Currently I'm still checking the functionality of the driver, but as of now all looks good.

No restarts of pods after 3 hours of runtime realized. Claiming and removal of disks also works.

Will the /manifests directory remain in the image? Currently I'm extracting the yaml files from there in order to do the deployment.

Comment 7 Gal Zaidman 2021-10-11 12:51:46 UTC

(In reply to Andreas Bleischwitz from comment #6)
> Hi Gal,
> 
> Not sure about a fresh RHV-IPI cluster, as I only have resources to run one
> OCP-cluster at a time.
> 
> But I applied the changes from the pull-request into my deployment and the
> driver is way less chatty about missing permissions and other stuff.
> Currently I'm still checking the functionality of the driver, but as of now
> all looks good.
> 
> No restarts of pods after 3 hours of runtime realized. Claiming and removal
> of disks also works.
> 

Great update me whenever you are comfortable with it and I will merge and backport.

> Will the /manifests directory remain in the image?

Yes, I looked at your day2 repo and it just updated the manifest directory.

> Currently I'm extracting the yaml files from there in order to do the deployment.

Comment 11 Andreas Bleischwitz 2021-10-25 07:37:24 UTC

After having the deployment running for several days now, I didn't see any restarts of pods caused by missing permissions.

CVO does no longer reconcile the CRD and the CR can get applied without providing "driverConfig"

```
% oc get crd/clustercsidrivers.operator.openshift.io -o yaml | grep "driverConfig" | wc -l
0
```

Setting to verified.

Comment 12 Andreas Bleischwitz 2021-10-25 07:39:43 UTC

Forgot to mention:

I tested based on ovirt-csi-driver-operator
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a49f947d7bc4be906e931434ded09da176821f9ad58ee1dc2c0462777165b05

and manual adjustments of deployments based on https://github.com/openshift/ovirt-csi-driver-operator/pull/72

Comment 19 errata-xmlrpc 2022-03-10 16:17:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056