Bug 1896320

Summary: ovirt-csi-driver-operator pod crashes upon OCP upgrade from 4.5 to 4.6 on RHV platform
Product: OpenShift Container Platform Reporter: Oren Cohen <ocohen>
Component: StorageAssignee: Benny Zlotnik <bzlotnik>
Storage sub component: oVirt CSI Driver QA Contact: Lucie Leistnerova <lleistne>
Status: CLOSED WONTFIX Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, danken, ellorent, gzaidman, hpopal, phoracek, ychoukse
Version: 4.6   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-25 14:46:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Oren Cohen 2020-11-10 09:38:18 UTC
Version: 4.6.1

$ openshift-install version
<your output here>

Platform:
RHV 4.4.1.10-0.1.el8ev

#Please specify the platform type: aws, libvirt, openstack or baremetal etc.

Please specify:
IPI

What happened?
When the cluster was in version 4.5.16, the cluster was stable with all cluster-operators available. In an upgrade towards 4.6.1, the cluster operator "storage" is stuck on "Updating", with "Available=false" and "Progressing=true" with the following message:
OVirtCSIDriverOperatorCRProgressing: Waiting for OVirt operator to report status 

Checking "openshift-cluster-csi-drivers", the pod "ovirt-csi-driver-operator-*" is crashing in loop.
pod logs:
http://pastebin.test.redhat.com/917008

deployment manifest:
http://pastebin.test.redhat.com/917010

Note: it is also the same outcome when I turned off the operator pod "cluster-storage-operator" in "openshift-cluster-storage-operator" namespace, and switched the image of ovirt-csi-driver-operator to:
quay.io/openshift/origin-ovirt-csi-driver-operator:latest
for both of its containers.

There are no connection issues between the pod and the ovirt engine (tested with curl when the pod was in debug mode).


What did you expect to happen?

Expect to complete the upgrade successfully and have the cluster operator "storage" in "Available" status on version 4.6.1

How to reproduce it (as minimally and precisely as possible)?
Start with an OCP-over-RHV (IPI) cluster, version 4.5.16, and perform an upgrade to 4.6.1 

Anything else we need to know?

This is preventing the cluster to complete an upgrade towards 4.6.1

Comment 1 Benny Zlotnik 2020-11-10 14:34:40 UTC
So the logs are extremely confusing, but the relevant error is:
E1110 09:21:15.641660       1 starter.go:36] yaml: line 3: found character that cannot start any token

Ultimately the issue is caused by the ovirt-api password starting with a reserved character, this can be resolved by editing the ovirt-credentials object and wrap the password with quotes.

I created a PR to fail earlier so the logs aren't as difficult to read

Comment 3 Oren Cohen 2020-11-10 16:56:46 UTC
It turned out that ovirt-csi-driver-node DaemonSet's pods are colliding with nmstate-handler DaemonSet's pods (part of CNV).
They are both listening to port 8080 on the host level.

Meaning, the issue is reproducing only on OCP-over-RHV clusters, version 4.6, with OpenShift Virtualization installed, at least from version 2.4.

From what I gathered from CNV network team, this port on nmstate is used for metrics and can be disabled.

Comment 4 Quique Llorente 2020-11-20 12:40:18 UTC
Hi, we are trying to release CNAO https://github.com/kubevirt/cluster-network-addons-operator/pull/667 but looks like we have some issues in the CI, it includes the fixes at kubernetes-nmstate to close port 8080.

Comment 7 Dan Kenigsberg 2020-12-17 12:53:57 UTC
(In reply to Benny Zlotnik from comment #1)
> So the logs are extremely confusing, but the relevant error is:
> E1110 09:21:15.641660       1 starter.go:36] yaml: line 3: found character
> that cannot start any token
> 
> Ultimately the issue is caused by the ovirt-api password starting with a
> reserved character, this can be resolved by editing the ovirt-credentials
> object and wrap the password with quotes.
> 
> I created a PR to fail earlier so the logs aren't as difficult to read

Thanks for making the logs more readable. However the important thing is that the ovirt_password secret must accept any printable character. It is encrypted in base64 armor exactly to allow this. After ovirt-csi-driver reads it, it should quote it according to the destination. From what you say it seems that I cannot have a password that starts with quotes, either.

Comment 8 Gal Zaidman 2021-01-27 08:47:37 UTC
due to capacity constraints we will be revisiting this bug in the upcoming sprint

Comment 9 Gal Zaidman 2021-02-25 14:46:26 UTC
Moving this Bug to https://bugzilla.redhat.com/show_bug.cgi?id=1933028