Bug 1896320 - ovirt-csi-driver-operator pod crashes upon OCP upgrade from 4.5 to 4.6 on RHV platform
Summary: ovirt-csi-driver-operator pod crashes upon OCP upgrade from 4.5 to 4.6 on RHV...
Keywords:
Status: ASSIGNED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Benny Zlotnik
QA Contact: Lucie Leistnerova
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-10 09:38 UTC by Oren Cohen
Modified: 2021-02-11 10:26 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovirt-csi-driver-operator pull 33 0 None closed Bug 1896320: Bail immediately if we failed to create an OvirtClient 2021-02-03 08:39:34 UTC

Internal Links: 1896485

Description Oren Cohen 2020-11-10 09:38:18 UTC
Version: 4.6.1

$ openshift-install version
<your output here>

Platform:
RHV 4.4.1.10-0.1.el8ev

#Please specify the platform type: aws, libvirt, openstack or baremetal etc.

Please specify:
IPI

What happened?
When the cluster was in version 4.5.16, the cluster was stable with all cluster-operators available. In an upgrade towards 4.6.1, the cluster operator "storage" is stuck on "Updating", with "Available=false" and "Progressing=true" with the following message:
OVirtCSIDriverOperatorCRProgressing: Waiting for OVirt operator to report status 

Checking "openshift-cluster-csi-drivers", the pod "ovirt-csi-driver-operator-*" is crashing in loop.
pod logs:
http://pastebin.test.redhat.com/917008

deployment manifest:
http://pastebin.test.redhat.com/917010

Note: it is also the same outcome when I turned off the operator pod "cluster-storage-operator" in "openshift-cluster-storage-operator" namespace, and switched the image of ovirt-csi-driver-operator to:
quay.io/openshift/origin-ovirt-csi-driver-operator:latest
for both of its containers.

There are no connection issues between the pod and the ovirt engine (tested with curl when the pod was in debug mode).


What did you expect to happen?

Expect to complete the upgrade successfully and have the cluster operator "storage" in "Available" status on version 4.6.1

How to reproduce it (as minimally and precisely as possible)?
Start with an OCP-over-RHV (IPI) cluster, version 4.5.16, and perform an upgrade to 4.6.1 

Anything else we need to know?

This is preventing the cluster to complete an upgrade towards 4.6.1

Comment 1 Benny Zlotnik 2020-11-10 14:34:40 UTC
So the logs are extremely confusing, but the relevant error is:
E1110 09:21:15.641660       1 starter.go:36] yaml: line 3: found character that cannot start any token

Ultimately the issue is caused by the ovirt-api password starting with a reserved character, this can be resolved by editing the ovirt-credentials object and wrap the password with quotes.

I created a PR to fail earlier so the logs aren't as difficult to read

Comment 3 Oren Cohen 2020-11-10 16:56:46 UTC
It turned out that ovirt-csi-driver-node DaemonSet's pods are colliding with nmstate-handler DaemonSet's pods (part of CNV).
They are both listening to port 8080 on the host level.

Meaning, the issue is reproducing only on OCP-over-RHV clusters, version 4.6, with OpenShift Virtualization installed, at least from version 2.4.

From what I gathered from CNV network team, this port on nmstate is used for metrics and can be disabled.

Comment 4 Quique Llorente 2020-11-20 12:40:18 UTC
Hi, we are trying to release CNAO https://github.com/kubevirt/cluster-network-addons-operator/pull/667 but looks like we have some issues in the CI, it includes the fixes at kubernetes-nmstate to close port 8080.

Comment 7 Dan Kenigsberg 2020-12-17 12:53:57 UTC
(In reply to Benny Zlotnik from comment #1)
> So the logs are extremely confusing, but the relevant error is:
> E1110 09:21:15.641660       1 starter.go:36] yaml: line 3: found character
> that cannot start any token
> 
> Ultimately the issue is caused by the ovirt-api password starting with a
> reserved character, this can be resolved by editing the ovirt-credentials
> object and wrap the password with quotes.
> 
> I created a PR to fail earlier so the logs aren't as difficult to read

Thanks for making the logs more readable. However the important thing is that the ovirt_password secret must accept any printable character. It is encrypted in base64 armor exactly to allow this. After ovirt-csi-driver reads it, it should quote it according to the destination. From what you say it seems that I cannot have a password that starts with quotes, either.

Comment 8 Gal Zaidman 2021-01-27 08:47:37 UTC
due to capacity constraints we will be revisiting this bug in the upcoming sprint


Note You need to log in before you can comment on or make changes to this bug.