2064613 – [OCPonRHV]- after few days that cluster is alive we got error in storage operator

Bug 2064613 - [OCPonRHV]- after few days that cluster is alive we got error in storage operator

Summary: [OCPonRHV]- after few days that cluster is alive we got error in storage oper...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Evgeny Slutsky
QA Contact:	michal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2070525
TreeView+	depends on / blocked

Reported:	2022-03-16 09:41 UTC by michal
Modified:	2022-08-10 10:54 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:54:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ovirt-csi-driver-operator pull 92	None	open	Bug 2064613: Recreate oVirt connection for every sync	2022-03-22 15:34:37 UTC
Red Hat Knowledge Base (Solution)	6889161	None	None	None	2022-04-11 20:12:51 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 10:54:54 UTC

Internal Links: 2080888

Description michal 2022-03-16 09:41:32 UTC

Description of problem:
after few days that cluster is alive we got error in storage operator -OVirtCSIDriverOperatorCRDegraded 
Version-Release number of selected component (if applicable):
rhv- 4.4.10.6
ocp- 4.11.0-0.nightly-2022-02-27-122819 

How reproducible:


Steps to Reproduce:
1. install openshift on rhv - it's not specific version (it's happened in 4.10 and 4.11)
2. after few days run 'oc get co' to check the status of cluster
3. error appear

Actual results:
error when running oc get co
storage                                    4.11.0-0.nightly-2022-02-27-122819   True        False         True       5d16h   OVirtCSIDriverOperatorCRDegraded: OvirtStorageClassControllerDegraded: generic_error: non-retryable error encountered while listing disk attachments on VM f8f999d0-770e-4498-b86d-6445edc70045, giving up (failed to parse oVirt Engine fault response: <html><head><title>Error</title></head><body>invalid_grant: The provided authorization grant for the auth code has expired.</body></html> (Tag not matched: expect <fault> but got <html>))


Expected results:


Additional info:

Comment 1 Janos Bonic 2022-03-16 09:48:34 UTC

@eslutsky please investigate why the CSI driver operator is not restarted when there is an authentication failure. This should happen as part of a health check.

Comment 2 Peter Larsen 2022-03-16 22:25:54 UTC

This error happened to me due to the LDAP authenticator on ovirt went into failure mode - causing most authentications that required LDAP/Kerberos lookups to randomly fail.  You can reproduce the issue without OCP by simply trying to login to a valid LDAP user on oVirt (for instance, use the username specified in the install "cloud-credentials"). In my case logging in was not working or if you logged in, features would all of a sudden fail with authentication errors.  

Once the Authentication issue was resolved with oVirt, I could get this error to go away by bouncing the operator (4.11 required stopping all the ovirt-csi pods first, then the operator).

Comment 3 Evgeny Slutsky 2022-03-22 13:31:50 UTC

it appears this issue caused by  the port to go-ovirt-client [0] ,


this part  was removed which handles the reconnect in case authorization grant revoked:


func (o *Client) GetConnection() (*ovirtsdk.Connection, error) {
	if o.connection == nil || o.connection.Test() != nil {
		return newOvirtConnection()
	}

	return o.connection, nil
}

[0] https://github.com/openshift/ovirt-csi-driver-operator/commit/2813fbe80f8c244c643f1b06466461996e41c4eb#47

Comment 4 Martin Perina 2022-03-22 14:51:42 UTC

(In reply to Peter Larsen from comment #2)
> This error happened to me due to the LDAP authenticator on ovirt went into
> failure mode - causing most authentications that required LDAP/Kerberos
> lookups to randomly fail.  You can reproduce the issue without OCP by simply
> trying to login to a valid LDAP user on oVirt (for instance, use the
> username specified in the install "cloud-credentials"). In my case logging
> in was not working or if you logged in, features would all of a sudden fail
> with authentication errors.  
> 
> Once the Authentication issue was resolved with oVirt, I could get this
> error to go away by bouncing the operator (4.11 required stopping all the
> ovirt-csi pods first, then the operator).

Could you please provide logs from ovirt-engine? From the description it seems to me that reauthentication in the client doesn't work as expected ...

Comment 5 Peter Larsen 2022-03-22 16:53:18 UTC

(In reply to Martin Perina from comment #4)
> (In reply to Peter Larsen from comment #2)
> > This error happened to me due to the LDAP authenticator on ovirt went into
> > failure mode - causing most authentications that required LDAP/Kerberos
> > lookups to randomly fail.  You can reproduce the issue without OCP by simply
> > trying to login to a valid LDAP user on oVirt (for instance, use the
> > username specified in the install "cloud-credentials"). In my case logging
> > in was not working or if you logged in, features would all of a sudden fail
> > with authentication errors.  
> > 
> > Once the Authentication issue was resolved with oVirt, I could get this
> > error to go away by bouncing the operator (4.11 required stopping all the
> > ovirt-csi pods first, then the operator).
> 
> Could you please provide logs from ovirt-engine? From the description it
> seems to me that reauthentication in the client doesn't work as expected ...

Not sure it's going to help here - I posted to indicate that in my case this wasn't an oVirt installer or machine config issue, but that oVirt/RHV became the root cause of the authentication issue. That doesn't exclude another issue, but I haven't seen expired tokens in my test lab. I no longer have those OCP installs around (4.10 and 4.11). My clusters mostly survive at most one day perhaps a handful before they're recreated.

I need to file a BZ/RFE to indicate the account used for openshift-install on oVirt/RHV needs to be a service-account that doesn't expire or issues like what this ticket shows can happen. That said, the username/password token will expire and the client MUST be able to regenerate a new token, so if that code is gone that would be an error.

Comment 8 Clemente, Alex 2022-04-04 12:52:42 UTC

Hy guys,


I had same problem, a feel days, token to access ovirt environment has expired.

My solution (temporary), is delete olds pods in project (namespace) openshift-cluster-csi-drivers,deployments: D
ovirt-csi-driver-controller and ovirt-csi-driver-operator, basically scale down and scale up deployments (recreate pods ovirt-csi-driver-controller-XXXXX and ovirt-csi-driver-operator-XXXX, XXXX is the complement name of Pod).


OKD Release/Version: Cluster version is 4.10.0-0.okd-2022-03-07-131213

oVirt Release/Version: 4.4.9.5-1.el8


But in some days, i go have problem again.

Steps:
1 - oc get -o yaml clusteroperator storage
2 - oc project openshift-cluster-csi-drivers
3 - Edit or scale up/down de deployments: ovirt-csi-driver-operator and ovirt-csi-driver-controller
4 - Wait a minute, and check status of Operators or via cmdline:  
"$ oc adm upgrade
Cluster version is 4.10.0-0.okd-2022-03-07-131213
No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss."

Comment 9 michal 2022-04-04 12:58:31 UTC

rhv: 4.5
ocp:4.11.0-0.nightly-2022-03-29-152521

steps:
1) run oc get clusterversion - no error appear
2) run oc get co -no error appear

Comment 10 Peter Larsen 2022-04-04 13:47:54 UTC

(In reply to Clemente, Alex from comment #8)
> Hy guys,
> 
> 
> I had same problem, a feel days, token to access ovirt environment has
> expired.

Alex, can you clarify if the oVirt credentials are still valid (ie. just need to sign in again to generate a new session token)?  If so, it's not the issue I've seen. I would suggest including the ovirt-engine log data where the session data/processing is found to help clarify the root cause.  It absolutely could be an OCP/OKD issue - but it's going to help being able to exclude oVirt if that's possible.

> "$ oc adm upgrade
> Cluster version is 4.10.0-0.okd-2022-03-07-131213
> No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss."

Not sure how this relates to lack of cloud credentials? This message is not a result from querying oVirt API.

Comment 11 Clemente, Alex 2022-04-04 14:14:04 UTC

(In reply to Peter Larsen from comment #10)
> (In reply to Clemente, Alex from comment #8)
> > Hy guys,
> > 
> > 
> > I had same problem, a feel days, token to access ovirt environment has
> > expired.
> 
> Alex, can you clarify if the oVirt credentials are still valid (ie. just
> need to sign in again to generate a new session token)?  If so, it's not the
> issue I've seen. I would suggest including the ovirt-engine log data where
> the session data/processing is found to help clarify the root cause.  It
> absolutely could be an OCP/OKD issue - but it's going to help being able to
> exclude oVirt if that's possible.
> 
> > "$ oc adm upgrade
> > Cluster version is 4.10.0-0.okd-2022-03-07-131213
> > No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss."
> 
> Not sure how this relates to lack of cloud credentials? This message is not
> a result from querying oVirt API.

The credentials are valid, I accessed oVirt with the same user and pass, through the log message, it indicates that OKD/OCP creates an access token to oVirt, however, when the token's activation time expires, it does not automatically recreate it, being necessary to recreate the pod, so it is forced to create a new token in oVirt.

I understand that OKD/OCP should create a new token when it is expired, it doesn't make much sense that you want to recreate the pod manually, like, I'll have to recreate the pod to generate a new token.

Engine.log of oVirt.

https://filetransfer.io/data-package/vzyCXGT6#link

Comment 12 Michael Burman 2022-04-05 12:09:21 UTC

(In reply to Clemente, Alex from comment #11)
> (In reply to Peter Larsen from comment #10)
> > (In reply to Clemente, Alex from comment #8)
> > > Hy guys,
> > > 
> > > 
> > > I had same problem, a feel days, token to access ovirt environment has
> > > expired.
> > 
> > Alex, can you clarify if the oVirt credentials are still valid (ie. just
> > need to sign in again to generate a new session token)?  If so, it's not the
> > issue I've seen. I would suggest including the ovirt-engine log data where
> > the session data/processing is found to help clarify the root cause.  It
> > absolutely could be an OCP/OKD issue - but it's going to help being able to
> > exclude oVirt if that's possible.
> > 
> > > "$ oc adm upgrade
> > > Cluster version is 4.10.0-0.okd-2022-03-07-131213
> > > No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss."
> > 
> > Not sure how this relates to lack of cloud credentials? This message is not
> > a result from querying oVirt API.
> 
> The credentials are valid, I accessed oVirt with the same user and pass,
> through the log message, it indicates that OKD/OCP creates an access token
> to oVirt, however, when the token's activation time expires, it does not
> automatically recreate it, being necessary to recreate the pod, so it is
> forced to create a new token in oVirt.
> 
> I understand that OKD/OCP should create a new token when it is expired, it
> doesn't make much sense that you want to recreate the pod manually, like,
> I'll have to recreate the pod to generate a new token.
> 
> Engine.log of oVirt.
> 
> https://filetransfer.io/data-package/vzyCXGT6#link

Hi,

I suggest to report a new issue.
This bug is verified and it's about the CSI Driver operator been degraded after few days. 
Thank you

Comment 16 errata-xmlrpc 2022-08-10 10:54:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.