Description of problem: after few days that cluster is alive we got error in storage operator -OVirtCSIDriverOperatorCRDegraded Version-Release number of selected component (if applicable): rhv- 4.4.10.6 ocp- 4.11.0-0.nightly-2022-02-27-122819 How reproducible: Steps to Reproduce: 1. install openshift on rhv - it's not specific version (it's happened in 4.10 and 4.11) 2. after few days run 'oc get co' to check the status of cluster 3. error appear Actual results: error when running oc get co storage 4.11.0-0.nightly-2022-02-27-122819 True False True 5d16h OVirtCSIDriverOperatorCRDegraded: OvirtStorageClassControllerDegraded: generic_error: non-retryable error encountered while listing disk attachments on VM f8f999d0-770e-4498-b86d-6445edc70045, giving up (failed to parse oVirt Engine fault response: <html><head><title>Error</title></head><body>invalid_grant: The provided authorization grant for the auth code has expired.</body></html> (Tag not matched: expect <fault> but got <html>)) Expected results: Additional info:
@eslutsky please investigate why the CSI driver operator is not restarted when there is an authentication failure. This should happen as part of a health check.
This error happened to me due to the LDAP authenticator on ovirt went into failure mode - causing most authentications that required LDAP/Kerberos lookups to randomly fail. You can reproduce the issue without OCP by simply trying to login to a valid LDAP user on oVirt (for instance, use the username specified in the install "cloud-credentials"). In my case logging in was not working or if you logged in, features would all of a sudden fail with authentication errors. Once the Authentication issue was resolved with oVirt, I could get this error to go away by bouncing the operator (4.11 required stopping all the ovirt-csi pods first, then the operator).
it appears this issue caused by the port to go-ovirt-client [0] , this part was removed which handles the reconnect in case authorization grant revoked: func (o *Client) GetConnection() (*ovirtsdk.Connection, error) { if o.connection == nil || o.connection.Test() != nil { return newOvirtConnection() } return o.connection, nil } [0] https://github.com/openshift/ovirt-csi-driver-operator/commit/2813fbe80f8c244c643f1b06466461996e41c4eb#47
(In reply to Peter Larsen from comment #2) > This error happened to me due to the LDAP authenticator on ovirt went into > failure mode - causing most authentications that required LDAP/Kerberos > lookups to randomly fail. You can reproduce the issue without OCP by simply > trying to login to a valid LDAP user on oVirt (for instance, use the > username specified in the install "cloud-credentials"). In my case logging > in was not working or if you logged in, features would all of a sudden fail > with authentication errors. > > Once the Authentication issue was resolved with oVirt, I could get this > error to go away by bouncing the operator (4.11 required stopping all the > ovirt-csi pods first, then the operator). Could you please provide logs from ovirt-engine? From the description it seems to me that reauthentication in the client doesn't work as expected ...
(In reply to Martin Perina from comment #4) > (In reply to Peter Larsen from comment #2) > > This error happened to me due to the LDAP authenticator on ovirt went into > > failure mode - causing most authentications that required LDAP/Kerberos > > lookups to randomly fail. You can reproduce the issue without OCP by simply > > trying to login to a valid LDAP user on oVirt (for instance, use the > > username specified in the install "cloud-credentials"). In my case logging > > in was not working or if you logged in, features would all of a sudden fail > > with authentication errors. > > > > Once the Authentication issue was resolved with oVirt, I could get this > > error to go away by bouncing the operator (4.11 required stopping all the > > ovirt-csi pods first, then the operator). > > Could you please provide logs from ovirt-engine? From the description it > seems to me that reauthentication in the client doesn't work as expected ... Not sure it's going to help here - I posted to indicate that in my case this wasn't an oVirt installer or machine config issue, but that oVirt/RHV became the root cause of the authentication issue. That doesn't exclude another issue, but I haven't seen expired tokens in my test lab. I no longer have those OCP installs around (4.10 and 4.11). My clusters mostly survive at most one day perhaps a handful before they're recreated. I need to file a BZ/RFE to indicate the account used for openshift-install on oVirt/RHV needs to be a service-account that doesn't expire or issues like what this ticket shows can happen. That said, the username/password token will expire and the client MUST be able to regenerate a new token, so if that code is gone that would be an error.
Hy guys, I had same problem, a feel days, token to access ovirt environment has expired. My solution (temporary), is delete olds pods in project (namespace) openshift-cluster-csi-drivers,deployments: D ovirt-csi-driver-controller and ovirt-csi-driver-operator, basically scale down and scale up deployments (recreate pods ovirt-csi-driver-controller-XXXXX and ovirt-csi-driver-operator-XXXX, XXXX is the complement name of Pod). OKD Release/Version: Cluster version is 4.10.0-0.okd-2022-03-07-131213 oVirt Release/Version: 4.4.9.5-1.el8 But in some days, i go have problem again. Steps: 1 - oc get -o yaml clusteroperator storage 2 - oc project openshift-cluster-csi-drivers 3 - Edit or scale up/down de deployments: ovirt-csi-driver-operator and ovirt-csi-driver-controller 4 - Wait a minute, and check status of Operators or via cmdline: "$ oc adm upgrade Cluster version is 4.10.0-0.okd-2022-03-07-131213 No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss."
rhv: 4.5 ocp:4.11.0-0.nightly-2022-03-29-152521 steps: 1) run oc get clusterversion - no error appear 2) run oc get co -no error appear
(In reply to Clemente, Alex from comment #8) > Hy guys, > > > I had same problem, a feel days, token to access ovirt environment has > expired. Alex, can you clarify if the oVirt credentials are still valid (ie. just need to sign in again to generate a new session token)? If so, it's not the issue I've seen. I would suggest including the ovirt-engine log data where the session data/processing is found to help clarify the root cause. It absolutely could be an OCP/OKD issue - but it's going to help being able to exclude oVirt if that's possible. > "$ oc adm upgrade > Cluster version is 4.10.0-0.okd-2022-03-07-131213 > No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss." Not sure how this relates to lack of cloud credentials? This message is not a result from querying oVirt API.
(In reply to Peter Larsen from comment #10) > (In reply to Clemente, Alex from comment #8) > > Hy guys, > > > > > > I had same problem, a feel days, token to access ovirt environment has > > expired. > > Alex, can you clarify if the oVirt credentials are still valid (ie. just > need to sign in again to generate a new session token)? If so, it's not the > issue I've seen. I would suggest including the ovirt-engine log data where > the session data/processing is found to help clarify the root cause. It > absolutely could be an OCP/OKD issue - but it's going to help being able to > exclude oVirt if that's possible. > > > "$ oc adm upgrade > > Cluster version is 4.10.0-0.okd-2022-03-07-131213 > > No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss." > > Not sure how this relates to lack of cloud credentials? This message is not > a result from querying oVirt API. The credentials are valid, I accessed oVirt with the same user and pass, through the log message, it indicates that OKD/OCP creates an access token to oVirt, however, when the token's activation time expires, it does not automatically recreate it, being necessary to recreate the pod, so it is forced to create a new token in oVirt. I understand that OKD/OCP should create a new token when it is expired, it doesn't make much sense that you want to recreate the pod manually, like, I'll have to recreate the pod to generate a new token. Engine.log of oVirt. https://filetransfer.io/data-package/vzyCXGT6#link
(In reply to Clemente, Alex from comment #11) > (In reply to Peter Larsen from comment #10) > > (In reply to Clemente, Alex from comment #8) > > > Hy guys, > > > > > > > > > I had same problem, a feel days, token to access ovirt environment has > > > expired. > > > > Alex, can you clarify if the oVirt credentials are still valid (ie. just > > need to sign in again to generate a new session token)? If so, it's not the > > issue I've seen. I would suggest including the ovirt-engine log data where > > the session data/processing is found to help clarify the root cause. It > > absolutely could be an OCP/OKD issue - but it's going to help being able to > > exclude oVirt if that's possible. > > > > > "$ oc adm upgrade > > > Cluster version is 4.10.0-0.okd-2022-03-07-131213 > > > No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss." > > > > Not sure how this relates to lack of cloud credentials? This message is not > > a result from querying oVirt API. > > The credentials are valid, I accessed oVirt with the same user and pass, > through the log message, it indicates that OKD/OCP creates an access token > to oVirt, however, when the token's activation time expires, it does not > automatically recreate it, being necessary to recreate the pod, so it is > forced to create a new token in oVirt. > > I understand that OKD/OCP should create a new token when it is expired, it > doesn't make much sense that you want to recreate the pod manually, like, > I'll have to recreate the pod to generate a new token. > > Engine.log of oVirt. > > https://filetransfer.io/data-package/vzyCXGT6#link Hi, I suggest to report a new issue. This bug is verified and it's about the CSI Driver operator been degraded after few days. Thank you
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069