Bug 2064613
Summary: | [OCPonRHV]- after few days that cluster is alive we got error in storage operator | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | michal <mgold> |
Component: | Storage | Assignee: | Evgeny Slutsky <eslutsky> |
Storage sub component: | oVirt CSI Driver | QA Contact: | michal <mgold> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | alexbmw00, aos-bugs, eslutsky, fhirtz, jpasztor, lleistne, mburman, mkalinin, mperina, plarsen, wking |
Version: | 4.11 | ||
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-10 10:54:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2070525 |
Description
michal
2022-03-16 09:41:32 UTC
@eslutsky please investigate why the CSI driver operator is not restarted when there is an authentication failure. This should happen as part of a health check. This error happened to me due to the LDAP authenticator on ovirt went into failure mode - causing most authentications that required LDAP/Kerberos lookups to randomly fail. You can reproduce the issue without OCP by simply trying to login to a valid LDAP user on oVirt (for instance, use the username specified in the install "cloud-credentials"). In my case logging in was not working or if you logged in, features would all of a sudden fail with authentication errors. Once the Authentication issue was resolved with oVirt, I could get this error to go away by bouncing the operator (4.11 required stopping all the ovirt-csi pods first, then the operator). it appears this issue caused by the port to go-ovirt-client [0] , this part was removed which handles the reconnect in case authorization grant revoked: func (o *Client) GetConnection() (*ovirtsdk.Connection, error) { if o.connection == nil || o.connection.Test() != nil { return newOvirtConnection() } return o.connection, nil } [0] https://github.com/openshift/ovirt-csi-driver-operator/commit/2813fbe80f8c244c643f1b06466461996e41c4eb#47 (In reply to Peter Larsen from comment #2) > This error happened to me due to the LDAP authenticator on ovirt went into > failure mode - causing most authentications that required LDAP/Kerberos > lookups to randomly fail. You can reproduce the issue without OCP by simply > trying to login to a valid LDAP user on oVirt (for instance, use the > username specified in the install "cloud-credentials"). In my case logging > in was not working or if you logged in, features would all of a sudden fail > with authentication errors. > > Once the Authentication issue was resolved with oVirt, I could get this > error to go away by bouncing the operator (4.11 required stopping all the > ovirt-csi pods first, then the operator). Could you please provide logs from ovirt-engine? From the description it seems to me that reauthentication in the client doesn't work as expected ... (In reply to Martin Perina from comment #4) > (In reply to Peter Larsen from comment #2) > > This error happened to me due to the LDAP authenticator on ovirt went into > > failure mode - causing most authentications that required LDAP/Kerberos > > lookups to randomly fail. You can reproduce the issue without OCP by simply > > trying to login to a valid LDAP user on oVirt (for instance, use the > > username specified in the install "cloud-credentials"). In my case logging > > in was not working or if you logged in, features would all of a sudden fail > > with authentication errors. > > > > Once the Authentication issue was resolved with oVirt, I could get this > > error to go away by bouncing the operator (4.11 required stopping all the > > ovirt-csi pods first, then the operator). > > Could you please provide logs from ovirt-engine? From the description it > seems to me that reauthentication in the client doesn't work as expected ... Not sure it's going to help here - I posted to indicate that in my case this wasn't an oVirt installer or machine config issue, but that oVirt/RHV became the root cause of the authentication issue. That doesn't exclude another issue, but I haven't seen expired tokens in my test lab. I no longer have those OCP installs around (4.10 and 4.11). My clusters mostly survive at most one day perhaps a handful before they're recreated. I need to file a BZ/RFE to indicate the account used for openshift-install on oVirt/RHV needs to be a service-account that doesn't expire or issues like what this ticket shows can happen. That said, the username/password token will expire and the client MUST be able to regenerate a new token, so if that code is gone that would be an error. Hy guys, I had same problem, a feel days, token to access ovirt environment has expired. My solution (temporary), is delete olds pods in project (namespace) openshift-cluster-csi-drivers,deployments: D ovirt-csi-driver-controller and ovirt-csi-driver-operator, basically scale down and scale up deployments (recreate pods ovirt-csi-driver-controller-XXXXX and ovirt-csi-driver-operator-XXXX, XXXX is the complement name of Pod). OKD Release/Version: Cluster version is 4.10.0-0.okd-2022-03-07-131213 oVirt Release/Version: 4.4.9.5-1.el8 But in some days, i go have problem again. Steps: 1 - oc get -o yaml clusteroperator storage 2 - oc project openshift-cluster-csi-drivers 3 - Edit or scale up/down de deployments: ovirt-csi-driver-operator and ovirt-csi-driver-controller 4 - Wait a minute, and check status of Operators or via cmdline: "$ oc adm upgrade Cluster version is 4.10.0-0.okd-2022-03-07-131213 No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss." rhv: 4.5 ocp:4.11.0-0.nightly-2022-03-29-152521 steps: 1) run oc get clusterversion - no error appear 2) run oc get co -no error appear (In reply to Clemente, Alex from comment #8) > Hy guys, > > > I had same problem, a feel days, token to access ovirt environment has > expired. Alex, can you clarify if the oVirt credentials are still valid (ie. just need to sign in again to generate a new session token)? If so, it's not the issue I've seen. I would suggest including the ovirt-engine log data where the session data/processing is found to help clarify the root cause. It absolutely could be an OCP/OKD issue - but it's going to help being able to exclude oVirt if that's possible. > "$ oc adm upgrade > Cluster version is 4.10.0-0.okd-2022-03-07-131213 > No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss." Not sure how this relates to lack of cloud credentials? This message is not a result from querying oVirt API. (In reply to Peter Larsen from comment #10) > (In reply to Clemente, Alex from comment #8) > > Hy guys, > > > > > > I had same problem, a feel days, token to access ovirt environment has > > expired. > > Alex, can you clarify if the oVirt credentials are still valid (ie. just > need to sign in again to generate a new session token)? If so, it's not the > issue I've seen. I would suggest including the ovirt-engine log data where > the session data/processing is found to help clarify the root cause. It > absolutely could be an OCP/OKD issue - but it's going to help being able to > exclude oVirt if that's possible. > > > "$ oc adm upgrade > > Cluster version is 4.10.0-0.okd-2022-03-07-131213 > > No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss." > > Not sure how this relates to lack of cloud credentials? This message is not > a result from querying oVirt API. The credentials are valid, I accessed oVirt with the same user and pass, through the log message, it indicates that OKD/OCP creates an access token to oVirt, however, when the token's activation time expires, it does not automatically recreate it, being necessary to recreate the pod, so it is forced to create a new token in oVirt. I understand that OKD/OCP should create a new token when it is expired, it doesn't make much sense that you want to recreate the pod manually, like, I'll have to recreate the pod to generate a new token. Engine.log of oVirt. https://filetransfer.io/data-package/vzyCXGT6#link (In reply to Clemente, Alex from comment #11) > (In reply to Peter Larsen from comment #10) > > (In reply to Clemente, Alex from comment #8) > > > Hy guys, > > > > > > > > > I had same problem, a feel days, token to access ovirt environment has > > > expired. > > > > Alex, can you clarify if the oVirt credentials are still valid (ie. just > > need to sign in again to generate a new session token)? If so, it's not the > > issue I've seen. I would suggest including the ovirt-engine log data where > > the session data/processing is found to help clarify the root cause. It > > absolutely could be an OCP/OKD issue - but it's going to help being able to > > exclude oVirt if that's possible. > > > > > "$ oc adm upgrade > > > Cluster version is 4.10.0-0.okd-2022-03-07-131213 > > > No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss." > > > > Not sure how this relates to lack of cloud credentials? This message is not > > a result from querying oVirt API. > > The credentials are valid, I accessed oVirt with the same user and pass, > through the log message, it indicates that OKD/OCP creates an access token > to oVirt, however, when the token's activation time expires, it does not > automatically recreate it, being necessary to recreate the pod, so it is > forced to create a new token in oVirt. > > I understand that OKD/OCP should create a new token when it is expired, it > doesn't make much sense that you want to recreate the pod manually, like, > I'll have to recreate the pod to generate a new token. > > Engine.log of oVirt. > > https://filetransfer.io/data-package/vzyCXGT6#link Hi, I suggest to report a new issue. This bug is verified and it's about the CSI Driver operator been degraded after few days. Thank you Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |