Description of problem: Authentication operator fails to become available during upgrade to 4.8.2 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.2 False False False 31m baremetal 4.8.2 True False False 157d cloud-credential 4.8.2 True False False 352d cluster-autoscaler 4.8.2 True False False 352d config-operator 4.8.2 True False False 352d console 4.8.2 True False False 75m csi-snapshot-controller 4.8.2 True False False 176m dns 4.7.21 True False False 3h19m etcd 4.8.2 True False False 352d image-registry 4.8.2 True False False 31h ingress 4.8.2 True False False 30h insights 4.8.2 True False False 352d kube-apiserver 4.8.2 True False False 352d kube-controller-manager 4.8.2 True False False 352d kube-scheduler 4.8.2 True False False 352d kube-storage-version-migrator 4.8.2 True False False 129m machine-api 4.8.2 True False False 352d machine-approver 4.8.2 True False False 352d machine-config 4.7.21 True False False 140m marketplace 4.8.2 True False False 32h monitoring 4.8.2 True False False 95m network 4.7.21 True False False 157d node-tuning 4.8.2 True False False 30h openshift-apiserver 4.8.2 True False False 63m openshift-controller-manager 4.8.2 True False False 32h openshift-samples 4.8.2 True False False 30h operator-lifecycle-manager 4.8.2 True False False 352d operator-lifecycle-manager-catalog 4.8.2 True False False 352d operator-lifecycle-manager-packageserver 4.8.2 True False False 42m service-ca 4.8.2 True False False 352d storage 4.8.2 True False False 86m Version-Release number of selected component (if applicable): 4.8.2 on AWS IPI How reproducible: Unknown Steps to Reproduce: 1. Install 4.7.21 2. Upgrade to 4.8.2 3. Upgrade will stall when attempting to upgrade the authentication operator Actual results: Operator status should reflect that it is Available. If there is a problem it should reflect that it is Degraded. Operator reports: clusteroperator/authentication is not available (OAuthRouteCheckEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.xx.yy.org/healthz": context canceled) because All is well Expected results: Additional info: oauth is functional. Can login via the console with a configured provider. This is the vSphere CI build cluster. If needed access can be provided.
the underlying issue seems to be that the route endpoint check fails with a timeout: OAuthRouteCheckEndpointAccessibleControllerAvailable: Get \"https://oauth-openshift.apps.build01-us-west-2.vmc.ci.openshift.org/healthz\": context canceled" this is rather network related, the oauth pod cannot reach external routes.
Could you enable router access logs? $ oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge --patch='{"spec":{"logging":{"access":{"destination":{"type":"Container"}}}}}' That will restart the ingress router pods, and each pod will now have a "logs" container. I'd like to see if this helps provide any insight or correlation w.r.t auth GET failure mentioned in comment #1. On top of that would it be possible to run the curler binary (see attachments). This is a wrapped up version of using curl that repeatedly makes a GET request: Usage would be: $ O=1 ./curler https://oauth-openshift.apps.build01-us-west-2.vmc.ci.openshift.org/healthz reopening stdout to "curler-R0-2021-08-02-133557.stdout" reopening stderr to "curler-R0-2021-08-02-133557.stderr" You can tail -f the .stdout file to watch the GET requests to the endpoint, looking for either slow requests, requests that fail, DNS issues, et al. It will repeat the GET indefinitely. Can we run this external to the cluster and from a pod (or node?) within the cluster? I'd like to see the same failure from the curler binary that we do from the auth pod.
Created attachment 1810102 [details] curler binary - make repeated calls to an endpoint. Usage: O=1 ./curler <URL>
needinfo should go to OP
Performed 36 minutes of curl testing From the perspective of the auth pod: sh-4.4# cat curler-R0-2021-08-02-130644.stderr sh-4.4# cat curler-R0-2021-08-02-130644.stdout | grep "http_code 200" | wc -l 123425 sh-4.4# cat curler-R0-2021-08-02-130644.stdout | grep -v "http_code 200" | wc -l 0 From the perspective of an external caller: $ cat curler-R0-2021-08-02-090657.stderr $ cat curler-R0-2021-08-02-090657.stdout | grep "http_code 200" | wc -l 5808 $ cat curler-R0-2021-08-02-090657.stdout | grep -v "http_code 200" | wc -l 0 No failures were observed from the degraded operator pod or externally.
The issue that the failing route status was set in a authentication-operator from version 4.7 ``` $ kubectl get co authentication -o yaml ... - lastTransitionTime: "2021-08-02T21:47:41Z" message: 'OAuthRouteCheckEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.build01-us-west-2.vmc.ci.openshift.org/healthz": context canceled' reason: OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable status: "False" type: Available ``` however the prefix OAuthRoute changed to OAuthServerRoute but the stale status controller entries have not been updated correctly. Instead of referring to OAuthRouteCheckEndpointAccessibleController_Available_ they refer to OAuthRouteCheckEndpointAccessibleController_Degraded_: https://github.com/openshift/cluster-authentication-operator/blob/4dfd59792e303282731f6120a22b042144901b39/pkg/operator/starter.go#L256
Worked around this issue by appending an `Available: True` condition to the operator conditions in etcd. The upgrading is proceeding.
edit: I had to remove the condition `OAuthRouteCheckEndpointAccessibleController` from the authentications/cluster resource.
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, itβs always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
> > Customers upgrading from 4.7.z to 4.8.z during a period when the authentication operator is unable to reach the oauth route just as the authentication operator is rolling out 4.8. > > This should be prevented by CVO as upgrades should not be possible while operators report degraded. We are still not sure why the upgrade was still possible. Because sometimes updating to a new release is how we want folks to fix a degraded operator. During updates, the CVO blocks on ClusterOperator manifests when they are Degraded=True [1], but that's generally (always?) after the manifest for the operator deployment. Blocking a move from release A to release B because A's operator X is Degraded=True isn't crisp, because we'll only block mid-update if B's operator X is also Degraded=True. Or maybe the degradation root is outside operator X completely, in which case, maybe it gets sorted out before we get around to ClusterOperator X, or maybe not, but that chance is still better than forcing folks to manually recover or force through a guard before they can attempt an update. [1]: https://github.com/openshift/enhancements/blob/ac1c27da8307933263e5273bc087b407d79f713f/dev-guide/cluster-version-operator/user/reconciliation.md#clusteroperator
We have not seen this issue in Telemetry and as per our discussion it seems like a corner case issue. So we have decided not to remove the edge from 4.7 to 4.8 for this bug. However if we get evidence of this bug is impacting more clusters then we will reconsider the decision.
It seems that has changed: https://github.com/openshift/cincinnati-graph-data/pull/987
*** Bug 1993712 has been marked as a duplicate of this bug. ***
It's been confirmed that a cluster which had run into this upgrade halting problem completed the upgrade after applying 4.8.5 which contained the backported version of this fix. This bug primarilly served as a pre-requisite for backporting that change to 4.8 and as such I'm marking this CLOSED CURRENTRELEASE after the verification I just mentioned.
With a few more clusters getting stuck on this issue, and 4.8.5 now in fast-4.8 with the fix [1,2], we've blocked 4.7 -> 4.8[234] [3] to keep future updates from hitting this same problem. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1989587#c5 [2]: https://github.com/openshift/cincinnati-graph-data/pull/988#event-5164866620 [3]: https://github.com/openshift/cincinnati-graph-data/pull/987
Updated Impact Statement Who is impacted? Some Clusters upgrading from 4.7 to 4.8.2-4.8.4. What is the impact? Is it serious enough to warrant blocking edges? The authentication operator incorrectly marks itself Available=False and the upgrade process halts once that happens. The upgrade will never complete but absent other unrelated issues the cluster should be healthy. Additionally since the CVO halts reconciliation any out of band changes made will remain intact until this problem is resolved. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? We have shipped a fix for this issue in 4.8.5, upgrading to that version will heal the cluster, the following command should work oc adm upgrade --to=4.8.5 --allow-upgrade-with-warnings Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? Yes, it's a regression between 4.7 and 4.8.2-4.8.4 and has been fixed in 4.8.5.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759