Bug 1988576
Summary: | Authentication operator fails to become available during upgrade to 4.8.2 | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | rvanderp | ||||
Component: | apiserver-auth | Assignee: | Sergiusz Urbaniak <surbania> | ||||
Status: | CLOSED ERRATA | QA Contact: | Xingxing Xia <xxia> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.8 | CC: | alkazako, amcdermo, aos-bugs, cblecker, david.karlsen, fsoppels, g.parera, jscalf, liyao, lmohanty, mbargenq, mfojtik, mtleilia, mwhittin, nsu, rsandu, sdodson, sdodson, slaznick, surbania, wking, xxia, yanyang, ychoukse | ||||
Target Milestone: | --- | Keywords: | Regression, Reopened, ServiceDeliveryBlocker, ServiceDeliveryImpact, UpgradeBlocker | ||||
Target Release: | 4.9.0 | Flags: | ychoukse:
needinfo-
ychoukse: needinfo- |
||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | UpdateRecommendationsBlocked | ||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1989587 (view as bug list) | Environment: | |||||
Last Closed: | 2021-10-18 17:43:44 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1989587 | ||||||
Attachments: |
|
Description
rvanderp
2021-07-30 21:37:36 UTC
the underlying issue seems to be that the route endpoint check fails with a timeout: OAuthRouteCheckEndpointAccessibleControllerAvailable: Get \"https://oauth-openshift.apps.build01-us-west-2.vmc.ci.openshift.org/healthz\": context canceled" this is rather network related, the oauth pod cannot reach external routes. Could you enable router access logs? $ oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge --patch='{"spec":{"logging":{"access":{"destination":{"type":"Container"}}}}}' That will restart the ingress router pods, and each pod will now have a "logs" container. I'd like to see if this helps provide any insight or correlation w.r.t auth GET failure mentioned in comment #1. On top of that would it be possible to run the curler binary (see attachments). This is a wrapped up version of using curl that repeatedly makes a GET request: Usage would be: $ O=1 ./curler https://oauth-openshift.apps.build01-us-west-2.vmc.ci.openshift.org/healthz reopening stdout to "curler-R0-2021-08-02-133557.stdout" reopening stderr to "curler-R0-2021-08-02-133557.stderr" You can tail -f the .stdout file to watch the GET requests to the endpoint, looking for either slow requests, requests that fail, DNS issues, et al. It will repeat the GET indefinitely. Can we run this external to the cluster and from a pod (or node?) within the cluster? I'd like to see the same failure from the curler binary that we do from the auth pod. Created attachment 1810102 [details]
curler binary - make repeated calls to an endpoint.
Usage:
O=1 ./curler <URL>
needinfo should go to OP Performed 36 minutes of curl testing From the perspective of the auth pod: sh-4.4# cat curler-R0-2021-08-02-130644.stderr sh-4.4# cat curler-R0-2021-08-02-130644.stdout | grep "http_code 200" | wc -l 123425 sh-4.4# cat curler-R0-2021-08-02-130644.stdout | grep -v "http_code 200" | wc -l 0 From the perspective of an external caller: $ cat curler-R0-2021-08-02-090657.stderr $ cat curler-R0-2021-08-02-090657.stdout | grep "http_code 200" | wc -l 5808 $ cat curler-R0-2021-08-02-090657.stdout | grep -v "http_code 200" | wc -l 0 No failures were observed from the degraded operator pod or externally. The issue that the failing route status was set in a authentication-operator from version 4.7 ``` $ kubectl get co authentication -o yaml ... - lastTransitionTime: "2021-08-02T21:47:41Z" message: 'OAuthRouteCheckEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.build01-us-west-2.vmc.ci.openshift.org/healthz": context canceled' reason: OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable status: "False" type: Available ``` however the prefix OAuthRoute changed to OAuthServerRoute but the stale status controller entries have not been updated correctly. Instead of referring to OAuthRouteCheckEndpointAccessibleController_Available_ they refer to OAuthRouteCheckEndpointAccessibleController_Degraded_: https://github.com/openshift/cluster-authentication-operator/blob/4dfd59792e303282731f6120a22b042144901b39/pkg/operator/starter.go#L256 Worked around this issue by appending an `Available: True` condition to the operator conditions in etcd. The upgrading is proceeding. edit: I had to remove the condition `OAuthRouteCheckEndpointAccessibleController` from the authentications/cluster resource. We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 > > Customers upgrading from 4.7.z to 4.8.z during a period when the authentication operator is unable to reach the oauth route just as the authentication operator is rolling out 4.8. > > This should be prevented by CVO as upgrades should not be possible while operators report degraded. We are still not sure why the upgrade was still possible. Because sometimes updating to a new release is how we want folks to fix a degraded operator. During updates, the CVO blocks on ClusterOperator manifests when they are Degraded=True [1], but that's generally (always?) after the manifest for the operator deployment. Blocking a move from release A to release B because A's operator X is Degraded=True isn't crisp, because we'll only block mid-update if B's operator X is also Degraded=True. Or maybe the degradation root is outside operator X completely, in which case, maybe it gets sorted out before we get around to ClusterOperator X, or maybe not, but that chance is still better than forcing folks to manually recover or force through a guard before they can attempt an update. [1]: https://github.com/openshift/enhancements/blob/ac1c27da8307933263e5273bc087b407d79f713f/dev-guide/cluster-version-operator/user/reconciliation.md#clusteroperator We have not seen this issue in Telemetry and as per our discussion it seems like a corner case issue. So we have decided not to remove the edge from 4.7 to 4.8 for this bug. However if we get evidence of this bug is impacting more clusters then we will reconsider the decision. It seems that has changed: https://github.com/openshift/cincinnati-graph-data/pull/987 *** Bug 1993712 has been marked as a duplicate of this bug. *** It's been confirmed that a cluster which had run into this upgrade halting problem completed the upgrade after applying 4.8.5 which contained the backported version of this fix. This bug primarilly served as a pre-requisite for backporting that change to 4.8 and as such I'm marking this CLOSED CURRENTRELEASE after the verification I just mentioned. With a few more clusters getting stuck on this issue, and 4.8.5 now in fast-4.8 with the fix [1,2], we've blocked 4.7 -> 4.8[234] [3] to keep future updates from hitting this same problem. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1989587#c5 [2]: https://github.com/openshift/cincinnati-graph-data/pull/988#event-5164866620 [3]: https://github.com/openshift/cincinnati-graph-data/pull/987 Updated Impact Statement Who is impacted? Some Clusters upgrading from 4.7 to 4.8.2-4.8.4. What is the impact? Is it serious enough to warrant blocking edges? The authentication operator incorrectly marks itself Available=False and the upgrade process halts once that happens. The upgrade will never complete but absent other unrelated issues the cluster should be healthy. Additionally since the CVO halts reconciliation any out of band changes made will remain intact until this problem is resolved. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? We have shipped a fix for this issue in 4.8.5, upgrading to that version will heal the cluster, the following command should work oc adm upgrade --to=4.8.5 --allow-upgrade-with-warnings Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? Yes, it's a regression between 4.7 and 4.8.2-4.8.4 and has been fixed in 4.8.5. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |