Description of problem: OSD v4 sets error, login, and providerSelection overrides for branding in the OAuth cluster CR. On upgrade from 4.3.12 to 4.4.0-rc.8 oauth fails even if that branding is removed prior to upgrade or after upgrade. Observed in 4.4.0-.rc.6 as well, but data provided will be for 4.4.0-rc.8. Version-Release number of selected component (if applicable): 4.4.0-rc.6 4.4.0-rc.8 How reproducible: Every time. Steps to Reproduce: 1. Create 4.3.12 cluster. 2. Apply OAuth templates 3. Wait for authentication to finish processing the oauth changes 4. Remove OAuth templates 5. Wait for authentication to finish processing the oauth changes 6. Upgrade to 4.4.0-rc.8 Actual results: Any use of oauth to access the cluster web applications like console, prometheus, and grafana fail. Expected results: Oauth functions with default branding and allows authentication to web applications like console, prometheus, and grafana. Additional info: Apply OSD branding: oc create -f https://raw.githubusercontent.com/openshift/managed-cluster-config/master/deploy/osd-oauth-templates-errors/osd-oauth-templates-errors.secret.yaml oc create -f https://raw.githubusercontent.com/openshift/managed-cluster-config/master/deploy/osd-oauth-templates-login/osd-oauth-templates-login.secret.yaml oc create -f https://raw.githubusercontent.com/openshift/managed-cluster-config/master/deploy/osd-oauth-templates-providers/osd-oauth-templates-providers.secret.yaml oc patch oauth cluster --patch '{"spec":{"templates": {"login": {"name": "osd-oauth-templates-login"},"providerSelection": {"name": "osd-oauth-templates-providers"},"error": {"name": "osd-oauth-templates-errors"}}}}' --type merge Remove OSD branding: oc patch oauth cluster --patch '{"spec":{"templates":null}}' --type merge I will attach must-gather from before upgrade and after upgrade as soon as I can. The "before" will be after branding has been added and removed using the steps above. The "after" will be post-upgrade and after at least one failed attempt to login to the console.
https://issues.redhat.com/browse/OSD-3315
I happened to refresh the login for a cluster that had previously failed and it is working now. I didn't change anything and openshift-authentication pods had completed rolling out. I will do more periodic polling of oauth post-upgrade for the test I was planning to do for posting must-gather...
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, it’s always been like this we just never noticed Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
This is frustrating, tested OCP 4.3.12 to 4.4.0-rc.8, no problems. Even the OSD branding worked when applied. I turned hive syncing on to make it OSD 4.4.0-rc.8 and OSD 4.3 branding landed, oauth was unhappy. But when it's removed with "oc patch oauth cluster --patch '{"spec":{"templates":null}}' --type merge" oauth does not work. I get "Application is not available". This is what I experience before and why I was trying to remove branding before. So, the frustrating part is this take a long time to test and I don't have automation around it. I think we need it (https://issues.redhat.com/browse/OSD-3432) but it's not there. I will do more testing Monday to attempt to reproduce. Will go with a fresh OSD 4.3.12 installed cluster, remove branding, and upgrade. If they're consistently working I'll be happy to close this.
4.4.0-rc.9 just landed in candidate-4.4 if you want to try that as well.
Thanks, good point. I will try further tests with whatever the most recent 4.4.0-rc in candidate-4.4 is with an upgrade edge from 4.3.12 (the next target version of OSD).
If you encounter the same issue again, provide some logs, must-gather preferably. And don't set your comments to private if they don't have to be. Also, you need to explain why you set the blocker keywords. While maybe concerning, I don't see how what you described is a common use-case.
Reproduced with OSD 4.3.12 upgrade to 4.4.0-rc.9. 1. provision OSD 4.3.12 cluster 2. stop config management from syncing changes (so we can do step #3 and have it stick) 3. edit OAuth and remove .spec.templates 4. wait for openshift-authentication to apply the change 5. verify can login via console 6. change channel to candidate-4.4 7. oc adm upgrade --to 4.4.0-rc.9 8. attempt to login via consol Failure at #8, "Application is not available". Screenshot will be attached and must-gather linked in a moment.
Created attachment 1680335 [details] attempt to view console
@sdodson: Who is impacted? Customers upgrading from 4.3.z to 4.4.z that have setup OAuth templates overrides prior to upgrade, even if removing those templates prior to upgrade. What is the impact? No user login that requires oauth works. How involved is remediation? Unknown how to remediate. Is this a regression? Yes, this functionality has worked since 4.1.0-rc.7
defaulting to blocking based on repro identified above. severity can be reduced w/ explanation/workaround.
(In reply to Naveen Malik from comment #11) > Who is impacted? > Customers upgrading from 4.3.z to 4.4.z that have setup OAuth templates > overrides prior to upgrade, even if removing those templates prior to > upgrade. How can we identify these clusters? Can we look at something in Telemetry/Insights? Do we have to look at something on the cluster itself? Should an operator be setting Upgradeable=False on these clusters until we sort out a fix?
screenshot appears to be a page indicating that all endpoints are down. However, must-gather indicates that endpoints are up. In addition the authentication and ingress operators are reporting good. Not sure what to debug next for figuring out why the route cannot make contact. moving to network-edge
Couple of interesting things in the recent test, I upgraded a 4.3.12 OSD cluster to 4.4.0-rc.9 and was still not able to access oauth, using Chrome. For fun I tried Firefox, which I don't use for anything. It worked! Asked someone else to try and it didn't work on vivaldi or safari. Noticed error to https://oauth-openshift.apps..../favicon.ico { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/favicon.ico\"", "reason": "Forbidden", "details": { }, "code": 403 } My firefox setup has a proxy to access an internal jenkins host. When I turn off the proxy I get errors again. I see the favicon.ico error in chrome, didn't dig into firefox yet.
(In reply to W. Trevor King from comment #13) > (In reply to Naveen Malik from comment #11) > > Who is impacted? > > Customers upgrading from 4.3.z to 4.4.z that have setup OAuth templates > > overrides prior to upgrade, even if removing those templates prior to > > upgrade. > > How can we identify these clusters? Can we look at something in > Telemetry/Insights? Do we have to look at something on the cluster itself? > Should an operator be setting Upgradeable=False on these clusters until we > sort out a fix? There may be some flag in telemeter to say they're "managed", I can't recall without digging. SRE fully controls upgrades on these clusters, customer cannot do the upgrade, so we'll make sure upgrades are not done until it is resolved.
> SRE fully controls upgrades on these clusters... But this issue is not specific to SRE/OSD, is it? Presumably this is a generic OpenShift product issue that other customers could also hit, which is why Scott filed comment 3 asking the assigned team (since changed) to fill out an impact statement.
Created attachment 1680703 [details] prom-oauth-backend-metrics-example
I’ve spent time looking into the mustgather. The ingress operator, ingresscontroller, routes, svc’s, endpoints and pods all look good. Nothing stands out in the router pod logs. I compared the provided haproxy.config with one from my 4.5 cluster and I don’t see any differences that would indicate an issue. I suggest checking the Prometheus metrics for the oauth route (see attachment for details), which should indicate the backend (i.e. endpoints) are up. To isolate the issue, it would be helpful if you can force the connection to use http 1.1. [1] provides details how to force Firefox to use http 1.1. [1] https://superuser.com/questions/1042053/how-to-make-firefox-use-http-1-1
My local Firefox was already set to http 1.1
My observations from Firefox today: I use Firefox on Arch, the config Daneyon mentions is the default. However, per logs, Firefox still uses HTTP/2 to access console. When it gets the redirect to the oauth-server, it attempts to reuse the same connection, this is the request I see right after the redirect ``` Request URL:https://oauth-openshift.apps.nmalik-4312u.b2j4.s1.devshift.org/oauth/authorize?client_id=console&redirect_uri=https%3A%2F%2Fconsole-openshift-console.apps.nmalik-4312u.b2j4.s1.devshift.org%2Fauth%2Fcallback&response_type=code&scope=user%3Afull&state=9c11bcdf Request Method:GET Remote Address:54.81.231.37:443 Status Code: 503 Version:HTTP/2 Referrer Policy:strict-origin-when-cross-origin ``` However, wait 30 seconds, cache times out, and everything starts working (just hit refresh after 30s, hitting it repeatedly does not work as that apparently refreshes the cache timeout as well): ``` Request URL:https://oauth-openshift.apps.nmalik-4312u.b2j4.s1.devshift.org/oauth/authorize?client_id=console&redirect_uri=https%3A%2F%2Fconsole-openshift-console.apps.nmalik-4312u.b2j4.s1.devshift.org%2Fauth%2Fcallback&response_type=code&scope=user%3Afull&state=9c11bcdf Request Method:GET Remote Address:54.81.231.37:443 Status Code: 200 Version:HTTP/1.1 Referrer Policy:strict-origin-when-cross-origin ``` I wonder what changed from 4.3 to 4.4 that breaks this all of a sudden.
It's not 100% clear if this is related but in 4.4 http/2 was enabled on the router: https://github.com/openshift/router/pull/75
Nice find! And makes sense of previous testing where I saw a mix of this not working and then working later. I must have refreshed a tab after the timeout. Confirmed locally in Chrome, waiting and refreshing after 30 seconds (40 in my specific test) works.
Looking more into HTTP/2 in OCP 4.4, cannot be disabled per the PR noted above, it's always on. This means we have no remediation for the issue. Additional point of context, OSD SRE is required to login via console to access any production OSD clusters. We are looking for an alternate solution but this is where we are right now. Therefore, any production issue for OSD that requires SRE to access a cluster will require SRE to know to wait for 30 seconds to refresh. This is going to be messy and result in human error / misunderstanding. Meaning longer outages and issues for customer clusters.
Is this all reproducible from fresh/new incognito browses tabs too?
Yes, reproduced in incognito in Chrome 100% of the time. Confirmed incognito works if waiting the 30 seconds for cache to timeout then refresh.
I am spinning up a fresh 4.3.12 cluster and will NOT deploy any oauth templates. Will then upgrade to 4.4.0-rc.9 and check behavior. Expect it will work but will post details here when I have it in a few housr.
I removed the changes introduced by https://github.com/openshift/router/pull/75 in the test cluster that Naveen provided. I can now consistently hit https://console-openshift-console.apps.nmalik-4312u.b2j4.s1.devshift.org/. Prior to my changes, I would consistently hit the no endpoints 503 page.
(In reply to Daneyon Hansen from comment #32) > I removed the changes introduced by > https://github.com/openshift/router/pull/75 in the test cluster that Naveen > provided. I can now consistently hit > https://console-openshift-console.apps.nmalik-4312u.b2j4.s1.devshift.org/. > Prior to my changes, I would consistently hit the no endpoints 503 page. But why does everything work (presumably) when oauth templates are not in use? Or am I misunderstanding the problem?
Dan, I am spinning up a fresh 4.3.12 and upgrading to 4.4.0-rc.9 and have not applied any oauth customizations. Will collect must-gather and haproxy.config and update here once it's all done. I can provide same cluster-admin access to those that have access to the other test cluster.
I should note for this second cluster I am not applying any OSD customizations at all, it's the base install provided by hive. If we suspect additional OSD configs may impact oauth-server I can enumerate the list of resources we apply to a cluster day-2 to make it OSD. Or look at new and modified CR's in the previously provided must-gathers.
New cluster up without any customizations and oauth-server works just fine and I can login to the console without issue. Will add reference to must-gather in a sec.
In OpenShift 4.4, we enabled HTTP/2 on the frontend for the ingress controller; that is, we allow clients to ask the ingress controller to use HTTP/2, using ALPN. Because the console and oauth-server are both behind the ingress controller's balancer and are using the same serving certificate (namely the ingress controller's default certificate), a browser might perform connection coalescing[1], meaning the browser re-uses the connection that it used to connect to the console route to connect to the oauth route. Because we enabled HTTP/2 on the frontend, this means the browser may connect to the console route using HTTP/2 and then re-use the HTTP/2 connection to try to connect to the oauth route, which fails. In general, we cannot support HTTP/2 ALPN on routes that use the default certificate without risk of connection re-use/coalescing causing problems of this nature. To unblock this issue, we can disable HTTP/2 on the frontend. Later on, in order to support HTTP/2, we will need a solution that enables HTTP/2 only for routes that have custom certificates (which should prevent browsers from coalescing connections). 1. https://daniel.haxx.se/blog/2016/08/18/http2-connection-coalescing/
https://bugzilla.redhat.com/show_bug.cgi?id=1826994 filed to add oauth template e2e coverage.
Also tried without any oauth customizations and upgrade the cluster from 4.3.12 to 4.4.0-rc.8, didn't reproduce the issue. browser: Firefox 74.0 (Build ID: 20200309095159)
Daneyon / Miciah, Now that the problem is understood and a fix has been proposed can you please answer the questions in comment 3 to the best of your ability. Key here would be the expected failure rate and pre-conditions that trigger failure without the fix for this bug.
Clearing Stefan's needinfo now that Daneyon / Miciah are on the hook.
Who is impacted? Customers running 4.x or greater that expose multiple HTTP/2 services through a Route that uses the default (wildcard) certificate. What is the impact? The inability to access said services without forcing the client to use HTTP/1.1. How involved is remediation? Force clients to use HTTP/1.1 so ALPN does not choose HTTP/2. Disable ALPN and HTTP/2 that was added (but not officially supported) in 4.4. PR: https://github.com/openshift/router/pull/121 Is this a regression? Yes, from 4.3 to 4.x or greater
> Who is impacted? > Customers running 4.x or greater that expose multiple HTTP/2 services through a Route that uses the default (wildcard) certificate. Do we have any guess at how many customer clusters this is? My guess would be "lots", in which case getting the HTTP/1.1-force back into 4.4.0 (bug 1826990?) would remain a 4.4.0 blocker, right? And not even as an update-scoped blocker, just a blocker period, right?
Verified with 4.5.0-0.nightly-2020-04-25-170442 that all routes are using HTTP/1.1 now. $ curl https://console-openshift-console.apps.hongli-pl442.qe.azure.devcluster.openshift.com -k -I HTTP/1.1 200 OK $ curl https://oauth-openshift.apps.hongli-pl442.qe.azure.devcluster.openshift.com -k -I HTTP/1.1 403 Forbidden $ curl https://my-edge-hongli.apps.hongli-pl442.qe.azure.devcluster.openshift.com -k -I HTTP/1.1 200 OK $ curl http://service-unsecure-hongli.apps.hongli-pl442.qe.azure.devcluster.openshift.com -I HTTP/1.1 200 OK
We opened this 4.5.0 bug to disable HTTP/2 so that we could disable HTTP/2 in 4.4.0 to fix bug 1826990. We later re-enabled HTTP/2 in 4.5.0, with some changes to avoid breaking OAuth again, so no doc update is needed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475