1825354 – Upgrade from 4.3 to 4.4 with oauth template overrides breaks oauth

Bug 1825354 - Upgrade from 4.3 to 4.4 with oauth template overrides breaks oauth

Summary: Upgrade from 4.3 to 4.4 with oauth template overrides breaks oauth

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Daneyon Hansen
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1826990 1826992
TreeView+	depends on / blocked

Reported:	2020-04-17 18:56 UTC by Naveen Malik
Modified:	2022-08-04 22:27 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1826990 1826992 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:28:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
attempt to view console (66.19 KB, image/png) 2020-04-20 16:14 UTC, Naveen Malik	no flags	Details
prom-oauth-backend-metrics-example (127.32 KB, image/png) 2020-04-21 23:49 UTC, Daneyon Hansen	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24913	None	closed	Bug 1825354: test/extended/router: Disable HTTP/2 tests	2021-01-05 08:40:29 UTC
Github	openshift router pull 121	None	closed	Bug 1825354: Removes ALPN from haproxy frontends	2021-01-05 08:41:05 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:28:58 UTC

Description Naveen Malik 2020-04-17 18:56:47 UTC

Description of problem:
OSD v4 sets error, login, and providerSelection overrides for branding in the OAuth cluster CR. On upgrade from 4.3.12 to 4.4.0-rc.8 oauth fails even if that branding is removed prior to upgrade or after upgrade. Observed in 4.4.0-.rc.6 as well, but data provided will be for 4.4.0-rc.8.

Version-Release number of selected component (if applicable):
4.4.0-rc.6
4.4.0-rc.8

How reproducible:
Every time.

Steps to Reproduce:
1. Create 4.3.12 cluster.
2. Apply OAuth templates
3. Wait for authentication to finish processing the oauth changes
4. Remove OAuth templates
5. Wait for authentication to finish processing the oauth changes
6. Upgrade to 4.4.0-rc.8

Actual results:
Any use of oauth to access the cluster web applications like console, prometheus, and grafana fail.

Expected results:
Oauth functions with default branding and allows authentication to web applications like console, prometheus, and grafana.

Additional info:

Apply OSD branding:

oc create -f https://raw.githubusercontent.com/openshift/managed-cluster-config/master/deploy/osd-oauth-templates-errors/osd-oauth-templates-errors.secret.yaml
oc create -f https://raw.githubusercontent.com/openshift/managed-cluster-config/master/deploy/osd-oauth-templates-login/osd-oauth-templates-login.secret.yaml
oc create -f https://raw.githubusercontent.com/openshift/managed-cluster-config/master/deploy/osd-oauth-templates-providers/osd-oauth-templates-providers.secret.yaml
oc patch oauth cluster --patch '{"spec":{"templates": {"login": {"name": "osd-oauth-templates-login"},"providerSelection":
{"name": "osd-oauth-templates-providers"},"error": {"name": "osd-oauth-templates-errors"}}}}' --type merge

Remove OSD branding:

oc patch oauth cluster --patch '{"spec":{"templates":null}}' --type merge

I will attach must-gather from before upgrade and after upgrade as soon as I can. The "before" will be after branding has been added and removed using the steps above. The "after" will be post-upgrade and after at least one failed attempt to login to the console.

Comment 1 Naveen Malik 2020-04-17 19:03:10 UTC

https://issues.redhat.com/browse/OSD-3315

Comment 2 Naveen Malik 2020-04-17 19:14:27 UTC

I happened to refresh the login for a cluster that had previously failed and it is working now.  I didn't change anything and openshift-authentication pods had completed rolling out.  I will do more periodic polling of oauth post-upgrade for the test I was planning to do for posting must-gather...

Comment 3 Scott Dodson 2020-04-17 19:32:45 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 4 Naveen Malik 2020-04-17 20:30:13 UTC

This is frustrating, tested OCP 4.3.12 to 4.4.0-rc.8, no problems. Even the OSD branding worked when applied.

I turned hive syncing on to make it OSD 4.4.0-rc.8 and OSD 4.3 branding landed, oauth was unhappy.

But when it's removed with "oc patch oauth cluster --patch '{"spec":{"templates":null}}' --type merge" oauth does not work.  I get "Application is not available".  This is what I experience before and why I was trying to remove branding before.

So, the frustrating part is this take a long time to test and I don't have automation around it.  I think we need it (https://issues.redhat.com/browse/OSD-3432) but it's not there.

I will do more testing Monday to attempt to reproduce.  Will go with a fresh OSD 4.3.12 installed cluster, remove branding, and upgrade.  If they're consistently working I'll be happy to close this.

Comment 5 W. Trevor King 2020-04-17 20:38:34 UTC

4.4.0-rc.9 just landed in candidate-4.4 if you want to try that as well.

Comment 6 Naveen Malik 2020-04-17 20:49:02 UTC

Thanks, good point.  I will try further tests with whatever the most recent 4.4.0-rc in candidate-4.4 is with an upgrade edge from 4.3.12 (the next target version of OSD).

Comment 7 Standa Laznicka 2020-04-20 07:30:30 UTC

If you encounter the same issue again, provide some logs, must-gather preferably.

And don't set your comments to private if they don't have to be.

Also, you need to explain why you set the blocker keywords. While maybe concerning, I don't see how what you described is a common use-case.

Comment 8 Naveen Malik 2020-04-20 16:14:04 UTC

Reproduced with OSD 4.3.12 upgrade to 4.4.0-rc.9.

1. provision OSD 4.3.12 cluster
2. stop config management from syncing changes (so we can do step #3 and have it stick)
3. edit OAuth and remove .spec.templates
4. wait for openshift-authentication to apply the change
5. verify can login via console
6. change channel to candidate-4.4
7. oc adm upgrade --to 4.4.0-rc.9
8. attempt to login via consol

Failure at #8, "Application is not available". Screenshot will be attached and must-gather linked in a moment.

Comment 9 Naveen Malik 2020-04-20 16:14:57 UTC

Created attachment 1680335 [details]
attempt to view console

Comment 11 Naveen Malik 2020-04-21 15:29:32 UTC

@sdodson:

Who is impacted?
  Customers upgrading from 4.3.z to 4.4.z that have setup OAuth templates overrides prior to upgrade, even if removing those templates prior to upgrade.
What is the impact?
  No user login that requires oauth works.
How involved is remediation?
  Unknown how to remediate.
Is this a regression?
  Yes, this functionality has worked since 4.1.0-rc.7

Comment 12 Ben Parees 2020-04-21 16:02:32 UTC

defaulting to blocking based on repro identified above.  severity can be reduced w/ explanation/workaround.

Comment 13 W. Trevor King 2020-04-21 16:11:37 UTC

(In reply to Naveen Malik from comment #11)
> Who is impacted?
>   Customers upgrading from 4.3.z to 4.4.z that have setup OAuth templates
> overrides prior to upgrade, even if removing those templates prior to
> upgrade.

How can we identify these clusters?  Can we look at something in Telemetry/Insights?  Do we have to look at something on the cluster itself?  Should an operator be setting Upgradeable=False on these clusters until we sort out a fix?

Comment 14 David Eads 2020-04-21 16:26:01 UTC

screenshot appears to be a page indicating that all endpoints are down.  However, must-gather indicates that endpoints are up.  In addition the authentication and ingress operators are reporting good.

Not sure what to debug next for figuring out why the route cannot make contact.  moving to network-edge

Comment 15 Naveen Malik 2020-04-21 21:10:35 UTC

Couple of interesting things in the recent test, I upgraded a 4.3.12 OSD cluster to 4.4.0-rc.9 and was still not able to access oauth, using Chrome.  For fun I tried Firefox, which I don't use for anything.  It worked!  Asked someone else to try and it didn't work on vivaldi or safari. Noticed error to https://oauth-openshift.apps..../favicon.ico

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/favicon.ico\"",
  "reason": "Forbidden",
  "details": {
  },
  "code": 403
}

My firefox setup has a proxy to access an internal jenkins host.  When I turn off the proxy I get errors again.  I see the favicon.ico error in chrome, didn't dig into firefox yet.

Comment 16 Naveen Malik 2020-04-21 21:11:46 UTC

(In reply to W. Trevor King from comment #13)
> (In reply to Naveen Malik from comment #11)
> > Who is impacted?
> >   Customers upgrading from 4.3.z to 4.4.z that have setup OAuth templates
> > overrides prior to upgrade, even if removing those templates prior to
> > upgrade.
> 
> How can we identify these clusters?  Can we look at something in
> Telemetry/Insights?  Do we have to look at something on the cluster itself? 
> Should an operator be setting Upgradeable=False on these clusters until we
> sort out a fix?

There may be some flag in telemeter to say they're "managed", I can't recall without digging.  SRE fully controls upgrades on these clusters, customer cannot do the upgrade, so we'll make sure upgrades are not done until it is resolved.

Comment 18 W. Trevor King 2020-04-21 21:15:33 UTC

> SRE fully controls upgrades on these clusters...

But this issue is not specific to SRE/OSD, is it?  Presumably this is a generic OpenShift product issue that other customers could also hit, which is why Scott filed comment 3 asking the assigned team (since changed) to fill out an impact statement.

Comment 20 Daneyon Hansen 2020-04-21 23:49:30 UTC

Created attachment 1680703 [details]
prom-oauth-backend-metrics-example

Comment 21 Daneyon Hansen 2020-04-21 23:49:49 UTC

I’ve spent time looking into the mustgather. The ingress operator, ingresscontroller, routes, svc’s, endpoints and pods all look good. Nothing stands out in the router pod logs. I compared the provided haproxy.config with one from my 4.5 cluster and I don’t see any differences that would indicate an issue. I suggest checking the Prometheus metrics for the oauth route (see attachment for details), which should indicate the backend (i.e. endpoints) are up. To isolate the issue, it would be helpful if you can force the connection to use http 1.1. [1] provides details how to force Firefox to use http 1.1.

[1] https://superuser.com/questions/1042053/how-to-make-firefox-use-http-1-1

Comment 23 Naveen Malik 2020-04-22 16:39:37 UTC

My local Firefox was already set to http 1.1

Comment 24 Standa Laznicka 2020-04-22 16:52:23 UTC

My observations from Firefox today:

I use Firefox on Arch, the config Daneyon mentions is the default. However, per logs, Firefox still uses HTTP/2 to access console. When it gets the redirect to the oauth-server, it attempts to reuse the same connection, this is the request I see right after the redirect
```
Request URL:https://oauth-openshift.apps.nmalik-4312u.b2j4.s1.devshift.org/oauth/authorize?client_id=console&redirect_uri=https%3A%2F%2Fconsole-openshift-console.apps.nmalik-4312u.b2j4.s1.devshift.org%2Fauth%2Fcallback&response_type=code&scope=user%3Afull&state=9c11bcdf
Request Method:GET
Remote Address:54.81.231.37:443
Status Code: 503
Version:HTTP/2
Referrer Policy:strict-origin-when-cross-origin
```

However, wait 30 seconds, cache times out, and everything starts working (just hit refresh after 30s, hitting it repeatedly does not work as that apparently refreshes the cache timeout as well):
```
Request URL:https://oauth-openshift.apps.nmalik-4312u.b2j4.s1.devshift.org/oauth/authorize?client_id=console&redirect_uri=https%3A%2F%2Fconsole-openshift-console.apps.nmalik-4312u.b2j4.s1.devshift.org%2Fauth%2Fcallback&response_type=code&scope=user%3Afull&state=9c11bcdf
Request Method:GET
Remote Address:54.81.231.37:443
Status Code:
200
Version:HTTP/1.1
Referrer Policy:strict-origin-when-cross-origin
```

I wonder what changed from 4.3 to 4.4 that breaks this all of a sudden.

Comment 25 Andrew McDermott 2020-04-22 17:08:44 UTC

It's not 100% clear if this is related but in 4.4 http/2 was enabled on the router:

  https://github.com/openshift/router/pull/75

Comment 26 Naveen Malik 2020-04-22 17:22:58 UTC

Nice find!  And makes sense of previous testing where I saw a mix of this not working and then working later.  I must have refreshed a tab after the timeout.

Confirmed locally in Chrome, waiting and refreshing after 30 seconds (40 in my specific test) works.

Comment 27 Naveen Malik 2020-04-22 17:42:17 UTC

Looking more into HTTP/2 in OCP 4.4, cannot be disabled per the PR noted above, it's always on.  This means we have no remediation for the issue.

Additional point of context, OSD SRE is required to login via console to access any production OSD clusters.  We are looking for an alternate solution but this is where we are right now.  Therefore, any production issue for OSD that requires SRE to access a cluster will require SRE to know to wait for 30 seconds to refresh.  This is going to be messy and result in human error / misunderstanding.  Meaning longer outages and issues for customer clusters.

Comment 29 Andrew McDermott 2020-04-22 18:03:48 UTC

Is this all reproducible from fresh/new incognito browses tabs too?

Comment 30 Naveen Malik 2020-04-22 18:19:48 UTC

Yes, reproduced in incognito in Chrome 100% of the time.
Confirmed incognito works if waiting the 30 seconds for cache to timeout then refresh.

Comment 31 Naveen Malik 2020-04-22 18:20:49 UTC

I am spinning up a fresh 4.3.12 cluster and will NOT deploy any oauth templates.  Will then upgrade to 4.4.0-rc.9 and check behavior.  Expect it will work but will post details here when I have it in a few housr.

Comment 32 Daneyon Hansen 2020-04-22 18:44:17 UTC

I removed the changes introduced by https://github.com/openshift/router/pull/75 in the test cluster that Naveen provided. I can now consistently hit https://console-openshift-console.apps.nmalik-4312u.b2j4.s1.devshift.org/. Prior to my changes, I would consistently hit the no endpoints 503 page.

Comment 33 Dan Mace 2020-04-22 18:48:25 UTC

(In reply to Daneyon Hansen from comment #32)
> I removed the changes introduced by
> https://github.com/openshift/router/pull/75 in the test cluster that Naveen
> provided. I can now consistently hit
> https://console-openshift-console.apps.nmalik-4312u.b2j4.s1.devshift.org/.
> Prior to my changes, I would consistently hit the no endpoints 503 page.

But why does everything work (presumably) when oauth templates are not in use? Or am I misunderstanding the problem?

Comment 34 Naveen Malik 2020-04-22 19:32:22 UTC

Dan, I am spinning up a fresh 4.3.12 and upgrading to 4.4.0-rc.9 and have not applied any oauth customizations.  Will collect must-gather and haproxy.config and update here once it's all done.  I can provide same cluster-admin access to those that have access to the other test cluster.

Comment 35 Naveen Malik 2020-04-22 19:34:46 UTC

I should note for this second cluster I am not applying any OSD customizations at all, it's the base install provided by hive.  If we suspect additional OSD configs may impact oauth-server I can enumerate the list of resources we apply to a cluster day-2 to make it OSD.  Or look at new and modified CR's in the previously provided must-gathers.

Comment 36 Naveen Malik 2020-04-22 20:15:07 UTC

New cluster up without any customizations and oauth-server works just fine and I can login to the console without issue.  Will add reference to must-gather in a sec.

Comment 38 Miciah Dashiel Butler Masters 2020-04-22 23:52:50 UTC

In OpenShift 4.4, we enabled HTTP/2 on the frontend for the ingress controller; that is, we allow clients to ask the ingress controller to use HTTP/2, using ALPN.  Because the console and oauth-server are both behind the ingress controller's balancer and are using the same serving certificate (namely the ingress controller's default certificate), a browser might perform connection coalescing[1], meaning the browser re-uses the connection that it used to connect to the console route to connect to the oauth route.  Because we enabled HTTP/2 on the frontend, this means the browser may connect to the console route using HTTP/2 and then re-use the HTTP/2 connection to try to connect to the oauth route, which fails.

In general, we cannot support HTTP/2 ALPN on routes that use the default certificate without risk of connection re-use/coalescing causing problems of this nature.  To unblock this issue, we can disable HTTP/2 on the frontend.  Later on, in order to support HTTP/2, we will need a solution that enables HTTP/2 only for routes that have custom certificates (which should prevent browsers from coalescing connections).

1. https://daniel.haxx.se/blog/2016/08/18/http2-connection-coalescing/

Comment 39 Daneyon Hansen 2020-04-23 00:44:35 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1826994 filed to add oauth template e2e coverage.

Comment 40 Hongan Li 2020-04-23 05:59:10 UTC

Also tried without any oauth customizations and upgrade the cluster from 4.3.12 to 4.4.0-rc.8, didn't reproduce the issue.

browser: Firefox 74.0 (Build ID: 20200309095159)

Comment 41 Scott Dodson 2020-04-23 12:11:15 UTC

Daneyon / Miciah,

Now that the problem is understood and a fix has been proposed can you please answer the questions in comment 3 to the best of your ability.  Key here would be the expected failure rate and pre-conditions that trigger failure without the fix for this bug.

Comment 42 W. Trevor King 2020-04-23 17:21:33 UTC

Clearing Stefan's needinfo now that Daneyon / Miciah are on the hook.

Comment 43 Daneyon Hansen 2020-04-23 17:50:54 UTC

Who is impacted?
  Customers running 4.x or greater that expose multiple HTTP/2 services through a Route that uses the default (wildcard) certificate.
What is the impact?
  The inability to access said services without forcing the client to use HTTP/1.1.
How involved is remediation?
  Force clients to use HTTP/1.1 so ALPN does not choose HTTP/2.
  Disable ALPN and HTTP/2 that was added (but not officially supported) in 4.4. PR: https://github.com/openshift/router/pull/121
Is this a regression?
  Yes, from 4.3 to 4.x or greater

Comment 44 W. Trevor King 2020-04-23 22:10:05 UTC

> Who is impacted?
>   Customers running 4.x or greater that expose multiple HTTP/2 services through a Route that uses the default (wildcard) certificate.

Do we have any guess at how many customer clusters this is?  My guess would be "lots", in which case getting the HTTP/1.1-force back into 4.4.0 (bug 1826990?) would remain a 4.4.0 blocker, right?  And not even as an update-scoped blocker, just a blocker period, right?

Comment 47 Hongan Li 2020-04-26 03:09:52 UTC

Verified with 4.5.0-0.nightly-2020-04-25-170442 that all routes are using HTTP/1.1 now.

$ curl https://console-openshift-console.apps.hongli-pl442.qe.azure.devcluster.openshift.com -k -I
HTTP/1.1 200 OK

$ curl https://oauth-openshift.apps.hongli-pl442.qe.azure.devcluster.openshift.com -k -I
HTTP/1.1 403 Forbidden

$ curl https://my-edge-hongli.apps.hongli-pl442.qe.azure.devcluster.openshift.com -k -I
HTTP/1.1 200 OK

$ curl http://service-unsecure-hongli.apps.hongli-pl442.qe.azure.devcluster.openshift.com -I
HTTP/1.1 200 OK

Comment 48 Miciah Dashiel Butler Masters 2020-06-19 20:41:02 UTC

We opened this 4.5.0 bug to disable HTTP/2 so that we could disable HTTP/2 in 4.4.0 to fix bug 1826990.  We later re-enabled HTTP/2 in 4.5.0, with some changes to avoid breaking OAuth again, so no doc update is needed.

Comment 49 errata-xmlrpc 2020-07-13 17:28:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 50 W. Trevor King 2021-04-05 17:46:50 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.