1956879 – authentication errors with "square/go-jose: error in cryptographic primitive" are observed in the CI

Bug 1956879 - authentication errors with "square/go-jose: error in cryptographic primitive" are observed in the CI [NEEDINFO]

Summary: authentication errors with "square/go-jose: error in cryptographic primitive"...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Sergiusz Urbaniak
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:	tag-ci LifecycleReset
Depends On:
Blocks:	2008414
TreeView+	depends on / blocked

Reported:	2021-05-04 15:22 UTC by Standa Laznicka
Modified:	2021-10-18 17:30 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2008414 (view as bug list)
Environment:
Last Closed:	2021-10-18 17:30:50 UTC
Target Upstream Version:
Embargoed:
Flags:	mfojtik: needinfo?

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 650	0	None	None	None	2021-08-26 07:34:10 UTC
Red Hat Bugzilla	1907728	1	low	CLOSED	[invalid bearer token, square/go-jose: error in cryptographic primitive, token lookup failed] in some kube-rbac-proxy co...	2022-03-21 19:53:14 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:30:53 UTC

Description Standa Laznicka 2021-05-04 15:22:18 UTC

Description of problem:
Some services that delegate authentication, like kube-rbac-proxy and group-b operators seem to be reporting authentication failures in CI clusters with 
`square/go-jose: error in cryptographic primitive`.

We need to figure out why that happens.

https://search.ci.openshift.org/?search=square%2Fgo-jose%3A+error+in+cryptographic+primitive&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Michal Fojtik 2021-06-03 15:29:09 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 2 Sebastian Łaskawiec 2021-06-15 06:30:09 UTC

There are 2 subtypes of this error:
- square/go-jose: error in cryptographic primitive, old, insecure token format
- square/go-jose: error in cryptographic primitive, token lookup failed

The former might be related to some leftovers after forbidding the old token format [1]. Sergiusz Urbaniak - I've seen this happening in the monitoring Pods, can I kindly ask you to let the Monitoring Team know about this? Here are two examples extracted from [2]:
- Jun 14 19:53:09.227 E ns/openshift-monitoring pod/thanos-querier-74b7584698-7c7cq node/ip-10-0-227-143.us-west-1.compute.internal container/oauth-proxy reason/ContainerExit code/2 cause/Error format]\n2021/06/14 18:50:55 oauthproxy.go:793: requestauth: 10.128.0.7:39240 [invalid bearer token, square/go-jose: error in cryptographic primitive, old, insecure token format]\n2021/06/14 18:50:57 [...]
- Jun 13 22:56:27.417 E ns/openshift-monitoring pod/alertmanager-main-1 node/ip-10-0-238-162.ec2.internal container/alertmanager-proxy reason/ContainerExit code/2 cause/Error /06/13 22:31:28 oauthproxy.go:793: requestauth: 10.128.2.16:36640 [invalid bearer token, square/go-jose: error in cryptographic primitive, old, insecure token format][...]

The latter is more interesting and happens when the Token Authenticator can not get Tokens [3][4]. Analyzing one of the failed builds [5] I found the kube-apiserver was emitting this error at the time shown below:

2021-06-15T02:28:51.789845375Z E0615 02:28:51.788905      19 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, Token has been invalidated, token lookup failed]"
2021-06-15T02:28:51.789845375Z E0615 02:28:51.789294      19 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, Token has been invalidated, token lookup failed]"
2021-06-15T02:28:51.789845375Z E0615 02:28:51.789419      19 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, Token has been invalidated, token lookup failed]"
2021-06-15T02:28:51.789845375Z E0615 02:28:51.789573      19 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, Token has been invalidated, token lookup failed]"

After this time, the error stopped appearing. Interestingly, the API Server Pods started earlier than that but it took some time until they connected to Etcd and started serving requests:

- apiserver-58b64fd885-5gg7b: ~02:40:09.150111
- apiserver-58b64fd885-9lw55: ~02:40:09.150174208Z
- apiserver-58b64fd885-hnmxf: ~02:28:44.705157016Z

Based on the timestamps the above, I believe this is a timing issue. Things are booting up and the oAuth API Server temporarily can not obtain Tokens. Standa - if you agree with me, that will probably be a "won't fix".

[1] https://github.com/openshift/oauth-apiserver/pull/44
[2] https://search.ci.openshift.org/?search=square%2Fgo-jose%3A+error+in+cryptographic+primitive%2C+old%2C+insecure+token+format&maxAge=336h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
[3] https://github.com/openshift/oauth-apiserver/blob/09435a5dd505b3b90eb7ce355ab41c8e4c1a349c/pkg/tokenvalidation/tokenauthenticator.go#L53
[4] https://github.com/openshift/oauth-apiserver/blob/09435a5dd505b3b90eb7ce355ab41c8e4c1a349c/pkg/tokenvalidation/tokenauthenticator.go#L62
[5] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/574/pull-ci-openshift-ovn-kubernetes-master-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1404592194467139584

Comment 3 Michal Fojtik 2021-06-15 07:06:02 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 4 Sebastian Łaskawiec 2021-06-18 08:50:49 UTC

It turns out my previous explanation was entirely incorrect. Standa clarified that Kube API Server is one of the first things that we start. Such a timing error is simply impossible in this case.

So far I've verified:
- This is not a new problem, it started happening in 4.7: https://bugzilla.redhat.com/show_bug.cgi?id=1907728
- The square/go-jose code suggests that this error when verifying the token signature
- The SA keys/certs haven't been rotated
- The failure happened in a Pod that has been restarted, so I can't compare mounted certs if they match the API server
- Couldn't find anything in events
- Couldn't find anything in audit logs

I've created 2 debugging PRs that might help me investigate this failure further:
- https://github.com/openshift/kubernetes/pull/816
- https://github.com/openshift/oauth-apiserver/pull/57

Comment 5 Sebastian Łaskawiec 2021-06-22 09:43:08 UTC

Closing as there's not enough data to sort this problem out. Both PRs (mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1956879#c4) didn't catch anything suspicious. 

Since this error also appeared in 4.7, it seems it's not related anyhow with Bounded Service Account Tokens and key rotation.

In order to tell anything more about it, I'd need a stable way to reproduce it.

Comment 6 Sebastian Łaskawiec 2021-06-28 09:12:49 UTC

Together with Sergiusz and Standa we decided to keep this bug around. 

Unfortunately we do not have enough data to debug it further. Logging tokens and cryptographic keys anywhere is simply a no-go solution. So far we also noticed that this error happens only in the monitoring stack by the oauth-proxy.

For now we only know that a bearer token that is coming through oauth-proxy is invalid. Once we find a stable way to reproduce it, we can probably track the root cause.

Comment 7 Sebastian Łaskawiec 2021-06-28 09:53:11 UTC

Might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1953264

Comment 8 Michal Fojtik 2021-07-28 10:09:00 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 9 Sergiusz Urbaniak 2021-08-16 12:43:11 UTC

sprint review: we have not found the root cause yet but the issue is being worked on.

Comment 10 Michal Fojtik 2021-08-16 12:53:42 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 11 Sergiusz Urbaniak 2021-08-26 06:54:59 UTC

We found that the error is caused by clients sending invalid jwt tokens against api server. In this concrete case etcd-operator was identified.

Comment 14 Sandeep 2021-09-03 10:18:25 UTC

The etcd-operator logs are being observed. But issue is yet to be encountered.

Comment 15 Standa Laznicka 2021-09-03 10:51:00 UTC

Please link a 4.9 test run that was run after the merge and shows the symptoms.

Comment 16 Sandeep 2021-09-03 11:24:02 UTC

Its not reproducible now. Will need to wait for some more time (probably a week) to see if the issue is encountered.

Comment 17 Sandeep 2021-09-03 15:31:12 UTC

Moving it to Verified since its not reproducible. This issue is not seen any more.

Comment 20 errata-xmlrpc 2021-10-18 17:30:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.