Description of problem:
Some services that delegate authentication, like kube-rbac-proxy and group-b operators seem to be reporting authentication failures in CI clusters with
`square/go-jose: error in cryptographic primitive`.
We need to figure out why that happens.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
There are 2 subtypes of this error:
- square/go-jose: error in cryptographic primitive, old, insecure token format
- square/go-jose: error in cryptographic primitive, token lookup failed
The former might be related to some leftovers after forbidding the old token format . Sergiusz Urbaniak - I've seen this happening in the monitoring Pods, can I kindly ask you to let the Monitoring Team know about this? Here are two examples extracted from :
- Jun 14 19:53:09.227 E ns/openshift-monitoring pod/thanos-querier-74b7584698-7c7cq node/ip-10-0-227-143.us-west-1.compute.internal container/oauth-proxy reason/ContainerExit code/2 cause/Error format]\n2021/06/14 18:50:55 oauthproxy.go:793: requestauth: 10.128.0.7:39240 [invalid bearer token, square/go-jose: error in cryptographic primitive, old, insecure token format]\n2021/06/14 18:50:57 [...]
- Jun 13 22:56:27.417 E ns/openshift-monitoring pod/alertmanager-main-1 node/ip-10-0-238-162.ec2.internal container/alertmanager-proxy reason/ContainerExit code/2 cause/Error /06/13 22:31:28 oauthproxy.go:793: requestauth: 10.128.2.16:36640 [invalid bearer token, square/go-jose: error in cryptographic primitive, old, insecure token format][...]
The latter is more interesting and happens when the Token Authenticator can not get Tokens . Analyzing one of the failed builds  I found the kube-apiserver was emitting this error at the time shown below:
2021-06-15T02:28:51.789845375Z E0615 02:28:51.788905 19 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, Token has been invalidated, token lookup failed]"
2021-06-15T02:28:51.789845375Z E0615 02:28:51.789294 19 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, Token has been invalidated, token lookup failed]"
2021-06-15T02:28:51.789845375Z E0615 02:28:51.789419 19 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, Token has been invalidated, token lookup failed]"
2021-06-15T02:28:51.789845375Z E0615 02:28:51.789573 19 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, Token has been invalidated, token lookup failed]"
After this time, the error stopped appearing. Interestingly, the API Server Pods started earlier than that but it took some time until they connected to Etcd and started serving requests:
- apiserver-58b64fd885-5gg7b: ~02:40:09.150111
- apiserver-58b64fd885-9lw55: ~02:40:09.150174208Z
- apiserver-58b64fd885-hnmxf: ~02:28:44.705157016Z
Based on the timestamps the above, I believe this is a timing issue. Things are booting up and the oAuth API Server temporarily can not obtain Tokens. Standa - if you agree with me, that will probably be a "won't fix".
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.
It turns out my previous explanation was entirely incorrect. Standa clarified that Kube API Server is one of the first things that we start. Such a timing error is simply impossible in this case.
So far I've verified:
- This is not a new problem, it started happening in 4.7: https://bugzilla.redhat.com/show_bug.cgi?id=1907728
- The square/go-jose code suggests that this error when verifying the token signature
- The SA keys/certs haven't been rotated
- The failure happened in a Pod that has been restarted, so I can't compare mounted certs if they match the API server
- Couldn't find anything in events
- Couldn't find anything in audit logs
I've created 2 debugging PRs that might help me investigate this failure further:
Closing as there's not enough data to sort this problem out. Both PRs (mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1956879#c4) didn't catch anything suspicious.
Since this error also appeared in 4.7, it seems it's not related anyhow with Bounded Service Account Tokens and key rotation.
In order to tell anything more about it, I'd need a stable way to reproduce it.
Together with Sergiusz and Standa we decided to keep this bug around.
Unfortunately we do not have enough data to debug it further. Logging tokens and cryptographic keys anywhere is simply a no-go solution. So far we also noticed that this error happens only in the monitoring stack by the oauth-proxy.
For now we only know that a bearer token that is coming through oauth-proxy is invalid. Once we find a stable way to reproduce it, we can probably track the root cause.
Might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1953264
sprint review: we have not found the root cause yet but the issue is being worked on.
We found that the error is caused by clients sending invalid jwt tokens against api server. In this concrete case etcd-operator was identified.
The etcd-operator logs are being observed. But issue is yet to be encountered.
Please link a 4.9 test run that was run after the merge and shows the symptoms.
Its not reproducible now. Will need to wait for some more time (probably a week) to see if the issue is encountered.
Moving it to Verified since its not reproducible. This issue is not seen any more.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.