Bug 1779388 - container oauth-proxy has crashlooping: unable to retrieve authentication information for tokens
Summary: container oauth-proxy has crashlooping: unable to retrieve authentication inf...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: apiserver-auth
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.5.0
Assignee: Standa Laznicka
QA Contact: scheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-03 21:32 UTC by W. Trevor King
Modified: 2020-07-13 17:12 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: the oauth-proxy container exits with an error when there is an error reaching the kube-apiserver during configuration phase Consequence: container restarts were observed in the CI when the kube-apiserver/controllers were not stable/fast enough Fix: allow multiple attempts to perform checks against the kube-apiserver when oauth-proxy starts Result: oauth-proxy container should fail only when the underlying infrastructure is really broken.
Clone Of:
Environment:
Last Closed: 2020-07-13 17:12:18 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift oauth-proxy pull 175 None closed Bug 1779388: Repeat TokenAccess/SubjectAccess reviews when starting up 2020-09-02 09:55:29 UTC
Red Hat Product Errata RHBA-2020:2409 None None None 2020-07-13 17:12:47 UTC

Description W. Trevor King 2019-12-03 21:32:34 UTC
CI flake [1]:

    {
      "name": "oauth-proxy",
      "state": {
        "running": {
          "startedAt": "2019-12-03T08:31:14Z"
        }
      },
      "lastState": {
        "terminated": {
          "exitCode": 1,
          "reason": "Error",
          "message": "2019/12/03 08:28:32 provider.go:117: Defaulting client-id to system:serviceaccount:openshift-monitoring:thanos-querier\n2019/12/03 08:28:32 provider.go:122: Defaulting client-secret to service account token /var/run/secrets/kubernetes.io/serviceaccount/token\n2019/12/03 08:28:32 provider.go:310: Delegation of authentication and authorization to OpenShift is enabled for bearer tokens and client certificates.\n2019/12/03 08:28:32 main.go:138: Invalid configuration:\n  unable to load OpenShift configuration: unable to retrieve authentication information for tokens: Unauthorized\n",
          "startedAt": "2019-12-03T08:28:32Z",
          "finishedAt": "2019-12-03T08:28:32Z",
          "containerID": "cri-o://ec1364f90ecd6a3b1ee9eedc117badfce8e1196701dd6850999e79f830c40e29"
        }
      },
      "ready": true,
      "restartCount": 6,
      "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2592294b8965d8a767f3a52dc3a1406e8e814e8fb762df0bef941470d39403cc",
      "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2592294b8965d8a767f3a52dc3a1406e8e814e8fb762df0bef941470d39403cc",
      "containerID": "cri-o://69c59743a68da03ea16c27b195183c7232a5f8fd491573877ddc25060e787032",
      "started": true
    },
...
fail [github.com/openshift/origin/test/extended/operators/cluster.go:122]: Expected
    <[]string | len:2, cap:2>: [
        "Pod openshift-monitoring/thanos-querier-6589b497cb-p4hvj is not healthy: container oauth-proxy has restarted more than 5 times",
        "Pod openshift-monitoring/thanos-querier-6589b497cb-rqrnw is not healthy: container oauth-proxy has restarted more than 5 times",
    ]
to be empty

failed: (2m8s) 2019-12-03T08:43:22 "[Feature:Platform] Managed cluster should have no crashlooping pods in core namespaces over two minutes [Suite:openshift/conformance/parallel]"

Happened 6 times in the past 24h [2], so not very common.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/510
[2]: https://search.svc.ci.openshift.org/chart?search=unable%20to%20retrieve%20authentication%20information%20for%20tokens

Comment 1 W. Trevor King 2019-12-03 21:34:15 UTC
Similar error mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1734704#c4 , but the GCP job I saw this in is not using (and does not need) a proxy.

Comment 7 W. Trevor King 2020-02-04 17:28:21 UTC
11 of these in the past 24h [1] (that's the same link I used in comment 0).  Clicking through to the most recent failure [2] (or just hovering over the dot in [1]), the matching line from that job's build log is:

  Feb 04 14:43:57.008 E ns/openshift-monitoring pod/thanos-querier-8955bc494-9d8zs node/ip-10-0-138-134.ec2.internal container=oauth-proxy container exited with code 1 (Error): 2020/02/04 14:43:22 provider.go:118: Defaulting client-id to system:serviceaccount:openshift-monitoring:thanos-querier\n2020/02/04 14:43:22 provider.go:123: Defaulting client-secret to service account token /var/run/secrets/kubernetes.io/serviceaccount/token\n2020/02/04 14:43:22 provider.go:311: Delegation of authentication and authorization to OpenShift is enabled for bearer tokens and client certificates.\n2020/02/04 14:43:56 main.go:138: Invalid configuration:\n  unable to load OpenShift configuration: unable to retrieve authentication information for tokens: Timeout: request did not complete within requested timeout 34s\n

so, yeah, still happening.

I have no idea how to reproduce it.  But if folks have any idea about what might be going on, you can add additional debugging (either increasing what templates gather after an update run or landing a PR with increased logging for a particular component), and I'm pretty sure we'll have an additional handful or two of hits in the next 24 hours that include the updated logging.

[1]: https://search.svc.ci.openshift.org/chart?search=unable%20to%20retrieve%20authentication%20information%20for%20tokens
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/16321

Comment 9 Standa Laznicka 2020-02-10 13:39:21 UTC
I am having trouble finding the root cause of this, usually it happens before the final revision of kube-apiserver is deployed, which means all the potentially useful logs are gone when must-gather is performed.

Please note that the link to find failing jobs in https://bugzilla.redhat.com/show_bug.cgi?id=1779388#c7 is wrong as that will lead you to all the jobs where the control plane/networking failed horribly, the correct link is https://search.svc.ci.openshift.org/?search=unable+to+retrieve+authentication+information+for+tokens%3A+Unauthorized

Maybe if you point me to the code where you add the SA and its cluster-role and cluster-rolebinding, I could find something, otherwise this seems to be quite the dead end.

Comment 11 Standa Laznicka 2020-02-10 15:18:30 UTC
Looking at the code, I would indeed propose to add a wait with a subject access review to avoid the pods' crashlooping. Ideally, you would add these static roles, rolebindings and SAs to your manifests so you don't have to deal with them at all.

Comment 15 Michal Fojtik 2020-05-12 10:54:00 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale".

If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 17 Michal Fojtik 2020-05-14 11:31:16 UTC
Moving back to investigate.

Comment 18 Standa Laznicka 2020-05-19 09:11:25 UTC
I've checked the thanos-querier pods in most the deployments and haven't found

```
Invalid configuration:\n  unable to load OpenShift configuration: unable to retrieve authentication information for tokens: Unauthorized
```

two of the runs had restarts because of `i/o timeout` or just `timed out`, so at least one of the issue's gone away, however magical that might have been.
 
I agree that retries during the start-up would be desirable to prevent the proxy from dying.

Comment 22 W. Trevor King 2020-05-21 04:58:49 UTC
ON_QA, so no need to punt it to future sprints.

Comment 26 errata-xmlrpc 2020-07-13 17:12:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.