CI flake :
"message": "2019/12/03 08:28:32 provider.go:117: Defaulting client-id to system:serviceaccount:openshift-monitoring:thanos-querier\n2019/12/03 08:28:32 provider.go:122: Defaulting client-secret to service account token /var/run/secrets/kubernetes.io/serviceaccount/token\n2019/12/03 08:28:32 provider.go:310: Delegation of authentication and authorization to OpenShift is enabled for bearer tokens and client certificates.\n2019/12/03 08:28:32 main.go:138: Invalid configuration:\n unable to load OpenShift configuration: unable to retrieve authentication information for tokens: Unauthorized\n",
fail [github.com/openshift/origin/test/extended/operators/cluster.go:122]: Expected
<string | len:2, cap:2>: [
"Pod openshift-monitoring/thanos-querier-6589b497cb-p4hvj is not healthy: container oauth-proxy has restarted more than 5 times",
"Pod openshift-monitoring/thanos-querier-6589b497cb-rqrnw is not healthy: container oauth-proxy has restarted more than 5 times",
to be empty
failed: (2m8s) 2019-12-03T08:43:22 "[Feature:Platform] Managed cluster should have no crashlooping pods in core namespaces over two minutes [Suite:openshift/conformance/parallel]"
Happened 6 times in the past 24h , so not very common.
Similar error mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1734704#c4 , but the GCP job I saw this in is not using (and does not need) a proxy.
11 of these in the past 24h  (that's the same link I used in comment 0). Clicking through to the most recent failure  (or just hovering over the dot in ), the matching line from that job's build log is:
Feb 04 14:43:57.008 E ns/openshift-monitoring pod/thanos-querier-8955bc494-9d8zs node/ip-10-0-138-134.ec2.internal container=oauth-proxy container exited with code 1 (Error): 2020/02/04 14:43:22 provider.go:118: Defaulting client-id to system:serviceaccount:openshift-monitoring:thanos-querier\n2020/02/04 14:43:22 provider.go:123: Defaulting client-secret to service account token /var/run/secrets/kubernetes.io/serviceaccount/token\n2020/02/04 14:43:22 provider.go:311: Delegation of authentication and authorization to OpenShift is enabled for bearer tokens and client certificates.\n2020/02/04 14:43:56 main.go:138: Invalid configuration:\n unable to load OpenShift configuration: unable to retrieve authentication information for tokens: Timeout: request did not complete within requested timeout 34s\n
so, yeah, still happening.
I have no idea how to reproduce it. But if folks have any idea about what might be going on, you can add additional debugging (either increasing what templates gather after an update run or landing a PR with increased logging for a particular component), and I'm pretty sure we'll have an additional handful or two of hits in the next 24 hours that include the updated logging.
I am having trouble finding the root cause of this, usually it happens before the final revision of kube-apiserver is deployed, which means all the potentially useful logs are gone when must-gather is performed.
Please note that the link to find failing jobs in https://bugzilla.redhat.com/show_bug.cgi?id=1779388#c7 is wrong as that will lead you to all the jobs where the control plane/networking failed horribly, the correct link is https://search.svc.ci.openshift.org/?search=unable+to+retrieve+authentication+information+for+tokens%3A+Unauthorized
Maybe if you point me to the code where you add the SA and its cluster-role and cluster-rolebinding, I could find something, otherwise this seems to be quite the dead end.
Looking at the code, I would indeed propose to add a wait with a subject access review to avoid the pods' crashlooping. Ideally, you would add these static roles, rolebindings and SAs to your manifests so you don't have to deal with them at all.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.
As such, we're marking this bug as "LifecycleStale".
If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.
Still seeing these occasionally in a number of job flavors:
$ curl -sL 'https://search.svc.ci.openshift.org/search?search=unable%20to%20retrieve%20authentication%20information%20for%20tokens&type=build-log&maxAge=96h' | jq -r '. | keys'
Moving back to investigate.
I've checked the thanos-querier pods in most the deployments and haven't found
Invalid configuration:\n unable to load OpenShift configuration: unable to retrieve authentication information for tokens: Unauthorized
two of the runs had restarts because of `i/o timeout` or just `timed out`, so at least one of the issue's gone away, however magical that might have been.
I agree that retries during the start-up would be desirable to prevent the proxy from dying.
ON_QA, so no need to punt it to future sprints.
This issue happened again on 4.5 test lane: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/2315
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.