Description of problem: The telemeter-client pod fails to authenticate with telemeter server after a pull secret is changed/updated (for example when the owner of a cluster changes). The telemeter-client pod endlessly throws the following error in its logs: level=error caller=forwarder.go:268 ts=2022-08-03T06:07:27.588588761Z component=forwarder/worker msg="unable to forward results" err="Post \"https://infogw.api.openshift.com/upload\": unable to exchange initial token for a long lived token: 409:\nthe provided cluster identifier is already in use under a different account or is not sufficiently random\n" How reproducible: Consistently Steps to Reproduce: 1. Change the pull secret in a cluster in a way that the older pull-secret is no longer valid 2. The telemeter-client pod would start failing to authenticate with the telemeter server Actual results: telemeter-client pod fails to forward metrics after a pull secret update Expected results: telemeter-client should detect the updated pull-secret and use that automatically instead of failing. It probably needs to reconcile the new pull-secret when it is changed. Additional info: The workaround is to restart the telemeter-client pod which forces it to reconcile the new pull secret.
[1] looks like the monitoring operator is grabbing the pull secret once and then assuming it remains unchanged, although if it's the monitoring operator that's not noticing, I'm not clear on why comment 0's telemeter-client restart alone was sufficient to recover. [1]: https://github.com/openshift/cluster-monitoring-operator/blob/fcc377d33b5c41bcdacecb5838ac5d60fd5010ac/pkg/operator/operator.go#L845-L857
Hello @kramraj do you know if attention was given to the "IMPORTANT" warning in the OpenShift docs about this issue [1]? Was this procedure followed? [1] https://docs.openshift.com/container-platform/4.10/openshift_images/managing_images/using-image-pull-secrets.html#images-update-global-pull-secret_using-image-pull-secrets
For managed openShift clusters (OSD/ROSA) the ownership transfer is carried out by SRE by following an internal SOP. See https://access.redhat.com/solutions/6126691 Do you know if the OCM process (for self managed OCP clusters) includes a step (under the hood?) that tells the telemeter client in-cluster, to use the new updated pull secret?
telemeter-client pod now use the updated pull secret when it is changed, verification steps: 1. # oc -n openshift-config get secret pull-secret -o jsonpath="{.data.\.dockerconfigjson}" | base64 -d change the "cloud.openshift.com"."auth" to invalid value and base64 encode the whole pull secret, update it to secret pull-secret 2. wait for the telemeter pod restart, would see the error in logs # oc -n openshift-monitoring logs -c telemeter-client $(oc -n openshift-monitoring get pod --no-headers | grep telemeter-client | awk '{print $1}') level=info caller=main.go:97 ts=2022-09-30T03:00:40.608472233Z msg="telemeter client initialized" level=warn caller=forwarder.go:137 ts=2022-09-30T03:00:40.608657708Z component=forwarder msg="not anonymizing any labels" level=info caller=main.go:292 ts=2022-09-30T03:00:40.62568095Z msg="starting telemeter-client" from=https://prometheus-k8s.openshift-monitoring.svc:9091 to=https://infogw.api.openshift.com/ listen=localhost:8080 level=error caller=forwarder.go:276 ts=2022-09-30T03:00:40.854837755Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 404:\nnot found\n" level=warn caller=forwarder.go:137 ts=2022-09-30T03:00:42.968917576Z component=forwarder msg="not anonymizing any labels" level=error caller=forwarder.go:276 ts=2022-09-30T03:00:43.112688892Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 404:\nnot found\n" 3. update the "cloud.openshift.com"."auth" to valid value, wait for the telemeter pod restart. no error in logs # oc -n openshift-monitoring logs -c telemeter-client $(oc -n openshift-monitoring get pod --no-headers | grep telemeter-client | awk '{print $1}') level=info caller=main.go:97 ts=2022-09-30T03:10:43.361386255Z msg="telemeter client initialized" level=warn caller=forwarder.go:137 ts=2022-09-30T03:10:43.361537316Z component=forwarder msg="not anonymizing any labels" level=info caller=main.go:292 ts=2022-09-30T03:10:43.380925875Z msg="starting telemeter-client" from=https://prometheus-k8s.openshift-monitoring.svc:9091 to=https://infogw.api.openshift.com/ listen=localhost:8080 level=warn caller=forwarder.go:137 ts=2022-09-30T03:10:45.699834751Z component=forwarder msg="not anonymizing any labels" 4. # oc -n openshift-monitoring get secret telemeter-client -o jsonpath="{.data.token"} | base64 -d is also updated, the result is the same with "cloud.openshift.com"."auth" 5. check in telemeter server, the metrics could be pushed from client to server
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399