Bug 2114721

Summary:	telemeter-client pod does not use the updated pull secret when it is changed
Product:	OpenShift Container Platform	Reporter:	Karthik Perumal <kramraja>
Component:	Monitoring	Assignee:	Joao Marcal <jmarcal>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:	Brian Burt <bburt>
Priority:	medium
Version:	4.10	CC:	anpicker, bburt, jmarcal, kgordeev, spasquie, tremes, wking
Target Milestone:	---	Keywords:	ServiceDeliveryImpact
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	* Before this update the Telemeter Client (TC) only loaded new pull secrets when it was manually restarted. Therefore, if a pull secret had been changed or updated and the TC had not been restarted, the TC would fail to authenticate with the server. This update addresses the issue so that when the secret is rotated, the deployment is automatically restarted and uses the updated token to authenticate. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2114721[BZ#2114721])	Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-17 19:54:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Karthik Perumal 2022-08-03 07:01:42 UTC

Description of problem:
The telemeter-client pod fails to authenticate with telemeter server after a pull secret is changed/updated (for example when the owner of a cluster changes). The telemeter-client pod endlessly throws the following error in its logs:

level=error caller=forwarder.go:268 ts=2022-08-03T06:07:27.588588761Z component=forwarder/worker msg="unable to forward results" err="Post \"https://infogw.api.openshift.com/upload\": unable to exchange initial token for a long lived token: 409:\nthe provided cluster identifier is already in use under a different account or is not sufficiently random\n"


How reproducible:
Consistently

Steps to Reproduce:
1. Change the pull secret in a cluster in a way that the older pull-secret is no longer valid
2. The telemeter-client pod would start failing to authenticate with the telemeter server

Actual results:
telemeter-client pod fails to forward metrics after a pull secret update

Expected results:
telemeter-client should detect the updated pull-secret and use that automatically instead of failing. It probably needs to reconcile the new pull-secret when it is changed.

Additional info:
The workaround is to restart the telemeter-client pod which forces it to reconcile the new pull secret.

Comment 2 W. Trevor King 2022-08-04 16:15:25 UTC

[1] looks like the monitoring operator is grabbing the pull secret once and then assuming it remains unchanged, although if it's the monitoring operator that's not noticing, I'm not clear on why comment 0's telemeter-client restart alone was sufficient to recover.

[1]: https://github.com/openshift/cluster-monitoring-operator/blob/fcc377d33b5c41bcdacecb5838ac5d60fd5010ac/pkg/operator/operator.go#L845-L857

Comment 4 Joao Marcal 2022-08-05 09:51:05 UTC

Hello @kramraj do you know if attention was given to the "IMPORTANT" warning in the OpenShift docs about this issue [1]? Was this procedure followed?

[1] https://docs.openshift.com/container-platform/4.10/openshift_images/managing_images/using-image-pull-secrets.html#images-update-global-pull-secret_using-image-pull-secrets

Comment 5 Karthik Perumal 2022-08-07 22:40:58 UTC

For managed openShift clusters (OSD/ROSA) the ownership transfer is carried out by SRE by following an internal SOP. See https://access.redhat.com/solutions/6126691

Do you know if the OCM process (for self managed OCP clusters) includes a step (under the hood?) that tells the telemeter client in-cluster, to use the new updated pull secret?

Comment 15 Junqi Zhao 2022-09-30 04:13:07 UTC

 telemeter-client pod now use the updated pull secret when it is changed, verification steps:
1. # oc -n openshift-config get secret pull-secret -o jsonpath="{.data.\.dockerconfigjson}" | base64 -d
change the "cloud.openshift.com"."auth" to invalid value and base64 encode the whole pull secret, update it to secret pull-secret
2. wait for the telemeter pod restart, would see the error in logs
# oc -n openshift-monitoring logs -c telemeter-client $(oc -n openshift-monitoring get pod --no-headers | grep telemeter-client | awk '{print $1}')
level=info caller=main.go:97 ts=2022-09-30T03:00:40.608472233Z msg="telemeter client initialized"
level=warn caller=forwarder.go:137 ts=2022-09-30T03:00:40.608657708Z component=forwarder msg="not anonymizing any labels"
level=info caller=main.go:292 ts=2022-09-30T03:00:40.62568095Z msg="starting telemeter-client" from=https://prometheus-k8s.openshift-monitoring.svc:9091 to=https://infogw.api.openshift.com/ listen=localhost:8080
level=error caller=forwarder.go:276 ts=2022-09-30T03:00:40.854837755Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 404:\nnot found\n"
level=warn caller=forwarder.go:137 ts=2022-09-30T03:00:42.968917576Z component=forwarder msg="not anonymizing any labels"
level=error caller=forwarder.go:276 ts=2022-09-30T03:00:43.112688892Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 404:\nnot found\n"

3. update the "cloud.openshift.com"."auth" to valid value, wait for the telemeter pod restart. no error in logs
# oc -n openshift-monitoring logs -c telemeter-client $(oc -n openshift-monitoring get pod --no-headers | grep telemeter-client | awk '{print $1}')
level=info caller=main.go:97 ts=2022-09-30T03:10:43.361386255Z msg="telemeter client initialized"
level=warn caller=forwarder.go:137 ts=2022-09-30T03:10:43.361537316Z component=forwarder msg="not anonymizing any labels"
level=info caller=main.go:292 ts=2022-09-30T03:10:43.380925875Z msg="starting telemeter-client" from=https://prometheus-k8s.openshift-monitoring.svc:9091 to=https://infogw.api.openshift.com/ listen=localhost:8080
level=warn caller=forwarder.go:137 ts=2022-09-30T03:10:45.699834751Z component=forwarder msg="not anonymizing any labels"

4. # oc -n openshift-monitoring get secret telemeter-client -o jsonpath="{.data.token"} | base64 -d
is also updated, the result is the same with "cloud.openshift.com"."auth"

5. check in telemeter server, the metrics could be pushed from client to server

Comment 18 errata-xmlrpc 2023-01-17 19:54:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399