2114721 – telemeter-client pod does not use the updated pull secret when it is changed

Bug 2114721 - telemeter-client pod does not use the updated pull secret when it is changed

Summary: telemeter-client pod does not use the updated pull secret when it is changed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Joao Marcal
QA Contact:	Junqi Zhao
Docs Contact:	Brian Burt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-08-03 07:01 UTC by Karthik Perumal
Modified:	2023-01-17 19:54 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Before this update the Telemeter Client (TC) only loaded new pull secrets when it was manually restarted. Therefore, if a pull secret had been changed or updated and the TC had not been restarted, the TC would fail to authenticate with the server. This update addresses the issue so that when the secret is rotated, the deployment is automatically restarted and uses the updated token to authenticate. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2114721[BZ#2114721])
Clone Of:
Environment:
Last Closed:	2023-01-17 19:54:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1747	0	None	Merged	Bug 2114721: Adds telemeter token hash to Deployment annotation	2022-09-30 05:52:44 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:54:35 UTC

Description Karthik Perumal 2022-08-03 07:01:42 UTC

Description of problem:
The telemeter-client pod fails to authenticate with telemeter server after a pull secret is changed/updated (for example when the owner of a cluster changes). The telemeter-client pod endlessly throws the following error in its logs:

level=error caller=forwarder.go:268 ts=2022-08-03T06:07:27.588588761Z component=forwarder/worker msg="unable to forward results" err="Post \"https://infogw.api.openshift.com/upload\": unable to exchange initial token for a long lived token: 409:\nthe provided cluster identifier is already in use under a different account or is not sufficiently random\n"


How reproducible:
Consistently

Steps to Reproduce:
1. Change the pull secret in a cluster in a way that the older pull-secret is no longer valid
2. The telemeter-client pod would start failing to authenticate with the telemeter server

Actual results:
telemeter-client pod fails to forward metrics after a pull secret update

Expected results:
telemeter-client should detect the updated pull-secret and use that automatically instead of failing. It probably needs to reconcile the new pull-secret when it is changed.

Additional info:
The workaround is to restart the telemeter-client pod which forces it to reconcile the new pull secret.

Comment 2 W. Trevor King 2022-08-04 16:15:25 UTC

[1] looks like the monitoring operator is grabbing the pull secret once and then assuming it remains unchanged, although if it's the monitoring operator that's not noticing, I'm not clear on why comment 0's telemeter-client restart alone was sufficient to recover.

[1]: https://github.com/openshift/cluster-monitoring-operator/blob/fcc377d33b5c41bcdacecb5838ac5d60fd5010ac/pkg/operator/operator.go#L845-L857

Comment 4 Joao Marcal 2022-08-05 09:51:05 UTC

Hello @kramraj do you know if attention was given to the "IMPORTANT" warning in the OpenShift docs about this issue [1]? Was this procedure followed?

[1] https://docs.openshift.com/container-platform/4.10/openshift_images/managing_images/using-image-pull-secrets.html#images-update-global-pull-secret_using-image-pull-secrets

Comment 5 Karthik Perumal 2022-08-07 22:40:58 UTC

For managed openShift clusters (OSD/ROSA) the ownership transfer is carried out by SRE by following an internal SOP. See https://access.redhat.com/solutions/6126691

Do you know if the OCM process (for self managed OCP clusters) includes a step (under the hood?) that tells the telemeter client in-cluster, to use the new updated pull secret?

Comment 15 Junqi Zhao 2022-09-30 04:13:07 UTC

 telemeter-client pod now use the updated pull secret when it is changed, verification steps:
1. # oc -n openshift-config get secret pull-secret -o jsonpath="{.data.\.dockerconfigjson}" | base64 -d
change the "cloud.openshift.com"."auth" to invalid value and base64 encode the whole pull secret, update it to secret pull-secret
2. wait for the telemeter pod restart, would see the error in logs
# oc -n openshift-monitoring logs -c telemeter-client $(oc -n openshift-monitoring get pod --no-headers | grep telemeter-client | awk '{print $1}')
level=info caller=main.go:97 ts=2022-09-30T03:00:40.608472233Z msg="telemeter client initialized"
level=warn caller=forwarder.go:137 ts=2022-09-30T03:00:40.608657708Z component=forwarder msg="not anonymizing any labels"
level=info caller=main.go:292 ts=2022-09-30T03:00:40.62568095Z msg="starting telemeter-client" from=https://prometheus-k8s.openshift-monitoring.svc:9091 to=https://infogw.api.openshift.com/ listen=localhost:8080
level=error caller=forwarder.go:276 ts=2022-09-30T03:00:40.854837755Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 404:\nnot found\n"
level=warn caller=forwarder.go:137 ts=2022-09-30T03:00:42.968917576Z component=forwarder msg="not anonymizing any labels"
level=error caller=forwarder.go:276 ts=2022-09-30T03:00:43.112688892Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 404:\nnot found\n"

3. update the "cloud.openshift.com"."auth" to valid value, wait for the telemeter pod restart. no error in logs
# oc -n openshift-monitoring logs -c telemeter-client $(oc -n openshift-monitoring get pod --no-headers | grep telemeter-client | awk '{print $1}')
level=info caller=main.go:97 ts=2022-09-30T03:10:43.361386255Z msg="telemeter client initialized"
level=warn caller=forwarder.go:137 ts=2022-09-30T03:10:43.361537316Z component=forwarder msg="not anonymizing any labels"
level=info caller=main.go:292 ts=2022-09-30T03:10:43.380925875Z msg="starting telemeter-client" from=https://prometheus-k8s.openshift-monitoring.svc:9091 to=https://infogw.api.openshift.com/ listen=localhost:8080
level=warn caller=forwarder.go:137 ts=2022-09-30T03:10:45.699834751Z component=forwarder msg="not anonymizing any labels"

4. # oc -n openshift-monitoring get secret telemeter-client -o jsonpath="{.data.token"} | base64 -d
is also updated, the result is the same with "cloud.openshift.com"."auth"

5. check in telemeter server, the metrics could be pushed from client to server

Comment 18 errata-xmlrpc 2023-01-17 19:54:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.