Bug 2095719 landed [1], introducing ServiceAccount PATCH calls. However, in 4.12 CI on the pull request's presubmit, we can see more PATCH calls than necessary. Digging into [2], which as an update job, should show things very calm under the old operator, and with churny PATCHes once the new operator comes in. Drilling into gathered assets gets us audit logs [3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar xz --strip-components=2 $ zgrep -h thanos-querier kube-apiserver/*log.gz | jq -r 'select(.user.username == "system:serviceaccount:openshift-monitoring:cluster-monitoring-operator" and .verb == "patch") | .stageTimestamp + " " + .requestURI' 2022-07-06T15:00:19.463198Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier 2022-07-06T15:14:53.970100Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier 2022-07-06T15:15:42.052818Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier 2022-07-06T15:23:07.392031Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier 2022-07-06T15:24:19.827991Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier 2022-07-06T15:39:19.360960Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier parse error: Invalid numeric literal at line 1623, column 6 Timestamp of monitoring completing it's portion of the update: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'clusteroperator/monitoring.*version' Jul 06 15:03:33.705 I clusteroperator/monitoring versions: operator 4.12.0-0.ci.test-2022-07-06-134429-ci-op-s076v2wy-initial -> 4.12.0-0.ci.test-2022-07-06-135548-ci-op-s076v2wy-latest So the monitoring operator updated shortly before 15:03:33. The monitoring operator rev'ed its ClusterOperator versions to claim it had completed its portion of the update at 15:03:33. And every 15m or so (with some more frequent rounds) after that, it was PATCHing the thanos-querier ServiceAccount. Dropping into monitoring-operator logs, we come up at 15:24:11, just before the 15:24:19 PATCH: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-5f58cdbcdb-5pl8s_cluster-monitoring-operator.log | head -n1 W0706 15:24:11.867255 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. But I see nothing about the 15:24:19 PATCH, or the 15:39:19 PATCH to explain why the monitoring operator feels a PATCH is necessary: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-5f58cdbcdb-5pl8s_cluster-monitoring-operator.log | grep 15:24:19 I0706 15:24:19.405009 1 tasks.go:74] ran task 2 of 15: Updating Cluster Monitoring Operator $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-5f58cdbcdb-5pl8s_cluster-monitoring-operator.log | tail -n1 I0706 15:27:03.624952 1 operator.go:678] Updating ClusterOperator status to done. To avoid Secret churn (which I still need to pin down), I think we should revert or adjust [1] to avoid the periodic PATCHing in situations where no changes are actually required. [1]: https://github.com/openshift/cluster-monitoring-operator/pull/1704 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336 [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-audit-logs/artifacts/ [4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-5f58cdbcdb-5pl8s_cluster-monitoring-operator.log
Observed this sequence of updates to the thanos-querier ServiceAccount (picked arbitrarily): $ kubectl get -n openshift-monitoring serviceaccount thanos-querier -o yaml -w apiVersion: v1 imagePullSecrets: - name: thanos-querier-dockercfg-rklg9 kind: ServiceAccount metadata: annotations: serviceaccounts.openshift.io/oauth-redirectreference.thanos-querier: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"thanos-querier"}}' creationTimestamp: "2022-08-04T20:59:16Z" labels: app.kubernetes.io/component: query-layer app.kubernetes.io/instance: thanos-querier app.kubernetes.io/name: thanos-query app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.27.0 name: thanos-querier namespace: openshift-monitoring resourceVersion: "28720" uid: 69268a0f-12b0-4b35-8512-91880e8cf819 secrets: - name: thanos-querier-dockercfg-rklg9 --- apiVersion: v1 kind: ServiceAccount metadata: annotations: serviceaccounts.openshift.io/oauth-redirectreference.thanos-querier: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"thanos-querier"}}' creationTimestamp: "2022-08-04T20:59:16Z" labels: app.kubernetes.io/component: query-layer app.kubernetes.io/instance: thanos-querier app.kubernetes.io/name: thanos-query app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.27.0 name: thanos-querier namespace: openshift-monitoring resourceVersion: "30882" uid: 69268a0f-12b0-4b35-8512-91880e8cf819 --- apiVersion: v1 kind: ServiceAccount metadata: annotations: openshift.io/create-dockercfg-secrets.pending-token: thanos-querier-token-fhw5x serviceaccounts.openshift.io/oauth-redirectreference.thanos-querier: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"thanos-querier"}}' creationTimestamp: "2022-08-04T20:59:16Z" labels: app.kubernetes.io/component: query-layer app.kubernetes.io/instance: thanos-querier app.kubernetes.io/name: thanos-query app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.27.0 name: thanos-querier namespace: openshift-monitoring resourceVersion: "30883" uid: 69268a0f-12b0-4b35-8512-91880e8cf819 --- apiVersion: v1 imagePullSecrets: - name: thanos-querier-dockercfg-lx4bt kind: ServiceAccount metadata: annotations: serviceaccounts.openshift.io/oauth-redirectreference.thanos-querier: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"thanos-querier"}}' creationTimestamp: "2022-08-04T20:59:16Z" labels: app.kubernetes.io/component: query-layer app.kubernetes.io/instance: thanos-querier app.kubernetes.io/name: thanos-query app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.27.0 name: thanos-querier namespace: openshift-monitoring resourceVersion: "30891" uid: 69268a0f-12b0-4b35-8512-91880e8cf819 secrets: - name: thanos-querier-dockercfg-lx4bt Effectively: 1) Initially, secrets and imagePullSecrets are both populated with the names of existing secrets. 2) Both fields are wiped out by an update. 3) The dockercfg secret controller in openshift-controller-manager sees the update and decides it needs to act because the serviceaccount doesn't have appropriate items in its secrets and imagePullSecrets lists. It annotates the serviceaccount with "openshift.io/create-dockercfg-secrets.pending-token" to indicate that it has generated a new token name. 4) The same controller creates two secrets and adds their names back to the serviceaccount.
And here is the Secret leak showing up in the presubmit on the original pull request: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-must-gather/artifacts/must-gather.tar | tar xOz registry-build03-ci-openshift-org-ci-op-s076v2wy-stable-sha256-662ef5efb9a8489a5283d1f94cbf470f763da37d4b284e30efdbc035fa5a6b1c/namespaces/openshift-monitoring/core/secrets.yaml | yaml2json | jq -r '.items[].metadata | .name + " " + .creationTimestamp' | sort ... telemeter-client-token-xh5tp 2022-07-06T15:01:09Z telemeter-client-token-xj6tr 2022-07-06T14:14:36Z thanos-querier-dockercfg-b6dmf 2022-07-06T15:15:42Z thanos-querier-dockercfg-d8fbm 2022-07-06T15:01:09Z thanos-querier-dockercfg-l7dt4 2022-07-06T14:15:37Z thanos-querier-dockercfg-t92dm 2022-07-06T15:24:19Z thanos-querier-dockercfg-vm6c2 2022-07-06T15:23:46Z thanos-querier-dockercfg-w2m2n 2022-07-06T15:14:54Z thanos-querier-grpc-tls-r8qd6n80gdn2 2022-07-06T14:15:38Z thanos-querier-kube-rbac-proxy 2022-07-06T14:15:36Z thanos-querier-kube-rbac-proxy-metrics 2022-07-06T14:15:36Z thanos-querier-kube-rbac-proxy-rules 2022-07-06T14:15:36Z thanos-querier-oauth-cookie 2022-07-06T14:15:36Z thanos-querier-tls 2022-07-06T14:13:27Z thanos-querier-token-d68sk 2022-07-06T15:24:19Z thanos-querier-token-gbwng 2022-07-06T15:01:09Z thanos-querier-token-ksczb 2022-07-06T15:23:46Z thanos-querier-token-m26kv 2022-07-06T14:15:36Z thanos-querier-token-sfjh6 2022-07-06T15:15:42Z thanos-querier-token-vj57h 2022-07-06T15:14:53Z
I'm setting this high/high and blocker+, because leaking multiple Secrets into openshift-monitoring in batches every 15m or more frequently in ways that don't get garbage collected puts all 4.12 clusters at risk until we fix the leak, possibly by reverting [1]. [1]: https://github.com/openshift/cluster-monitoring-operator/pull/1704
tested with 4.12.0-0.nightly-2022-08-09-114621, no Secret leak now # kubectl get -n openshift-monitoring serviceaccount thanos-querier -o yaml -w apiVersion: v1 imagePullSecrets: - name: thanos-querier-dockercfg-rn7qz kind: ServiceAccount metadata: annotations: serviceaccounts.openshift.io/oauth-redirectreference.thanos-querier: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"thanos-querier"}}' creationTimestamp: "2022-08-10T01:06:18Z" labels: app.kubernetes.io/component: query-layer app.kubernetes.io/instance: thanos-querier app.kubernetes.io/name: thanos-query app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.27.0 name: thanos-querier namespace: openshift-monitoring resourceVersion: "23367" uid: 2e6aeeb8-58de-4e85-b92b-7433442d3c78 secrets: - name: thanos-querier-dockercfg-rn7qz and # cat must-gather.local.3853711220397963982/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-9ac372d87be122862b54104f11de19bf6d23688c3c382546c3767c8677539ba6/namespaces/openshift-monitoring/core/secrets.yaml | yaml2json | jq -r '.items[].metadata | .name + " " + .creationTimestamp' | sort ..... telemeter-client 2022-08-10T01:07:12Z telemeter-client-dockercfg-q86jb 2022-08-10T01:08:50Z telemeter-client-kube-rbac-proxy-config 2022-08-10T01:07:12Z telemeter-client-tls 2022-08-10T01:07:12Z telemeter-client-token-72gz2 2022-08-10T01:08:50Z thanos-querier-dockercfg-rn7qz 2022-08-10T01:08:50Z thanos-querier-grpc-tls-ak3b6m82k2aqf 2022-08-10T01:06:20Z thanos-querier-kube-rbac-proxy 2022-08-10T01:06:17Z thanos-querier-kube-rbac-proxy-metrics 2022-08-10T01:06:17Z thanos-querier-kube-rbac-proxy-rules 2022-08-10T01:06:17Z thanos-querier-oauth-cookie 2022-08-10T01:06:17Z thanos-querier-tls 2022-08-10T01:06:18Z thanos-querier-token-6nmjg 2022-08-10T01:08:50Z
*** Bug 2115376 has been marked as a duplicate of this bug. ***
The issue is fixed in oc --context build02 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-ec.1 True False 18h Cluster version is 4.12.0-ec.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399