Bug 2115527

Summary: ServiceAccounts PATCH noise leads to Secret leakage
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: MonitoringAssignee: Joao Marcal <jmarcal>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact: Brian Burt <bburt>
Priority: high    
Version: 4.12CC: anpicker, bburt, bluddy, hongkliu, jmarcal, juzhao, tremes, xxia
Target Milestone: ---   
Target Release: 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 2095719 Environment:
Last Closed: 2023-01-17 19:54:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2022-08-04 21:12:28 UTC
Bug 2095719 landed [1], introducing ServiceAccount PATCH calls.  However, in 4.12 CI on the pull request's presubmit, we can see more PATCH calls than necessary.  Digging into [2], which as an update job, should show things very calm under the old operator, and with churny PATCHes once the new operator comes in.  Drilling into gathered assets gets us audit logs [3]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar xz --strip-components=2
  $ zgrep -h thanos-querier kube-apiserver/*log.gz | jq -r 'select(.user.username == "system:serviceaccount:openshift-monitoring:cluster-monitoring-operator" and .verb == "patch") | .stageTimestamp + " " + .requestURI'
  2022-07-06T15:00:19.463198Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier
  2022-07-06T15:14:53.970100Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier
  2022-07-06T15:15:42.052818Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier
  2022-07-06T15:23:07.392031Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier
  2022-07-06T15:24:19.827991Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier
  2022-07-06T15:39:19.360960Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier
  parse error: Invalid numeric literal at line 1623, column 6

Timestamp of monitoring completing it's portion of the update:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'clusteroperator/monitoring.*version'
  Jul 06 15:03:33.705 I clusteroperator/monitoring versions: operator 4.12.0-0.ci.test-2022-07-06-134429-ci-op-s076v2wy-initial -> 4.12.0-0.ci.test-2022-07-06-135548-ci-op-s076v2wy-latest

So the monitoring operator updated shortly before 15:03:33.  The monitoring operator rev'ed its ClusterOperator versions to claim it had completed its portion of the update at 15:03:33.  And every 15m or so (with some more frequent rounds) after that, it was PATCHing the thanos-querier ServiceAccount.

Dropping into monitoring-operator logs, we come up at 15:24:11, just before the 15:24:19 PATCH:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-5f58cdbcdb-5pl8s_cluster-monitoring-operator.log | head -n1
  W0706 15:24:11.867255       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.

But I see nothing about the 15:24:19 PATCH, or the 15:39:19 PATCH to explain why the monitoring operator feels a PATCH is necessary:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-5f58cdbcdb-5pl8s_cluster-monitoring-operator.log | grep 15:24:19
  I0706 15:24:19.405009       1 tasks.go:74] ran task 2 of 15: Updating Cluster Monitoring Operator
  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-5f58cdbcdb-5pl8s_cluster-monitoring-operator.log | tail -n1
  I0706 15:27:03.624952       1 operator.go:678] Updating ClusterOperator status to done.

To avoid Secret churn (which I still need to pin down), I think we should revert or adjust [1] to avoid the periodic PATCHing in situations where no changes are actually required.

[1]: https://github.com/openshift/cluster-monitoring-operator/pull/1704
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-audit-logs/artifacts/
[4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-5f58cdbcdb-5pl8s_cluster-monitoring-operator.log

Comment 1 Ben Luddy 2022-08-04 21:18:54 UTC
Observed this sequence of updates to the thanos-querier ServiceAccount (picked arbitrarily):

$ kubectl get -n openshift-monitoring serviceaccount thanos-querier -o yaml -w
apiVersion: v1
imagePullSecrets:
- name: thanos-querier-dockercfg-rklg9
kind: ServiceAccount
metadata:
  annotations:
    serviceaccounts.openshift.io/oauth-redirectreference.thanos-querier: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"thanos-querier"}}'
  creationTimestamp: "2022-08-04T20:59:16Z"
  labels:
    app.kubernetes.io/component: query-layer
    app.kubernetes.io/instance: thanos-querier
    app.kubernetes.io/name: thanos-query
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.27.0
  name: thanos-querier
  namespace: openshift-monitoring
  resourceVersion: "28720"
  uid: 69268a0f-12b0-4b35-8512-91880e8cf819
secrets:
- name: thanos-querier-dockercfg-rklg9
---
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    serviceaccounts.openshift.io/oauth-redirectreference.thanos-querier: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"thanos-querier"}}'
  creationTimestamp: "2022-08-04T20:59:16Z"
  labels:
    app.kubernetes.io/component: query-layer
    app.kubernetes.io/instance: thanos-querier
    app.kubernetes.io/name: thanos-query
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.27.0
  name: thanos-querier
  namespace: openshift-monitoring
  resourceVersion: "30882"
  uid: 69268a0f-12b0-4b35-8512-91880e8cf819
---
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    openshift.io/create-dockercfg-secrets.pending-token: thanos-querier-token-fhw5x
    serviceaccounts.openshift.io/oauth-redirectreference.thanos-querier: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"thanos-querier"}}'
  creationTimestamp: "2022-08-04T20:59:16Z"
  labels:
    app.kubernetes.io/component: query-layer
    app.kubernetes.io/instance: thanos-querier
    app.kubernetes.io/name: thanos-query
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.27.0
  name: thanos-querier
  namespace: openshift-monitoring
  resourceVersion: "30883"
  uid: 69268a0f-12b0-4b35-8512-91880e8cf819
---
apiVersion: v1
imagePullSecrets:
- name: thanos-querier-dockercfg-lx4bt
kind: ServiceAccount
metadata:
  annotations:
    serviceaccounts.openshift.io/oauth-redirectreference.thanos-querier: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"thanos-querier"}}'
  creationTimestamp: "2022-08-04T20:59:16Z"
  labels:
    app.kubernetes.io/component: query-layer
    app.kubernetes.io/instance: thanos-querier
    app.kubernetes.io/name: thanos-query
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.27.0
  name: thanos-querier
  namespace: openshift-monitoring
  resourceVersion: "30891"
  uid: 69268a0f-12b0-4b35-8512-91880e8cf819
secrets:
- name: thanos-querier-dockercfg-lx4bt

Effectively:

1) Initially, secrets and imagePullSecrets are both populated with the names of existing secrets.
2) Both fields are wiped out by an update.
3) The dockercfg secret controller in openshift-controller-manager sees the update and decides it needs to act because the serviceaccount doesn't have appropriate items in its secrets and imagePullSecrets lists. It annotates the serviceaccount with "openshift.io/create-dockercfg-secrets.pending-token" to indicate that it has generated a new token name.
4) The same controller creates two secrets and adds their names back to the serviceaccount.

Comment 2 W. Trevor King 2022-08-04 21:20:04 UTC
And here is the Secret leak showing up in the presubmit on the original pull request:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-must-gather/artifacts/must-gather.tar | tar xOz registry-build03-ci-openshift-org-ci-op-s076v2wy-stable-sha256-662ef5efb9a8489a5283d1f94cbf470f763da37d4b284e30efdbc035fa5a6b1c/namespaces/openshift-monitoring/core/secrets.yaml | yaml2json | jq -r '.items[].metadata | .name + " " + .creationTimestamp' | sort
...
telemeter-client-token-xh5tp 2022-07-06T15:01:09Z
telemeter-client-token-xj6tr 2022-07-06T14:14:36Z
thanos-querier-dockercfg-b6dmf 2022-07-06T15:15:42Z
thanos-querier-dockercfg-d8fbm 2022-07-06T15:01:09Z
thanos-querier-dockercfg-l7dt4 2022-07-06T14:15:37Z
thanos-querier-dockercfg-t92dm 2022-07-06T15:24:19Z
thanos-querier-dockercfg-vm6c2 2022-07-06T15:23:46Z
thanos-querier-dockercfg-w2m2n 2022-07-06T15:14:54Z
thanos-querier-grpc-tls-r8qd6n80gdn2 2022-07-06T14:15:38Z
thanos-querier-kube-rbac-proxy 2022-07-06T14:15:36Z
thanos-querier-kube-rbac-proxy-metrics 2022-07-06T14:15:36Z
thanos-querier-kube-rbac-proxy-rules 2022-07-06T14:15:36Z
thanos-querier-oauth-cookie 2022-07-06T14:15:36Z
thanos-querier-tls 2022-07-06T14:13:27Z
thanos-querier-token-d68sk 2022-07-06T15:24:19Z
thanos-querier-token-gbwng 2022-07-06T15:01:09Z
thanos-querier-token-ksczb 2022-07-06T15:23:46Z
thanos-querier-token-m26kv 2022-07-06T14:15:36Z
thanos-querier-token-sfjh6 2022-07-06T15:15:42Z
thanos-querier-token-vj57h 2022-07-06T15:14:53Z

Comment 3 W. Trevor King 2022-08-04 21:22:55 UTC
I'm setting this high/high and blocker+, because leaking multiple Secrets into openshift-monitoring in batches every 15m or more frequently in ways that don't get garbage collected puts all 4.12 clusters at risk until we fix the leak, possibly by reverting [1].

[1]: https://github.com/openshift/cluster-monitoring-operator/pull/1704

Comment 6 Junqi Zhao 2022-08-10 03:42:34 UTC
tested with 4.12.0-0.nightly-2022-08-09-114621, no Secret leak now
# kubectl get -n openshift-monitoring serviceaccount thanos-querier -o yaml -w
apiVersion: v1
imagePullSecrets:
- name: thanos-querier-dockercfg-rn7qz
kind: ServiceAccount
metadata:
  annotations:
    serviceaccounts.openshift.io/oauth-redirectreference.thanos-querier: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"thanos-querier"}}'
  creationTimestamp: "2022-08-10T01:06:18Z"
  labels:
    app.kubernetes.io/component: query-layer
    app.kubernetes.io/instance: thanos-querier
    app.kubernetes.io/name: thanos-query
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.27.0
  name: thanos-querier
  namespace: openshift-monitoring
  resourceVersion: "23367"
  uid: 2e6aeeb8-58de-4e85-b92b-7433442d3c78
secrets:
- name: thanos-querier-dockercfg-rn7qz

and
# cat must-gather.local.3853711220397963982/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-9ac372d87be122862b54104f11de19bf6d23688c3c382546c3767c8677539ba6/namespaces/openshift-monitoring/core/secrets.yaml | yaml2json | jq -r '.items[].metadata | .name + " " + .creationTimestamp' | sort
.....
telemeter-client 2022-08-10T01:07:12Z
telemeter-client-dockercfg-q86jb 2022-08-10T01:08:50Z
telemeter-client-kube-rbac-proxy-config 2022-08-10T01:07:12Z
telemeter-client-tls 2022-08-10T01:07:12Z
telemeter-client-token-72gz2 2022-08-10T01:08:50Z
thanos-querier-dockercfg-rn7qz 2022-08-10T01:08:50Z
thanos-querier-grpc-tls-ak3b6m82k2aqf 2022-08-10T01:06:20Z
thanos-querier-kube-rbac-proxy 2022-08-10T01:06:17Z
thanos-querier-kube-rbac-proxy-metrics 2022-08-10T01:06:17Z
thanos-querier-kube-rbac-proxy-rules 2022-08-10T01:06:17Z
thanos-querier-oauth-cookie 2022-08-10T01:06:17Z
thanos-querier-tls 2022-08-10T01:06:18Z
thanos-querier-token-6nmjg 2022-08-10T01:08:50Z

Comment 7 Xingxing Xia 2022-08-11 10:37:32 UTC
*** Bug 2115376 has been marked as a duplicate of this bug. ***

Comment 10 Hongkai Liu 2022-08-16 13:40:41 UTC
The issue is fixed in 

oc --context build02 get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-ec.1   True        False         18h     Cluster version is 4.12.0-ec.1

Comment 13 errata-xmlrpc 2023-01-17 19:54:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399