Bug 2115527 - ServiceAccounts PATCH noise leads to Secret leakage
Summary: ServiceAccounts PATCH noise leads to Secret leakage
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.12
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.12.0
Assignee: Joao Marcal
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 2115376 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-04 21:12 UTC by W. Trevor King
Modified: 2022-08-16 13:40 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2095719
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1734 0 None open Bug 2115527: reverts #1704 2022-08-05 08:18:05 UTC

Description W. Trevor King 2022-08-04 21:12:28 UTC
Bug 2095719 landed [1], introducing ServiceAccount PATCH calls.  However, in 4.12 CI on the pull request's presubmit, we can see more PATCH calls than necessary.  Digging into [2], which as an update job, should show things very calm under the old operator, and with churny PATCHes once the new operator comes in.  Drilling into gathered assets gets us audit logs [3]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar xz --strip-components=2
  $ zgrep -h thanos-querier kube-apiserver/*log.gz | jq -r 'select(.user.username == "system:serviceaccount:openshift-monitoring:cluster-monitoring-operator" and .verb == "patch") | .stageTimestamp + " " + .requestURI'
  2022-07-06T15:00:19.463198Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier
  2022-07-06T15:14:53.970100Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier
  2022-07-06T15:15:42.052818Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier
  2022-07-06T15:23:07.392031Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier
  2022-07-06T15:24:19.827991Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier
  2022-07-06T15:39:19.360960Z /api/v1/namespaces/openshift-monitoring/serviceaccounts/thanos-querier
  parse error: Invalid numeric literal at line 1623, column 6

Timestamp of monitoring completing it's portion of the update:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'clusteroperator/monitoring.*version'
  Jul 06 15:03:33.705 I clusteroperator/monitoring versions: operator 4.12.0-0.ci.test-2022-07-06-134429-ci-op-s076v2wy-initial -> 4.12.0-0.ci.test-2022-07-06-135548-ci-op-s076v2wy-latest

So the monitoring operator updated shortly before 15:03:33.  The monitoring operator rev'ed its ClusterOperator versions to claim it had completed its portion of the update at 15:03:33.  And every 15m or so (with some more frequent rounds) after that, it was PATCHing the thanos-querier ServiceAccount.

Dropping into monitoring-operator logs, we come up at 15:24:11, just before the 15:24:19 PATCH:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-5f58cdbcdb-5pl8s_cluster-monitoring-operator.log | head -n1
  W0706 15:24:11.867255       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.

But I see nothing about the 15:24:19 PATCH, or the 15:39:19 PATCH to explain why the monitoring operator feels a PATCH is necessary:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-5f58cdbcdb-5pl8s_cluster-monitoring-operator.log | grep 15:24:19
  I0706 15:24:19.405009       1 tasks.go:74] ran task 2 of 15: Updating Cluster Monitoring Operator
  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-5f58cdbcdb-5pl8s_cluster-monitoring-operator.log | tail -n1
  I0706 15:27:03.624952       1 operator.go:678] Updating ClusterOperator status to done.

To avoid Secret churn (which I still need to pin down), I think we should revert or adjust [1] to avoid the periodic PATCHing in situations where no changes are actually required.

[1]: https://github.com/openshift/cluster-monitoring-operator/pull/1704
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-audit-logs/artifacts/
[4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-5f58cdbcdb-5pl8s_cluster-monitoring-operator.log

Comment 1 Ben Luddy 2022-08-04 21:18:54 UTC
Observed this sequence of updates to the thanos-querier ServiceAccount (picked arbitrarily):

$ kubectl get -n openshift-monitoring serviceaccount thanos-querier -o yaml -w
apiVersion: v1
imagePullSecrets:
- name: thanos-querier-dockercfg-rklg9
kind: ServiceAccount
metadata:
  annotations:
    serviceaccounts.openshift.io/oauth-redirectreference.thanos-querier: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"thanos-querier"}}'
  creationTimestamp: "2022-08-04T20:59:16Z"
  labels:
    app.kubernetes.io/component: query-layer
    app.kubernetes.io/instance: thanos-querier
    app.kubernetes.io/name: thanos-query
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.27.0
  name: thanos-querier
  namespace: openshift-monitoring
  resourceVersion: "28720"
  uid: 69268a0f-12b0-4b35-8512-91880e8cf819
secrets:
- name: thanos-querier-dockercfg-rklg9
---
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    serviceaccounts.openshift.io/oauth-redirectreference.thanos-querier: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"thanos-querier"}}'
  creationTimestamp: "2022-08-04T20:59:16Z"
  labels:
    app.kubernetes.io/component: query-layer
    app.kubernetes.io/instance: thanos-querier
    app.kubernetes.io/name: thanos-query
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.27.0
  name: thanos-querier
  namespace: openshift-monitoring
  resourceVersion: "30882"
  uid: 69268a0f-12b0-4b35-8512-91880e8cf819
---
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    openshift.io/create-dockercfg-secrets.pending-token: thanos-querier-token-fhw5x
    serviceaccounts.openshift.io/oauth-redirectreference.thanos-querier: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"thanos-querier"}}'
  creationTimestamp: "2022-08-04T20:59:16Z"
  labels:
    app.kubernetes.io/component: query-layer
    app.kubernetes.io/instance: thanos-querier
    app.kubernetes.io/name: thanos-query
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.27.0
  name: thanos-querier
  namespace: openshift-monitoring
  resourceVersion: "30883"
  uid: 69268a0f-12b0-4b35-8512-91880e8cf819
---
apiVersion: v1
imagePullSecrets:
- name: thanos-querier-dockercfg-lx4bt
kind: ServiceAccount
metadata:
  annotations:
    serviceaccounts.openshift.io/oauth-redirectreference.thanos-querier: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"thanos-querier"}}'
  creationTimestamp: "2022-08-04T20:59:16Z"
  labels:
    app.kubernetes.io/component: query-layer
    app.kubernetes.io/instance: thanos-querier
    app.kubernetes.io/name: thanos-query
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.27.0
  name: thanos-querier
  namespace: openshift-monitoring
  resourceVersion: "30891"
  uid: 69268a0f-12b0-4b35-8512-91880e8cf819
secrets:
- name: thanos-querier-dockercfg-lx4bt

Effectively:

1) Initially, secrets and imagePullSecrets are both populated with the names of existing secrets.
2) Both fields are wiped out by an update.
3) The dockercfg secret controller in openshift-controller-manager sees the update and decides it needs to act because the serviceaccount doesn't have appropriate items in its secrets and imagePullSecrets lists. It annotates the serviceaccount with "openshift.io/create-dockercfg-secrets.pending-token" to indicate that it has generated a new token name.
4) The same controller creates two secrets and adds their names back to the serviceaccount.

Comment 2 W. Trevor King 2022-08-04 21:20:04 UTC
And here is the Secret leak showing up in the presubmit on the original pull request:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1704/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1544678570972942336/artifacts/e2e-agnostic-upgrade/gather-must-gather/artifacts/must-gather.tar | tar xOz registry-build03-ci-openshift-org-ci-op-s076v2wy-stable-sha256-662ef5efb9a8489a5283d1f94cbf470f763da37d4b284e30efdbc035fa5a6b1c/namespaces/openshift-monitoring/core/secrets.yaml | yaml2json | jq -r '.items[].metadata | .name + " " + .creationTimestamp' | sort
...
telemeter-client-token-xh5tp 2022-07-06T15:01:09Z
telemeter-client-token-xj6tr 2022-07-06T14:14:36Z
thanos-querier-dockercfg-b6dmf 2022-07-06T15:15:42Z
thanos-querier-dockercfg-d8fbm 2022-07-06T15:01:09Z
thanos-querier-dockercfg-l7dt4 2022-07-06T14:15:37Z
thanos-querier-dockercfg-t92dm 2022-07-06T15:24:19Z
thanos-querier-dockercfg-vm6c2 2022-07-06T15:23:46Z
thanos-querier-dockercfg-w2m2n 2022-07-06T15:14:54Z
thanos-querier-grpc-tls-r8qd6n80gdn2 2022-07-06T14:15:38Z
thanos-querier-kube-rbac-proxy 2022-07-06T14:15:36Z
thanos-querier-kube-rbac-proxy-metrics 2022-07-06T14:15:36Z
thanos-querier-kube-rbac-proxy-rules 2022-07-06T14:15:36Z
thanos-querier-oauth-cookie 2022-07-06T14:15:36Z
thanos-querier-tls 2022-07-06T14:13:27Z
thanos-querier-token-d68sk 2022-07-06T15:24:19Z
thanos-querier-token-gbwng 2022-07-06T15:01:09Z
thanos-querier-token-ksczb 2022-07-06T15:23:46Z
thanos-querier-token-m26kv 2022-07-06T14:15:36Z
thanos-querier-token-sfjh6 2022-07-06T15:15:42Z
thanos-querier-token-vj57h 2022-07-06T15:14:53Z

Comment 3 W. Trevor King 2022-08-04 21:22:55 UTC
I'm setting this high/high and blocker+, because leaking multiple Secrets into openshift-monitoring in batches every 15m or more frequently in ways that don't get garbage collected puts all 4.12 clusters at risk until we fix the leak, possibly by reverting [1].

[1]: https://github.com/openshift/cluster-monitoring-operator/pull/1704

Comment 6 Junqi Zhao 2022-08-10 03:42:34 UTC
tested with 4.12.0-0.nightly-2022-08-09-114621, no Secret leak now
# kubectl get -n openshift-monitoring serviceaccount thanos-querier -o yaml -w
apiVersion: v1
imagePullSecrets:
- name: thanos-querier-dockercfg-rn7qz
kind: ServiceAccount
metadata:
  annotations:
    serviceaccounts.openshift.io/oauth-redirectreference.thanos-querier: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"thanos-querier"}}'
  creationTimestamp: "2022-08-10T01:06:18Z"
  labels:
    app.kubernetes.io/component: query-layer
    app.kubernetes.io/instance: thanos-querier
    app.kubernetes.io/name: thanos-query
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.27.0
  name: thanos-querier
  namespace: openshift-monitoring
  resourceVersion: "23367"
  uid: 2e6aeeb8-58de-4e85-b92b-7433442d3c78
secrets:
- name: thanos-querier-dockercfg-rn7qz

and
# cat must-gather.local.3853711220397963982/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-9ac372d87be122862b54104f11de19bf6d23688c3c382546c3767c8677539ba6/namespaces/openshift-monitoring/core/secrets.yaml | yaml2json | jq -r '.items[].metadata | .name + " " + .creationTimestamp' | sort
.....
telemeter-client 2022-08-10T01:07:12Z
telemeter-client-dockercfg-q86jb 2022-08-10T01:08:50Z
telemeter-client-kube-rbac-proxy-config 2022-08-10T01:07:12Z
telemeter-client-tls 2022-08-10T01:07:12Z
telemeter-client-token-72gz2 2022-08-10T01:08:50Z
thanos-querier-dockercfg-rn7qz 2022-08-10T01:08:50Z
thanos-querier-grpc-tls-ak3b6m82k2aqf 2022-08-10T01:06:20Z
thanos-querier-kube-rbac-proxy 2022-08-10T01:06:17Z
thanos-querier-kube-rbac-proxy-metrics 2022-08-10T01:06:17Z
thanos-querier-kube-rbac-proxy-rules 2022-08-10T01:06:17Z
thanos-querier-oauth-cookie 2022-08-10T01:06:17Z
thanos-querier-tls 2022-08-10T01:06:18Z
thanos-querier-token-6nmjg 2022-08-10T01:08:50Z

Comment 7 Xingxing Xia 2022-08-11 10:37:32 UTC
*** Bug 2115376 has been marked as a duplicate of this bug. ***

Comment 10 Hongkai Liu 2022-08-16 13:40:41 UTC
The issue is fixed in 

oc --context build02 get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-ec.1   True        False         18h     Cluster version is 4.12.0-ec.1


Note You need to log in before you can comment on or make changes to this bug.