Bug 1882450 - kube garbage collector picks deployments that still have active ownerReferences to a custom resource
Summary: kube garbage collector picks deployments that still have active ownerReferenc...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Lukasz Szaszkiewicz
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-24 15:21 UTC by Periklis Tsirakidis
Modified: 2024-03-25 16:34 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-24 10:53:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-logging-operator pull 713 0 None closed Bug 1880926: Fix owner references for ClusterLogging CR 2021-01-14 09:47:11 UTC
Github openshift elasticsearch-operator pull 498 0 None closed Bug 1880926: Fix owner references for elasticsearch CR 2021-01-14 09:47:12 UTC
Red Hat Knowledge Base (Solution) 5439621 0 None None None 2020-10-05 17:06:20 UTC

Description Periklis Tsirakidis 2020-09-24 15:21:06 UTC
Description of problem:

The investigation of the following BZ [1] on cluster [2] uncovered that the mentioned deployment resources are getting garbage collected although the ownerReferences to the Elasticseach CR is present and alive.

Checking the audit logs, one can spot the delete events for the deployments only. There are no delete events for the custom resources, e.g. Elasticsearch:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"80e425e0-2165-4e55-a705-4fca893430ee","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-2wg9lezz-3","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"54762d94-cf4c-4103-9813-4dc2c7a9f944","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.0.6"],"userAgent":"kube-controller-manager/v1.19.0+f5121a6 (linux/amd64) kubernetes/f5121a6/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-2wg9lezz-3","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-21T02:40:18.348424Z","stageTimestamp":"2020-09-21T02:40:18.415860Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:generic-garbage-collector","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"9ff43d02-7740-4423-a0cb-d8e057dc1e83","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-2wg9lezz-2","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"54762d94-cf4c-4103-9813-4dc2c7a9f944","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.0.6"],"userAgent":"kube-controller-manager/v1.19.0+f5121a6 (linux/amd64) kubernetes/f5121a6/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-2wg9lezz-2","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-21T02:40:18.458157Z","stageTimestamp":"2020-09-21T02:40:18.502307Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:generic-garbage-collector","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"683e3884-6971-4b85-8a44-1571e7e8fedb","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-2wg9lezz-1","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"54762d94-cf4c-4103-9813-4dc2c7a9f944","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.0.6"],"userAgent":"kube-controller-manager/v1.19.0+f5121a6 (linux/amd64) kubernetes/f5121a6/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-2wg9lezz-1","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-21T02:40:18.554628Z","stageTimestamp":"2020-09-21T02:40:18.624505Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:generic-garbage-collector","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}


Version-Release number of selected component (if applicable): 4.6.0


How reproducible:
There is no clear why this keeps happening, but we spotted the similar cases like [3] and [4], but both lack evidence from the audit logs.

Additional info:

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1880926
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1880926#c1
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1868300
[4] https://bugzilla.redhat.com/show_bug.cgi?id=1873652

Comment 1 Periklis Tsirakidis 2020-09-24 15:43:41 UTC
Regarding the customer case attached by Andre Costa. Here are audit logs that provide the case that Elasticsearch CR as well as the deployments get garbage collected. For the record the ownerReferences hierarchy in the cluster-logging stack is:

ClusterLogging (CR) -> Elasticsearch (CR) -> Deployments/Services/etc.

- Elasticsearch CR
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"201bbb81-0a33-4450-8843-28cf43db36f8","stage":"ResponseComplete","requestURI":"/apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"05546566-bb88-44bd-9cf2-417f54ac7e21","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["::1"],"userAgent":"kube-controller-manager/v1.18.3+b0068a8 (linux/amd64) kubernetes/b0068a8/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"elasticsearches","namespace":"openshift-logging","name":"elasticsearch","apiGroup":"logging.openshift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-24T14:31:05.885138Z","stageTimestamp":"2020-09-24T14:31:05.921780Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}



- Deployments for elasticsearch nodes:
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"11282bde-03d9-46c9-8886-7fe6f71c4d0d","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-gwanpq2e-1","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"05546566-bb88-44bd-9cf2-417f54ac7e21","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["::1"],"userAgent":"kube-controller-manager/v1.18.3+b0068a8 (linux/amd64) kubernetes/b0068a8/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-gwanpq2e-1","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-24T14:31:07.156835Z","stageTimestamp":"2020-09-24T14:31:07.170630Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"e2b0d445-df80-4c20-867c-1e47d980b7e7","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-gwanpq2e-3","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"05546566-bb88-44bd-9cf2-417f54ac7e21","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["::1"],"userAgent":"kube-controller-manager/v1.18.3+b0068a8 (linux/amd64) kubernetes/b0068a8/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-gwanpq2e-3","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-24T14:31:07.343443Z","stageTimestamp":"2020-09-24T14:31:07.363623Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"5b4dd823-3d10-402f-b232-40a3b466889d","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-gwanpq2e-2","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"05546566-bb88-44bd-9cf2-417f54ac7e21","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["::1"],"userAgent":"kube-controller-manager/v1.18.3+b0068a8 (linux/amd64) kubernetes/b0068a8/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-gwanpq2e-2","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-24T14:31:07.356816Z","stageTimestamp":"2020-09-24T14:31:07.370929Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}

Comment 3 Periklis Tsirakidis 2020-09-25 09:28:32 UTC
To help de-escalating, I believe I identified the issue with garbage collection for elasticsearch-operator managed resources.

Thanks to @deads2k's hint on [1] and considering linked issues and user reports, I can conclude that the issue mentioned in the BZ description above happens because the elasticsearch-operator puts owner references to cluster-scoped child resources (e.g. ClusterRole, ClusterRolebinding) that link to a namespace-scoped resource, i.e. the Elasticsearch CR. I believe this conclusion is also supported by the official docs [2].

To mitigate this I've created a PR for the elasticsearch-operator [3].

Thus my ask here, can someone confirm/refute my conclusion and proposed solution for the elasticsearch-operator?

[1] https://github.com/kubernetes/kubernetes/issues/65200
[2] https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents
[3] https://github.com/openshift/elasticsearch-operator/pull/498

Comment 5 Lukasz Szaszkiewicz 2020-10-02 10:22:14 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 6 Masaki Hatada 2020-10-05 10:26:02 UTC
Dear Red Hat,

We have faced this issue in our customer env.
https://github.com/openshift/elasticsearch-operator/pull/498 is just for elasticsearch-operator, but we think the same issue exists even in clusterlogging-operator.

The following cluster resources have ClusterLogging CR(namespace scope object) as a ownerReferences.

- ClusterRole/metadata-reader
- ClusterroleBinding/cluster-logging-metadata-reader

Could Red hat fix clusterlogging-operator together with elasticsearch-operator?

Best Regards,
Masaki Hatada

Comment 7 Periklis Tsirakidis 2020-10-05 10:51:14 UTC
(In reply to Masaki Hatada from comment #6)
> Dear Red Hat,
> 
> We have faced this issue in our customer env.
> https://github.com/openshift/elasticsearch-operator/pull/498 is just for
> elasticsearch-operator, but we think the same issue exists even in
> clusterlogging-operator.
> 
> The following cluster resources have ClusterLogging CR(namespace scope
> object) as a ownerReferences.
> 
> - ClusterRole/metadata-reader
> - ClusterroleBinding/cluster-logging-metadata-reader
> 
> Could Red hat fix clusterlogging-operator together with
> elasticsearch-operator?
> 
> Best Regards,
> Masaki Hatada

@Masaki Hatada

There is already a fix for cluster-logging-operator in [1] and is going to be backported for 4.5.z in [2]

[1] https://github.com/openshift/cluster-logging-operator/pull/713
[2] https://github.com/openshift/cluster-logging-operator/pull/718

Comment 8 Masaki Hatada 2020-10-05 10:58:38 UTC
> @Masaki Hatada
> 
> There is already a fix for cluster-logging-operator in [1] and is going to
> be backported for 4.5.z in [2]
> 
> [1] https://github.com/openshift/cluster-logging-operator/pull/713
> [2] https://github.com/openshift/cluster-logging-operator/pull/718

Thank you! It's a very good info for us!

Comment 9 Lukasz Szaszkiewicz 2020-10-23 07:40:37 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 10 Lukasz Szaszkiewicz 2020-11-05 16:38:53 UTC
Periklis Tsirakidis many thanks for finding the root cause and providing the fixes. If I read this correctly both elasticsearch-operator and clusterlogging-operator were cross-namespaceing owner references that are disallowed.

Comment 12 Xingxing Xia 2020-11-24 10:02:16 UTC
I investigated this bug by checking the symptoms and related PRs & upstream docs & links, the cause seems to be: cluster-scoped resources like ClusterRole set ownerReferences to namespace-scoped elasticsearch and clusterlogging CRs, which is disallowed by design, and therefore the fix PRs are from https://github.com/openshift/cluster-logging-operator/pull/713 and https://github.com/openshift/elasticsearch-operator/pull/498 instead of kube-apiserver repos. If this is true, the bug needs to be verified in a cluster that deploys logging and the bug is better to be moved to "Logging" Component like bug 1880926? If so could you move it? Thanks :)

Comment 13 Periklis Tsirakidis 2020-11-24 10:08:02 UTC
(In reply to Xingxing Xia from comment #12)
> I investigated this bug by checking the symptoms and related PRs & upstream
> docs & links, the cause seems to be: cluster-scoped resources like
> ClusterRole set ownerReferences to namespace-scoped elasticsearch and
> clusterlogging CRs, which is disallowed by design, and therefore the fix PRs
> are from https://github.com/openshift/cluster-logging-operator/pull/713 and
> https://github.com/openshift/elasticsearch-operator/pull/498 instead of
> kube-apiserver repos. If this is true, the bug needs to be verified in a
> cluster that deploys logging and the bug is better to be moved to "Logging"
> Component like bug 1880926? If so could you move it? Thanks :)

We don't need to move this to logging. If you say it is verified that the issue is only related to cluster-scoped resources like ClusterRole set ownerReferences to namespace-scoped elasticsearch and clusterlogging CRs, then [1] already verified this behaviour based on the two PRs you mentioned. We can close this for kube-apiserver as WORKSFORME or NOTABUG.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1880926


Note You need to log in before you can comment on or make changes to this bug.