Bug 1882450

Summary:	kube garbage collector picks deployments that still have active ownerReferences to a custom resource
Product:	OpenShift Container Platform	Reporter:	Periklis Tsirakidis <periklis>
Component:	kube-apiserver	Assignee:	Lukasz Szaszkiewicz <lszaszki>
Status:	CLOSED WORKSFORME	QA Contact:	Xingxing Xia <xxia>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.5	CC:	andcosta, aos-bugs, jnordell, kewang, mas-hatada, mfojtik, mfuruta, ssadhale, stwalter, xxia
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-24 10:53:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Periklis Tsirakidis 2020-09-24 15:21:06 UTC

Description of problem:

The investigation of the following BZ [1] on cluster [2] uncovered that the mentioned deployment resources are getting garbage collected although the ownerReferences to the Elasticseach CR is present and alive.

Checking the audit logs, one can spot the delete events for the deployments only. There are no delete events for the custom resources, e.g. Elasticsearch:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"80e425e0-2165-4e55-a705-4fca893430ee","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-2wg9lezz-3","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"54762d94-cf4c-4103-9813-4dc2c7a9f944","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.0.6"],"userAgent":"kube-controller-manager/v1.19.0+f5121a6 (linux/amd64) kubernetes/f5121a6/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-2wg9lezz-3","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-21T02:40:18.348424Z","stageTimestamp":"2020-09-21T02:40:18.415860Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:generic-garbage-collector","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"9ff43d02-7740-4423-a0cb-d8e057dc1e83","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-2wg9lezz-2","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"54762d94-cf4c-4103-9813-4dc2c7a9f944","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.0.6"],"userAgent":"kube-controller-manager/v1.19.0+f5121a6 (linux/amd64) kubernetes/f5121a6/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-2wg9lezz-2","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-21T02:40:18.458157Z","stageTimestamp":"2020-09-21T02:40:18.502307Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:generic-garbage-collector","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"683e3884-6971-4b85-8a44-1571e7e8fedb","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-2wg9lezz-1","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"54762d94-cf4c-4103-9813-4dc2c7a9f944","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.0.6"],"userAgent":"kube-controller-manager/v1.19.0+f5121a6 (linux/amd64) kubernetes/f5121a6/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-2wg9lezz-1","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-21T02:40:18.554628Z","stageTimestamp":"2020-09-21T02:40:18.624505Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:generic-garbage-collector","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}


Version-Release number of selected component (if applicable): 4.6.0


How reproducible:
There is no clear why this keeps happening, but we spotted the similar cases like [3] and [4], but both lack evidence from the audit logs.

Additional info:

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1880926
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1880926#c1
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1868300
[4] https://bugzilla.redhat.com/show_bug.cgi?id=1873652

Comment 1 Periklis Tsirakidis 2020-09-24 15:43:41 UTC

Regarding the customer case attached by Andre Costa. Here are audit logs that provide the case that Elasticsearch CR as well as the deployments get garbage collected. For the record the ownerReferences hierarchy in the cluster-logging stack is:

ClusterLogging (CR) -> Elasticsearch (CR) -> Deployments/Services/etc.

- Elasticsearch CR
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"201bbb81-0a33-4450-8843-28cf43db36f8","stage":"ResponseComplete","requestURI":"/apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"05546566-bb88-44bd-9cf2-417f54ac7e21","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["::1"],"userAgent":"kube-controller-manager/v1.18.3+b0068a8 (linux/amd64) kubernetes/b0068a8/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"elasticsearches","namespace":"openshift-logging","name":"elasticsearch","apiGroup":"logging.openshift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-24T14:31:05.885138Z","stageTimestamp":"2020-09-24T14:31:05.921780Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}



- Deployments for elasticsearch nodes:
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"11282bde-03d9-46c9-8886-7fe6f71c4d0d","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-gwanpq2e-1","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"05546566-bb88-44bd-9cf2-417f54ac7e21","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["::1"],"userAgent":"kube-controller-manager/v1.18.3+b0068a8 (linux/amd64) kubernetes/b0068a8/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-gwanpq2e-1","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-24T14:31:07.156835Z","stageTimestamp":"2020-09-24T14:31:07.170630Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"e2b0d445-df80-4c20-867c-1e47d980b7e7","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-gwanpq2e-3","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"05546566-bb88-44bd-9cf2-417f54ac7e21","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["::1"],"userAgent":"kube-controller-manager/v1.18.3+b0068a8 (linux/amd64) kubernetes/b0068a8/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-gwanpq2e-3","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-24T14:31:07.343443Z","stageTimestamp":"2020-09-24T14:31:07.363623Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"5b4dd823-3d10-402f-b232-40a3b466889d","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-gwanpq2e-2","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"05546566-bb88-44bd-9cf2-417f54ac7e21","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["::1"],"userAgent":"kube-controller-manager/v1.18.3+b0068a8 (linux/amd64) kubernetes/b0068a8/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-gwanpq2e-2","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-24T14:31:07.356816Z","stageTimestamp":"2020-09-24T14:31:07.370929Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}

Comment 3 Periklis Tsirakidis 2020-09-25 09:28:32 UTC

To help de-escalating, I believe I identified the issue with garbage collection for elasticsearch-operator managed resources.

Thanks to @deads2k's hint on [1] and considering linked issues and user reports, I can conclude that the issue mentioned in the BZ description above happens because the elasticsearch-operator puts owner references to cluster-scoped child resources (e.g. ClusterRole, ClusterRolebinding) that link to a namespace-scoped resource, i.e. the Elasticsearch CR. I believe this conclusion is also supported by the official docs [2].

To mitigate this I've created a PR for the elasticsearch-operator [3].

Thus my ask here, can someone confirm/refute my conclusion and proposed solution for the elasticsearch-operator?

[1] https://github.com/kubernetes/kubernetes/issues/65200
[2] https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents
[3] https://github.com/openshift/elasticsearch-operator/pull/498

Comment 5 Lukasz Szaszkiewicz 2020-10-02 10:22:14 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 6 Masaki Hatada 2020-10-05 10:26:02 UTC

Dear Red Hat,

We have faced this issue in our customer env.
https://github.com/openshift/elasticsearch-operator/pull/498 is just for elasticsearch-operator, but we think the same issue exists even in clusterlogging-operator.

The following cluster resources have ClusterLogging CR(namespace scope object) as a ownerReferences.

- ClusterRole/metadata-reader
- ClusterroleBinding/cluster-logging-metadata-reader

Could Red hat fix clusterlogging-operator together with elasticsearch-operator?

Best Regards,
Masaki Hatada

Comment 7 Periklis Tsirakidis 2020-10-05 10:51:14 UTC

(In reply to Masaki Hatada from comment #6)
> Dear Red Hat,
> 
> We have faced this issue in our customer env.
> https://github.com/openshift/elasticsearch-operator/pull/498 is just for
> elasticsearch-operator, but we think the same issue exists even in
> clusterlogging-operator.
> 
> The following cluster resources have ClusterLogging CR(namespace scope
> object) as a ownerReferences.
> 
> - ClusterRole/metadata-reader
> - ClusterroleBinding/cluster-logging-metadata-reader
> 
> Could Red hat fix clusterlogging-operator together with
> elasticsearch-operator?
> 
> Best Regards,
> Masaki Hatada

@Masaki Hatada

There is already a fix for cluster-logging-operator in [1] and is going to be backported for 4.5.z in [2]

[1] https://github.com/openshift/cluster-logging-operator/pull/713
[2] https://github.com/openshift/cluster-logging-operator/pull/718

Comment 8 Masaki Hatada 2020-10-05 10:58:38 UTC

> @Masaki Hatada
> 
> There is already a fix for cluster-logging-operator in [1] and is going to
> be backported for 4.5.z in [2]
> 
> [1] https://github.com/openshift/cluster-logging-operator/pull/713
> [2] https://github.com/openshift/cluster-logging-operator/pull/718

Thank you! It's a very good info for us!

Comment 9 Lukasz Szaszkiewicz 2020-10-23 07:40:37 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 10 Lukasz Szaszkiewicz 2020-11-05 16:38:53 UTC

Periklis Tsirakidis many thanks for finding the root cause and providing the fixes. If I read this correctly both elasticsearch-operator and clusterlogging-operator were cross-namespaceing owner references that are disallowed.

Comment 12 Xingxing Xia 2020-11-24 10:02:16 UTC

I investigated this bug by checking the symptoms and related PRs & upstream docs & links, the cause seems to be: cluster-scoped resources like ClusterRole set ownerReferences to namespace-scoped elasticsearch and clusterlogging CRs, which is disallowed by design, and therefore the fix PRs are from https://github.com/openshift/cluster-logging-operator/pull/713 and https://github.com/openshift/elasticsearch-operator/pull/498 instead of kube-apiserver repos. If this is true, the bug needs to be verified in a cluster that deploys logging and the bug is better to be moved to "Logging" Component like bug 1880926? If so could you move it? Thanks :)

Comment 13 Periklis Tsirakidis 2020-11-24 10:08:02 UTC

(In reply to Xingxing Xia from comment #12)
> I investigated this bug by checking the symptoms and related PRs & upstream
> docs & links, the cause seems to be: cluster-scoped resources like
> ClusterRole set ownerReferences to namespace-scoped elasticsearch and
> clusterlogging CRs, which is disallowed by design, and therefore the fix PRs
> are from https://github.com/openshift/cluster-logging-operator/pull/713 and
> https://github.com/openshift/elasticsearch-operator/pull/498 instead of
> kube-apiserver repos. If this is true, the bug needs to be verified in a
> cluster that deploys logging and the bug is better to be moved to "Logging"
> Component like bug 1880926? If so could you move it? Thanks :)

We don't need to move this to logging. If you say it is verified that the issue is only related to cluster-scoped resources like ClusterRole set ownerReferences to namespace-scoped elasticsearch and clusterlogging CRs, then [1] already verified this behaviour based on the two PRs you mentioned. We can close this for kube-apiserver as WORKSFORME or NOTABUG.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1880926