Bug 1882450
Summary: | kube garbage collector picks deployments that still have active ownerReferences to a custom resource | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Periklis Tsirakidis <periklis> |
Component: | kube-apiserver | Assignee: | Lukasz Szaszkiewicz <lszaszki> |
Status: | CLOSED WORKSFORME | QA Contact: | Xingxing Xia <xxia> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.5 | CC: | andcosta, aos-bugs, jnordell, kewang, mas-hatada, mfojtik, mfuruta, ssadhale, stwalter, xxia |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-11-24 10:53:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Periklis Tsirakidis
2020-09-24 15:21:06 UTC
Regarding the customer case attached by Andre Costa. Here are audit logs that provide the case that Elasticsearch CR as well as the deployments get garbage collected. For the record the ownerReferences hierarchy in the cluster-logging stack is: ClusterLogging (CR) -> Elasticsearch (CR) -> Deployments/Services/etc. - Elasticsearch CR {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"201bbb81-0a33-4450-8843-28cf43db36f8","stage":"ResponseComplete","requestURI":"/apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"05546566-bb88-44bd-9cf2-417f54ac7e21","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["::1"],"userAgent":"kube-controller-manager/v1.18.3+b0068a8 (linux/amd64) kubernetes/b0068a8/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"elasticsearches","namespace":"openshift-logging","name":"elasticsearch","apiGroup":"logging.openshift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-24T14:31:05.885138Z","stageTimestamp":"2020-09-24T14:31:05.921780Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}} - Deployments for elasticsearch nodes: {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"11282bde-03d9-46c9-8886-7fe6f71c4d0d","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-gwanpq2e-1","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"05546566-bb88-44bd-9cf2-417f54ac7e21","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["::1"],"userAgent":"kube-controller-manager/v1.18.3+b0068a8 (linux/amd64) kubernetes/b0068a8/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-gwanpq2e-1","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-24T14:31:07.156835Z","stageTimestamp":"2020-09-24T14:31:07.170630Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"e2b0d445-df80-4c20-867c-1e47d980b7e7","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-gwanpq2e-3","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"05546566-bb88-44bd-9cf2-417f54ac7e21","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["::1"],"userAgent":"kube-controller-manager/v1.18.3+b0068a8 (linux/amd64) kubernetes/b0068a8/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-gwanpq2e-3","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-24T14:31:07.343443Z","stageTimestamp":"2020-09-24T14:31:07.363623Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"5b4dd823-3d10-402f-b232-40a3b466889d","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-gwanpq2e-2","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"05546566-bb88-44bd-9cf2-417f54ac7e21","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["::1"],"userAgent":"kube-controller-manager/v1.18.3+b0068a8 (linux/amd64) kubernetes/b0068a8/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-gwanpq2e-2","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-24T14:31:07.356816Z","stageTimestamp":"2020-09-24T14:31:07.370929Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}} To help de-escalating, I believe I identified the issue with garbage collection for elasticsearch-operator managed resources. Thanks to @deads2k's hint on [1] and considering linked issues and user reports, I can conclude that the issue mentioned in the BZ description above happens because the elasticsearch-operator puts owner references to cluster-scoped child resources (e.g. ClusterRole, ClusterRolebinding) that link to a namespace-scoped resource, i.e. the Elasticsearch CR. I believe this conclusion is also supported by the official docs [2]. To mitigate this I've created a PR for the elasticsearch-operator [3]. Thus my ask here, can someone confirm/refute my conclusion and proposed solution for the elasticsearch-operator? [1] https://github.com/kubernetes/kubernetes/issues/65200 [2] https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents [3] https://github.com/openshift/elasticsearch-operator/pull/498 I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. Dear Red Hat, We have faced this issue in our customer env. https://github.com/openshift/elasticsearch-operator/pull/498 is just for elasticsearch-operator, but we think the same issue exists even in clusterlogging-operator. The following cluster resources have ClusterLogging CR(namespace scope object) as a ownerReferences. - ClusterRole/metadata-reader - ClusterroleBinding/cluster-logging-metadata-reader Could Red hat fix clusterlogging-operator together with elasticsearch-operator? Best Regards, Masaki Hatada (In reply to Masaki Hatada from comment #6) > Dear Red Hat, > > We have faced this issue in our customer env. > https://github.com/openshift/elasticsearch-operator/pull/498 is just for > elasticsearch-operator, but we think the same issue exists even in > clusterlogging-operator. > > The following cluster resources have ClusterLogging CR(namespace scope > object) as a ownerReferences. > > - ClusterRole/metadata-reader > - ClusterroleBinding/cluster-logging-metadata-reader > > Could Red hat fix clusterlogging-operator together with > elasticsearch-operator? > > Best Regards, > Masaki Hatada @Masaki Hatada There is already a fix for cluster-logging-operator in [1] and is going to be backported for 4.5.z in [2] [1] https://github.com/openshift/cluster-logging-operator/pull/713 [2] https://github.com/openshift/cluster-logging-operator/pull/718 > @Masaki Hatada
>
> There is already a fix for cluster-logging-operator in [1] and is going to
> be backported for 4.5.z in [2]
>
> [1] https://github.com/openshift/cluster-logging-operator/pull/713
> [2] https://github.com/openshift/cluster-logging-operator/pull/718
Thank you! It's a very good info for us!
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. Periklis Tsirakidis many thanks for finding the root cause and providing the fixes. If I read this correctly both elasticsearch-operator and clusterlogging-operator were cross-namespaceing owner references that are disallowed. I investigated this bug by checking the symptoms and related PRs & upstream docs & links, the cause seems to be: cluster-scoped resources like ClusterRole set ownerReferences to namespace-scoped elasticsearch and clusterlogging CRs, which is disallowed by design, and therefore the fix PRs are from https://github.com/openshift/cluster-logging-operator/pull/713 and https://github.com/openshift/elasticsearch-operator/pull/498 instead of kube-apiserver repos. If this is true, the bug needs to be verified in a cluster that deploys logging and the bug is better to be moved to "Logging" Component like bug 1880926? If so could you move it? Thanks :) (In reply to Xingxing Xia from comment #12) > I investigated this bug by checking the symptoms and related PRs & upstream > docs & links, the cause seems to be: cluster-scoped resources like > ClusterRole set ownerReferences to namespace-scoped elasticsearch and > clusterlogging CRs, which is disallowed by design, and therefore the fix PRs > are from https://github.com/openshift/cluster-logging-operator/pull/713 and > https://github.com/openshift/elasticsearch-operator/pull/498 instead of > kube-apiserver repos. If this is true, the bug needs to be verified in a > cluster that deploys logging and the bug is better to be moved to "Logging" > Component like bug 1880926? If so could you move it? Thanks :) We don't need to move this to logging. If you say it is verified that the issue is only related to cluster-scoped resources like ClusterRole set ownerReferences to namespace-scoped elasticsearch and clusterlogging CRs, then [1] already verified this behaviour based on the two PRs you mentioned. We can close this for kube-apiserver as WORKSFORME or NOTABUG. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1880926 |