Bug 1890182
| Summary: | DaemonSet with existing owner garbage collected | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Matthew Robson <mrobson> |
| Component: | kube-controller-manager | Assignee: | Maciej Szulik <maszulik> |
| Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.5 | CC: | aos-bugs, karsharm, maszulik, mfojtik, rugouvei |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 4.7.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 15:27:22 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The processing of the TridentProvisioner right before the delete is odd in that the namespace: is blank
October 19th 2020, 12:24:44.736 I1019 19:24:44.736416 1 garbagecollector.go:404] processing item [trident.netapp.io/v1/TridentProvisioner, namespace: , name: trident, uid: 2e0df0a5-0b40-4f9e-a356-a56c997b82d4]
Here is the garbagecollector processing the TridentProvisioner 2s after the delete of the DaemonSet occurred; <massive amount of data> Operation cannot be fulfilled on daemonsets.apps "trident-csi": the object has been modified; please apply your changes to the latest version and try again
I1019 19:24:47.881119 1 garbagecollector.go:404] processing item [trident.netapp.io/v1/TridentProvisioner, namespace: trident, name: trident, uid: 2e0df0a5-0b40-4f9e-a356-a56c997b82d4]
I1019 19:24:47.912415 1 garbagecollector.go:447] object [trident.netapp.io/v1/TridentProvisioner, namespace: trident, name: trident, uid: 2e0df0a5-0b40-4f9e-a356-a56c997b82d4]'s doesn't have an owner, continue on next item
The logs also a log of messages like;
October 19th 2020, 12:24:48.369 E1019 19:24:48.369066 1 daemon_controller.go:332] trident/trident-csi failed with : error storing status for daemon set &v1.DaemonSet
The new DS references the same uid '2e0df0a5-0b40-4f9e-a356-a56c997b82d4' for the TridentProvisioner..
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "1"
creationTimestamp: "2020-10-19T19:24:46Z"
generation: 1
...
name: trident-csi
namespace: trident
ownerReferences:
- apiVersion: trident.netapp.io/v1
controller: true
kind: TridentProvisioner
name: trident
uid: 2e0df0a5-0b40-4f9e-a356-a56c997b82d4
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. From reading https://github.com/NetApp/trident/issues/474 it looks like the problem is actually similar to the one described in bug 1883946. There's already a WIP PR fixing the races in https://github.com/kubernetes/kubernetes/pull/92743, hopefully that should land in k8s 1.20 and we'll get that with the next k8s bump. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. This landed along with k8s 1.20 bump, moving to qa. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |
Description of problem: Trident deployment in customers pre-production OCP 4.5.13 cluster. Typical message where the object will not get GC'd: I1019 19:20:36.029255 1 garbagecollector.go:459] object garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"apps/v1", Kind:"DaemonSet", Name:"trident-csi", UID:"26deb7d0-7dc7-4907-924d-1444813b8fbe", Controller:(*bool)(0xc00177adf0), BlockOwnerDeletion:(*bool)(0xc00177adf1)}, Namespace:"trident"} has at least one existing owner: []v1.OwnerReference{v1.OwnerReference{APIVersion:"trident.netapp.io/v1", Kind:"TridentProvisioner", Name:"trident", UID:"2e0df0a5-0b40-4f9e-a356-a56c997b82d4", Controller:(*bool)(0xc005c36627), BlockOwnerDeletion:(*bool)(nil)}}, will not garbage collect 4 Minutes later, is gets deleted: I1019 19:24:45.442874 1 garbagecollector.go:517] delete object [apps/v1/DaemonSet, namespace: trident, name: trident-csi, uid: 26deb7d0-7dc7-4907-924d-1444813b8fbe] with propagation policy Background TridentProvisioner still exists and has not been modified: # oc get TridentProvisioner -n trident -o yaml trident apiVersion: trident.netapp.io/v1 kind: TridentProvisioner metadata: creationTimestamp: "2020-09-30T22:20:32Z" generation: 1 name: trident namespace: trident resourceVersion: "88013962" selfLink: /apis/trident.netapp.io/v1/namespaces/trident/tridentprovisioners/trident uid: 2e0df0a5-0b40-4f9e-a356-a56c997b82d4 spec: debug: false silenceAutosupport: true status: currentInstallationParams: IPv6: "false" autosupportImage: netapp/trident-autosupport:20.07.0 autosupportProxy: "" debug: "false" imagePullSecrets: [] imageRegistry: quay.io k8sTimeout: "30" kubeletDir: /var/lib/kubelet logFormat: text silenceAutosupport: "true" tridentImage: netapp/trident:20.07.1 message: Trident installed status: Installed version: v20.07.1 I can't find any reference to uid 26deb7d0-7dc7-4907-924d-1444813b8fbe / trident-csi daemonset being removed in the audit logs. Only thing I see is 2 create events for cluster role and binding a little over 2 minutes after the GC. ./openshift-apiserver/{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"367e2719-f3da-4f14-9f9d-e4f93c442da8","stage":"ResponseComplete","requestURI":"/apis/authorization.openshift.io/v1/clusterroles","verb":"create","user":{"username":"system:serviceaccount:trident:trident-operator","groups":["system:serviceaccounts","system:serviceaccounts:trident","system:authenticated"]},"sourceIPs":["142.34.194.9","10.97.0.1"],"userAgent":"trident-operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"clusterroles","name":"trident-csi","apiGroup":"authorization.openshift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":201},"requestReceivedTimestamp":"2020-10-19T19:27:00.681438Z","stageTimestamp":"2020-10-19T19:27:00.721815Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"trident-operator\" of ClusterRole \"trident-operator\" to ServiceAccount \"trident-operator/trident\""}} ./openshift-apiserver/{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"a317a0dd-cad3-4070-984b-be95982d1a91","stage":"ResponseComplete","requestURI":"/apis/authorization.openshift.io/v1/clusterrolebindings","verb":"create","user":{"username":"system:serviceaccount:trident:trident-operator","groups":["system:serviceaccounts","system:serviceaccounts:trident","system:authenticated"]},"sourceIPs":["142.34.194.9","10.97.0.1"],"userAgent":"trident-operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"clusterrolebindings","name":"trident-csi","apiGroup":"authorization.openshift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":201},"requestReceivedTimestamp":"2020-10-19T19:27:00.734474Z","stageTimestamp":"2020-10-19T19:27:00.784549Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"trident-operator\" of ClusterRole \"trident-operator\" to ServiceAccount \"trident-operator/trident\""}} Version-Release number of selected component (if applicable): How reproducible: Random - based on the symptoms of trident / storage breaking it has happened before, but this is the first time we've captutred data Steps to Reproduce: 1. Unknown 2. 3. Actual results: Trident objects are recreated causing existing PVCs to disconnect / fail. Data is not lost, the secrets need to be updated so the pods can remount their storage again. Expected results: Should not be garbage collecting. Additional info: