Description of problem: Trident deployment in customers pre-production OCP 4.5.13 cluster. Typical message where the object will not get GC'd: I1019 19:20:36.029255 1 garbagecollector.go:459] object garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"apps/v1", Kind:"DaemonSet", Name:"trident-csi", UID:"26deb7d0-7dc7-4907-924d-1444813b8fbe", Controller:(*bool)(0xc00177adf0), BlockOwnerDeletion:(*bool)(0xc00177adf1)}, Namespace:"trident"} has at least one existing owner: []v1.OwnerReference{v1.OwnerReference{APIVersion:"trident.netapp.io/v1", Kind:"TridentProvisioner", Name:"trident", UID:"2e0df0a5-0b40-4f9e-a356-a56c997b82d4", Controller:(*bool)(0xc005c36627), BlockOwnerDeletion:(*bool)(nil)}}, will not garbage collect 4 Minutes later, is gets deleted: I1019 19:24:45.442874 1 garbagecollector.go:517] delete object [apps/v1/DaemonSet, namespace: trident, name: trident-csi, uid: 26deb7d0-7dc7-4907-924d-1444813b8fbe] with propagation policy Background TridentProvisioner still exists and has not been modified: # oc get TridentProvisioner -n trident -o yaml trident apiVersion: trident.netapp.io/v1 kind: TridentProvisioner metadata: creationTimestamp: "2020-09-30T22:20:32Z" generation: 1 name: trident namespace: trident resourceVersion: "88013962" selfLink: /apis/trident.netapp.io/v1/namespaces/trident/tridentprovisioners/trident uid: 2e0df0a5-0b40-4f9e-a356-a56c997b82d4 spec: debug: false silenceAutosupport: true status: currentInstallationParams: IPv6: "false" autosupportImage: netapp/trident-autosupport:20.07.0 autosupportProxy: "" debug: "false" imagePullSecrets: [] imageRegistry: quay.io k8sTimeout: "30" kubeletDir: /var/lib/kubelet logFormat: text silenceAutosupport: "true" tridentImage: netapp/trident:20.07.1 message: Trident installed status: Installed version: v20.07.1 I can't find any reference to uid 26deb7d0-7dc7-4907-924d-1444813b8fbe / trident-csi daemonset being removed in the audit logs. Only thing I see is 2 create events for cluster role and binding a little over 2 minutes after the GC. ./openshift-apiserver/{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"367e2719-f3da-4f14-9f9d-e4f93c442da8","stage":"ResponseComplete","requestURI":"/apis/authorization.openshift.io/v1/clusterroles","verb":"create","user":{"username":"system:serviceaccount:trident:trident-operator","groups":["system:serviceaccounts","system:serviceaccounts:trident","system:authenticated"]},"sourceIPs":["142.34.194.9","10.97.0.1"],"userAgent":"trident-operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"clusterroles","name":"trident-csi","apiGroup":"authorization.openshift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":201},"requestReceivedTimestamp":"2020-10-19T19:27:00.681438Z","stageTimestamp":"2020-10-19T19:27:00.721815Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"trident-operator\" of ClusterRole \"trident-operator\" to ServiceAccount \"trident-operator/trident\""}} ./openshift-apiserver/{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"a317a0dd-cad3-4070-984b-be95982d1a91","stage":"ResponseComplete","requestURI":"/apis/authorization.openshift.io/v1/clusterrolebindings","verb":"create","user":{"username":"system:serviceaccount:trident:trident-operator","groups":["system:serviceaccounts","system:serviceaccounts:trident","system:authenticated"]},"sourceIPs":["142.34.194.9","10.97.0.1"],"userAgent":"trident-operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"clusterrolebindings","name":"trident-csi","apiGroup":"authorization.openshift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":201},"requestReceivedTimestamp":"2020-10-19T19:27:00.734474Z","stageTimestamp":"2020-10-19T19:27:00.784549Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"trident-operator\" of ClusterRole \"trident-operator\" to ServiceAccount \"trident-operator/trident\""}} Version-Release number of selected component (if applicable): How reproducible: Random - based on the symptoms of trident / storage breaking it has happened before, but this is the first time we've captutred data Steps to Reproduce: 1. Unknown 2. 3. Actual results: Trident objects are recreated causing existing PVCs to disconnect / fail. Data is not lost, the secrets need to be updated so the pods can remount their storage again. Expected results: Should not be garbage collecting. Additional info:
The processing of the TridentProvisioner right before the delete is odd in that the namespace: is blank October 19th 2020, 12:24:44.736 I1019 19:24:44.736416 1 garbagecollector.go:404] processing item [trident.netapp.io/v1/TridentProvisioner, namespace: , name: trident, uid: 2e0df0a5-0b40-4f9e-a356-a56c997b82d4] Here is the garbagecollector processing the TridentProvisioner 2s after the delete of the DaemonSet occurred; <massive amount of data> Operation cannot be fulfilled on daemonsets.apps "trident-csi": the object has been modified; please apply your changes to the latest version and try again I1019 19:24:47.881119 1 garbagecollector.go:404] processing item [trident.netapp.io/v1/TridentProvisioner, namespace: trident, name: trident, uid: 2e0df0a5-0b40-4f9e-a356-a56c997b82d4] I1019 19:24:47.912415 1 garbagecollector.go:447] object [trident.netapp.io/v1/TridentProvisioner, namespace: trident, name: trident, uid: 2e0df0a5-0b40-4f9e-a356-a56c997b82d4]'s doesn't have an owner, continue on next item The logs also a log of messages like; October 19th 2020, 12:24:48.369 E1019 19:24:48.369066 1 daemon_controller.go:332] trident/trident-csi failed with : error storing status for daemon set &v1.DaemonSet The new DS references the same uid '2e0df0a5-0b40-4f9e-a356-a56c997b82d4' for the TridentProvisioner.. kind: DaemonSet metadata: annotations: deprecated.daemonset.template.generation: "1" creationTimestamp: "2020-10-19T19:24:46Z" generation: 1 ... name: trident-csi namespace: trident ownerReferences: - apiVersion: trident.netapp.io/v1 controller: true kind: TridentProvisioner name: trident uid: 2e0df0a5-0b40-4f9e-a356-a56c997b82d4
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
From reading https://github.com/NetApp/trident/issues/474 it looks like the problem is actually similar to the one described in bug 1883946. There's already a WIP PR fixing the races in https://github.com/kubernetes/kubernetes/pull/92743, hopefully that should land in k8s 1.20 and we'll get that with the next k8s bump.
This landed along with k8s 1.20 bump, moving to qa.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633