1890182 – DaemonSet with existing owner garbage collected

Bug 1890182 - DaemonSet with existing owner garbage collected

Summary: DaemonSet with existing owner garbage collected

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.5
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Maciej Szulik
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-21 15:29 UTC by Matthew Robson
Modified:	2024-06-13 23:15 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:27:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:27:46 UTC

Description Matthew Robson 2020-10-21 15:29:28 UTC

Description of problem:

Trident deployment in customers pre-production OCP 4.5.13 cluster.

Typical message where the object will not get GC'd:

I1019 19:20:36.029255       1 garbagecollector.go:459] object garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"apps/v1", Kind:"DaemonSet", Name:"trident-csi", UID:"26deb7d0-7dc7-4907-924d-1444813b8fbe", Controller:(*bool)(0xc00177adf0), BlockOwnerDeletion:(*bool)(0xc00177adf1)}, Namespace:"trident"} has at least one existing owner: []v1.OwnerReference{v1.OwnerReference{APIVersion:"trident.netapp.io/v1", Kind:"TridentProvisioner", Name:"trident", UID:"2e0df0a5-0b40-4f9e-a356-a56c997b82d4", Controller:(*bool)(0xc005c36627), BlockOwnerDeletion:(*bool)(nil)}}, will not garbage collect

4 Minutes later, is gets deleted:

I1019 19:24:45.442874       1 garbagecollector.go:517] delete object [apps/v1/DaemonSet, namespace: trident, name: trident-csi, uid: 26deb7d0-7dc7-4907-924d-1444813b8fbe] with propagation policy Background

TridentProvisioner still exists and has not been modified:

# oc get TridentProvisioner -n trident -o yaml trident
apiVersion: trident.netapp.io/v1
kind: TridentProvisioner
metadata:
  creationTimestamp: "2020-09-30T22:20:32Z"
  generation: 1
  name: trident
  namespace: trident
  resourceVersion: "88013962"
  selfLink: /apis/trident.netapp.io/v1/namespaces/trident/tridentprovisioners/trident
  uid: 2e0df0a5-0b40-4f9e-a356-a56c997b82d4
spec:
  debug: false
  silenceAutosupport: true
status:
  currentInstallationParams:
    IPv6: "false"
    autosupportImage: netapp/trident-autosupport:20.07.0
    autosupportProxy: ""
    debug: "false"
    imagePullSecrets: []
    imageRegistry: quay.io
    k8sTimeout: "30"
    kubeletDir: /var/lib/kubelet
    logFormat: text
    silenceAutosupport: "true"
    tridentImage: netapp/trident:20.07.1
  message: Trident installed
  status: Installed
  version: v20.07.1


I can't find any reference to uid 26deb7d0-7dc7-4907-924d-1444813b8fbe / trident-csi daemonset being removed in the audit logs. 

Only thing I see is 2 create events for cluster role and binding a little over 2 minutes after the GC.

./openshift-apiserver/{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"367e2719-f3da-4f14-9f9d-e4f93c442da8","stage":"ResponseComplete","requestURI":"/apis/authorization.openshift.io/v1/clusterroles","verb":"create","user":{"username":"system:serviceaccount:trident:trident-operator","groups":["system:serviceaccounts","system:serviceaccounts:trident","system:authenticated"]},"sourceIPs":["142.34.194.9","10.97.0.1"],"userAgent":"trident-operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"clusterroles","name":"trident-csi","apiGroup":"authorization.openshift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":201},"requestReceivedTimestamp":"2020-10-19T19:27:00.681438Z","stageTimestamp":"2020-10-19T19:27:00.721815Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"trident-operator\" of ClusterRole \"trident-operator\" to ServiceAccount \"trident-operator/trident\""}}

./openshift-apiserver/{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"a317a0dd-cad3-4070-984b-be95982d1a91","stage":"ResponseComplete","requestURI":"/apis/authorization.openshift.io/v1/clusterrolebindings","verb":"create","user":{"username":"system:serviceaccount:trident:trident-operator","groups":["system:serviceaccounts","system:serviceaccounts:trident","system:authenticated"]},"sourceIPs":["142.34.194.9","10.97.0.1"],"userAgent":"trident-operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"clusterrolebindings","name":"trident-csi","apiGroup":"authorization.openshift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":201},"requestReceivedTimestamp":"2020-10-19T19:27:00.734474Z","stageTimestamp":"2020-10-19T19:27:00.784549Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"trident-operator\" of ClusterRole \"trident-operator\" to ServiceAccount \"trident-operator/trident\""}}



Version-Release number of selected component (if applicable):


How reproducible:

Random - based on the symptoms of trident / storage breaking it has happened before, but this is the first time we've captutred data 

Steps to Reproduce:
1. Unknown
2.
3.

Actual results:
Trident objects are recreated causing existing PVCs to disconnect / fail. Data is not lost, the secrets need to be updated so the pods can remount their storage again.


Expected results:

Should not be garbage collecting.


Additional info:

Comment 1 Matthew Robson 2020-10-21 17:17:09 UTC

The processing of the TridentProvisioner right before the delete is odd in that the namespace: is blank

October 19th 2020, 12:24:44.736	I1019 19:24:44.736416       1 garbagecollector.go:404] processing item [trident.netapp.io/v1/TridentProvisioner, namespace: , name: trident, uid: 2e0df0a5-0b40-4f9e-a356-a56c997b82d4]

Here is the garbagecollector processing the TridentProvisioner 2s after the delete of the DaemonSet occurred; <massive amount of data> Operation cannot be fulfilled on daemonsets.apps "trident-csi": the object has been modified; please apply your changes to the latest version and try again

I1019 19:24:47.881119       1 garbagecollector.go:404] processing item [trident.netapp.io/v1/TridentProvisioner, namespace: trident, name: trident, uid: 2e0df0a5-0b40-4f9e-a356-a56c997b82d4]

I1019 19:24:47.912415       1 garbagecollector.go:447] object [trident.netapp.io/v1/TridentProvisioner, namespace: trident, name: trident, uid: 2e0df0a5-0b40-4f9e-a356-a56c997b82d4]'s doesn't have an owner, continue on next item

The logs also a log of messages like;

October 19th 2020, 12:24:48.369	E1019 19:24:48.369066       1 daemon_controller.go:332] trident/trident-csi failed with : error storing status for daemon set &v1.DaemonSet

The new DS references the same uid '2e0df0a5-0b40-4f9e-a356-a56c997b82d4' for the TridentProvisioner..

  kind: DaemonSet
  metadata:
    annotations:
      deprecated.daemonset.template.generation: "1"
    creationTimestamp: "2020-10-19T19:24:46Z"
    generation: 1
    ...
    name: trident-csi
    namespace: trident
    ownerReferences:
    - apiVersion: trident.netapp.io/v1
      controller: true
      kind: TridentProvisioner
      name: trident
      uid: 2e0df0a5-0b40-4f9e-a356-a56c997b82d4

Comment 6 Maciej Szulik 2020-10-23 10:53:07 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 7 Maciej Szulik 2020-11-03 09:38:15 UTC

From reading https://github.com/NetApp/trident/issues/474 it looks like the problem is actually similar to the one described in bug 1883946.
There's already a WIP PR fixing the races in https://github.com/kubernetes/kubernetes/pull/92743, hopefully that should land in 
k8s 1.20 and we'll get that with the next k8s bump.

Comment 8 Maciej Szulik 2020-12-04 16:30:03 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 9 Maciej Szulik 2021-01-15 13:44:00 UTC

This landed along with k8s 1.20 bump, moving to qa.

Comment 14 errata-xmlrpc 2021-02-24 15:27:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.