Description of problem: We have a case requested by NetApp to investigate on our side a possible issue where their Trident CSI deployments getting deleted and then their operator is creating a broken deployment because a secret is regenerated but the pods are not restarted. There is an issue created by Netapp - https://github.com/NetApp/trident/issues/444 I'm currently checking on the audit logs if I can find something. Since there is also a case opened in Netapp for this issue, should we open a TSANet case? Version-Release number of selected component (if applicable): OCP 4.5.z How reproducible: On the customer happens sometimes Steps to Reproduce: Unknown
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
From reading https://github.com/NetApp/trident/issues/444 and https://github.com/NetApp/trident/issues/474 it looks like the problem is on the trident side, which should be fixed in newer release - https://github.com/NetApp/trident/issues/444#issuecomment-718059956 For the k8s gc issue, there's a WIP PR fixing the races in https://github.com/kubernetes/kubernetes/pull/92743, hopefully that should land in k8s 1.20 and we'll get that with the next k8s bump.
Netapp release 20.10.0 fixes the broken deployment. That means that the deployment is done correctly when parts of the deployment are removed. There is a second hint about an incorrect ownerReference. See github issue 474 for detail. This is not yet fixed. I cannot judge whether or not the ownerReference is the real cause of the problem. But if it isn't, I'd expect a backport of the gc fix for at least 4.6.
With k8s 1.20 already available in 4.7 and trident bug fixed, I'm moving this to qa.
FYI: Netapp fixed the github issue 474 in Trident 20.10.1. We're still testing this version. If this fixes the issue completely, we're fine. If not, it's our expectation: it might still necessary to backport the gc fix to 4.6 because it's a longterm support release.
(In reply to Armin Kunaschik from comment #12) > FYI: Netapp fixed the github issue 474 in Trident 20.10.1. We're still > testing this version. If this fixes the issue completely, we're fine. If > not, it's our expectation: it might still necessary to backport the gc fix > to 4.6 because it's a longterm support release. I don't expect that GC fix being backported since this is a too big and too risky change.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633