Bug 1883946 - Understand why trident CSI pods are getting deleted by OCP
Summary: Understand why trident CSI pods are getting deleted by OCP
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 4.5
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.7.0
Assignee: Maciej Szulik
QA Contact: zhou ying
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-30 14:44 UTC by Andre Costa
Modified: 2024-03-25 16:37 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:21:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:22:21 UTC

Description Andre Costa 2020-09-30 14:44:42 UTC
Description of problem:
We have a case requested by NetApp to investigate on our side a possible issue where their Trident CSI deployments getting deleted and then their operator is creating a broken deployment because a secret is regenerated but the pods are not restarted. There is an issue created by Netapp - https://github.com/NetApp/trident/issues/444

I'm currently checking on the audit logs if I can find something. Since there is also a case opened in Netapp for this issue, should we open a TSANet case?

Version-Release number of selected component (if applicable):
OCP 4.5.z

How reproducible:
On the customer happens sometimes

Steps to Reproduce:
Unknown

Comment 7 Maciej Szulik 2020-10-23 10:46:49 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 8 Maciej Szulik 2020-11-03 09:23:06 UTC
From reading https://github.com/NetApp/trident/issues/444 and https://github.com/NetApp/trident/issues/474 it looks like the problem is on the 
trident side, which should be fixed in newer release - https://github.com/NetApp/trident/issues/444#issuecomment-718059956

For the k8s gc issue, there's a WIP PR fixing the races in https://github.com/kubernetes/kubernetes/pull/92743, hopefully that should land in 
k8s 1.20 and we'll get that with the next k8s bump.

Comment 9 Armin Kunaschik 2020-11-03 13:18:45 UTC
Netapp release 20.10.0 fixes the broken deployment. That means that the deployment is done correctly when parts of the deployment are removed.
There is a second hint about an incorrect ownerReference. See github issue 474 for detail. This is not yet fixed.
I cannot judge whether or not the ownerReference is the real cause of the problem. But if it isn't, I'd expect a backport of the gc fix for at least 4.6.

Comment 10 Maciej Szulik 2020-12-04 16:29:28 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 11 Maciej Szulik 2021-01-15 12:49:13 UTC
With k8s 1.20 already available in 4.7 and trident bug fixed, I'm moving this to qa.

Comment 12 Armin Kunaschik 2021-01-15 13:03:14 UTC
FYI: Netapp fixed the github issue 474 in Trident 20.10.1. We're still testing this version. If this fixes the issue completely, we're fine. If not, it's our expectation: it might still necessary to backport the gc fix to 4.6 because it's a longterm support release.

Comment 13 Maciej Szulik 2021-01-15 14:03:08 UTC
(In reply to Armin Kunaschik from comment #12)
> FYI: Netapp fixed the github issue 474 in Trident 20.10.1. We're still
> testing this version. If this fixes the issue completely, we're fine. If
> not, it's our expectation: it might still necessary to backport the gc fix
> to 4.6 because it's a longterm support release.

I don't expect that GC fix being backported since this is a too big and too risky change.

Comment 18 errata-xmlrpc 2021-02-24 15:21:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.