1883946 – Understand why trident CSI pods are getting deleted by OCP

Bug 1883946 - Understand why trident CSI pods are getting deleted by OCP

Summary: Understand why trident CSI pods are getting deleted by OCP

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.5
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Maciej Szulik
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-30 14:44 UTC by Andre Costa
Modified:	2024-03-25 16:37 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:21:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:22:21 UTC

Description Andre Costa 2020-09-30 14:44:42 UTC

Description of problem:
We have a case requested by NetApp to investigate on our side a possible issue where their Trident CSI deployments getting deleted and then their operator is creating a broken deployment because a secret is regenerated but the pods are not restarted. There is an issue created by Netapp - https://github.com/NetApp/trident/issues/444

I'm currently checking on the audit logs if I can find something. Since there is also a case opened in Netapp for this issue, should we open a TSANet case?

Version-Release number of selected component (if applicable):
OCP 4.5.z

How reproducible:
On the customer happens sometimes

Steps to Reproduce:
Unknown

Comment 7 Maciej Szulik 2020-10-23 10:46:49 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 8 Maciej Szulik 2020-11-03 09:23:06 UTC

From reading https://github.com/NetApp/trident/issues/444 and https://github.com/NetApp/trident/issues/474 it looks like the problem is on the 
trident side, which should be fixed in newer release - https://github.com/NetApp/trident/issues/444#issuecomment-718059956

For the k8s gc issue, there's a WIP PR fixing the races in https://github.com/kubernetes/kubernetes/pull/92743, hopefully that should land in 
k8s 1.20 and we'll get that with the next k8s bump.

Comment 9 Armin Kunaschik 2020-11-03 13:18:45 UTC

Netapp release 20.10.0 fixes the broken deployment. That means that the deployment is done correctly when parts of the deployment are removed.
There is a second hint about an incorrect ownerReference. See github issue 474 for detail. This is not yet fixed.
I cannot judge whether or not the ownerReference is the real cause of the problem. But if it isn't, I'd expect a backport of the gc fix for at least 4.6.

Comment 10 Maciej Szulik 2020-12-04 16:29:28 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 11 Maciej Szulik 2021-01-15 12:49:13 UTC

With k8s 1.20 already available in 4.7 and trident bug fixed, I'm moving this to qa.

Comment 12 Armin Kunaschik 2021-01-15 13:03:14 UTC

FYI: Netapp fixed the github issue 474 in Trident 20.10.1. We're still testing this version. If this fixes the issue completely, we're fine. If not, it's our expectation: it might still necessary to backport the gc fix to 4.6 because it's a longterm support release.

Comment 13 Maciej Szulik 2021-01-15 14:03:08 UTC

(In reply to Armin Kunaschik from comment #12)
> FYI: Netapp fixed the github issue 474 in Trident 20.10.1. We're still
> testing this version. If this fixes the issue completely, we're fine. If
> not, it's our expectation: it might still necessary to backport the gc fix
> to 4.6 because it's a longterm support release.

I don't expect that GC fix being backported since this is a too big and too risky change.

Comment 18 errata-xmlrpc 2021-02-24 15:21:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.