2170138 – CSI pod hangs permanently on some error conditions

Bug 2170138 - CSI pod hangs permanently on some error conditions

Summary: CSI pod hangs permanently on some error conditions

Keywords:
Status:	ASSIGNED
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Madhu Rajanna
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-02-15 18:43 UTC by Greg Farnum
Modified:	2024-09-12 12:18 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph-csi issues 3657	0	None	open	Add timeout to Ceph GET API calls	2023-08-25 12:57:58 UTC

Description Greg Farnum 2023-02-15 18:43:21 UTC

Description of problem (please be detailed as possible and provide log
snippests):

The CSI pod can hang "if something is wrong in the ceph cluster or some network problem. There is no auto recovery for this one. The csi pod restart is the only manual fix for it.

Nothing can be done at cephcsi to fix it, adding timeout is an option, but that can lead to many unexpected cases it was avoided by design based on suggestions from upstream."

See https://bugzilla.redhat.com/show_bug.cgi?id=2162403#c44

This seems to occur if there's a network issue which prevents requests from reaching other services like the ceph manager, or if there are issues in any of Ceph's services which delay API responses for long enough. Even if these issues resolve themselves in the other systems, the CSI pod continues to fail until it is restarted, and it does not auto-recover in any way. This leads to end-user issues like failures to mount PVCs.

Version of all relevant components (if applicable):
All


It's not clear to me what the underlying problem is here, though it's been discussed at least some upstream. Madhu generated an upstream issue narrowly focused on timeouts: https://github.com/ceph/ceph-csi/issues/3657. Maybe that's all it takes?

Note You need to log in before you can comment on or make changes to this bug.