Bug 2170138 - CSI pod hangs permanently on some error conditions
Summary: CSI pod hangs permanently on some error conditions
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: csi-driver
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Karthik U S
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-02-15 18:43 UTC by Greg Farnum
Modified: 2023-08-09 16:37 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-csi issues 3657 0 None open Add timeout to Ceph GET API calls 2023-02-15 18:43:20 UTC

Description Greg Farnum 2023-02-15 18:43:21 UTC
Description of problem (please be detailed as possible and provide log
snippests):

The CSI pod can hang "if something is wrong in the ceph cluster or some network problem. There is no auto recovery for this one. The csi pod restart is the only manual fix for it.

Nothing can be done at cephcsi to fix it, adding timeout is an option, but that can lead to many unexpected cases it was avoided by design based on suggestions from upstream."

See https://bugzilla.redhat.com/show_bug.cgi?id=2162403#c44

This seems to occur if there's a network issue which prevents requests from reaching other services like the ceph manager, or if there are issues in any of Ceph's services which delay API responses for long enough. Even if these issues resolve themselves in the other systems, the CSI pod continues to fail until it is restarted, and it does not auto-recover in any way. This leads to end-user issues like failures to mount PVCs.

Version of all relevant components (if applicable):
All


It's not clear to me what the underlying problem is here, though it's been discussed at least some upstream. Madhu generated an upstream issue narrowly focused on timeouts: https://github.com/ceph/ceph-csi/issues/3657. Maybe that's all it takes?


Note You need to log in before you can comment on or make changes to this bug.