Description of problem (please be detailed as possible and provide log snippests): The CSI pod can hang "if something is wrong in the ceph cluster or some network problem. There is no auto recovery for this one. The csi pod restart is the only manual fix for it. Nothing can be done at cephcsi to fix it, adding timeout is an option, but that can lead to many unexpected cases it was avoided by design based on suggestions from upstream." See https://bugzilla.redhat.com/show_bug.cgi?id=2162403#c44 This seems to occur if there's a network issue which prevents requests from reaching other services like the ceph manager, or if there are issues in any of Ceph's services which delay API responses for long enough. Even if these issues resolve themselves in the other systems, the CSI pod continues to fail until it is restarted, and it does not auto-recover in any way. This leads to end-user issues like failures to mount PVCs. Version of all relevant components (if applicable): All It's not clear to me what the underlying problem is here, though it's been discussed at least some upstream. Madhu generated an upstream issue narrowly focused on timeouts: https://github.com/ceph/ceph-csi/issues/3657. Maybe that's all it takes?