Description of problem: On OSD we have seen monitoring volumes (prometheus & AM) stuck detaching or attaching, usually as a part of upgrade. I don't have a must-gather for the problem right now but generally any time a PVC is stuck in these states we should automatically force a detach so things can recover. SRE has a script to quickly do this but it is a manual SOP at this time. Version-Release number of selected component (if applicable): 4.4.11 and earlier How reproducible: A couple of clusters per upgrade cycle. Steps to Reproduce: 1. Configure persistent storage for prom & AM https://github.com/openshift/managed-cluster-config/blob/master/deploy/cluster-monitoring-config/cluster-monitoring-config.yaml#L8-L36 2. Upgrade cluster 3. Actual results: Sometimes the PVC is stuck attaching/detaching for prometheus or AM. This blocks deployment. Expected results: Stuck volumes are automatically force detached allowing the platform to try again. Additional info: Experience with this is on AWS, no known cases of this for GCP OSD clusters.
Do we have any logs from the node or masters from the time of the failure? It's not possible to guess what is going wrong though I would suspect an API quota hit: installer issues more AWS API calls and it is possible the volume operations get hit by quota limits.
It is possible that this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1866843 which has similar symptoms. Obviously we can't be sure until we see logs or events from Pod.
@Hemant that does look like a similar problem. I will ask the team to provide a must-gather when we find it happens next.
@Naveen - to verify if this is similar to bug I linked above, it is not required to have must-gather I think. Just events from failing pod/pvc should be enough.