Bug 1872842

Summary: Monitoring volumes stuck detaching/attaching
Product: OpenShift Container Platform Reporter: Naveen Malik <nmalik>
Component: StorageAssignee: Tomas Smetana <tsmetana>
Status: CLOSED DUPLICATE QA Contact: Qin Ping <piqin>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.4CC: aos-bugs, hekumar, jeder, khnguyen, wking
Target Milestone: ---Keywords: ServiceDeliveryImpact
Target Release: 4.7.0Flags: nmalik: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-08 10:57:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Naveen Malik 2020-08-26 17:35:24 UTC
Description of problem:
On OSD we have seen monitoring volumes (prometheus & AM) stuck detaching or attaching, usually as a part of upgrade.  I don't have a must-gather for the problem right now but generally any time a PVC is stuck in these states we should automatically force a detach so things can recover.  SRE has a script to quickly do this but it is a manual SOP at this time.

Version-Release number of selected component (if applicable):
4.4.11 and earlier

How reproducible:
A couple of clusters per upgrade cycle.

Steps to Reproduce:
1. Configure persistent storage for prom & AM
https://github.com/openshift/managed-cluster-config/blob/master/deploy/cluster-monitoring-config/cluster-monitoring-config.yaml#L8-L36
2. Upgrade cluster
3.

Actual results:
Sometimes the PVC is stuck attaching/detaching for prometheus or AM.  This blocks deployment.

Expected results:
Stuck volumes are automatically force detached allowing the platform to try again.

Additional info:
Experience with this is on AWS, no known cases of this for GCP OSD clusters.

Comment 1 Tomas Smetana 2020-08-27 08:36:56 UTC
Do we have any logs from the node or masters from the time of the failure? It's not possible to guess what is going wrong though I would suspect an API quota hit: installer issues more AWS API calls and it is possible the volume operations get hit by quota limits.

Comment 3 Hemant Kumar 2020-08-27 20:47:52 UTC
It is possible that this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1866843 which has similar symptoms. Obviously we can't be sure until we see logs or events from Pod.

Comment 4 Naveen Malik 2020-08-27 21:06:30 UTC
@Hemant that does look like a similar problem.  I will ask the team to provide a must-gather when we find it happens next.

Comment 5 Hemant Kumar 2020-08-28 02:17:06 UTC
@Naveen - to verify if this is similar to bug I linked above, it is not required to have must-gather I think. Just events from failing pod/pvc should be enough.