Bug 1872842

Summary:	Monitoring volumes stuck detaching/attaching
Product:	OpenShift Container Platform	Reporter:	Naveen Malik <nmalik>
Component:	Storage	Assignee:	Tomas Smetana <tsmetana>
Status:	CLOSED DUPLICATE	QA Contact:	Qin Ping <piqin>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.4	CC:	aos-bugs, hekumar, jeder, khnguyen, wking
Target Milestone:	---	Keywords:	ServiceDeliveryImpact
Target Release:	4.7.0	Flags:	nmalik: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-08 10:57:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Naveen Malik 2020-08-26 17:35:24 UTC

Description of problem:
On OSD we have seen monitoring volumes (prometheus & AM) stuck detaching or attaching, usually as a part of upgrade.  I don't have a must-gather for the problem right now but generally any time a PVC is stuck in these states we should automatically force a detach so things can recover.  SRE has a script to quickly do this but it is a manual SOP at this time.

Version-Release number of selected component (if applicable):
4.4.11 and earlier

How reproducible:
A couple of clusters per upgrade cycle.

Steps to Reproduce:
1. Configure persistent storage for prom & AM
https://github.com/openshift/managed-cluster-config/blob/master/deploy/cluster-monitoring-config/cluster-monitoring-config.yaml#L8-L36
2. Upgrade cluster
3.

Actual results:
Sometimes the PVC is stuck attaching/detaching for prometheus or AM.  This blocks deployment.

Expected results:
Stuck volumes are automatically force detached allowing the platform to try again.

Additional info:
Experience with this is on AWS, no known cases of this for GCP OSD clusters.

Comment 1 Tomas Smetana 2020-08-27 08:36:56 UTC

Do we have any logs from the node or masters from the time of the failure? It's not possible to guess what is going wrong though I would suspect an API quota hit: installer issues more AWS API calls and it is possible the volume operations get hit by quota limits.

Comment 3 Hemant Kumar 2020-08-27 20:47:52 UTC

It is possible that this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1866843 which has similar symptoms. Obviously we can't be sure until we see logs or events from Pod.

Comment 4 Naveen Malik 2020-08-27 21:06:30 UTC

@Hemant that does look like a similar problem.  I will ask the team to provide a must-gather when we find it happens next.

Comment 5 Hemant Kumar 2020-08-28 02:17:06 UTC

@Naveen - to verify if this is similar to bug I linked above, it is not required to have must-gather I think. Just events from failing pod/pvc should be enough.