1872842 – Monitoring volumes stuck detaching/attaching

Bug 1872842 - Monitoring volumes stuck detaching/attaching

Summary: Monitoring volumes stuck detaching/attaching

Keywords:
Status:	CLOSED DUPLICATE of bug 1866843
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Tomas Smetana
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-26 17:35 UTC by Naveen Malik
Modified:	2020-09-24 20:16 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-08 10:57:17 UTC
Target Upstream Version:
Embargoed:
Flags:	nmalik: needinfo-

Attachments	(Terms of Use)

Description Naveen Malik 2020-08-26 17:35:24 UTC

Description of problem:
On OSD we have seen monitoring volumes (prometheus & AM) stuck detaching or attaching, usually as a part of upgrade.  I don't have a must-gather for the problem right now but generally any time a PVC is stuck in these states we should automatically force a detach so things can recover.  SRE has a script to quickly do this but it is a manual SOP at this time.

Version-Release number of selected component (if applicable):
4.4.11 and earlier

How reproducible:
A couple of clusters per upgrade cycle.

Steps to Reproduce:
1. Configure persistent storage for prom & AM
https://github.com/openshift/managed-cluster-config/blob/master/deploy/cluster-monitoring-config/cluster-monitoring-config.yaml#L8-L36
2. Upgrade cluster
3.

Actual results:
Sometimes the PVC is stuck attaching/detaching for prometheus or AM.  This blocks deployment.

Expected results:
Stuck volumes are automatically force detached allowing the platform to try again.

Additional info:
Experience with this is on AWS, no known cases of this for GCP OSD clusters.

Comment 1 Tomas Smetana 2020-08-27 08:36:56 UTC

Do we have any logs from the node or masters from the time of the failure? It's not possible to guess what is going wrong though I would suspect an API quota hit: installer issues more AWS API calls and it is possible the volume operations get hit by quota limits.

Comment 3 Hemant Kumar 2020-08-27 20:47:52 UTC

It is possible that this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1866843 which has similar symptoms. Obviously we can't be sure until we see logs or events from Pod.

Comment 4 Naveen Malik 2020-08-27 21:06:30 UTC

@Hemant that does look like a similar problem.  I will ask the team to provide a must-gather when we find it happens next.

Comment 5 Hemant Kumar 2020-08-28 02:17:06 UTC

@Naveen - to verify if this is similar to bug I linked above, it is not required to have must-gather I think. Just events from failing pod/pvc should be enough.

Note You need to log in before you can comment on or make changes to this bug.