Bug 1872842 - Monitoring volumes stuck detaching/attaching
Summary: Monitoring volumes stuck detaching/attaching
Keywords:
Status: CLOSED DUPLICATE of bug 1866843
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Tomas Smetana
QA Contact: Qin Ping
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-26 17:35 UTC by Naveen Malik
Modified: 2020-09-24 20:16 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-08 10:57:17 UTC
Target Upstream Version:
Embargoed:
nmalik: needinfo-


Attachments (Terms of Use)

Description Naveen Malik 2020-08-26 17:35:24 UTC
Description of problem:
On OSD we have seen monitoring volumes (prometheus & AM) stuck detaching or attaching, usually as a part of upgrade.  I don't have a must-gather for the problem right now but generally any time a PVC is stuck in these states we should automatically force a detach so things can recover.  SRE has a script to quickly do this but it is a manual SOP at this time.

Version-Release number of selected component (if applicable):
4.4.11 and earlier

How reproducible:
A couple of clusters per upgrade cycle.

Steps to Reproduce:
1. Configure persistent storage for prom & AM
https://github.com/openshift/managed-cluster-config/blob/master/deploy/cluster-monitoring-config/cluster-monitoring-config.yaml#L8-L36
2. Upgrade cluster
3.

Actual results:
Sometimes the PVC is stuck attaching/detaching for prometheus or AM.  This blocks deployment.

Expected results:
Stuck volumes are automatically force detached allowing the platform to try again.

Additional info:
Experience with this is on AWS, no known cases of this for GCP OSD clusters.

Comment 1 Tomas Smetana 2020-08-27 08:36:56 UTC
Do we have any logs from the node or masters from the time of the failure? It's not possible to guess what is going wrong though I would suspect an API quota hit: installer issues more AWS API calls and it is possible the volume operations get hit by quota limits.

Comment 3 Hemant Kumar 2020-08-27 20:47:52 UTC
It is possible that this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1866843 which has similar symptoms. Obviously we can't be sure until we see logs or events from Pod.

Comment 4 Naveen Malik 2020-08-27 21:06:30 UTC
@Hemant that does look like a similar problem.  I will ask the team to provide a must-gather when we find it happens next.

Comment 5 Hemant Kumar 2020-08-28 02:17:06 UTC
@Naveen - to verify if this is similar to bug I linked above, it is not required to have must-gather I think. Just events from failing pod/pvc should be enough.


Note You need to log in before you can comment on or make changes to this bug.