Bug 1881144

Summary: [CI] 4.6 OpenStack jobs leaking volumes
Product: OpenShift Container Platform Reporter: Martin André <m.andre>
Component: InstallerAssignee: Jan Safranek <jsafrane>
Installer sub component: OpenShift on OpenStack QA Contact: David Sanz <dsanzmor>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: high CC: adduarte, jsafrane, pprinett, wking
Version: 4.6Keywords: UpcomingSprint
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:43:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Martin André 2020-09-21 15:45:07 UTC
Description of problem: There is a large amount of orphaned volumes in the Vexxhost CI tenant, not tied to any running cluster. MOC does not seem to have the problem (MOC runs some 4.6 presubmit jobs, while Vexxhost runs the 4.6 periodic jobs + some presubmit).

It appears the job successfully deleted the cluster but somehow left volumes behind. We need to understand what causes the leaks and fix it.

We've observed the 4.6 job to leak about 75 volumes per day since roughly 9/16.

We have a quota of 200 volumes on vexxhost. Hitting the quota causes the jobs to fail with:

Sep 21 13:19:56.303: INFO: cinder output:
ERROR: VolumeLimitExceeded: Maximum number of volumes allowed (200) exceeded for quota 'volumes'. (HTTP 413) (Request-ID: req-55b7a487-8663-4376-836c-9349bf30ea92)

Comment 1 Pierre Prinetti 2020-09-22 12:09:12 UTC
MOC does have the same problem; I probably pruned them right before you checked.

Comment 4 W. Trevor King 2020-09-23 21:07:32 UTC
[1] in flight upstream to help with debugging suggests it may be some time before we have a handle on this.  Leaking volumes is not great, but also seems unlikely to be severe enough to block 4.6 going GA.  Punting to 4.7, and fixes can be backported to 4.6.z.

[1]: https://github.com/kubernetes/kubernetes/pull/95003

Comment 5 Martin André 2020-09-24 07:49:28 UTC
I'm hopeful that [1] is going to fix the volume leak.

The upstream tests remove the volumes using the `cinder delete <volume_name>` command and it appears to have started failing since we switched cinder client from stein release to train [2]. The error coming back is:

  Delete for volume e2e-volumemode-9872 failed: Invalid filters all_tenants,name are found in query options

According to Jan, `cinder delete <volume_id>` doesn't yield such error and he has switched to volume IDs in his patch. It's good practice anyway to use resource IDs rather than names in OpenStack.

[1] https://github.com/kubernetes/kubernetes/pull/95003
[2] https://github.com/openshift/installer/pull/4175

Comment 11 Martin André 2020-10-05 06:16:57 UTC
OpenStack CI is no longer leaking volumes since 2020-10-02T19:23:37.000000. Verified against both MOC and Vexxhost.

Comment 14 errata-xmlrpc 2020-10-27 16:43:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196