Bug 1881144 - [CI] 4.6 OpenStack jobs leaking volumes
Summary: [CI] 4.6 OpenStack jobs leaking volumes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.6.0
Assignee: Jan Safranek
QA Contact: David Sanz
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-21 15:45 UTC by Martin André
Modified: 2020-10-27 16:43 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:43:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 380 0 None closed Bug 1881144: UPSTREAM: 95003: Fail a test on pre-provisioned Cinder volume deletion error 2020-10-07 14:44:13 UTC
Github openshift origin pull 25579 0 None closed Bug 1881144: Fix Cinder e2e tests not to leak volumes 2020-10-07 14:44:21 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:43:51 UTC

Description Martin André 2020-09-21 15:45:07 UTC
Description of problem: There is a large amount of orphaned volumes in the Vexxhost CI tenant, not tied to any running cluster. MOC does not seem to have the problem (MOC runs some 4.6 presubmit jobs, while Vexxhost runs the 4.6 periodic jobs + some presubmit).

It appears the job successfully deleted the cluster but somehow left volumes behind. We need to understand what causes the leaks and fix it.

We've observed the 4.6 job to leak about 75 volumes per day since roughly 9/16.

We have a quota of 200 volumes on vexxhost. Hitting the quota causes the jobs to fail with:

Sep 21 13:19:56.303: INFO: cinder output:
ERROR: VolumeLimitExceeded: Maximum number of volumes allowed (200) exceeded for quota 'volumes'. (HTTP 413) (Request-ID: req-55b7a487-8663-4376-836c-9349bf30ea92)

Comment 1 Pierre Prinetti 2020-09-22 12:09:12 UTC
MOC does have the same problem; I probably pruned them right before you checked.

Comment 4 W. Trevor King 2020-09-23 21:07:32 UTC
[1] in flight upstream to help with debugging suggests it may be some time before we have a handle on this.  Leaking volumes is not great, but also seems unlikely to be severe enough to block 4.6 going GA.  Punting to 4.7, and fixes can be backported to 4.6.z.

[1]: https://github.com/kubernetes/kubernetes/pull/95003

Comment 5 Martin André 2020-09-24 07:49:28 UTC
I'm hopeful that [1] is going to fix the volume leak.

The upstream tests remove the volumes using the `cinder delete <volume_name>` command and it appears to have started failing since we switched cinder client from stein release to train [2]. The error coming back is:

  Delete for volume e2e-volumemode-9872 failed: Invalid filters all_tenants,name are found in query options

According to Jan, `cinder delete <volume_id>` doesn't yield such error and he has switched to volume IDs in his patch. It's good practice anyway to use resource IDs rather than names in OpenStack.

[1] https://github.com/kubernetes/kubernetes/pull/95003
[2] https://github.com/openshift/installer/pull/4175

Comment 11 Martin André 2020-10-05 06:16:57 UTC
OpenStack CI is no longer leaking volumes since 2020-10-02T19:23:37.000000. Verified against both MOC and Vexxhost.

Comment 14 errata-xmlrpc 2020-10-27 16:43:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.