1881144 – [CI] 4.6 OpenStack jobs leaking volumes

Bug 1881144 - [CI] 4.6 OpenStack jobs leaking volumes

Summary: [CI] 4.6 OpenStack jobs leaking volumes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Jan Safranek
QA Contact:	David Sanz
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-21 15:45 UTC by Martin André
Modified:	2020-10-27 16:43 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:43:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 380	None	closed	Bug 1881144: UPSTREAM: 95003: Fail a test on pre-provisioned Cinder volume deletion error	2020-10-07 14:44:13 UTC
Github	openshift origin pull 25579	None	closed	Bug 1881144: Fix Cinder e2e tests not to leak volumes	2020-10-07 14:44:21 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:43:51 UTC

Description Martin André 2020-09-21 15:45:07 UTC

Description of problem: There is a large amount of orphaned volumes in the Vexxhost CI tenant, not tied to any running cluster. MOC does not seem to have the problem (MOC runs some 4.6 presubmit jobs, while Vexxhost runs the 4.6 periodic jobs + some presubmit).

It appears the job successfully deleted the cluster but somehow left volumes behind. We need to understand what causes the leaks and fix it.

We've observed the 4.6 job to leak about 75 volumes per day since roughly 9/16.

We have a quota of 200 volumes on vexxhost. Hitting the quota causes the jobs to fail with:

Sep 21 13:19:56.303: INFO: cinder output:
ERROR: VolumeLimitExceeded: Maximum number of volumes allowed (200) exceeded for quota 'volumes'. (HTTP 413) (Request-ID: req-55b7a487-8663-4376-836c-9349bf30ea92)

Comment 1 Pierre Prinetti 2020-09-22 12:09:12 UTC

MOC does have the same problem; I probably pruned them right before you checked.

Comment 4 W. Trevor King 2020-09-23 21:07:32 UTC

[1] in flight upstream to help with debugging suggests it may be some time before we have a handle on this.  Leaking volumes is not great, but also seems unlikely to be severe enough to block 4.6 going GA.  Punting to 4.7, and fixes can be backported to 4.6.z.

[1]: https://github.com/kubernetes/kubernetes/pull/95003

Comment 5 Martin André 2020-09-24 07:49:28 UTC

I'm hopeful that [1] is going to fix the volume leak.

The upstream tests remove the volumes using the `cinder delete <volume_name>` command and it appears to have started failing since we switched cinder client from stein release to train [2]. The error coming back is:

  Delete for volume e2e-volumemode-9872 failed: Invalid filters all_tenants,name are found in query options

According to Jan, `cinder delete <volume_id>` doesn't yield such error and he has switched to volume IDs in his patch. It's good practice anyway to use resource IDs rather than names in OpenStack.

[1] https://github.com/kubernetes/kubernetes/pull/95003
[2] https://github.com/openshift/installer/pull/4175

Comment 11 Martin André 2020-10-05 06:16:57 UTC

OpenStack CI is no longer leaking volumes since 2020-10-02T19:23:37.000000. Verified against both MOC and Vexxhost.

Comment 14 errata-xmlrpc 2020-10-27 16:43:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.