1702346 – [3.11] Image pruning on api.ci is wedged due to too many images

Bug 1702346 - [3.11] Image pruning on api.ci is wedged due to too many images

Summary: [3.11] Image pruning on api.ci is wedged due to too many images

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	ImageStreams
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Oleg Bulatov
QA Contact:	XiuJuan Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1702757 1710561
TreeView+	depends on / blocked

Reported:	2019-04-23 14:31 UTC by Clayton Coleman
Modified:	2019-09-03 15:56 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: the pruner were getting all images in a single request Consequence: this request took too long Fix: use pager to get all images Result: the pruner can get all images without hitting timeout
Clone Of:
Clones:	1702757 (view as bug list)
Environment:
Last Closed:	2019-09-03 15:56:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 23617	0	None	closed	[release-3.11] Bug 1702346: use pager to get images for pruning	2020-05-29 14:21:20 UTC
Red Hat Product Errata	RHBA-2019:2580	0	None	None	None	2019-09-03 15:56:21 UTC

Internal Links: 1702757 1710561

Description Clayton Coleman 2019-04-23 14:31:54 UTC

Approximately 55 days ago image pruning stopped working on api.ci, probably either due to a transient failure, or hitting a certain limit.

The current error is

oc logs jobs/image-pruner-clayton-debug
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get images.image.openshift.io)

We have 200k images on the cluster, and it looks like the images call times out trying to load them all into memory.  When I ran a paged call (locally in oc get) it took several minutes and we eventually hit the compaction window, so I was unable.  Testing locally to see size.

The cluster needs to be able to prune, and we will need to take action to get it back under the threshold.  We then need to ensure that this failure mode doesn't happen in the future.

Comment 1 Clayton Coleman 2019-04-23 17:33:29 UTC

Testing on the API server directly images was 39M in JSON and took about 3m20s to retrieve.  Compaction is set to 5m so we are close to the "unable to read all images before compaction window".

Comment 4 XiuJuan Wang 2019-08-26 12:02:50 UTC

Verified this in 
./oc version  
oc v3.11.141
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ec2-54-80-207-203.compute-1.amazonaws.com:443
openshift v3.11.141
kubernetes v1.11.0+d4cacc0

Could prune 200K images without timeout in a pod.

cat 20kimageprune-1.log | grep  ImageStreamLi
I0826 11:57:35.595904     148 request.go:897] Response Body: {"kind":"ImageStreamList","apiVersion":"image.openshift.io/v1","metadata":{"selfLink":"/apis/image.openshift.io/v1/imagestreams","resourceVersion":"246416"},"items":[{"metadata":{"name":"nodejs-mongodb-example","namespace":"install-test","selfLink":"/apis/image.openshift.io/v1/namespaces/install-test/imagestreams/nodejs-mongodb-example","uid":"0ffb7d02-c7d5-11e9-b89e-0e8918b91460","resourceVersion":"5912","generation":1,"creationTimestamp":"2019-08-26T07:42:32Z","labels":{"app":"nodejs-mongodb-example","template":"nodejs-mongodb-example"},"annotations":{"description":"Keeps track of changes in the application image","openshift.io/generated-by":"OpenShiftNewApp"}},"spec":{"lookupPolicy":{"local":false}},"status":{"dockerImageRepository":"docker-registry.default.svc:5000/install-test/nodejs-mongodb-example","tags":[{"tag":"latest","items":[{"created":"2019-08-26T07:43:18Z","dockerImageReference":"docker-registry.default.svc:5000/install-test/nodejs-mongodb-example@sha256:e2059198fbc704c5bbd3b672482d0f6fada954e5 [truncated 300040 chars]

Comment 6 errata-xmlrpc 2019-09-03 15:56:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2580

Note You need to log in before you can comment on or make changes to this bug.