Bug 1702757

Summary: Image pruning on api.ci is wedged due to too many images
Product: OpenShift Container Platform Reporter: Adam Kaplan <adam.kaplan>
Component: ImageStreamsAssignee: Oleg Bulatov <obulatov>
Status: CLOSED ERRATA QA Contact: XiuJuan Wang <xiuwang>
Severity: low Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: adam.kaplan, aos-bugs, ccoleman, jokerman, mmccomas, obulatov, wzheng, xiuwang
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: the pruner were getting all images in a single request Consequence: this request took too long Fix: use pager to get all images Result: the pruner can get all images without hitting timeout
Story Points: ---
Clone Of: 1702346
: 1710561 (view as bug list) Environment:
Last Closed: 2019-10-16 06:28:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1702346    
Bug Blocks: 1710561    

Description Adam Kaplan 2019-04-24 17:07:25 UTC
+++ This bug was initially created as a clone of Bug #1702346 +++

Approximately 55 days ago image pruning stopped working on api.ci, probably either due to a transient failure, or hitting a certain limit.

The current error is

oc logs jobs/image-pruner-clayton-debug
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get images.image.openshift.io)

We have 200k images on the cluster, and it looks like the images call times out trying to load them all into memory.  When I ran a paged call (locally in oc get) it took several minutes and we eventually hit the compaction window, so I was unable.  Testing locally to see size.

The cluster needs to be able to prune, and we will need to take action to get it back under the threshold.  We then need to ensure that this failure mode doesn't happen in the future.

--- Additional comment from Clayton Coleman on 2019-04-23 17:33:29 UTC ---

Testing on the API server directly images was 39M in JSON and took about 3m20s to retrieve.  Compaction is set to 5m so we are close to the "unable to read all images before compaction window".

Comment 1 Adam Kaplan 2019-04-24 17:09:28 UTC
PR: https://github.com/openshift/origin/pull/22655

Comment 2 Adam Kaplan 2019-04-24 17:12:43 UTC
Backport request to 3.11: https://bugzilla.redhat.com/show_bug.cgi?id=1702346

Comment 4 XiuJuan Wang 2019-06-25 08:34:35 UTC
Could do pruning 200K images operation in a pod with 4.2 version(4.2.0-0.nightly-2019-06-25-003324)
$ ./oc version 
Client Version: version.Info{Major:"4", Minor:"2+", GitVersion:"v4.2.0-201906241832+7a0a2f2-dirty", GitCommit:"7a0a2f2", GitTreeState:"dirty", BuildDate:"2019-06-24T23:20:08Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0+952fea3", GitCommit:"952fea3", GitTreeState:"clean", BuildDate:"2019-06-24T23:20:31Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"linux/amd64"}

$ time ./oc get images  | wc -l 
200210

real	2m55.400s
user	0m14.086s
sys	0m3.886s

$./oc adm prune images --registry-url=default-route-openshift-image-registry.apps.xiuwang-42-largeimages.qe.devcluster.openshift.com --certificate-authority=ca.crt --all --loglevel=8 2>> 20kimageprune-2.log >> 20kimageprune-2.log
===========================snip==============================
I0625 08:14:38.409367     174 round_trippers.go:423] Request Headers:
I0625 08:14:38.409374     174 round_trippers.go:426]     Accept: application/json, */*
I0625 08:14:38.409381     174 round_trippers.go:426]     User-Agent: oc/v1.14.0+7a0a2f2 (linux/amd64) kubernetes/7a0a2f2
I0625 08:14:38.409388     174 round_trippers.go:426]     Authorization: Bearer 9QZf_Y30gHVBa1FW6eqdp7124c1i7nr_fxlytnV6o88
I0625 08:14:39.945956     174 round_trippers.go:441] Response Status: 200 OK in 1536 milliseconds
I0625 08:14:39.945993     174 round_trippers.go:444] Response Headers:
I0625 08:14:39.945999     174 round_trippers.go:447]     Content-Type: application/json
I0625 08:14:39.946012     174 round_trippers.go:447]     Date: Tue, 25 Jun 2019 08:14:39 GMT
I0625 08:14:39.946016     174 round_trippers.go:447]     Audit-Id: f8b1c280-c595-457f-b2dc-8c9cdefaeb56
I0625 08:14:39.946021     174 round_trippers.go:447]     Cache-Control: no-store
I0625 08:14:39.946027     174 round_trippers.go:447]     Cache-Control: no-store
I0625 08:14:39.996312     174 request.go:942] Response Body: {"kind":"ImageStreamList","apiVersion":"image.openshift.io/v1","metadata":{"selfLink":"/apis/image.openshift.io/v1/imagestreams","resourceVersion":"272039"},"items":[{"metadata":{"name":"apicast-gateway","namespace":"openshift","selfLink":"/apis/image.openshift.io/v1/namespaces/openshift/imagestreams/apicast-gateway","uid":"23493dea-96fd-11e9-825f-0a580a820014","resourceVersion":"8161","generation":2,"creationTimestamp":"2019-06-25T03:55:57Z","labels":{"samples.operator.openshift.io/managed":"true"},"annotations":{"openshift.io/display-name":"3scale APIcast API Gateway","openshift.io/image.dockerRepositoryCheck":"2019-06-25T03:56:14Z","samples.operator.openshift.io/version":"4.2.0-0.nightly-2019-06-25-003324"}},"spec":{"lookupPolicy":{"local":false},"tags":[{"name":"2.1.0.GA","annotations":{"description":"3scale's APIcast is an NGINX based API gateway used to integrate your internal and external API services with 3scale's API Management Platform. It supports OpenID connect to integrate with external Identity  [truncated 283646 chars]
I0625 08:14:40.001577     174 prune.go:277] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true
I0625 08:14:40.001604     174 prune.go:356] Adding image "sha256:0089883f8e4387618946cd24378a447b8cf7e5dfaa146b94acab27fc5e170a14" to graph
I0625 08:14:40.001842     174 prune.go:378] Adding image layer "sha256:26e5ed6899dbf4b1e93e0898255e8aaf43465cecd3a24910f26edb5d43dafa3c" to graph
===================================snip=====================================

Comment 5 errata-xmlrpc 2019-10-16 06:28:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922