Bug 1702346

Summary: [3.11] Image pruning on api.ci is wedged due to too many images
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: ImageStreamsAssignee: Oleg Bulatov <obulatov>
Status: CLOSED ERRATA QA Contact: XiuJuan Wang <xiuwang>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: adam.kaplan, aos-bugs, jokerman, mmccomas, wzheng
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: the pruner were getting all images in a single request Consequence: this request took too long Fix: use pager to get all images Result: the pruner can get all images without hitting timeout
Story Points: ---
Clone Of:
: 1702757 (view as bug list) Environment:
Last Closed: 2019-09-03 15:56:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1702757, 1710561    

Description Clayton Coleman 2019-04-23 14:31:54 UTC
Approximately 55 days ago image pruning stopped working on api.ci, probably either due to a transient failure, or hitting a certain limit.

The current error is

oc logs jobs/image-pruner-clayton-debug
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get images.image.openshift.io)

We have 200k images on the cluster, and it looks like the images call times out trying to load them all into memory.  When I ran a paged call (locally in oc get) it took several minutes and we eventually hit the compaction window, so I was unable.  Testing locally to see size.

The cluster needs to be able to prune, and we will need to take action to get it back under the threshold.  We then need to ensure that this failure mode doesn't happen in the future.

Comment 1 Clayton Coleman 2019-04-23 17:33:29 UTC
Testing on the API server directly images was 39M in JSON and took about 3m20s to retrieve.  Compaction is set to 5m so we are close to the "unable to read all images before compaction window".

Comment 4 XiuJuan Wang 2019-08-26 12:02:50 UTC
Verified this in 
./oc version  
oc v3.11.141
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ec2-54-80-207-203.compute-1.amazonaws.com:443
openshift v3.11.141
kubernetes v1.11.0+d4cacc0

Could prune 200K images without timeout in a pod.

cat 20kimageprune-1.log | grep  ImageStreamLi
I0826 11:57:35.595904     148 request.go:897] Response Body: {"kind":"ImageStreamList","apiVersion":"image.openshift.io/v1","metadata":{"selfLink":"/apis/image.openshift.io/v1/imagestreams","resourceVersion":"246416"},"items":[{"metadata":{"name":"nodejs-mongodb-example","namespace":"install-test","selfLink":"/apis/image.openshift.io/v1/namespaces/install-test/imagestreams/nodejs-mongodb-example","uid":"0ffb7d02-c7d5-11e9-b89e-0e8918b91460","resourceVersion":"5912","generation":1,"creationTimestamp":"2019-08-26T07:42:32Z","labels":{"app":"nodejs-mongodb-example","template":"nodejs-mongodb-example"},"annotations":{"description":"Keeps track of changes in the application image","openshift.io/generated-by":"OpenShiftNewApp"}},"spec":{"lookupPolicy":{"local":false}},"status":{"dockerImageRepository":"docker-registry.default.svc:5000/install-test/nodejs-mongodb-example","tags":[{"tag":"latest","items":[{"created":"2019-08-26T07:43:18Z","dockerImageReference":"docker-registry.default.svc:5000/install-test/nodejs-mongodb-example@sha256:e2059198fbc704c5bbd3b672482d0f6fada954e5 [truncated 300040 chars]

Comment 6 errata-xmlrpc 2019-09-03 15:56:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.