Bug 1702346 - [3.11] Image pruning on api.ci is wedged due to too many images
Summary: [3.11] Image pruning on api.ci is wedged due to too many images
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: ImageStreams
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.11.z
Assignee: Oleg Bulatov
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On:
Blocks: 1702757 1710561
TreeView+ depends on / blocked
 
Reported: 2019-04-23 14:31 UTC by Clayton Coleman
Modified: 2019-09-03 15:56 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: the pruner were getting all images in a single request Consequence: this request took too long Fix: use pager to get all images Result: the pruner can get all images without hitting timeout
Clone Of:
: 1702757 (view as bug list)
Environment:
Last Closed: 2019-09-03 15:56:02 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift origin pull 23617 None closed [release-3.11] Bug 1702346: use pager to get images for pruning 2020-05-29 14:21:20 UTC
Red Hat Product Errata RHBA-2019:2580 None None None 2019-09-03 15:56:21 UTC

Internal Links: 1702757 1710561

Description Clayton Coleman 2019-04-23 14:31:54 UTC
Approximately 55 days ago image pruning stopped working on api.ci, probably either due to a transient failure, or hitting a certain limit.

The current error is

oc logs jobs/image-pruner-clayton-debug
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get images.image.openshift.io)

We have 200k images on the cluster, and it looks like the images call times out trying to load them all into memory.  When I ran a paged call (locally in oc get) it took several minutes and we eventually hit the compaction window, so I was unable.  Testing locally to see size.

The cluster needs to be able to prune, and we will need to take action to get it back under the threshold.  We then need to ensure that this failure mode doesn't happen in the future.

Comment 1 Clayton Coleman 2019-04-23 17:33:29 UTC
Testing on the API server directly images was 39M in JSON and took about 3m20s to retrieve.  Compaction is set to 5m so we are close to the "unable to read all images before compaction window".

Comment 4 XiuJuan Wang 2019-08-26 12:02:50 UTC
Verified this in 
./oc version  
oc v3.11.141
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ec2-54-80-207-203.compute-1.amazonaws.com:443
openshift v3.11.141
kubernetes v1.11.0+d4cacc0

Could prune 200K images without timeout in a pod.

cat 20kimageprune-1.log | grep  ImageStreamLi
I0826 11:57:35.595904     148 request.go:897] Response Body: {"kind":"ImageStreamList","apiVersion":"image.openshift.io/v1","metadata":{"selfLink":"/apis/image.openshift.io/v1/imagestreams","resourceVersion":"246416"},"items":[{"metadata":{"name":"nodejs-mongodb-example","namespace":"install-test","selfLink":"/apis/image.openshift.io/v1/namespaces/install-test/imagestreams/nodejs-mongodb-example","uid":"0ffb7d02-c7d5-11e9-b89e-0e8918b91460","resourceVersion":"5912","generation":1,"creationTimestamp":"2019-08-26T07:42:32Z","labels":{"app":"nodejs-mongodb-example","template":"nodejs-mongodb-example"},"annotations":{"description":"Keeps track of changes in the application image","openshift.io/generated-by":"OpenShiftNewApp"}},"spec":{"lookupPolicy":{"local":false}},"status":{"dockerImageRepository":"docker-registry.default.svc:5000/install-test/nodejs-mongodb-example","tags":[{"tag":"latest","items":[{"created":"2019-08-26T07:43:18Z","dockerImageReference":"docker-registry.default.svc:5000/install-test/nodejs-mongodb-example@sha256:e2059198fbc704c5bbd3b672482d0f6fada954e5 [truncated 300040 chars]

Comment 6 errata-xmlrpc 2019-09-03 15:56:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2580


Note You need to log in before you can comment on or make changes to this bug.