Bug 1710561 - [4.1.z] Image pruning on api.ci is wedged due to too many images
Summary: [4.1.z] Image pruning on api.ci is wedged due to too many images
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.1.z
Assignee: Oleg Bulatov
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On: 1702346 1702757
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-15 19:10 UTC by Adam Kaplan
Modified: 2019-06-26 16:17 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: the pruner were getting all images in a single request Consequence: this request took too long Fix: use pager to get all images Result: the pruner can get all images without hitting timeout
Clone Of: 1702757
Environment:
Last Closed: 2019-06-26 16:17:10 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1702346 None CLOSED [3.11] Image pruning on api.ci is wedged due to too many images 2019-09-04 10:01:48 UTC
Red Hat Bugzilla 1702757 None VERIFIED Image pruning on api.ci is wedged due to too many images 2019-09-04 10:01:49 UTC

Description Adam Kaplan 2019-05-15 19:10:13 UTC
+++ This bug was initially created as a clone of Bug #1702757 +++

+++ This bug was initially created as a clone of Bug #1702346 +++

Approximately 55 days ago image pruning stopped working on api.ci, probably either due to a transient failure, or hitting a certain limit.

The current error is

oc logs jobs/image-pruner-clayton-debug
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get images.image.openshift.io)

We have 200k images on the cluster, and it looks like the images call times out trying to load them all into memory.  When I ran a paged call (locally in oc get) it took several minutes and we eventually hit the compaction window, so I was unable.  Testing locally to see size.

The cluster needs to be able to prune, and we will need to take action to get it back under the threshold.  We then need to ensure that this failure mode doesn't happen in the future.

--- Additional comment from Clayton Coleman on 2019-04-23 17:33:29 UTC ---

Testing on the API server directly images was 39M in JSON and took about 3m20s to retrieve.  Compaction is set to 5m so we are close to the "unable to read all images before compaction window".

--- Additional comment from Adam Kaplan on 2019-04-24 17:09:28 UTC ---

PR: https://github.com/openshift/origin/pull/22655

--- Additional comment from Adam Kaplan on 2019-04-24 17:12:43 UTC ---

Backport request to 3.11: https://bugzilla.redhat.com/show_bug.cgi?id=1702346

Comment 1 Oleg Bulatov 2019-05-16 11:22:43 UTC
https://github.com/openshift/origin/pull/22845

Comment 2 W. Trevor King 2019-06-12 23:01:24 UTC
PR merged: https://github.com/openshift/origin/pull/22845#event-2406198483

Comment 4 XiuJuan Wang 2019-06-13 06:52:00 UTC
Oleg,
Could I use below steps to check this issue?
1.Set up a cluster with less resource(1 master, 1 worker with m4.large which have 2 vcpus, 8G memory)
2.Built lots of images or create lots of imagestreams.
3.Then prune all images using '--all'

Watching worker memory usage, if no interrupt during pruning, could I make this verified?

Comment 5 XiuJuan Wang 2019-06-14 12:37:16 UTC
Met server timeout again with 420K images
$oc adm prune images --registry-url=default-route-openshift-image-registry.apps.baremetal-lab-02.qe.devcluster.openshift.com  --certificate-authority=ca-bm.crt --all  Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get images.image.openshift.io)

$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-06-12-205455   True        False         26h     Cluster version is 4.1.0-0.nightly-2019-06-12-205455

Comment 6 Oleg Bulatov 2019-06-14 12:47:39 UTC
What is output of `oc version`? And please attach output from `oc adm prune images` with --loglevel=6.

Comment 8 XiuJuan Wang 2019-06-14 13:05:44 UTC
With 4.1.2 oc client, can't reproduce this bug  with 420K images now.
Move bug to verified

Comment 9 XiuJuan Wang 2019-06-14 13:10:50 UTC
 ./oc version 
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.2-201906121056+3569a06-dirty", GitCommit:"3569a06", GitTreeState:"dirty", BuildDate:"2019-06-12T15:47:26Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+d4417a7", GitCommit:"d4417a7", GitTreeState:"clean", BuildDate:"2019-06-12T15:47:30Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}


Sorry, after serval mins laster when pruning executing, met the server timeout again

Comment 15 XiuJuan Wang 2019-06-18 04:29:58 UTC
Could get images successfully from master or worker nodes, but in my localhost, sometimes will met this error:

[xiuwang@dhcp-141-173 ~]$  time oc get images  | wc  -l 

I0618 11:46:28.102849     843 helpers.go:198] server response object: [{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
    "continue": "eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6LTEsInN0YXJ0IjoieGl1d2FuZy0zZmFrZXNiam5tXHUwMDAwIn0"
  },
  "status": "Failure",
  "message": "The provided continue parameter is too old to display a consistent list result. You can start a new list without the continue parameter, or use the continue token in this response to retrieve the remainder of the results. Continuing with the provided token results in an inconsistent list - objects that were created, modified, or deleted between the time the first chunk was returned and now may show up in the list.",
  "reason": "Expired",
  "code": 410
}]
F0618 11:46:28.102885     843 helpers.go:116] Error from server (Expired): The provided continue parameter is too old to display a consistent list result. You can start a new list without the continue parameter, or use the continue token in this response to retrieve the remainder of the results. Continuing with the provided token results in an inconsistent list - objects that were created, modified, or deleted between the time the first chunk was returned and now may show up in the list.
186833

real	7m6.966s
user	0m19.941s
sys	0m6.651s

Comment 16 Oleg Bulatov 2019-06-18 20:55:02 UTC
Can you try to run pruning on a worker node?

Comment 17 XiuJuan Wang 2019-06-19 03:42:12 UTC
It works on master node. The timeout error getting from localhost is etcd or networking issue? Do we continue to resolve this issue?

$time oc adm prune images --registry-url=default-route-openshift-image-registry.apps.baremetal-lab-02.qe.devcluster.openshift.com  --certificate-authority=ca-bm.crt --all  --loglevel=8 2>>20kimageprune-1.log 1>>20kimageprune-1.log 
==========snip==================
I0619 02:59:35.404417  113970 request.go:942] Response Body: {"kind":"ImageList","apiVersion":"image.openshift.io/v1","metadata":{"selfLink":"/apis/image.openshift.io/v1/images","resourceVersion":"24422052"},"items":[{"metadata":{"name":"xiuwang-fakezw9lq","generateName":"xiuwang-fake","selfLink":"/apis/image.openshift.io/v1/images/xiuwang-fakezw9lq","uid":"cddf4469-8e98-11e9-814b-0a580a820048","resourceVersion":"18569957","creationTimestamp":"2019-06-14T11:37:35Z","annotations":{"image.openshift.io/dockerLayersOrder":"ascending"}},"dockerImageReference":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fff64ab64434424ac7c9f74f745e47f60138a1f401dd175a3de0694940dc5910","dockerImageMetadata":{"kind":"DockerImage","apiVersion":"1.0","Id":"sha256:fd82b0a3cfd648a6ad83e7ddfcd2b7c60691461d824dd24ec82ceaaae75e4007","Created":"2019-05-09T16:22:25Z","Container":"a5c82782437a4c5fa5ebb31fd501df5d94fbf0652f60627833b52c9c4d8c6a4c","ContainerConfig":{"Hostname":"a5c82782437a","Env":["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin","container=oci"],"Cmd":["slee [truncated 497139 chars]
I0619 02:59:35.942696  113970 round_trippers.go:416] GET https://10.1.10.123:6443/apis/image.openshift.io/v1/images?watch=true
I0619 02:59:35.942735  113970 round_trippers.go:423] Request Headers:
I0619 02:59:35.942740  113970 round_trippers.go:426]     Authorization: Bearer VFFbQmu3auOY3LQEeSoX0v5OnYgzDcNXYt9TXw5M9Pk
I0619 02:59:35.942744  113970 round_trippers.go:426]     Accept: application/json, */*
I0619 02:59:35.942748  113970 round_trippers.go:426]     User-Agent: oc/v1.13.4+9252851b0 (linux/amd64) kubernetes/9252851
I0619 02:59:36.146578  113970 round_trippers.go:441] Response Status: 200 OK in 203 milliseconds
I0619 02:59:36.146608  113970 round_trippers.go:444] Response Headers:
I0619 02:59:36.146613  113970 round_trippers.go:447]     Audit-Id: 125a8bb9-e141-44c1-ad36-fb32ae93397a
I0619 02:59:36.146617  113970 round_trippers.go:447]     Cache-Control: no-store
I0619 02:59:36.146621  113970 round_trippers.go:447]     Cache-Control: no-store
I0619 02:59:36.146625  113970 round_trippers.go:447]     Content-Type: application/json
I0619 02:59:36.146629  113970 round_trippers.go:447]     Date: Wed, 19 Jun 2019 02:59:35 GMT
I0619 02:59:36.146715  113970 round_trippers.go:416] GET https://10.1.10.123:6443/apis/image.openshift.io/v1/imagestreams?watch=true
I0619 02:59:36.146734  113970 round_trippers.go:423] Request Headers:
I0619 02:59:36.146739  113970 round_trippers.go:426]     Accept: application/json, */*
I0619 02:59:36.146744  113970 round_trippers.go:426]     User-Agent: oc/v1.13.4+9252851b0 (linux/amd64) kubernetes/9252851
I0619 02:59:36.146749  113970 round_trippers.go:426]     Authorization: Bearer VFFbQmu3auOY3LQEeSoX0v5OnYgzDcNXYt9TXw5M9Pk
I0619 02:59:36.153447  113970 round_trippers.go:441] Response Status: 200 OK in 6 milliseconds
I0619 02:59:36.153462  113970 round_trippers.go:444] Response Headers:
I0619 02:59:36.153467  113970 round_trippers.go:447]     Content-Type: application/json
I0619 02:59:36.153472  113970 round_trippers.go:447]     Date: Wed, 19 Jun 2019 02:59:36 GMT
I0619 02:59:36.153476  113970 round_trippers.go:447]     Audit-Id: 650fe98a-3e04-4f7b-a453-78e8f8bc59e1
I0619 02:59:36.153480  113970 round_trippers.go:447]     Cache-Control: no-store
I0619 02:59:36.153483  113970 round_trippers.go:447]     Cache-Control: no-store
I0619 02:59:36.153537  113970 round_trippers.go:416] GET https://10.1.10.123:6443/apis/image.openshift.io/v1/imagestreams
I0619 02:59:36.153544  113970 round_trippers.go:423] Request Headers:
I0619 02:59:36.153549  113970 round_trippers.go:426]     Accept: application/json, */*
I0619 02:59:36.153554  113970 round_trippers.go:426]     User-Agent: oc/v1.13.4+9252851b0 (linux/amd64) kubernetes/9252851
I0619 02:59:36.153558  113970 round_trippers.go:426]     Authorization: Bearer VFFbQmu3auOY3LQEeSoX0v5OnYgzDcNXYt9TXw5M9Pk
I0619 02:59:36.177086  113970 round_trippers.go:441] Response Status: 200 OK in 23 milliseconds
I0619 02:59:36.177115  113970 round_trippers.go:444] Response Headers:
I0619 02:59:36.177125  113970 round_trippers.go:447]     Audit-Id: 2bfc416d-9831-4569-9a39-4a3ac4fe96d2
I0619 02:59:36.177141  113970 round_trippers.go:447]     Cache-Control: no-store
I0619 02:59:36.177145  113970 round_trippers.go:447]     Cache-Control: no-store
I0619 02:59:36.177148  113970 round_trippers.go:447]     Content-Type: application/json
I0619 02:59:36.177153  113970 round_trippers.go:447]     Date: Wed, 19 Jun 2019 02:59:36 GMT
I0619 02:59:36.181379  113970 request.go:942] Response Body: {"kind":"ImageStreamList","apiVersion":"image.openshift.io/v1","metadata":{"selfLink":"/apis/image.openshift.io/v1/imagestreams","resourceVersion":"24423994"},"items":[{"metadata":{"name":"ruby-ex","namespace":"devexpupgrade","selfLink":"/apis/image.openshift.io/v1/namespaces/devexpupgrade/imagestreams/ruby-ex","uid":"170caa8d-8da3-11e9-bf19-0a580a83001b","resourceVersion":"17237769","generation":1,"creationTimestamp":"2019-06-13T06:18:41Z","labels":{"app":"ruby-ex"},"annotations":{"openshift.io/generated-by":"OpenShiftNewApp"}},"spec":{"lookupPolicy":{"local":false}},"status":{"dockerImageRepository":"image-registry.openshift-image-registry.svc:5000/devexpupgrade/ruby-ex","publicDockerImageRepository":"default-route-openshift-image-registry.apps.baremetal-lab-02.qe.devcluster.openshift.com/devexpupgrade/ruby-ex","tags":[{"tag":"latest","items":[{"created":"2019-06-13T06:19:46Z","dockerImageReference":"image-registry.openshift-image-registry.svc:5000/devexpupgrade/ruby-ex@sha256:5a1d5e269eb9360290edfe3933daa4 [truncated 310413 chars]
I0619 02:59:36.188808  113970 prune.go:278] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true
I0619 02:59:36.188831  113970 prune.go:357] Adding image "sha256:0089883f8e4387618946cd24378a447b8cf7e5dfaa146b94acab27fc5e170a14" to graph
I0619 02:59:36.189134  113970 prune.go:379] Adding image layer "sha256:26e5ed6899dbf4b1e93e0898255e8aaf43465cecd3a24910f26edb5d43dafa3c" to graph
I0619 02:59:36.189152  113970 prune.go:379] Adding image layer "sha256:66dbe984a319ca6d40dc10c2c561821128a0bd8967e0cbd8cc2a302736041ffb" to graph
==================================snip====================================

Comment 18 Oleg Bulatov 2019-06-19 12:03:05 UTC
I'd say it's architectural limitation. We should recommend to run pruning as a pod inside the cluster. Otherwise roundtrips between the master API and the pruner may hit the compaction window, i.e. the pruner may fail to get all images.

Comment 19 XiuJuan Wang 2019-06-20 02:29:45 UTC
Oleg, Thanks
Could prune 200K images in a pod, don't met timeout error.
Verified with oc client v4.1.2.


Note You need to log in before you can comment on or make changes to this bug.