+++ This bug was initially created as a clone of Bug #1702757 +++ +++ This bug was initially created as a clone of Bug #1702346 +++ Approximately 55 days ago image pruning stopped working on api.ci, probably either due to a transient failure, or hitting a certain limit. The current error is oc logs jobs/image-pruner-clayton-debug Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get images.image.openshift.io) We have 200k images on the cluster, and it looks like the images call times out trying to load them all into memory. When I ran a paged call (locally in oc get) it took several minutes and we eventually hit the compaction window, so I was unable. Testing locally to see size. The cluster needs to be able to prune, and we will need to take action to get it back under the threshold. We then need to ensure that this failure mode doesn't happen in the future. --- Additional comment from Clayton Coleman on 2019-04-23 17:33:29 UTC --- Testing on the API server directly images was 39M in JSON and took about 3m20s to retrieve. Compaction is set to 5m so we are close to the "unable to read all images before compaction window". --- Additional comment from Adam Kaplan on 2019-04-24 17:09:28 UTC --- PR: https://github.com/openshift/origin/pull/22655 --- Additional comment from Adam Kaplan on 2019-04-24 17:12:43 UTC --- Backport request to 3.11: https://bugzilla.redhat.com/show_bug.cgi?id=1702346
https://github.com/openshift/origin/pull/22845
PR merged: https://github.com/openshift/origin/pull/22845#event-2406198483
Oleg, Could I use below steps to check this issue? 1.Set up a cluster with less resource(1 master, 1 worker with m4.large which have 2 vcpus, 8G memory) 2.Built lots of images or create lots of imagestreams. 3.Then prune all images using '--all' Watching worker memory usage, if no interrupt during pruning, could I make this verified?
Met server timeout again with 420K images $oc adm prune images --registry-url=default-route-openshift-image-registry.apps.baremetal-lab-02.qe.devcluster.openshift.com --certificate-authority=ca-bm.crt --all Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get images.image.openshift.io) $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-06-12-205455 True False 26h Cluster version is 4.1.0-0.nightly-2019-06-12-205455
What is output of `oc version`? And please attach output from `oc adm prune images` with --loglevel=6.
With 4.1.2 oc client, can't reproduce this bug with 420K images now. Move bug to verified
./oc version Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.2-201906121056+3569a06-dirty", GitCommit:"3569a06", GitTreeState:"dirty", BuildDate:"2019-06-12T15:47:26Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+d4417a7", GitCommit:"d4417a7", GitTreeState:"clean", BuildDate:"2019-06-12T15:47:30Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} Sorry, after serval mins laster when pruning executing, met the server timeout again
Could get images successfully from master or worker nodes, but in my localhost, sometimes will met this error: [xiuwang@dhcp-141-173 ~]$ time oc get images | wc -l I0618 11:46:28.102849 843 helpers.go:198] server response object: [{ "kind": "Status", "apiVersion": "v1", "metadata": { "continue": "eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6LTEsInN0YXJ0IjoieGl1d2FuZy0zZmFrZXNiam5tXHUwMDAwIn0" }, "status": "Failure", "message": "The provided continue parameter is too old to display a consistent list result. You can start a new list without the continue parameter, or use the continue token in this response to retrieve the remainder of the results. Continuing with the provided token results in an inconsistent list - objects that were created, modified, or deleted between the time the first chunk was returned and now may show up in the list.", "reason": "Expired", "code": 410 }] F0618 11:46:28.102885 843 helpers.go:116] Error from server (Expired): The provided continue parameter is too old to display a consistent list result. You can start a new list without the continue parameter, or use the continue token in this response to retrieve the remainder of the results. Continuing with the provided token results in an inconsistent list - objects that were created, modified, or deleted between the time the first chunk was returned and now may show up in the list. 186833 real 7m6.966s user 0m19.941s sys 0m6.651s
Can you try to run pruning on a worker node?
It works on master node. The timeout error getting from localhost is etcd or networking issue? Do we continue to resolve this issue? $time oc adm prune images --registry-url=default-route-openshift-image-registry.apps.baremetal-lab-02.qe.devcluster.openshift.com --certificate-authority=ca-bm.crt --all --loglevel=8 2>>20kimageprune-1.log 1>>20kimageprune-1.log ==========snip================== I0619 02:59:35.404417 113970 request.go:942] Response Body: {"kind":"ImageList","apiVersion":"image.openshift.io/v1","metadata":{"selfLink":"/apis/image.openshift.io/v1/images","resourceVersion":"24422052"},"items":[{"metadata":{"name":"xiuwang-fakezw9lq","generateName":"xiuwang-fake","selfLink":"/apis/image.openshift.io/v1/images/xiuwang-fakezw9lq","uid":"cddf4469-8e98-11e9-814b-0a580a820048","resourceVersion":"18569957","creationTimestamp":"2019-06-14T11:37:35Z","annotations":{"image.openshift.io/dockerLayersOrder":"ascending"}},"dockerImageReference":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fff64ab64434424ac7c9f74f745e47f60138a1f401dd175a3de0694940dc5910","dockerImageMetadata":{"kind":"DockerImage","apiVersion":"1.0","Id":"sha256:fd82b0a3cfd648a6ad83e7ddfcd2b7c60691461d824dd24ec82ceaaae75e4007","Created":"2019-05-09T16:22:25Z","Container":"a5c82782437a4c5fa5ebb31fd501df5d94fbf0652f60627833b52c9c4d8c6a4c","ContainerConfig":{"Hostname":"a5c82782437a","Env":["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin","container=oci"],"Cmd":["slee [truncated 497139 chars] I0619 02:59:35.942696 113970 round_trippers.go:416] GET https://10.1.10.123:6443/apis/image.openshift.io/v1/images?watch=true I0619 02:59:35.942735 113970 round_trippers.go:423] Request Headers: I0619 02:59:35.942740 113970 round_trippers.go:426] Authorization: Bearer VFFbQmu3auOY3LQEeSoX0v5OnYgzDcNXYt9TXw5M9Pk I0619 02:59:35.942744 113970 round_trippers.go:426] Accept: application/json, */* I0619 02:59:35.942748 113970 round_trippers.go:426] User-Agent: oc/v1.13.4+9252851b0 (linux/amd64) kubernetes/9252851 I0619 02:59:36.146578 113970 round_trippers.go:441] Response Status: 200 OK in 203 milliseconds I0619 02:59:36.146608 113970 round_trippers.go:444] Response Headers: I0619 02:59:36.146613 113970 round_trippers.go:447] Audit-Id: 125a8bb9-e141-44c1-ad36-fb32ae93397a I0619 02:59:36.146617 113970 round_trippers.go:447] Cache-Control: no-store I0619 02:59:36.146621 113970 round_trippers.go:447] Cache-Control: no-store I0619 02:59:36.146625 113970 round_trippers.go:447] Content-Type: application/json I0619 02:59:36.146629 113970 round_trippers.go:447] Date: Wed, 19 Jun 2019 02:59:35 GMT I0619 02:59:36.146715 113970 round_trippers.go:416] GET https://10.1.10.123:6443/apis/image.openshift.io/v1/imagestreams?watch=true I0619 02:59:36.146734 113970 round_trippers.go:423] Request Headers: I0619 02:59:36.146739 113970 round_trippers.go:426] Accept: application/json, */* I0619 02:59:36.146744 113970 round_trippers.go:426] User-Agent: oc/v1.13.4+9252851b0 (linux/amd64) kubernetes/9252851 I0619 02:59:36.146749 113970 round_trippers.go:426] Authorization: Bearer VFFbQmu3auOY3LQEeSoX0v5OnYgzDcNXYt9TXw5M9Pk I0619 02:59:36.153447 113970 round_trippers.go:441] Response Status: 200 OK in 6 milliseconds I0619 02:59:36.153462 113970 round_trippers.go:444] Response Headers: I0619 02:59:36.153467 113970 round_trippers.go:447] Content-Type: application/json I0619 02:59:36.153472 113970 round_trippers.go:447] Date: Wed, 19 Jun 2019 02:59:36 GMT I0619 02:59:36.153476 113970 round_trippers.go:447] Audit-Id: 650fe98a-3e04-4f7b-a453-78e8f8bc59e1 I0619 02:59:36.153480 113970 round_trippers.go:447] Cache-Control: no-store I0619 02:59:36.153483 113970 round_trippers.go:447] Cache-Control: no-store I0619 02:59:36.153537 113970 round_trippers.go:416] GET https://10.1.10.123:6443/apis/image.openshift.io/v1/imagestreams I0619 02:59:36.153544 113970 round_trippers.go:423] Request Headers: I0619 02:59:36.153549 113970 round_trippers.go:426] Accept: application/json, */* I0619 02:59:36.153554 113970 round_trippers.go:426] User-Agent: oc/v1.13.4+9252851b0 (linux/amd64) kubernetes/9252851 I0619 02:59:36.153558 113970 round_trippers.go:426] Authorization: Bearer VFFbQmu3auOY3LQEeSoX0v5OnYgzDcNXYt9TXw5M9Pk I0619 02:59:36.177086 113970 round_trippers.go:441] Response Status: 200 OK in 23 milliseconds I0619 02:59:36.177115 113970 round_trippers.go:444] Response Headers: I0619 02:59:36.177125 113970 round_trippers.go:447] Audit-Id: 2bfc416d-9831-4569-9a39-4a3ac4fe96d2 I0619 02:59:36.177141 113970 round_trippers.go:447] Cache-Control: no-store I0619 02:59:36.177145 113970 round_trippers.go:447] Cache-Control: no-store I0619 02:59:36.177148 113970 round_trippers.go:447] Content-Type: application/json I0619 02:59:36.177153 113970 round_trippers.go:447] Date: Wed, 19 Jun 2019 02:59:36 GMT I0619 02:59:36.181379 113970 request.go:942] Response Body: {"kind":"ImageStreamList","apiVersion":"image.openshift.io/v1","metadata":{"selfLink":"/apis/image.openshift.io/v1/imagestreams","resourceVersion":"24423994"},"items":[{"metadata":{"name":"ruby-ex","namespace":"devexpupgrade","selfLink":"/apis/image.openshift.io/v1/namespaces/devexpupgrade/imagestreams/ruby-ex","uid":"170caa8d-8da3-11e9-bf19-0a580a83001b","resourceVersion":"17237769","generation":1,"creationTimestamp":"2019-06-13T06:18:41Z","labels":{"app":"ruby-ex"},"annotations":{"openshift.io/generated-by":"OpenShiftNewApp"}},"spec":{"lookupPolicy":{"local":false}},"status":{"dockerImageRepository":"image-registry.openshift-image-registry.svc:5000/devexpupgrade/ruby-ex","publicDockerImageRepository":"default-route-openshift-image-registry.apps.baremetal-lab-02.qe.devcluster.openshift.com/devexpupgrade/ruby-ex","tags":[{"tag":"latest","items":[{"created":"2019-06-13T06:19:46Z","dockerImageReference":"image-registry.openshift-image-registry.svc:5000/devexpupgrade/ruby-ex@sha256:5a1d5e269eb9360290edfe3933daa4 [truncated 310413 chars] I0619 02:59:36.188808 113970 prune.go:278] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true I0619 02:59:36.188831 113970 prune.go:357] Adding image "sha256:0089883f8e4387618946cd24378a447b8cf7e5dfaa146b94acab27fc5e170a14" to graph I0619 02:59:36.189134 113970 prune.go:379] Adding image layer "sha256:26e5ed6899dbf4b1e93e0898255e8aaf43465cecd3a24910f26edb5d43dafa3c" to graph I0619 02:59:36.189152 113970 prune.go:379] Adding image layer "sha256:66dbe984a319ca6d40dc10c2c561821128a0bd8967e0cbd8cc2a302736041ffb" to graph ==================================snip====================================
I'd say it's architectural limitation. We should recommend to run pruning as a pod inside the cluster. Otherwise roundtrips between the master API and the pruner may hit the compaction window, i.e. the pruner may fail to get all images.
Oleg, Thanks Could prune 200K images in a pod, don't met timeout error. Verified with oc client v4.1.2.