Bug 1710561
| Summary: | [4.1.z] Image pruning on api.ci is wedged due to too many images | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Adam Kaplan <adam.kaplan> |
| Component: | ImageStreams | Assignee: | Oleg Bulatov <obulatov> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | XiuJuan Wang <xiuwang> |
| Severity: | low | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.11.0 | CC: | adam.kaplan, aos-bugs, ccoleman, jokerman, mmccomas, obulatov, vlaad, wking, wzheng, xiuwang |
| Target Milestone: | --- | ||
| Target Release: | 4.1.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: the pruner were getting all images in a single request
Consequence: this request took too long
Fix: use pager to get all images
Result: the pruner can get all images without hitting timeout
|
Story Points: | --- |
| Clone Of: | 1702757 | Environment: | |
| Last Closed: | 2019-06-26 16:17:10 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1702346, 1702757 | ||
| Bug Blocks: | |||
|
Description
Adam Kaplan
2019-05-15 19:10:13 UTC
Oleg, Could I use below steps to check this issue? 1.Set up a cluster with less resource(1 master, 1 worker with m4.large which have 2 vcpus, 8G memory) 2.Built lots of images or create lots of imagestreams. 3.Then prune all images using '--all' Watching worker memory usage, if no interrupt during pruning, could I make this verified? Met server timeout again with 420K images $oc adm prune images --registry-url=default-route-openshift-image-registry.apps.baremetal-lab-02.qe.devcluster.openshift.com --certificate-authority=ca-bm.crt --all Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get images.image.openshift.io) $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-06-12-205455 True False 26h Cluster version is 4.1.0-0.nightly-2019-06-12-205455 What is output of `oc version`? And please attach output from `oc adm prune images` with --loglevel=6. With 4.1.2 oc client, can't reproduce this bug with 420K images now. Move bug to verified ./oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.2-201906121056+3569a06-dirty", GitCommit:"3569a06", GitTreeState:"dirty", BuildDate:"2019-06-12T15:47:26Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+d4417a7", GitCommit:"d4417a7", GitTreeState:"clean", BuildDate:"2019-06-12T15:47:30Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Sorry, after serval mins laster when pruning executing, met the server timeout again
Could get images successfully from master or worker nodes, but in my localhost, sometimes will met this error:
[xiuwang@dhcp-141-173 ~]$ time oc get images | wc -l
I0618 11:46:28.102849 843 helpers.go:198] server response object: [{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
"continue": "eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6LTEsInN0YXJ0IjoieGl1d2FuZy0zZmFrZXNiam5tXHUwMDAwIn0"
},
"status": "Failure",
"message": "The provided continue parameter is too old to display a consistent list result. You can start a new list without the continue parameter, or use the continue token in this response to retrieve the remainder of the results. Continuing with the provided token results in an inconsistent list - objects that were created, modified, or deleted between the time the first chunk was returned and now may show up in the list.",
"reason": "Expired",
"code": 410
}]
F0618 11:46:28.102885 843 helpers.go:116] Error from server (Expired): The provided continue parameter is too old to display a consistent list result. You can start a new list without the continue parameter, or use the continue token in this response to retrieve the remainder of the results. Continuing with the provided token results in an inconsistent list - objects that were created, modified, or deleted between the time the first chunk was returned and now may show up in the list.
186833
real 7m6.966s
user 0m19.941s
sys 0m6.651s
Can you try to run pruning on a worker node? It works on master node. The timeout error getting from localhost is etcd or networking issue? Do we continue to resolve this issue?
$time oc adm prune images --registry-url=default-route-openshift-image-registry.apps.baremetal-lab-02.qe.devcluster.openshift.com --certificate-authority=ca-bm.crt --all --loglevel=8 2>>20kimageprune-1.log 1>>20kimageprune-1.log
==========snip==================
I0619 02:59:35.404417 113970 request.go:942] Response Body: {"kind":"ImageList","apiVersion":"image.openshift.io/v1","metadata":{"selfLink":"/apis/image.openshift.io/v1/images","resourceVersion":"24422052"},"items":[{"metadata":{"name":"xiuwang-fakezw9lq","generateName":"xiuwang-fake","selfLink":"/apis/image.openshift.io/v1/images/xiuwang-fakezw9lq","uid":"cddf4469-8e98-11e9-814b-0a580a820048","resourceVersion":"18569957","creationTimestamp":"2019-06-14T11:37:35Z","annotations":{"image.openshift.io/dockerLayersOrder":"ascending"}},"dockerImageReference":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fff64ab64434424ac7c9f74f745e47f60138a1f401dd175a3de0694940dc5910","dockerImageMetadata":{"kind":"DockerImage","apiVersion":"1.0","Id":"sha256:fd82b0a3cfd648a6ad83e7ddfcd2b7c60691461d824dd24ec82ceaaae75e4007","Created":"2019-05-09T16:22:25Z","Container":"a5c82782437a4c5fa5ebb31fd501df5d94fbf0652f60627833b52c9c4d8c6a4c","ContainerConfig":{"Hostname":"a5c82782437a","Env":["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin","container=oci"],"Cmd":["slee [truncated 497139 chars]
I0619 02:59:35.942696 113970 round_trippers.go:416] GET https://10.1.10.123:6443/apis/image.openshift.io/v1/images?watch=true
I0619 02:59:35.942735 113970 round_trippers.go:423] Request Headers:
I0619 02:59:35.942740 113970 round_trippers.go:426] Authorization: Bearer VFFbQmu3auOY3LQEeSoX0v5OnYgzDcNXYt9TXw5M9Pk
I0619 02:59:35.942744 113970 round_trippers.go:426] Accept: application/json, */*
I0619 02:59:35.942748 113970 round_trippers.go:426] User-Agent: oc/v1.13.4+9252851b0 (linux/amd64) kubernetes/9252851
I0619 02:59:36.146578 113970 round_trippers.go:441] Response Status: 200 OK in 203 milliseconds
I0619 02:59:36.146608 113970 round_trippers.go:444] Response Headers:
I0619 02:59:36.146613 113970 round_trippers.go:447] Audit-Id: 125a8bb9-e141-44c1-ad36-fb32ae93397a
I0619 02:59:36.146617 113970 round_trippers.go:447] Cache-Control: no-store
I0619 02:59:36.146621 113970 round_trippers.go:447] Cache-Control: no-store
I0619 02:59:36.146625 113970 round_trippers.go:447] Content-Type: application/json
I0619 02:59:36.146629 113970 round_trippers.go:447] Date: Wed, 19 Jun 2019 02:59:35 GMT
I0619 02:59:36.146715 113970 round_trippers.go:416] GET https://10.1.10.123:6443/apis/image.openshift.io/v1/imagestreams?watch=true
I0619 02:59:36.146734 113970 round_trippers.go:423] Request Headers:
I0619 02:59:36.146739 113970 round_trippers.go:426] Accept: application/json, */*
I0619 02:59:36.146744 113970 round_trippers.go:426] User-Agent: oc/v1.13.4+9252851b0 (linux/amd64) kubernetes/9252851
I0619 02:59:36.146749 113970 round_trippers.go:426] Authorization: Bearer VFFbQmu3auOY3LQEeSoX0v5OnYgzDcNXYt9TXw5M9Pk
I0619 02:59:36.153447 113970 round_trippers.go:441] Response Status: 200 OK in 6 milliseconds
I0619 02:59:36.153462 113970 round_trippers.go:444] Response Headers:
I0619 02:59:36.153467 113970 round_trippers.go:447] Content-Type: application/json
I0619 02:59:36.153472 113970 round_trippers.go:447] Date: Wed, 19 Jun 2019 02:59:36 GMT
I0619 02:59:36.153476 113970 round_trippers.go:447] Audit-Id: 650fe98a-3e04-4f7b-a453-78e8f8bc59e1
I0619 02:59:36.153480 113970 round_trippers.go:447] Cache-Control: no-store
I0619 02:59:36.153483 113970 round_trippers.go:447] Cache-Control: no-store
I0619 02:59:36.153537 113970 round_trippers.go:416] GET https://10.1.10.123:6443/apis/image.openshift.io/v1/imagestreams
I0619 02:59:36.153544 113970 round_trippers.go:423] Request Headers:
I0619 02:59:36.153549 113970 round_trippers.go:426] Accept: application/json, */*
I0619 02:59:36.153554 113970 round_trippers.go:426] User-Agent: oc/v1.13.4+9252851b0 (linux/amd64) kubernetes/9252851
I0619 02:59:36.153558 113970 round_trippers.go:426] Authorization: Bearer VFFbQmu3auOY3LQEeSoX0v5OnYgzDcNXYt9TXw5M9Pk
I0619 02:59:36.177086 113970 round_trippers.go:441] Response Status: 200 OK in 23 milliseconds
I0619 02:59:36.177115 113970 round_trippers.go:444] Response Headers:
I0619 02:59:36.177125 113970 round_trippers.go:447] Audit-Id: 2bfc416d-9831-4569-9a39-4a3ac4fe96d2
I0619 02:59:36.177141 113970 round_trippers.go:447] Cache-Control: no-store
I0619 02:59:36.177145 113970 round_trippers.go:447] Cache-Control: no-store
I0619 02:59:36.177148 113970 round_trippers.go:447] Content-Type: application/json
I0619 02:59:36.177153 113970 round_trippers.go:447] Date: Wed, 19 Jun 2019 02:59:36 GMT
I0619 02:59:36.181379 113970 request.go:942] Response Body: {"kind":"ImageStreamList","apiVersion":"image.openshift.io/v1","metadata":{"selfLink":"/apis/image.openshift.io/v1/imagestreams","resourceVersion":"24423994"},"items":[{"metadata":{"name":"ruby-ex","namespace":"devexpupgrade","selfLink":"/apis/image.openshift.io/v1/namespaces/devexpupgrade/imagestreams/ruby-ex","uid":"170caa8d-8da3-11e9-bf19-0a580a83001b","resourceVersion":"17237769","generation":1,"creationTimestamp":"2019-06-13T06:18:41Z","labels":{"app":"ruby-ex"},"annotations":{"openshift.io/generated-by":"OpenShiftNewApp"}},"spec":{"lookupPolicy":{"local":false}},"status":{"dockerImageRepository":"image-registry.openshift-image-registry.svc:5000/devexpupgrade/ruby-ex","publicDockerImageRepository":"default-route-openshift-image-registry.apps.baremetal-lab-02.qe.devcluster.openshift.com/devexpupgrade/ruby-ex","tags":[{"tag":"latest","items":[{"created":"2019-06-13T06:19:46Z","dockerImageReference":"image-registry.openshift-image-registry.svc:5000/devexpupgrade/ruby-ex@sha256:5a1d5e269eb9360290edfe3933daa4 [truncated 310413 chars]
I0619 02:59:36.188808 113970 prune.go:278] Creating image pruner with keepYoungerThan=1h0m0s, keepTagRevisions=3, pruneOverSizeLimit=<nil>, allImages=true
I0619 02:59:36.188831 113970 prune.go:357] Adding image "sha256:0089883f8e4387618946cd24378a447b8cf7e5dfaa146b94acab27fc5e170a14" to graph
I0619 02:59:36.189134 113970 prune.go:379] Adding image layer "sha256:26e5ed6899dbf4b1e93e0898255e8aaf43465cecd3a24910f26edb5d43dafa3c" to graph
I0619 02:59:36.189152 113970 prune.go:379] Adding image layer "sha256:66dbe984a319ca6d40dc10c2c561821128a0bd8967e0cbd8cc2a302736041ffb" to graph
==================================snip====================================
I'd say it's architectural limitation. We should recommend to run pruning as a pod inside the cluster. Otherwise roundtrips between the master API and the pruner may hit the compaction window, i.e. the pruner may fail to get all images. Oleg, Thanks Could prune 200K images in a pod, don't met timeout error. Verified with oc client v4.1.2. |