Bug 1589994
| Summary: | Large image stream unable to pull image | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
| Component: | Image Registry | Assignee: | Ben Parees <bparees> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Fiedler <mifiedle> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.10.0 | CC: | aos-bugs, bparees, mifiedle, wsun, wzheng |
| Target Milestone: | --- | ||
| Target Release: | 3.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Pulling an image from the registry requires scanning all images associated with the image's imagestream to find the correct layers.
Consequence: When an imagestream contains a large number of images, this operator can take an excessive amount of time and result in a failure to pull the image due to timeouts.
Fix: The mechanism of scanning the imagestream to find the relevant layers has been improved to reduce the number of api calls as well as scan more efficiently, speeding up the operation.
Result: Timeouts should no longer be seen when pulling images from imagestreams that contain a large number of images.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-12-21 15:16:43 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Clayton Coleman
2018-06-11 19:29:58 UTC
The current search algorithm in HasBlob penalizes less frequently updated tags. Sorting all tag events by generation will allow one frequently updated tag to make a less frequently updated tag be last on the list. I.e.: tag1: 1h, 2h, 3h, 4h, 5h, ... tag2: 1d, 2d, 3d The first 24 tag1 will be scanned before the first tag2 is considered. The algorithm should probably be: 1. All current tagged images first 2. All non-current tagged images in order of generation. In the case above we have the equivalent of images pushed from origin builds (every hour) being checked against images pushed from docker-registry builds (every week), so docker-registry is "losing" to all origin images. Something is really wrong with caching. I confirmed repository caching was on. On a ~5 minute run with several accesses, I got the following stats from blobDescriptor services
○ grep "starting with digest"
5635
○ grep "could not stat layer link"
332
○ grep "exists in the global blob store"
269
○ grep "found cached blob"
9
○ grep "verifying presence"
260
○ grep "neither empty nor referenced"
238
9/332 cache hits on the repository seems really low. Of the 260 times we attempted to look up the blob in the image stream, we only found an image 22 times (probably head requests?).
https://github.com/openshift/image-registry/pull/100 will hopefully make this manageable while additional solutions are pursued post 3.10. https://github.com/openshift/image-registry/pull/100 has merged to help mitigate this in 3.10, but there are some longer term efforts underway for 3.11: https://github.com/openshift/image-registry/pull/101 Fixed in 3.11 Ben, could you give some suggestions on how to verify this bug? I'd recreate the scenario Clayton described in the initial report. Create an imagestream that has a large number of tags (~100). And make each tag have a large number of SHAs (~10 each, representing the image tag having been updated/pushed 10 times). This results in an imagestream with a total of around 1000 images in its history. Once you have that imagestream created, attempt to pull various SHAs from the imagestream. All pulls should succeed, and not timeout. Hi Mike,could you help check if this bug could be verified? Thanks! Verified on 3.11.0-0.28.0 per comment 8. Created an imagestream with 100 tags each with 10 SHAs. The imagestream showed a total of 1000 images. Then did a loop to docker pull all 1000 SHAs from the registry and all pulls were successful and took the expected amount of time. Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content. |