Bug 1567657

Summary: Very large repositories make no progress pruning images if any image has an issue
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: ImageStreamsAssignee: Michal Minar <miminar>
Status: CLOSED ERRATA QA Contact: XiuJuan Wang <xiuwang>
Severity: unspecified Docs Contact:
Priority: high    
Version: 3.9.0CC: aos-bugs, bparees, jokerman, mifiedle, miminar, mmccomas, wzheng
Target Milestone: ---   
Target Release: 3.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Image pruning stopped on encountering any unexpected error while deleting blobs. Consequence: In case of image deletion error, image pruning failed to delete any image object from etcd. Fix: Images are now being pruned concurrently in separated jobs. Result: Image pruning does not stop on a single unexpected blob deletion failure.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-11 07:19:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
output of oc adm prune none

Description Clayton Coleman 2018-04-15 21:08:57 UTC
The current pruning logic avoids pruning images if we have any issues pruning image streams, blobs, or manifests.  However, if a user falls significantly behind on pruning  the presence of any error can compound and prevent images from being deleted, which means subsequent prunes take longer and longer.

Right now on api.ci we have 10k images but only ~100 image streams, and we appear to have some issue that prevents a small number of image streams from being pruned (will file separate issue).  As a consequence of this the number of images continues to grow without pruning which is causing everything else to get slower.

We should either incrementally prune images as we are purging manifests and blobs (change the algorithm to walk the list of images to prune and delete the blobs, then the manifests, then the image, and track the ones we have already deleted), or perform a simple retry at the end.  The former is probably better, since it will make retries or accidental cancellations more efficient.  The latter is simpler.

Also, it appears that pruning is being rate limited by the client that performs user and subject access review gets to around 12/s, and that makes layer deletion a lot slower than it should be.  We could also consider pruning blobs in parallel, although the incremental prune described above would make more of an impact.

Comment 1 Ben Parees 2018-04-16 00:38:25 UTC
While fixing this i'd also like to add a "--ignore-invalid-references" flag which would default to false (current behavior) but when set to true, would allow pruning to proceed even if there are deploymentconfigs/etc that have invalid image references.

I believe that today a single invalid image reference in a DC prevents pruning from proceeding because we're afraid that we might prune something the user was trying to reference.

Comment 2 Michal Minar 2018-04-16 07:37:11 UTC
I'd be in favor of the iterative deletions: first image's blobs, than its layers ... then the image  - while storting them by the number of unique layers they have (from the highest to lowest). In each iteration deleting just the blobs referenced only by the image to be deleted.

I find retries much harder to design and reason about. Are we aware of all the errors that may happen now or in the future? What errors can be retried and what not? What errors we want to retry always? Which error just few times? How many times shall it be?

Comment 3 Michal Minar 2018-04-23 10:35:50 UTC
WIP PR: https://github.com/openshift/origin/pull/19468

Comment 5 Mike Fiedler 2018-08-30 12:00:07 UTC
@michal - Advice on how to verify this?  How to force or simulate errors deleting image blobs?

Comment 6 Michal Minar 2018-08-30 12:47:26 UTC
@mike making part of registry's storage read-only will do the job.

The algorithm should attempt to prune all of it and not stop on the first permission error.

Comment 7 Wenjing Zheng 2018-08-31 09:47:48 UTC
I am trying to verify this bug with below steps:
1. Configure registry storage backend to emptyDir
2. Push image to internal registry
3. Locate the image location like below:
/var/lib/origin/openshift.local.volumes/pods/82d40d38-acff-11e8-9e95-00163e008c9e/volumes/kubernetes.io~empty-dir
4. Change registry-storage to read-only as below:
drwxrwsrwx. 3 root 1000000000 37 Aug 31 05:29 registry-storage to
dr--r-Sr--. 3 root 1000000000 37 Aug 31 05:29 registry-storage
5.Try to prune images, and there is no error appears and image prune finishes in the end.

Does this mean the bug has been fixed? If no, could you correct my step if there is any mistake? Thanks!

Comment 8 Michal Minar 2018-08-31 09:50:53 UTC
Hi Wenjing,

could you be more specific about the options you pass to the pruner and the output you see?

There should definitely be some errors printed resulting from the blobs not being pruned.

Comment 9 Wenjing Zheng 2018-08-31 09:52:25 UTC
I just run below command to prune images:
# oc adm prune images --token=dfc0j99FNURtMfZylGzGHGg6ziMWIvNGrL7E1I3D40g

Comment 10 Wenjing Zheng 2018-08-31 10:33:56 UTC
After I run below command, can see some warning and errors, but I am not sure whether they are related, I will attach the output in attachment:
oc adm prune images --token=dfc0j99FNURtMfZylGzGHGg6ziMWIvNGrL7E1I3D40g --confirm=true --keep-tag-revisions=0 --keep-younger-than=0 

Also I have below questions:
what error am I supposed to see?
the pushed images with read-only storage should not be pruned, right?

Comment 11 Wenjing Zheng 2018-08-31 10:34:38 UTC
Created attachment 1480073 [details]
output of oc adm prune

Comment 12 Michal Minar 2018-08-31 14:10:49 UTC
On the read-only storage, I see a lot of errors like this:

  Deleting layer link sha256:39fe8b1d3a9cb13a361204c23cf4e342d53184b4440492fa724f4aeb4eb1d64f in repository openshift/jboss-webserver30-tomcat7-openshift
  error deleting repository openshift/jboss-webserver30-tomcat7-openshift layer link sha256:39fe8b1d3a9cb13a361204c23cf4e342d53184b4440492fa724f4aeb4eb1d64f from the registry: 500 Internal Server Error
  Deleting layer link sha256:39fe8b1d3a9cb13a361204c23cf4e342d53184b4440492fa724f4aeb4eb1d64f in repository openshift/redhat-sso71-openshift
  error deleting repository openshift/redhat-sso71-openshift layer link sha256:39fe8b1d3a9cb13a361204c23cf4e342d53184b4440492fa724f4aeb4eb1d64f from the registry: 500 Internal Server Error
  Deleting layer link sha256:39fe8b1d3a9cb13a361204c23cf4e342d53184b4440492fa724f4aeb4eb1d64f in repository openshift/jboss-amq-62
  error deleting repository openshift/jboss-amq-62 layer link sha256:39fe8b1d3a9cb13a361204c23cf4e342d53184b4440492fa724f4aeb4eb1d64f from the registry: 500 Internal Server Error
  Deleting layer link sha256:39fe8b1d3a9cb13a361204c23cf4e342d53184b4440492fa724f4aeb4eb1d64f in repository openshift/jboss-webserver31-tomcat8-openshift
  error deleting repository openshift/jboss-webserver31-tomcat8-openshift layer link sha256:39fe8b1d3a9cb13a361204c23cf4e342d53184b4440492fa724f4aeb4eb1d64f from the registry: 500 Internal Server Error
  Deleting layer link sha256:39fe8b1d3a9cb13a361204c23cf4e342d53184b4440492fa724f4aeb4eb1d64f in repository openshift/jboss-eap70-openshift
  error deleting repository openshift/jboss-eap70-openshift layer link sha256:39fe8b1d3a9cb13a361204c23cf4e342d53184b4440492fa724f4aeb4eb1d64f from the registry: 500 Internal Server Error
  Deleting layer link sha256:39fe8b1d3a9cb13a361204c23cf4e342d53184b4440492fa724f4aeb4eb1d64f in repository openshift/redhat-openjdk18-openshift
  error deleting repository openshift/redhat-openjdk18-openshift layer link sha256:39fe8b1d3a9cb13a361204c23cf4e342d53184b4440492fa724f4aeb4eb1d64f from the registry: 500 Internal Server Error
  Deleting layer link sha256:6b6ea5a6c4ac85e235d63e1326c3a6f624c8d83a1ae27429a34ebecd90cbe52c in repository openshift/redhat-openjdk18-openshift
  error deleting repository openshift/redhat-openjdk18-openshift layer link sha256:6b6ea5a6c4ac85e235d63e1326c3a6f624c8d83a1ae27429a34ebecd90cbe52c from the registry: 500 Internal Server Error
  Deleting blob sha256:6b6ea5a6c4ac85e235d63e1326c3a6f624c8d83a1ae27429a34ebecd90cbe52c
  error deleting blob sha256:6b6ea5a6c4ac85e235d63e1326c3a6f624c8d83a1ae27429a34ebecd90cbe52c from the registry: 500 Internal Server Error
  Deleting blob sha256:5ffd5b1ec8e4264cdd62a3063ee56e370a973e0777da9c8c6f3a5f12e22fe6d5
  error deleting blob sha256:5ffd5b1ec8e4264cdd62a3063ee56e370a973e0777da9c8c6f3a5f12e22fe6d5 from the registry: 500 Internal Server Error
  Deleting manifest link sha256:5ffd5b1ec8e4264cdd62a3063ee56e370a973e0777da9c8c6f3a5f12e22fe6d5 in repository openshift/redhat-openjdk18-openshift
  error deleting manifest link sha256:5ffd5b1ec8e4264cdd62a3063ee56e370a973e0777da9c8c6f3a5f12e22fe6d5 from repository openshift/redhat-openjdk18-openshift: 500 Internal Server Error
  Deleted 42 objects out of 5213.
  Failed to delete 5171 objects.
  error: failed

I have a recent oc client but an older release of docker-registry deployed (not sure if it still returns 500 or some other error), but the output should be quite similar. Your standard error output seems to be redirected somewhere else.

Anyway, if the pruner doesn't stop at the first error and continues to delete the other blobs, it means the bug has been addressed. Please compare with the older version of oc client.

Comment 13 Wenjing Zheng 2018-09-03 06:14:53 UTC
When I use oc v3.9.40 oc to test with read-only repository, it stopped after a while (but not at the first error) since it doesn't have below ending which is different with oc v3.11.0-0.25.0:
Deleted 728 objects out of 776.
Failed to delete 48 objects.
error: failed

Can this be regarded as issues has been fixed?

Comment 14 Wenjing Zheng 2018-09-03 06:36:05 UTC
Please ignore my above comment #13, I tried several times and found oc v3.9.40 stuck at the first error like below:
error pruning manifest sha256:5e8e0509e829bb8f990249135a36e81a3ecbe94294e7a185cc14616e5fad96bd in the repository docker-registry.default.svc:5000/sunny/myimage6: 500 Internal Server Error

But oc v3.11.0-0.25.0 finishes pruning after the error appears:
error deleting manifest link sha256:5e8e0509e829bb8f990249135a36e81a3ecbe94294e7a185cc14616e5fad96bd from repository sunny/myimage7: 500 Internal Server Error
Deleting blob sha256:09cf45760aea204766c1668c497f8571c67fbfa8d81ec03e4293f3fa0b9945d6
W0903 02:30:09.020965    6113 prune.go:1681] Unable to prune layer https://docker-registry.default.svc:5000/v2/openshift/postgresql/blobs/sha256:8275392acc4a34b880bc61b1482eec5049e67ae82ddd10ac9450ad6fdfdf3b74, returned 404 Not Found
Deleting blob sha256:7bd78273b66657ac8b3e800506047866ce94eea0b50e23ecdb76b0a8fbc5cdcc
Deleting blob sha256:642d3edf81580395cbafe161ea49ff5d988134d3ff8fe2240a5e30dc884cfcc8
Deleting blob sha256:610da2480f27448225a79ca668b755d8a90ecd698d85044f3902e4461b9bbfe2
Deleting blob sha256:c196631bd9ac47f0e62cd3b0160159ccf34a88b47a9487a0c3dd3c55b457d607
Deleting manifest link sha256:642d3edf81580395cbafe161ea49ff5d988134d3ff8fe2240a5e30dc884cfcc8 in repository openshift/python 

So I will verify this bug now. Thanks for your help, Michal!

Comment 15 Michal Minar 2018-09-03 08:32:49 UTC
> But oc v3.11.0-0.25.0 finishes pruning after the error appears:

Yes, that's what I meant. Glad it worked out.
Cheers.

Comment 17 errata-xmlrpc 2018-10-11 07:19:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652