Description of problem: When user performs an image build, the following error is received when trying to push the image to the registry. error: build error: Failed to push image: Put https://172.30.136.32:5000/v2/mobility-service/mobility-service/manifests/latest: EOF Image pruning is performed on the cluster every 3 hours. Pruning runs using the following: --image-keep-younger-than '24h' --image-keep-tag-revisions '5' Version-Release number of selected component (if applicable): Openshift v3.9.41 How reproducible: Steps to Reproduce: 1. Log into node where failed build pods exist. 2. Re-tagged image (referenced by build pods) found in the docker image cache. 3. Tried a push of the new image back into the registry. Actual results: - EOF error generated which matches the error the customer is receiving. Expected results: - Image push into the registry should have succeeded. Additional info: - Registry pods were also redeployed to fix any inconsistency issues with registry storage which didn't fix the problem. - Docker image cache on the node was cleared followed by pulling a fresh copy of the image from the registry, retag it, and push the new version which proved to work several times. - How could we ensure image pruning consistently updates the docker image cache across the nodes (or better yet, removes the images from the docker image cache across all nodes)?
Registry log shows the following error after the push is attempted: time="2018-10-16T17:42:42.794137832Z" level=panic msg="runtime error: invalid memory address or nil pointer dereference" I1016 17:42:42.794377 1 logs.go:41] http: panic serving 10.1.30.1:55790: &{0xc420152000 map[] 2018-10-16 17:42:42.794137832 +0000 UTC m=+66007.148067707 panic runtime error: invalid memory address or nil pointer dereference <nil>} We are working with Oleg Bulatov and SRE ops to troubleshoot this at this time which is why I've set the need additional information to him.
> How could we ensure image pruning consistently updates the docker image cache across the nodes (or better yet, removes the images from the docker image cache across all nodes)? This question reflects a misunderstanding of how these components work. The copies of images that exist on the nodes are unaffected by pruning and unrelated to it. The only cache that's relevant during pruning is the registry cache and since you restarted all the registries, that cache was cleared (and apparently didn't resolve the issue).
It seems the image stream maximum size was hit. Troubleshooting further showed that an excessive amount of image stream tags existed for the given image stream. Even with image pruning active, the tags are not cleaned (by design). Would it be possible to include cleaning image stream tags as part of the image pruning process (as a feature enhancement)?
So, the customer has 2 problems: 1. creation of ImageStreamMapping in a certain image stream fails because this image stream is already too big and nothing can be added into it, 2. the registry can't handle this error because the master API sends an error without the Details field and in 3.9 we don't expect it to be nil. We definitely should back-port the fix for the second problem. Ben, what is our position about image streams that have too many tags? Should we say that that's the customer's burden to delete obsolete tags?
> Ben, what is our position about image streams that have too many tags? Should we say that that's the customer's burden to delete obsolete tags? I'd say yes, i don't know what else we'd do about it. How many are "too many" though? Is this a configured limit somewhere, or is stuff just breaking because the object is too big? > Would it be possible to include cleaning image stream tags as part of the image pruning process (as a feature enhancement)? You can open it as an RFE, we'd have to discuss how feasible it is. Pruning today works by identifying images that aren't being referenced because no tag points to them. Removing tags is going to be more complicated to determine what's a good candidate for removal.
Stuff is just breaking because the object is too big, I don't think we need some additional limits there.
How many tags are in this imagestream? I seem to recall we have some general guidance around not exceeding ~100 tags per imagestream.
Also if that is the issue, why did removing the image from the node, re-pulling it, and then pushing it succeed?
This image stream has 3084 tags. Pulling an image and pushing it back with the same tag will work because it doesn't create additional records in the image stream history. Attempt to pull it, give it a new tag and push it back failed.
I've got full output. The latest tag's history has 1234 records. As this cluster runs pruner in the background, it can delete something from the history and give ability to upload something new.
Pruning should never remove a tag completely since they are passing "--image-keep-tag-revisions '5'" right? Cleaning up the history may help for now since that also reduces the imagestream size, of course, but if they continue adding new tags eventually even pruning will not help since we retain at least 1 entry for every tag (and in their case, 5). The long term solution is they need to remove tags, and they need to change their workflow/process that was leading to the creation of so many unique tags.
Also what are the odds that pruning is currently working? It seems likely that pruning is failing due to api timeouts trying to fetch the imagestream. Has it actually been confirmed that pruning is working? (if it were, how are there 1234 entries in the tag's history currently?)
It was just my guess why pushing might succeed sometimes. It hasn't been confirmed that it is actually working, that's right. Though, does it really matter why it sometimes works? No matter if pruning is working or not, they have the problems I've described in the comment 5.
https://github.com/openshift/image-registry/pull/132
Marking this VERIFIED per comment 26. Verified on 3.9.57 - Created/pushed 4000 builds - Tagged all of the images in the image stream - Attempted to push incremental builds Push failed with "error creating ImageStreamMapping: etcdserver: request is too large" which is expected. See earlier comments (comment 8, comment 15) in this bz re: the need to prevent imagestreams from growing to such a huge number of images and tags. The registry logs showed no panics (the focus of the fix in comment 16) and no restarts.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3748