1639839 – Built image fails to pushed to the OpenShift registry

Bug 1639839 - Built image fails to pushed to the OpenShift registry

Summary: Built image fails to pushed to the OpenShift registry

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	3.9.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Oleg Bulatov
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-16 17:49 UTC by Daniel Del Ciancio
Modified:	2021-12-10 17:57 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: the master API sent an error without details, but the registry wasn't ready for this. Consequence: the registry panicked. Fix: check if the details field is available. Result: the registry sends a proper error to the client and logs the full error from the master API.
Clone Of:
Environment:
Last Closed:	2018-12-13 19:27:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:3748	0	None	None	None	2018-12-13 19:27:15 UTC

Description Daniel Del Ciancio 2018-10-16 17:49:10 UTC

Description of problem:

When user performs an image build, the following error is received when trying to push the image to the registry.
error: build error: Failed to push image: Put https://172.30.136.32:5000/v2/mobility-service/mobility-service/manifests/latest: EOF

Image pruning is performed on the cluster every 3 hours.  Pruning runs using the following:
--image-keep-younger-than '24h'
--image-keep-tag-revisions '5'


Version-Release number of selected component (if applicable):

Openshift v3.9.41


How reproducible:

Steps to Reproduce:
1. Log into node where failed build pods exist.
2. Re-tagged image (referenced by build pods) found in the docker image cache.
3. Tried a push of the new image back into the registry.


Actual results:

- EOF error generated which matches the error the customer is receiving.

Expected results:

- Image push into the registry should have succeeded.


Additional info:

- Registry pods were also redeployed to fix any inconsistency issues with registry storage which didn't fix the problem.
- Docker image cache on the node was cleared followed by pulling a fresh copy of the image from the registry, retag it, and push the new version which proved to work several times.

- How could we ensure image pruning consistently updates the docker image cache across the nodes (or better yet, removes the images from the docker image cache across all nodes)?

Comment 1 Daniel Del Ciancio 2018-10-16 17:59:10 UTC

Registry log shows the following error after the push is attempted:

time="2018-10-16T17:42:42.794137832Z" level=panic msg="runtime error: invalid memory address or nil pointer dereference"  I1016 17:42:42.794377       1 logs.go:41] http: panic serving 10.1.30.1:55790: &{0xc420152000 map[] 2018-10-16 17:42:42.794137832 +0000 UTC m=+66007.148067707 panic runtime error: invalid memory address or nil pointer dereference <nil>}

We are working with Oleg Bulatov and SRE ops to troubleshoot this at this time which is why I've set the need additional information to him.

Comment 2 Ben Parees 2018-10-16 18:03:07 UTC

> How could we ensure image pruning consistently updates the docker image cache across the nodes (or better yet, removes the images from the docker image cache across all nodes)?

This question reflects a misunderstanding of how these components work.

The copies of images that exist on the nodes are unaffected by pruning and unrelated to it.

The only cache that's relevant during pruning is the registry cache and since you restarted all the registries, that cache was cleared (and apparently didn't resolve the issue).

Comment 4 Daniel Del Ciancio 2018-10-16 18:44:48 UTC

It seems the image stream maximum size was hit.  Troubleshooting further showed that an excessive amount of image stream tags existed for the given image stream.

Even with image pruning active, the tags are not cleaned (by design).

Would it be possible to include cleaning image stream tags as part of the image pruning process (as a feature enhancement)?

Comment 5 Oleg Bulatov 2018-10-16 19:26:15 UTC

So, the customer has 2 problems:

  1. creation of ImageStreamMapping in a certain image stream fails because this image stream is already too big and nothing can be added into it,
  2. the registry can't handle this error because the master API sends an error without the Details field and in 3.9 we don't expect it to be nil.

We definitely should back-port the fix for the second problem.

Ben, what is our position about image streams that have too many tags? Should we say that that's the customer's burden to delete obsolete tags?

Comment 6 Ben Parees 2018-10-16 19:30:47 UTC

> Ben, what is our position about image streams that have too many tags? Should we say that that's the customer's burden to delete obsolete tags?

I'd say yes, i don't know what else we'd do about it.  How many are "too many" though?  Is this a configured limit somewhere, or is stuff just breaking because the object is too big?


> Would it be possible to include cleaning image stream tags as part of the image pruning process (as a feature enhancement)?

You can open it as an RFE, we'd have to discuss how feasible it is.  Pruning today works by identifying images that aren't being referenced because no tag points to them.  Removing tags is going to be more complicated to determine what's a good candidate for removal.

Comment 7 Oleg Bulatov 2018-10-16 20:52:56 UTC

Stuff is just breaking because the object is too big, I don't think we need some additional limits there.

Comment 8 Ben Parees 2018-10-16 20:59:55 UTC

How many tags are in this imagestream?  I seem to recall we have some general guidance around not exceeding ~100 tags per imagestream.

Comment 9 Ben Parees 2018-10-16 21:00:30 UTC

Also if that is the issue, why did removing the image from the node, re-pulling it, and then pushing it succeed?

Comment 10 Oleg Bulatov 2018-10-16 21:16:03 UTC

This image stream has 3084 tags. Pulling an image and pushing it back with the same tag will work because it doesn't create additional records in the image stream history.

Attempt to pull it, give it a new tag and push it back failed.

Comment 11 Oleg Bulatov 2018-10-16 21:28:23 UTC

I've got full output. The latest tag's history has 1234 records. As this cluster runs pruner in the background, it can delete something from the history and give ability to upload something new.

Comment 12 Ben Parees 2018-10-16 21:42:43 UTC

Pruning should never remove a tag completely since they are passing "--image-keep-tag-revisions '5'"  right?

Cleaning up the history may help for now since that also reduces the imagestream size, of course, but if they continue adding new tags eventually even pruning will not help since we retain at least 1 entry for every tag (and in their case, 5).

The long term solution is they need to remove tags, and they need to change their workflow/process that was leading to the creation of so many unique tags.

Comment 13 Ben Parees 2018-10-16 21:44:19 UTC

Also what are the odds that pruning is currently working?  It seems likely that pruning is failing due to api timeouts trying to fetch the imagestream.  Has it actually been confirmed that pruning is working? (if it were, how are there 1234 entries in the tag's history currently?)

Comment 14 Oleg Bulatov 2018-10-17 08:46:34 UTC

It was just my guess why pushing might succeed sometimes. It hasn't been confirmed that it is actually working, that's right. Though, does it really matter why it sometimes works?

No matter if pruning is working or not, they have the problems I've described in the comment 5.

Comment 16 Oleg Bulatov 2018-10-18 15:38:36 UTC

https://github.com/openshift/image-registry/pull/132

Comment 27 Mike Fiedler 2018-12-06 13:04:02 UTC

Marking this VERIFIED per comment 26.   Verified on 3.9.57

- Created/pushed 4000 builds
- Tagged all of the images in the image stream
- Attempted to push incremental builds

Push failed with "error creating ImageStreamMapping: etcdserver: request is too large"  which is expected.   See earlier comments (comment 8, comment 15) in this bz re:  the need to prevent imagestreams from growing to such a huge number of images and tags.   The registry logs showed no panics (the focus of the fix in comment 16) and no restarts.

Comment 30 errata-xmlrpc 2018-12-13 19:27:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3748

Note You need to log in before you can comment on or make changes to this bug.