Bug 1574379

Summary:	Unable to pull image with error "Manifest unknown" when docker-registry memory is near pod memory limit.
Product:	OpenShift Container Platform	Reporter:	Jaspreet Kaur <jkaur>
Component:	Image Registry	Assignee:	Alexey Gladkov <agladkov>
Status:	CLOSED WORKSFORME	QA Contact:	Wenjing Zheng <wzheng>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.5.0	CC:	aos-bugs, bparees, jkaur, ktadimar, miminar, obulatov
Target Milestone:	---
Target Release:	3.5.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-08-22 02:39:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jaspreet Kaur 2018-05-03 07:32:46 UTC

Description of problem: Pods are not starting up as it is unable to pull the image, failing with the error:

Failed to pull image "172.30.150.150:5000/ci/mac-slave2.nexus:1.0": manifest unknown: manifest unknown

We have confirmed that the image is valid and has worked previously. The image is hosted in an external docker registry and is used via the image stream "mac-slave2.nexus" which has the docker pull spec: "172.30.150.150:5000/ci/mac-slave2.nexus". 

We have confirmed that the tag 1.0 has been resolved in the image stream as we can see the image hash as well as pull spec defined (previously if we had an invalid image name / tag, the image hash and pull spec will not be populated after creating the image stream). 

This issue only happens when the memory of the integrated docker-registry memory is near the docker-registry pod's memory limit. 

Re-deploying the docker-registry pod resolves the issue.

1. Memory on the docker-registry pod reaches the limit in a day or two which depends on the number of pods deployed in that day. More pod deployments can bring down the memory even earlier.
2. The memory used up will never be freed unless a restart is done.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: Fails when registry pod memory limit reaches in 1 or 2 days 


Expected results: should not have an issue with deployments in such a short span.


Additional info:

Comment 14 Venkata Tadimarri 2018-07-16 02:29:40 UTC

That's right. Restarting the docker-registry pod (and essentially flushing the memory) resolves the issue, without having to rebuild the image OR recreate the image stream either.