Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1873586

Summary: Many "blob unknown to registry" errors during migration of images
Product: OpenShift Container Platform Reporter: Robert Bost <rbost>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED NOTABUG QA Contact: Wenjing Zheng <wzheng>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.4CC: alpatel, aos-bugs, dymurray, ernelson, jmatthew, jmontleo, kelly.brown1, obulatov, rmarasch, sseago
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-07 14:50:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Robert Bost 2020-08-28 17:01:15 UTC
This issue started in bz1816180 where there is still some review happening.

A new bug is being opened to see if there's a problem in the Image Registry itself leading to the blob unknown errors. 

These errors happen too often and lead to failures during migration using the OpenShift CAM Tool.

Comment 2 Oleg Bulatov 2020-08-29 11:58:51 UTC
Can you describe at least the problem? The BZ template really helps to make a better description: Description of problem, How reproducible, Steps to Reproduce, Actual results, Expected results.

Comment 3 Robert Bost 2020-08-31 16:29:02 UTC
I didn't work this weekend so I didn't see the requests for more information so reopening.

Comment 5 Robert Bost 2020-08-31 21:11:12 UTC
Unfortunately, we do not seem to have any overlapping registry logs with the 'blob unknown' error seen in the migration pod. If its a problem to leave this bug open while we await that information from the customer, feel free to close the bug and I'll reopen again.

Some more detail:

- Customer is using NFS for registry storage and is unable to move to another storage anytime soon.
- Customer is performing a migration using Cluster Application Migration (CAM) tool 1.2.5 from OCP 3.11 -> OCP 4.4
- The CAM tool initiates the copy operations and is able to retry a copy multiple times if previous attempts fail. There are a limited number of retries (5 retries).
  https://github.com/konveyor/openshift-migration-plugin/blob/release-1.2.5/velero-plugins/migimagestream/shared.go#L36-L60
- Sometimes, these copy operations have hit their retry limit and the migration fails. 
- It has been helpful for the customer to prune their old image registry before starting the migration. This might imply some possible load issues on the target registry, but certainly not conclusive.

Comment 6 Robert Bost 2020-09-01 23:20:51 UTC
Considering the issue in bz1816180 and c#5 in this bug, is there anything you would recommend looking for or gathering when the issue comes back up? That way I can reopen the bug with some more helpful information.

Comment 7 Oleg Bulatov 2020-09-02 14:12:41 UTC
The registry logs from all replicas.

One thing that is interesting in BZ1816180: If you were doing migration from 3.11 to 4.x, why did you try to PUT a manifest into 3.11 registry?

But let's say you see the similar problem in 4.x logs that you have in BZ1816180:

time="2020-03-21T14:56:21.84283751Z" level=error msg="response completed with error" err.code="manifest blob unknown" err.detail=sha256:b97d26121a76202c69136d5426c485adebe3b190bb6ee30a316673cf18b73745 err.message="blob unknown to registry"

It means the manifest uses the blob sha256:b97d26121a76202c69136d5426c485adebe3b190bb6ee30a316673cf18b73745, but the registry cannot find it.

Was the blob uploaded before the manifest? It should be or there is a bug in the migration tool. Was it uploaded successfully? Was it uploaded to another replica? And so on.

That might be an indication of NFS problems. The blob was uploaded to one replica, and manifest was uploaded to another replica that doesn't see the blob because NFS cached lack of the blob.

But again, first let's figure out what's going on and why the logs are from the source registry.