Bug 1816180

Summary:	Image migration fails with: manifest blob unknown: blob unknown to registry
Product:	Migration Toolkit for Containers	Reporter:	spandura
Component:	General	Assignee:	Scott Seago <sseago>
Status:	CLOSED NOTABUG	QA Contact:	Xin jiang <xjiang>
Severity:	low	Docs Contact:	Avital Pinnick <apinnick>
Priority:	low
Version:	1.3.0	CC:	alpatel, apjagtap, chezhang, ernelson, fgiloux, jmatthew, mberube, mduasope, rbost, rjohnson, sregidor, sseago, whu
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1831614 (view as bug list)		Environment:
Last Closed:	2021-06-30 15:20:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1831614
Bug Blocks:	1759663

Description spandura 2020-03-23 13:55:17 UTC

Description of problem:
===========================
Migration of apps from OCP 3.11 GA to OCP 4.3.5 failed as it failed to create the InitialBackup 

Error copying image: Error writing manifest: Error uploading manifest latest to 172.30.73.7:5000/project34/httpd-example-0: errors:\nmanifest blob unknown: blob unknown to registry\nmanifest blob unknown: blob unknown to registry\n" backup=openshift-migration/migmigration-05397-d98gg cmd=/plugins/velero-plugins logSource="/go/src/github.com/fusor/openshift-migration-plugin/velero-plugins/migimagestream/backup.go:84" pluginName=velero-plugins
time="2020-03-21T14:56:21Z" level=info msg="1 errors encountered backup up item" backup=openshift-migration/migmigration-05397-d98gg group=image.openshift.io/v1 logSource="pkg/backup/resource_backupper.go:284" name=httpd-example-0 namespace=project34 resource=imagestreams


Version-Release number of selected component (if applicable):
================================================================
OCP 3.11 GA:
=============
[root@dell-per630-05 ~]# rpm -qa | grep openshift
atomic-openshift-hyperkube-3.11.188-1.git.0.db0eaa8.el7.x86_64
atomic-openshift-clients-3.11.188-1.git.0.db0eaa8.el7.x86_64
atomic-openshift-docker-excluder-3.11.188-1.git.0.db0eaa8.el7.noarch
atomic-openshift-excluder-3.11.188-1.git.0.db0eaa8.el7.noarch
atomic-openshift-node-3.11.188-1.git.0.db0eaa8.el7.x86_64

OCP 4.3.5:
=============
ocp_installers_index_url: https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.3.3/
ocp_rhcos_index_url: https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.3/latest/

How reproducible:

Steps to Reproduce:
============================
1. Install OCP 3.11 GA . Create Projects with apps.

2. Install OCP 4.3.5 

3. Start migration from OCP 3.11 to OCP 4.3.5

Actual results:
===============

Expected results:
==================

Additional info:
=============================
Slack channel discussion link: https://coreos.slack.com/archives/CHWBWE8LD/p1584803348279400


Registry logs from OCP 3.11 cluster:
==========================================
time="2020-03-21T14:56:21.68391271Z" level=debug msg="s3aws.GetContent("/docker/registry/v2/repositories/project34/httpd-example-0/_layers/sha256/f1e56db67514d64aacc14367d514a44098[207/1837]
039444b90d1ea76c8fb4/link")" go.version=go1.13.4 http.request.contenttype="application/vnd.docker.distribution.manifest.v2+json" http.request.host="172.30.73.7:5000" http.request.id=75bcee29
-cfe8-4cc1-b2d5-6c857b9bbac0 http.request.method=PUT http.request.remoteaddr="10.130.0.1:53848" http.request.uri="/v2/project34/httpd-example-0/manifests/latest" http.request.useragent="Go-h
ttp-client/1.1" trace.duration=17.915026ms trace.file="/go/src/github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storag
e/driver/base.(*Base).GetContent" trace.id=0635a6c3-8ca5-4c7e-8b68-95a053b6c694 trace.line=95 vars.name="project34/httpd-example-0" vars.reference=latest
time="2020-03-21T14:56:21.708862566Z" level=debug msg="s3aws.Stat("/docker/registry/v2/blobs/sha256/f1/f1e56db67514d64aacc14367d514a44098bcafe117d4039444b90d1ea76c8fb4/data")" go.version=go1
.13.4 http.request.contenttype="application/vnd.docker.distribution.manifest.v2+json" http.request.host="172.30.73.7:5000" http.request.id=75bcee29-cfe8-4cc1-b2d5-6c857b9bbac0 http.request.m
ethod=PUT http.request.remoteaddr="10.130.0.1:53848" http.request.uri="/v2/project34/httpd-example-0/manifests/latest" http.request.useragent="Go-http-client/1.1" trace.duration=24.903232ms
trace.file="/go/src/github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storage/driver/base.(*Base).Stat" trace.id=513e66
69-ff42-436d-925a-8014f9b2e206 trace.line=155 vars.name="project34/httpd-example-0" vars.reference=latest
time="2020-03-21T14:56:21.724017808Z" level=debug msg="s3aws.GetContent("/docker/registry/v2/repositories/project34/httpd-example-0/_layers/sha256/0e6748108ed650611fc6918b6319f0665398cb219be
a0d8a9d23ba7a01b26a48/link")" go.version=go1.13.4 http.request.contenttype="application/vnd.docker.distribution.manifest.v2+json" http.request.host="172.30.73.7:5000" http.request.id=75bcee2
9-cfe8-4cc1-b2d5-6c857b9bbac0 http.request.method=PUT http.request.remoteaddr="10.130.0.1:53848" http.request.uri="/v2/project34/httpd-example-0/manifests/latest" http.request.useragent="Go-
http-client/1.1" trace.duration=15.10567ms trace.file="/go/src/github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storag
e/driver/base.(*Base).GetContent" trace.id=5bda04dd-fcc5-4da3-81ab-f980d0738167 trace.line=95 vars.name="project34/httpd-example-0" vars.reference=latest
time="2020-03-21T14:56:21.842753473Z" level=debug msg="s3aws.Stat("/docker/registry/v2/blobs/sha256/0e/0e6748108ed650611fc6918b6319f0665398cb219bea0d8a9d23ba7a01b26a48/data")" go.version=go1
.13.4 http.request.contenttype="application/vnd.docker.distribution.manifest.v2+json" http.request.host="172.30.73.7:5000" http.request.id=75bcee29-cfe8-4cc1-b2d5-6c857b9bbac0 http.request.m
ethod=PUT http.request.remoteaddr="10.130.0.1:53848" http.request.uri="/v2/project34/httpd-example-0/manifests/latest" http.request.useragent="Go-http-client/1.1" trace.duration=118.683178ms
 trace.file="/go/src/github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storage/driver/base.(*Base).Stat" trace.id=86b6a
c74-4004-42c2-9ef4-f44dcd8740a8 trace.line=155 vars.name="project34/httpd-example-0" vars.reference=latest
time="2020-03-21T14:56:21.84283751Z" level=error msg="response completed with error" err.code="manifest blob unknown" err.detail=sha256:b97d26121a76202c69136d5426c485adebe3b190bb6ee30a316673
cf18b73745 err.message="blob unknown to registry" go.version=go1.13.4 http.request.contenttype="application/vnd.docker.distribution.manifest.v2+json" http.request.host="172.30.73.7:5000" htt
p.request.id=75bcee29-cfe8-4cc1-b2d5-6c857b9bbac0 http.request.method=PUT http.request.remoteaddr="10.130.0.1:53848" http.request.uri="/v2/project34/httpd-example-0/manifests/latest" http.re
quest.useragent="Go-http-client/1.1" http.response.contenttype="application/json; charset=utf-8" http.response.duration=436.671418ms http.response.status=400 http.response.written=319 vars.n
ame="project34/httpd-example-0" vars.reference=latest

Comment 1 spandura 2020-03-23 14:10:39 UTC

Velero pod logs: http://css-storinator-02.css.lab.eng.rdu2.redhat.com/storage/Bugzilla_info/1816180/velero_logs_from_ocp_3_11_cluster

Comment 2 spandura 2020-03-23 14:20:41 UTC

(In reply to spandura from comment #1)
> Velero pod logs:
> http://css-storinator-02.css.lab.eng.rdu2.redhat.com/storage/Bugzilla_info/
> 1816180/velero_logs_from_ocp_3_11_cluster

Including all logs here: http://css-storinator-02.css.lab.eng.rdu2.redhat.com/storage/Bugzilla_info/1816180/

Comment 3 Xin jiang 2020-03-23 15:29:20 UTC

would you please execute command 'oc get pods -n openshift-migration'  on 3.11 cluster? you should see pod 'registry-migplan-k4bpb-1-vvk8j', then execute 'oc describe pod registry-migplan-k4bpb-1-vvk8j'?

Comment 4 Xin jiang 2020-03-23 15:46:49 UTC

Today we hit a similar problem, probably the registry-migplan-k4bpb-1-vvk8j is failed status on the 3.11 cluster side.

Comment 5 spandura 2020-03-24 10:59:43 UTC

(In reply to Xin jiang from comment #3)
> would you please execute command 'oc get pods -n openshift-migration'  on
> 3.11 cluster? you should see pod 'registry-migplan-k4bpb-1-vvk8j', then
> execute 'oc describe pod registry-migplan-k4bpb-1-vvk8j'?

[root@dell-per630-05 ~]# oc get pods -n openshift-migration
NAME                                  READY     STATUS    RESTARTS   AGE
migration-operator-5997688469-984sg   2/2       Running   0          4h
registry-migplan-k4bpb-1-vvk8j        1/1       Running   0          15m
restic-5hnvm                          1/1       Running   0          21m
restic-c2ncd                          1/1       Running   0          21m
restic-fln9p                          1/1       Running   0          21m
restic-hvpsd                          1/1       Running   0          21m
restic-mkkqv                          1/1       Running   0          21m
restic-tshvr                          1/1       Running   0          21m
velero-6bc8b85bf-mvrt7                1/1       Running   0          21m
[root@dell-per630-05 ~]# 

We have tear down the setup and we don't have the output of "describe" command. All the logs related to this are  http://css-storinator-02.css.lab.eng.rdu2.redhat.com/storage/Bugzilla_info/1816180/

Comment 7 John Matthews 2020-07-08 11:44:01 UTC

*** Bug 1831614 has been marked as a duplicate of this bug. ***

Comment 12 Robert Bost 2020-08-28 17:01:39 UTC

@alay here's the new bug to focus on issues in the registry: bz1873586

Comment 14 Frederic Giloux 2020-12-11 09:22:56 UTC

I have a customer facing similar issues.

I would like to precise a bit what the errors "manifest unknown: manifest unknown" or "manifest blob unknown: blob unknown to registry" mean.

What I think is happening is that the migration tool goes through the images referenced in the imagestream tags and tries to pull them one after the other. For some of them it fails. It does not fail because it is sending a wrong command it fails because the image manifest of what is referenced in the imagestream (stored in etcd) does not exist in the image registry "manifest unknown: manifest unknown" or exists but references an image layer that does not exist or is corrupted: "manifest blob unknown: blob unknown to registry". If you were trying to pull the same using podman or docker you would get the same result.

I wish the migration tool would rather use skopeo than trying to pull/push images:
- it would be way quicker
- it is able to preserve sha digests
- it would not require to start an intermediary registry

In any way it would really be better if the migration tool does not completely hangs when it cannot pull an image. It should flag it as failed and carries on.

To mitigate the issue I recommended my customer to aggressively prune objects before migrating a project. Besides the fact that it will speed up the process it may also remove most of the imagestream tags that are not in use. The ones that are in use will most probably get successfully migrated as the migration tool is using the same pull command as what is used for deploying the image and running it as a container. If the pull command would not work the image would not get running as a container.

Comment 15 Erik Nelson 2021-06-16 01:39:42 UTC

ew

Comment 16 Erik Nelson 2021-06-30 15:20:58 UTC

Closing this BZ against MTC, it's clear the underlying issue is related to a registry with a backing store on NFS with a root cause that is outside the scope of MTC. In an effort to improve this, retry logic has been added to MTC to add some amount of resilience to the transfer process so that if this error does show up, a retry may be able to transparently resolve it. Additionally, since this was last filed, MTC has added direct image migrations (DIM), that may have lessened the impact of this.

Please reopen with a comment if this continues to surface and there's more work to be done here, specific to MTC.

Comment 17 Red Hat Bugzilla 2023-09-15 00:30:30 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days