Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1605072

Summary: scopeo image copy fails on first attempt within Jenkins pipeline
Product: OpenShift Container Platform Reporter: Luke Stanton <lstanton>
Component: Image RegistryAssignee: Alexey Gladkov <agladkov>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Dongbo Yan <dyan>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: aos-bugs, bparees, jokerman, lstanton, mitr, mmccomas, mpatel
Target Milestone: ---   
Target Release: 3.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1745743 (view as bug list) Environment:
Last Closed: 2018-09-27 16:31:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1745743, 1805500    

Description Luke Stanton 2018-07-20 06:21:56 UTC
Description of problem:
Copying images using the skopeo tool within a Jenkins pipeline fails on first try. A retry in the pipeline fixes it. The version of skopeo is 0.1.28-dev.

How reproducible:
Consistently

Steps to Reproduce:
Take action in Jenkins to cause an image copy via skopeo...

-----
skopeo copy --screds=claims-jenkins:**** --dcreds=claims-jenkins:**** docker://registry.****.com/b****/b****:quas docker://registry.****.com/b****/b****:prod
-----

Actual results:
The following skopeo error occurs on the first copy attempt:

-----
10:12:55 Writing manifest to image destination 10:12:55 time="2018-06-05T10:12:55-04:00" level=fatal msg="Error writing manifest: Error uploading manifest to /v2/bids-prod/bids-ao-claims-draft-service/manifests/prod: manifest blob unknown: blob unknown to registry" 
-----

Expected results:
skopeo copy should complete successfully the first time.

Comment 2 Ben Parees 2018-07-20 14:53:01 UTC
Moving this to containers as they *might* own skopeo, i'm not sure.  If not them, hopefully they can push it in the right direction.

Comment 3 Antonio Murdaca 2018-08-01 19:34:10 UTC
Miloslav, have you ever seen something like this in skopeo?

Comment 4 Miloslav Trmač 2018-08-01 20:04:42 UTC
No, I haven’t ever seen anything like this.

What registry implementation is this? Anything else unusual, perhaps the backing layer storage?

Overall, this looks as if the layer storage were somehow unable to read a layer immediately after successfully writing it and confirming to the client that it has been written — but the layer becomes readable a few seconds later.

(The “manifest blob unknown” error is lacking a bit of detail, but the most likely explanation is that while uploading a manifest, the registry is checking whether all referenced blobs (layers, config…) exist already, and finds that they don’t.)


Looking at skopeo_-_manifestunknownlog.txt, compare the handling of layer blobs b538cc6febe635e011f69d724aa31744ad50a0caee5347221874afa25629ca51 , 944b324912445e934ad17a152e23805fb75fe70e7b5bf6775d83420376fb43c9 , and
96eb74fb2f1d0f1ea94247cdcc4f11dc6df79ebca46a9af12cf26c726b709c9b:

b538… is, when running the command for the first time, not present, so it is uploaded; on second invocation, it is detected as already present at the destination.

944b… is likewise not present when running the command for the first time, and uploaded; on second invocation, it is _detected as missing_, so the command starts to upload it again, and at that time re-checks the presence at the destination [which is an inefficiency in skopeo, arguably], and _then_ the server reports that the layer already exists.

And 96eb… is not present the first time, uploaded, then not present the second time (in _both_ checks), and uploaded anew.

The 944b… layer seems to rule out an overzealous GC: the client merely asks twice ”do you have this layer“ and is told “no”, less than a second later it the server changes its mind and replies “yes”.

Comment 8 Ben Parees 2018-09-05 22:16:37 UTC
This was reported against 3.6, do we know if it occurs with any newer version?

Comment 9 Luke Stanton 2018-09-06 20:51:31 UTC
I'm not sure if it occurs in any newer versions. This is the only case that I'm aware of where the error has shown up.

Comment 10 Luke Stanton 2018-09-11 16:04:57 UTC
NFS is being used as the backing storage.