1907421 – [4.6.1]oc-image-mirror command failed on "error: unable to copy layer"

Bug 1907421 - [4.6.1]oc-image-mirror command failed on "error: unable to copy layer"

Summary: [4.6.1]oc-image-mirror command failed on "error: unable to copy layer"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Ricardo Maraschini
QA Contact:	Wenjing Zheng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-14 13:48 UTC by Hongkai Liu
Modified:	2021-02-24 15:43 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Lack of retries on a http2 protocol related retryable error. Consequence: Mirroring was aborted with an error message. Fix: Added a retry when the error message correspond to the http2 protocol related error. Result: On such failures we abort the mirror operation only after attempting multiple times.
Clone Of:
Environment:
Last Closed:	2021-02-24 15:43:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift oc pull 706	0	None	closed	Bug 1907421: Implementing retries during for image mirror	2021-02-17 18:23:11 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:43:59 UTC

Description Hongkai Liu 2020-12-14 13:48:01 UTC

Description of problem:
oc-image-mirror command failed on "error: unable to copy layer"
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/branch-ci-openshift-ci-tools-master-images/1337450242785677312#1:build-log.txt%3A381

error: unable to copy layer sha256:6f737d158d2b73972c2f217f01bb3cd0287c6ef50e2b68729b639c8ba61ae12a to registry.ci.openshift.org/ci/autopublicizeconfig: Patch "https://registry.ci.openshift.org/v2/ci/autopublicizeconfig/blobs/uploads/48c69c11-6e15-49a1-8343-0117f16b58de?_state=hx0djzzlswVpZCPaASn2B15d_9ce740d4jnM7_40wkB7Ik5hbWUiOiJjaS9hdXRvcHVibGljaXplY29uZmlnIiwiVVVJRCI6IjQ4YzY5YzExLTZlMTUtNDlhMS04MzQzLTAxMTdmMTZiNThkZSIsIk9mZnNldCI6MCwiU3RhcnRlZEF0IjoiMjAyMC0xMi0xMVQxNzo1NTo0MS40NjQzOTQ5NjRaIn0%3D": http2: Transport: cannot retry err [stream error: stream ID 19; REFUSED_STREAM] after Request.Body was written; define Request.GetBody to avoid this error


Version-Release number of selected component (if applicable):
oc --context app.ci get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.1     True        False         51d     Cluster version is 4.6.1

How reproducible:
Very likely: over 2/3 runs of branch-ci-openshift-ci-tools-master-images job failed after we switched to use registry on api.ci as the target of oc-image-mirror command on 2020-12-11.

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/branch-ci-openshift-ci-tools-master-images?buildId=1337613618056794112



Steps to Reproduce:
Rerun the CI job above.

Comment 2 Clayton Coleman 2020-12-16 15:40:44 UTC

We'd have to look to see whether this is the http2 implementation in golang, the registry golang http2 server implementation, or improper error handling in the go client stack.  This should be a retriable error at a minimum in the go registry client based on a reading of the message (REFUSED_STREAM implies we tried to open a new parallel stream to the registry, which is roughly equivalent to a 429 "retry later due to rate limit").  I suspect this is leaking from the underlying golang client stack through the docker registryclient stack and then up.

Comment 3 Ricardo Maraschini 2020-12-22 16:01:02 UTC

IIUC this happens because we can't retry as the body has been partially read. It asks to define the GetBody on the request so it can get a fresh version of the body for a proper retry. I will work in two fronts here, try to make a fix upstream [1] and work in a retry mechanism in oc.

[1] https://github.com/docker/distribution/blob/master/registry/client/blob_writer.go#L39

Comment 6 Ricardo Maraschini 2021-01-05 15:55:50 UTC

A have been trying to replicate this behavior without success. I manage to replicate through a go http server by artificially lowering down the number of http2 streams allowed, but not by issuing "oc image mirror" commands. I tried in both 4.6.1 and 4.6.7. What I have done is:

1) Created random generated images with "garbage" (200 different images with 100mb each)
2) Hosted these images in a registry deployed externally to the cluster
3) Spawn 4 pods simultaneously within the cluster mirroring these images (50 images each pod, exactly as the provided example) from the remote registry into the internal registry through the external route
4) Rinsed and repeated multiple times with different 200 images

So far all working. Could you provide me with more input? 

1) How many simultaneous mirror commands were you running (pods running those 50 mirrors)?
2) Is there any resource limits in place on the image registry pod?

Comment 7 Hongkai Liu 2021-01-05 18:51:07 UTC

> 1) How many simultaneous mirror commands were you running (pods running those 50 mirrors)?
The oc-image-mirror commands are combined with `&&` (see comment1). They are executed in sequence, aren't they?

> 2) Is there any resource limits in place on the image registry pod?

No limits, only requests:

$ oc --context app.ci get pod -n openshift-image-registry image-registry-66c68574bc-gjcf5 -o yaml | yq -y '.spec.containers[].resources'
requests:
  cpu: 100m
  memory: 256Mi

Comment 8 Ricardo Maraschini 2021-01-06 08:31:41 UTC

What I asked is how many of these pods were running simultaneously. It can help me to understand how much I need to stretch the tests I am doing to cause a failure.

Comment 9 Hongkai Liu 2021-01-06 17:01:41 UTC

> What I asked is how many of these pods were running simultaneously?

https://github.com/openshift/release/blob/master/ci-operator/jobs/openshift/ci-tools/openshift-ci-tools-master-postsubmits.yaml#L13
One pod at a time for this particular job.
We hit this many times for this job.

We could also have other promotion jobs (promotion is implemented by oc-image-mirror) for other repos at the time.
we have not got any complaints from other jobs.

From what I see, this promotion job has way more oc-image-mirror command (over 40) than others (1 or 2 mostly).

Comment 10 Ricardo Maraschini 2021-01-07 10:14:23 UTC

I have spent the last two days attempting to reproduce this but I could not. I have increased the load specified in Comment 6 by a ratio of 3 (mirroring 600 images in 12 pods - 50 images each - or in 24 pods, 25 images each), still not able to meet this error personally. I think that the load on our CI cluster registry might be much higher and as I can't reproduce I can't validate a pseudo-fix I have in mind.

Would it be easier for you to reproduce it? If yes, could you give my pseudo-fix a try? I have prepared a branch [1] with the fix, you would need to compile the oc binary and use it. As this fix needs to go upstream you need to compile oc leveraging the vendored packages (-mod vendor) as the patch [2] is part of it.

I am not even sure this fix is an acceptable one but I would like to check if that fixes the issue first before thinking in different approaches.


[1] https://github.com/ricardomaraschini/oc/tree/cache-to-local-file
[2] https://github.com/ricardomaraschini/oc/commit/a818c6f9fcf165a7020714411250423395e466c9

Comment 18 errata-xmlrpc 2021-02-24 15:43:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.