Description of problem: oc-image-mirror command failed on "error: unable to copy layer" https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/branch-ci-openshift-ci-tools-master-images/1337450242785677312#1:build-log.txt%3A381 error: unable to copy layer sha256:6f737d158d2b73972c2f217f01bb3cd0287c6ef50e2b68729b639c8ba61ae12a to registry.ci.openshift.org/ci/autopublicizeconfig: Patch "https://registry.ci.openshift.org/v2/ci/autopublicizeconfig/blobs/uploads/48c69c11-6e15-49a1-8343-0117f16b58de?_state=hx0djzzlswVpZCPaASn2B15d_9ce740d4jnM7_40wkB7Ik5hbWUiOiJjaS9hdXRvcHVibGljaXplY29uZmlnIiwiVVVJRCI6IjQ4YzY5YzExLTZlMTUtNDlhMS04MzQzLTAxMTdmMTZiNThkZSIsIk9mZnNldCI6MCwiU3RhcnRlZEF0IjoiMjAyMC0xMi0xMVQxNzo1NTo0MS40NjQzOTQ5NjRaIn0%3D": http2: Transport: cannot retry err [stream error: stream ID 19; REFUSED_STREAM] after Request.Body was written; define Request.GetBody to avoid this error Version-Release number of selected component (if applicable): oc --context app.ci get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.1 True False 51d Cluster version is 4.6.1 How reproducible: Very likely: over 2/3 runs of branch-ci-openshift-ci-tools-master-images job failed after we switched to use registry on api.ci as the target of oc-image-mirror command on 2020-12-11. https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/branch-ci-openshift-ci-tools-master-images?buildId=1337613618056794112 Steps to Reproduce: Rerun the CI job above.
We'd have to look to see whether this is the http2 implementation in golang, the registry golang http2 server implementation, or improper error handling in the go client stack. This should be a retriable error at a minimum in the go registry client based on a reading of the message (REFUSED_STREAM implies we tried to open a new parallel stream to the registry, which is roughly equivalent to a 429 "retry later due to rate limit"). I suspect this is leaking from the underlying golang client stack through the docker registryclient stack and then up.
IIUC this happens because we can't retry as the body has been partially read. It asks to define the GetBody on the request so it can get a fresh version of the body for a proper retry. I will work in two fronts here, try to make a fix upstream [1] and work in a retry mechanism in oc. [1] https://github.com/docker/distribution/blob/master/registry/client/blob_writer.go#L39
A have been trying to replicate this behavior without success. I manage to replicate through a go http server by artificially lowering down the number of http2 streams allowed, but not by issuing "oc image mirror" commands. I tried in both 4.6.1 and 4.6.7. What I have done is: 1) Created random generated images with "garbage" (200 different images with 100mb each) 2) Hosted these images in a registry deployed externally to the cluster 3) Spawn 4 pods simultaneously within the cluster mirroring these images (50 images each pod, exactly as the provided example) from the remote registry into the internal registry through the external route 4) Rinsed and repeated multiple times with different 200 images So far all working. Could you provide me with more input? 1) How many simultaneous mirror commands were you running (pods running those 50 mirrors)? 2) Is there any resource limits in place on the image registry pod?
> 1) How many simultaneous mirror commands were you running (pods running those 50 mirrors)? The oc-image-mirror commands are combined with `&&` (see comment1). They are executed in sequence, aren't they? > 2) Is there any resource limits in place on the image registry pod? No limits, only requests: $ oc --context app.ci get pod -n openshift-image-registry image-registry-66c68574bc-gjcf5 -o yaml | yq -y '.spec.containers[].resources' requests: cpu: 100m memory: 256Mi
What I asked is how many of these pods were running simultaneously. It can help me to understand how much I need to stretch the tests I am doing to cause a failure.
> What I asked is how many of these pods were running simultaneously? https://github.com/openshift/release/blob/master/ci-operator/jobs/openshift/ci-tools/openshift-ci-tools-master-postsubmits.yaml#L13 One pod at a time for this particular job. We hit this many times for this job. We could also have other promotion jobs (promotion is implemented by oc-image-mirror) for other repos at the time. we have not got any complaints from other jobs. From what I see, this promotion job has way more oc-image-mirror command (over 40) than others (1 or 2 mostly).
I have spent the last two days attempting to reproduce this but I could not. I have increased the load specified in Comment 6 by a ratio of 3 (mirroring 600 images in 12 pods - 50 images each - or in 24 pods, 25 images each), still not able to meet this error personally. I think that the load on our CI cluster registry might be much higher and as I can't reproduce I can't validate a pseudo-fix I have in mind. Would it be easier for you to reproduce it? If yes, could you give my pseudo-fix a try? I have prepared a branch [1] with the fix, you would need to compile the oc binary and use it. As this fix needs to go upstream you need to compile oc leveraging the vendored packages (-mod vendor) as the patch [2] is part of it. I am not even sure this fix is an acceptable one but I would like to check if that fixes the issue first before thinking in different approaches. [1] https://github.com/ricardomaraschini/oc/tree/cache-to-local-file [2] https://github.com/ricardomaraschini/oc/commit/a818c6f9fcf165a7020714411250423395e466c9
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633