Description of problem: Running `oc adm release new...` fails with error ``` uploading: registry.svc.ci.openshift.org/ci-op-hs1rgqjk/release-scratch sha256:c267ab2e6cb9696348a0cd8deb7892c45d83ef580f1848ed1ccab6b832a967af 24.83MiB error: unable to push quay.io/openshift-release-dev/ocp-v4.0-art-dev: failed to upload blob sha256:794e9e8f732745cd3301be325c2579b68749ab41c96cbfe9adfee0112ee83157: received unexpected HTTP status: 500 Internal Server Error info: Mirroring completed in 2m23.49s (34.79MB/s) ``` This is ultimately caused by an error uploading the blob to GCS storage: ``` docker-registry-13-bdklf.log:time="2019-09-19T04:12:02.774422669Z" level=error msg="unknown error completing upload: EOF" go.version=go1.10.3 http.request.host=registry.svc.ci.openshift.org http.request.id=08bcccf0-4c71-4e09-b244-e525187eff59 http.request.method=PUT http.request.remoteaddr=172.16.128.1 http.request.uri="/v2/ci-op-glb491np/release-scratch/blobs/uploads/922b5d5c-f3a6-428c-818e-1b6fd4d7e252?_state=ZqsaOQcFTnKCwyc1p4OEBVXLfMRaEZn0W4x4CgyEv-R7Ik5hbWUiOiJjaS1vcC1nbGI0OTFucC9yZWxlYXNlLXNjcmF0Y2giLCJVVUlEIjoiOTIyYjVkNWMtZjNhNi00MjhjLTgxOGUtMWI2ZmQ0ZDdlMjUyIiwiT2Zmc2V0IjozMjY5MzY5NywiU3RhcnRlZEF0IjoiMjAxOS0wOS0xOVQwNDoxMTo0OVoifQ%3D%3D&digest=sha256%3Aeb56c3e87377c893574ff6b82ce5a5bfac10fb3b7c472dcbb2bcca74ee0cc5ae" http.request.useragent=Go-http-client/1.1 instance.id=14108979-2fb5-4443-b967-8c2240a4bb5c openshift.auth.user="system:serviceaccount:ci-op-glb491np:default" openshift.auth.userid=1782dc6a-da93-11e9-bd1c-42010a8e0003 vars.name=ci-op-glb491np/release-scratch vars.uuid=922b5d5c-f3a6-428c-818e-1b6fd4d7e252 ``` Version-Release number of selected component (if applicable): 3.11.z (registry) 4.2.0 (oc client) How reproducible: Sometimes Steps to Reproduce: 1. Run release mirror command, as done in CI [1] Actual results: Sometimes will flake with upload blob error. Expected results: Succeeds. Additional info: [1] https://github.com/openshift/release/commit/29bb995a05bf00c09add552c6ebd7116f7424d68#diff-e9ff14653190666556f490397d19c553
CC Trevor King - it may be more realistic for `oc adm release new` to be updated to retry in situations like this. There's not much the registry can do if the backend storage drops out.
We need to update all the image commands to retry a few times. There's some in mirror, some here, but in general this is a broad sweeping problem with dealing with problematic registries.
One problem is that if you get a 500 uploading a 1gb file, do you *really* want to retry that a second time?
Absolutely, just resume your push [1]. [1]: https://github.com/opencontainers/distribution-spec/blob/v1.0.0-rc0/spec.md#initiate-resumable-blob-upload
Clients usually send a layer as a single chunk, so uploads can't be really resumed (technically they can be resumed, but only from the first byte).
> Clients usually send a layer as a single chunk... Which clients? Looks like containers/image doesn't support this yet [1], but I don't see why it couldn't learn about it. I don't have the Quay or OpenShift registry code handy to grep through for server-side support, but Blob-Upload-UUID is clearly part of the distribution spec. [1]: https://github.com/containers/image/search?utf8=%E2%9C%93&q=Blob-Upload-UUID&type=
All clients that I know. https://github.com/docker/distribution/blob/master/registry/client/blob_writer.go (this is used by Docker, ReadFrom sends the entire blob and it doesn't support the Content-Range header) https://github.com/containers/image/blob/a911b201c9edde74bfaabc85a61e85cb60466dff/docker/docker_image_dest.go#L141 (chunked upload is not supported either, also io.Reader is not seekable, so if some last bytes are lost on the server side, they can't be reread and resend) On the server side Distribution has problems as well due to the eventually consistent nature of S3.
Yeah without registry/client support I'm hesitant to go down that path. Note that we *should*, just not really a good use of time. I think we could easily do a time based retry (track how long previous took, then retry if below some bounded attempt. The point of retries is to reduce overall prob of failure.
Hi Oleg, I can get 500 internal error but not sure whether oc re-tries: https://privatebin-it-iso.int.open.paas.redhat.com/?dacaec7d6f1952e1#7a9UQOx7a+RxQzTZtFmKirEfduiWwWTBxS1NRVN4RKs= Whether just try with 500 error is enough? it seems hard to get other error code.
Created attachment 1633599 [details] Request twice log
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062
oc PR landed in 99a4f81 [1]. Checking to see when this went out: $ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.3.0-rc.0 | grep cli-artifacts cli-artifacts https://github.com/openshift/oc 6a937dfe56ff26255d09702c69b8406040c14505 $ git log --oneline origin/release-4.3 | grep -n '6a937df\|99a4f81' 5:6a937dfe5 Merge pull request #208 from openshift-cherrypick-robot/cherry-pick-207-to-release-4.3 124:99a4f8118 Merge pull request #112 from smarterclayton/retry_error So yeah, this was well before 4.3.0-rc.0. And as yet no backport to 4.2. [1]: https://github.com/openshift/oc/pull/112#event-2686554337