Bug 1754042 - [gcp] `oc adm release new` fails if upload to GCS storage fails
Summary: [gcp] `oc adm release new` fails if upload to GCS storage fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.3.0
Assignee: Clayton Coleman
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-20 15:59 UTC by Adam Kaplan
Modified: 2020-03-06 09:22 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-23 11:06:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Request twice log (251.05 KB, text/plain)
2019-11-07 09:51 UTC, Wenjing Zheng
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift library-go pull 534 0 'None' closed Bug 1754042: Detect more retriable errors from general registry clients 2020-09-29 18:01:10 UTC
Github openshift oc pull 112 0 'None' closed Bug 1754042: Retry 429 and 503 errors in registry client 2020-09-29 18:01:10 UTC
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:07:04 UTC

Description Adam Kaplan 2019-09-20 15:59:45 UTC
Description of problem:
Running `oc adm release new...` fails with error
```
uploading: registry.svc.ci.openshift.org/ci-op-hs1rgqjk/release-scratch sha256:c267ab2e6cb9696348a0cd8deb7892c45d83ef580f1848ed1ccab6b832a967af 24.83MiB
error: unable to push quay.io/openshift-release-dev/ocp-v4.0-art-dev: failed to upload blob sha256:794e9e8f732745cd3301be325c2579b68749ab41c96cbfe9adfee0112ee83157: received unexpected HTTP status: 500 Internal Server Error
info: Mirroring completed in 2m23.49s (34.79MB/s)
```

This is ultimately caused by an error uploading the blob to GCS storage:

```
docker-registry-13-bdklf.log:time="2019-09-19T04:12:02.774422669Z" level=error msg="unknown error completing upload: EOF" go.version=go1.10.3 http.request.host=registry.svc.ci.openshift.org http.request.id=08bcccf0-4c71-4e09-b244-e525187eff59 http.request.method=PUT http.request.remoteaddr=172.16.128.1 http.request.uri="/v2/ci-op-glb491np/release-scratch/blobs/uploads/922b5d5c-f3a6-428c-818e-1b6fd4d7e252?_state=ZqsaOQcFTnKCwyc1p4OEBVXLfMRaEZn0W4x4CgyEv-R7Ik5hbWUiOiJjaS1vcC1nbGI0OTFucC9yZWxlYXNlLXNjcmF0Y2giLCJVVUlEIjoiOTIyYjVkNWMtZjNhNi00MjhjLTgxOGUtMWI2ZmQ0ZDdlMjUyIiwiT2Zmc2V0IjozMjY5MzY5NywiU3RhcnRlZEF0IjoiMjAxOS0wOS0xOVQwNDoxMTo0OVoifQ%3D%3D&digest=sha256%3Aeb56c3e87377c893574ff6b82ce5a5bfac10fb3b7c472dcbb2bcca74ee0cc5ae" http.request.useragent=Go-http-client/1.1 instance.id=14108979-2fb5-4443-b967-8c2240a4bb5c openshift.auth.user="system:serviceaccount:ci-op-glb491np:default" openshift.auth.userid=1782dc6a-da93-11e9-bd1c-42010a8e0003 vars.name=ci-op-glb491np/release-scratch vars.uuid=922b5d5c-f3a6-428c-818e-1b6fd4d7e252
```

Version-Release number of selected component (if applicable): 3.11.z (registry)
4.2.0 (oc client)


How reproducible: Sometimes


Steps to Reproduce:
1. Run release mirror command, as done in CI [1]

Actual results:

Sometimes will flake with upload blob error.

Expected results:

Succeeds.


Additional info:

[1] https://github.com/openshift/release/commit/29bb995a05bf00c09add552c6ebd7116f7424d68#diff-e9ff14653190666556f490397d19c553

Comment 1 Adam Kaplan 2019-09-20 16:03:36 UTC
CC Trevor King - it may be more realistic for `oc adm release new` to be updated to retry in situations like this. There's not much the registry can do if the backend storage drops out.

Comment 2 Clayton Coleman 2019-09-20 17:30:10 UTC
We need to update all the image commands to retry a few times.  There's some in mirror, some here, but in general this is a broad sweeping problem with dealing with problematic registries.

Comment 3 Clayton Coleman 2019-09-20 17:30:33 UTC
One problem is that if you get a 500 uploading a 1gb file, do you *really* want to retry that a second time?

Comment 4 W. Trevor King 2019-09-20 17:35:14 UTC
Absolutely, just resume your push [1].

[1]: https://github.com/opencontainers/distribution-spec/blob/v1.0.0-rc0/spec.md#initiate-resumable-blob-upload

Comment 5 Oleg Bulatov 2019-09-22 20:37:29 UTC
Clients usually send a layer as a single chunk, so uploads can't be really resumed (technically they can be resumed, but only from the first byte).

Comment 6 W. Trevor King 2019-09-23 16:33:21 UTC
> Clients usually send a layer as a single chunk...

Which clients?  Looks like containers/image doesn't support this yet [1], but I don't see why it couldn't learn about it.  I don't have the Quay or OpenShift registry code handy to grep through for server-side support, but Blob-Upload-UUID is clearly part of the distribution spec.

[1]: https://github.com/containers/image/search?utf8=%E2%9C%93&q=Blob-Upload-UUID&type=

Comment 7 Oleg Bulatov 2019-09-24 10:10:09 UTC
All clients that I know.

https://github.com/docker/distribution/blob/master/registry/client/blob_writer.go (this is used by Docker, ReadFrom sends the entire blob and it doesn't support the Content-Range header)
https://github.com/containers/image/blob/a911b201c9edde74bfaabc85a61e85cb60466dff/docker/docker_image_dest.go#L141 (chunked upload is not supported either, also io.Reader is not seekable, so if some last bytes are lost on the server side, they can't be reread and resend)

On the server side Distribution has problems as well due to the eventually consistent nature of S3.

Comment 8 Clayton Coleman 2019-09-24 23:09:05 UTC
Yeah without registry/client support I'm hesitant to go down that path.  Note that we *should*, just not really a good use of time.  I think we could easily do a time based retry (track how long previous took, then retry if below some bounded attempt. The point of retries is to reduce overall prob of failure.

Comment 11 Wenjing Zheng 2019-10-25 09:55:42 UTC
Hi Oleg, I can get 500 internal error but not sure whether oc re-tries: https://privatebin-it-iso.int.open.paas.redhat.com/?dacaec7d6f1952e1#7a9UQOx7a+RxQzTZtFmKirEfduiWwWTBxS1NRVN4RKs=

Whether just try with 500 error is enough? it seems hard to get other error code.

Comment 18 Wenjing Zheng 2019-11-07 09:51:15 UTC
Created attachment 1633599 [details]
Request twice log

Comment 20 errata-xmlrpc 2020-01-23 11:06:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 21 W. Trevor King 2020-02-06 20:20:06 UTC
oc PR landed in 99a4f81 [1].  Checking to see when this went out:

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.3.0-rc.0 | grep cli-artifacts
  cli-artifacts                                 https://github.com/openshift/oc                                            6a937dfe56ff26255d09702c69b8406040c14505
$ git log --oneline origin/release-4.3 | grep -n '6a937df\|99a4f81'
5:6a937dfe5 Merge pull request #208 from openshift-cherrypick-robot/cherry-pick-207-to-release-4.3
124:99a4f8118 Merge pull request #112 from smarterclayton/retry_error

So yeah, this was well before 4.3.0-rc.0.  And as yet no backport to 4.2.

[1]: https://github.com/openshift/oc/pull/112#event-2686554337


Note You need to log in before you can comment on or make changes to this bug.