Bug 1754042

Summary: [gcp] `oc adm release new` fails if upload to GCS storage fails
Product: OpenShift Container Platform Reporter: Adam Kaplan <adam.kaplan>
Component: Image RegistryAssignee: Clayton Coleman <ccoleman>
Status: CLOSED ERRATA QA Contact: Wenjing Zheng <wzheng>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: aos-bugs, ccoleman, hwmwm, obulatov, wking
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-23 11:06:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Request twice log none

Description Adam Kaplan 2019-09-20 15:59:45 UTC
Description of problem:
Running `oc adm release new...` fails with error
```
uploading: registry.svc.ci.openshift.org/ci-op-hs1rgqjk/release-scratch sha256:c267ab2e6cb9696348a0cd8deb7892c45d83ef580f1848ed1ccab6b832a967af 24.83MiB
error: unable to push quay.io/openshift-release-dev/ocp-v4.0-art-dev: failed to upload blob sha256:794e9e8f732745cd3301be325c2579b68749ab41c96cbfe9adfee0112ee83157: received unexpected HTTP status: 500 Internal Server Error
info: Mirroring completed in 2m23.49s (34.79MB/s)
```

This is ultimately caused by an error uploading the blob to GCS storage:

```
docker-registry-13-bdklf.log:time="2019-09-19T04:12:02.774422669Z" level=error msg="unknown error completing upload: EOF" go.version=go1.10.3 http.request.host=registry.svc.ci.openshift.org http.request.id=08bcccf0-4c71-4e09-b244-e525187eff59 http.request.method=PUT http.request.remoteaddr=172.16.128.1 http.request.uri="/v2/ci-op-glb491np/release-scratch/blobs/uploads/922b5d5c-f3a6-428c-818e-1b6fd4d7e252?_state=ZqsaOQcFTnKCwyc1p4OEBVXLfMRaEZn0W4x4CgyEv-R7Ik5hbWUiOiJjaS1vcC1nbGI0OTFucC9yZWxlYXNlLXNjcmF0Y2giLCJVVUlEIjoiOTIyYjVkNWMtZjNhNi00MjhjLTgxOGUtMWI2ZmQ0ZDdlMjUyIiwiT2Zmc2V0IjozMjY5MzY5NywiU3RhcnRlZEF0IjoiMjAxOS0wOS0xOVQwNDoxMTo0OVoifQ%3D%3D&digest=sha256%3Aeb56c3e87377c893574ff6b82ce5a5bfac10fb3b7c472dcbb2bcca74ee0cc5ae" http.request.useragent=Go-http-client/1.1 instance.id=14108979-2fb5-4443-b967-8c2240a4bb5c openshift.auth.user="system:serviceaccount:ci-op-glb491np:default" openshift.auth.userid=1782dc6a-da93-11e9-bd1c-42010a8e0003 vars.name=ci-op-glb491np/release-scratch vars.uuid=922b5d5c-f3a6-428c-818e-1b6fd4d7e252
```

Version-Release number of selected component (if applicable): 3.11.z (registry)
4.2.0 (oc client)


How reproducible: Sometimes


Steps to Reproduce:
1. Run release mirror command, as done in CI [1]

Actual results:

Sometimes will flake with upload blob error.

Expected results:

Succeeds.


Additional info:

[1] https://github.com/openshift/release/commit/29bb995a05bf00c09add552c6ebd7116f7424d68#diff-e9ff14653190666556f490397d19c553

Comment 1 Adam Kaplan 2019-09-20 16:03:36 UTC
CC Trevor King - it may be more realistic for `oc adm release new` to be updated to retry in situations like this. There's not much the registry can do if the backend storage drops out.

Comment 2 Clayton Coleman 2019-09-20 17:30:10 UTC
We need to update all the image commands to retry a few times.  There's some in mirror, some here, but in general this is a broad sweeping problem with dealing with problematic registries.

Comment 3 Clayton Coleman 2019-09-20 17:30:33 UTC
One problem is that if you get a 500 uploading a 1gb file, do you *really* want to retry that a second time?

Comment 4 W. Trevor King 2019-09-20 17:35:14 UTC
Absolutely, just resume your push [1].

[1]: https://github.com/opencontainers/distribution-spec/blob/v1.0.0-rc0/spec.md#initiate-resumable-blob-upload

Comment 5 Oleg Bulatov 2019-09-22 20:37:29 UTC
Clients usually send a layer as a single chunk, so uploads can't be really resumed (technically they can be resumed, but only from the first byte).

Comment 6 W. Trevor King 2019-09-23 16:33:21 UTC
> Clients usually send a layer as a single chunk...

Which clients?  Looks like containers/image doesn't support this yet [1], but I don't see why it couldn't learn about it.  I don't have the Quay or OpenShift registry code handy to grep through for server-side support, but Blob-Upload-UUID is clearly part of the distribution spec.

[1]: https://github.com/containers/image/search?utf8=%E2%9C%93&q=Blob-Upload-UUID&type=

Comment 7 Oleg Bulatov 2019-09-24 10:10:09 UTC
All clients that I know.

https://github.com/docker/distribution/blob/master/registry/client/blob_writer.go (this is used by Docker, ReadFrom sends the entire blob and it doesn't support the Content-Range header)
https://github.com/containers/image/blob/a911b201c9edde74bfaabc85a61e85cb60466dff/docker/docker_image_dest.go#L141 (chunked upload is not supported either, also io.Reader is not seekable, so if some last bytes are lost on the server side, they can't be reread and resend)

On the server side Distribution has problems as well due to the eventually consistent nature of S3.

Comment 8 Clayton Coleman 2019-09-24 23:09:05 UTC
Yeah without registry/client support I'm hesitant to go down that path.  Note that we *should*, just not really a good use of time.  I think we could easily do a time based retry (track how long previous took, then retry if below some bounded attempt. The point of retries is to reduce overall prob of failure.

Comment 11 Wenjing Zheng 2019-10-25 09:55:42 UTC
Hi Oleg, I can get 500 internal error but not sure whether oc re-tries: https://privatebin-it-iso.int.open.paas.redhat.com/?dacaec7d6f1952e1#7a9UQOx7a+RxQzTZtFmKirEfduiWwWTBxS1NRVN4RKs=

Whether just try with 500 error is enough? it seems hard to get other error code.

Comment 18 Wenjing Zheng 2019-11-07 09:51:15 UTC
Created attachment 1633599 [details]
Request twice log

Comment 20 errata-xmlrpc 2020-01-23 11:06:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 21 W. Trevor King 2020-02-06 20:20:06 UTC
oc PR landed in 99a4f81 [1].  Checking to see when this went out:

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.3.0-rc.0 | grep cli-artifacts
  cli-artifacts                                 https://github.com/openshift/oc                                            6a937dfe56ff26255d09702c69b8406040c14505
$ git log --oneline origin/release-4.3 | grep -n '6a937df\|99a4f81'
5:6a937dfe5 Merge pull request #208 from openshift-cherrypick-robot/cherry-pick-207-to-release-4.3
124:99a4f8118 Merge pull request #112 from smarterclayton/retry_error

So yeah, this was well before 4.3.0-rc.0.  And as yet no backport to 4.2.

[1]: https://github.com/openshift/oc/pull/112#event-2686554337