Description of problem: Experiencing spurious errors while pushing images to an OpenShift 3.7 cluster. Some of the error seen include: - failed to upload schema2 manifest: received unexpected HTTP status: 500 Internal Server Error - falling back to schema1 - Attempting next endpoint for push after error: received unexpected HTTP status: 500 Internal Server Error - Attempting next endpoint for push after error: error parsing HTTP 400 response body: unexpected end of JSON input: "" - Upload failed, retrying: read tcp X.X.X.X:XXXX->Y.Y.Y.Y:443: read: connection reset by peer - Attempting next endpoint for push after error: Patch https://registry.example.com/v2/XXX/XXX/blobs/uploads/XXX?_state=…: EOF Can also be reproduced on 3.9 cluster Version-Release number of selected component (if applicable): OpenShift 3.7 OpenShift 3.9 ose-haproxy-router:v3.9.33 How reproducible: Very reproducible Steps to Reproduce: 1. Set router reload interval to 1s, the lowest allowed value. oc -n default env dc/router RELOAD_INTERVAL=1s 2. Create project for tests: oc -n default new-project --skip-config-write test-project 3. Create route to trigger configuration change in router reload: oc -n test-project create route edge test --service test --port 80 4. Start patching route regularily: while sleep 1; do oc -n test-project patch -p '{"spec": {"to": {"weight": '"$((1 + (RANDOM % 250)))"'}}}' route/test; done 5. Observe "oc -n default logs -f dc/router". Router should reload once a second or thereabout. 6. Start cleanup job in case quotas and/or limitranges are active on project: while sleep 60s; do oc -n test-project delete istag --all; done 7. Start watching Docker logs on client (command for upstream-provided packages, may vary depending on test environment): sudo journalctl -f -u docker 8. Allow Docker daemon to push to registry. oc whoami -t | \ sudo docker login -u "$(oc whoami)" --password-stdin registry.example.com 9. Write reproduction script $ cat >test-app.bash <<'EOF' && chmod +x test-app.bash #!/bin/bash set -e -u -o pipefail rev="${1:?}" tmpdir=$(mktemp -d) trap 'date >&2; rm -rf "$tmpdir"' EXIT tag="registry.example.com/test-project/test-app:${rev}" cd "$tmpdir" while true; do { echo "pid=$$" echo "random=$RANDOM" date } > data { echo 'FROM scratch' echo 'COPY ["data", "/"]' } > Dockerfile docker build -t "$tag" . date >&2 docker push "$tag" date >&2 done # vim: set sw=2 sts=2 et : EOF * Start reproduction script. I used 4 instances. ./test-ap.bash latest1 ./test-app.bash latest2 etc. Actual results: Eventually one or multiple reproduction scripts should fail with an error message, possibly one of those listed at the beginning. Expected results: Images are pushed correctly every time. Additional info:
It may be that the actual issue is within the Docker client code, not with the registry or OpenShift. When the OpenShift application router (HAProxy) reloads the configuration, requests may end up failing. The Docker client code has provisions to retry some failing requests, but looking at the code, they are not all done while pushing an image.
Pili based on your last comment what would you like us to do with this bug? Do the registry logs show these errors?
> Do the registry logs show these errors? The registry pod(s) show nothing. The errors stem from failing HTTP requests in the Docker client which aren't retried.
ok. I can either close this as "can't fix", or we can send it to the router/networking team if you think the router should be handling this differently. Up to you.
I don't think the router in OpenShift or the Docker registry can handle it differently. Changes would be required in the Docker client which isn't under your control. Context: I work for the company which opened the support case which led to this report via Pili. On our end we got multiple reports from customers who experienced these issues. It was only after in-depth debugging that we discovered that it was a client-side issue, not in OpenShift. As such it was a follow-up to the support case which assumed that it's an OpenShift-specific issue. Please go ahead and close as WONTFIX. We'll have to report it to upstream if it continues to be an issue.
Thanks
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days