Bug 1618657 - Random errors pushing images to docker-registry on router reload
Summary: Random errors pushing images to docker-registry on router reload
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.9.z
Assignee: Ben Parees
QA Contact: Dongbo Yan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-17 08:55 UTC by Pili Guerra
Modified: 2023-09-15 00:11 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-17 14:03:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Pili Guerra 2018-08-17 08:55:01 UTC
Description of problem:

Experiencing spurious errors while pushing images to an OpenShift 3.7 cluster.

Some of the error seen include:

- failed to upload schema2 manifest: received unexpected HTTP status: 500 Internal Server Error - falling back to schema1
- Attempting next endpoint for push after error: received unexpected HTTP status: 500 Internal Server Error
- Attempting next endpoint for push after error: error parsing HTTP 400 response body: unexpected end of JSON input: ""
- Upload failed, retrying: read tcp X.X.X.X:XXXX->Y.Y.Y.Y:443: read: connection reset by peer
- Attempting next endpoint for push after error: Patch https://registry.example.com/v2/XXX/XXX/blobs/uploads/XXX?_state=…: EOF

Can also be reproduced on 3.9 cluster

Version-Release number of selected component (if applicable):

OpenShift 3.7
OpenShift 3.9
ose-haproxy-router:v3.9.33


How reproducible:

Very reproducible

Steps to Reproduce:

1. Set router reload interval to 1s, the lowest allowed value. 

  oc -n default env dc/router RELOAD_INTERVAL=1s

2. Create project for tests:

  oc -n default new-project --skip-config-write test-project

3. Create route to trigger configuration change in router reload:

  oc -n test-project create route edge test --service test --port 80

4. Start patching route regularily:

  while sleep 1; do oc -n test-project patch -p '{"spec": {"to": {"weight": '"$((1 + (RANDOM % 250)))"'}}}' route/test; done

5. Observe "oc -n default logs -f dc/router". Router should reload once a second or thereabout.

6. Start cleanup job in case quotas and/or limitranges are active on project:

  while sleep 60s; do oc -n test-project delete istag --all; done

7. Start watching Docker logs on client (command for upstream-provided packages, may vary depending on test environment):

  sudo journalctl -f -u docker

8. Allow Docker daemon to push to registry.

  oc whoami -t | \
  sudo docker login -u "$(oc whoami)" --password-stdin registry.example.com

9. Write reproduction script

  $ cat >test-app.bash <<'EOF' && chmod +x test-app.bash
#!/bin/bash

set -e -u -o pipefail

rev="${1:?}"

tmpdir=$(mktemp -d)
trap 'date >&2; rm -rf "$tmpdir"' EXIT

tag="registry.example.com/test-project/test-app:${rev}"

cd "$tmpdir"

while true; do
  {
    echo "pid=$$"
    echo "random=$RANDOM"
    date
  } > data

  {
    echo 'FROM scratch'
    echo 'COPY ["data", "/"]'
  } > Dockerfile

  docker build -t "$tag" .
  date >&2
  docker push "$tag"
  date >&2
done

# vim: set sw=2 sts=2 et :
EOF

* Start reproduction script. I used 4 instances.

 ./test-ap.bash latest1
 ./test-app.bash latest2
  etc.


Actual results:

Eventually one or multiple reproduction scripts should fail with an error message, possibly one of those listed at the beginning.

Expected results:

Images are pushed correctly every time.

Additional info:

Comment 1 Pili Guerra 2018-08-17 08:57:40 UTC
It may be that the actual issue is within the Docker client code, not with the registry or OpenShift. 

When the OpenShift application router (HAProxy) reloads the configuration, requests may end up failing. The Docker client code has provisions to retry some failing requests, but looking at the code, they are not all done while pushing an image.

Comment 2 Ben Parees 2018-08-17 11:53:36 UTC
Pili based on your last comment what would you like us to do with this bug?


Do the registry logs show these errors?

Comment 3 hansmi 2018-08-17 12:20:58 UTC
> Do the registry logs show these errors?

The registry pod(s) show nothing. The errors stem from failing HTTP requests in the Docker client which aren't retried.

Comment 4 Ben Parees 2018-08-17 13:43:02 UTC
ok.  I can either close this as "can't fix", or we can send it to the router/networking team if you think the router should be handling this differently.  Up to you.

Comment 5 hansmi 2018-08-17 13:51:01 UTC
I don't think the router in OpenShift or the Docker registry can handle it differently. Changes would be required in the Docker client which isn't under your control.

Context: I work for the company which opened the support case which led to this report via Pili. On our end we got multiple reports from customers who experienced these issues. It was only after in-depth debugging that we discovered that it was a client-side issue, not in OpenShift. As such it was a follow-up to the support case which assumed that it's an OpenShift-specific issue.

Please go ahead and close as WONTFIX. We'll have to report it to upstream if it continues to be an issue.

Comment 6 Ben Parees 2018-08-17 14:03:32 UTC
Thanks

Comment 7 Red Hat Bugzilla 2023-09-15 00:11:41 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.