Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1618657

Summary:	Random errors pushing images to docker-registry on router reload
Product:	OpenShift Container Platform	Reporter:	Pili Guerra <pguerra>
Component:	Image Registry	Assignee:	Ben Parees <bparees>
Status:	CLOSED CANTFIX	QA Contact:	Dongbo Yan <dyan>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.9.0	CC:	aos-bugs, pguerra, public, tobias.brunner
Target Milestone:	---
Target Release:	3.9.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-08-17 14:03:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pili Guerra 2018-08-17 08:55:01 UTC

Description of problem:

Experiencing spurious errors while pushing images to an OpenShift 3.7 cluster.

Some of the error seen include:

- failed to upload schema2 manifest: received unexpected HTTP status: 500 Internal Server Error - falling back to schema1
- Attempting next endpoint for push after error: received unexpected HTTP status: 500 Internal Server Error
- Attempting next endpoint for push after error: error parsing HTTP 400 response body: unexpected end of JSON input: ""
- Upload failed, retrying: read tcp X.X.X.X:XXXX->Y.Y.Y.Y:443: read: connection reset by peer
- Attempting next endpoint for push after error: Patch https://registry.example.com/v2/XXX/XXX/blobs/uploads/XXX?_state=…: EOF

Can also be reproduced on 3.9 cluster

Version-Release number of selected component (if applicable):

OpenShift 3.7
OpenShift 3.9
ose-haproxy-router:v3.9.33


How reproducible:

Very reproducible

Steps to Reproduce:

1. Set router reload interval to 1s, the lowest allowed value. 

  oc -n default env dc/router RELOAD_INTERVAL=1s

2. Create project for tests:

  oc -n default new-project --skip-config-write test-project

3. Create route to trigger configuration change in router reload:

  oc -n test-project create route edge test --service test --port 80

4. Start patching route regularily:

  while sleep 1; do oc -n test-project patch -p '{"spec": {"to": {"weight": '"$((1 + (RANDOM % 250)))"'}}}' route/test; done

5. Observe "oc -n default logs -f dc/router". Router should reload once a second or thereabout.

6. Start cleanup job in case quotas and/or limitranges are active on project:

  while sleep 60s; do oc -n test-project delete istag --all; done

7. Start watching Docker logs on client (command for upstream-provided packages, may vary depending on test environment):

  sudo journalctl -f -u docker

8. Allow Docker daemon to push to registry.

  oc whoami -t | \
  sudo docker login -u "$(oc whoami)" --password-stdin registry.example.com

9. Write reproduction script

  $ cat >test-app.bash <<'EOF' && chmod +x test-app.bash
#!/bin/bash

set -e -u -o pipefail

rev="${1:?}"

tmpdir=$(mktemp -d)
trap 'date >&2; rm -rf "$tmpdir"' EXIT

tag="registry.example.com/test-project/test-app:${rev}"

cd "$tmpdir"

while true; do
  {
    echo "pid=$$"
    echo "random=$RANDOM"
    date
  } > data

  {
    echo 'FROM scratch'
    echo 'COPY ["data", "/"]'
  } > Dockerfile

  docker build -t "$tag" .
  date >&2
  docker push "$tag"
  date >&2
done

# vim: set sw=2 sts=2 et :
EOF

* Start reproduction script. I used 4 instances.

 ./test-ap.bash latest1
 ./test-app.bash latest2
  etc.


Actual results:

Eventually one or multiple reproduction scripts should fail with an error message, possibly one of those listed at the beginning.

Expected results:

Images are pushed correctly every time.

Additional info:

Comment 1 Pili Guerra 2018-08-17 08:57:40 UTC

It may be that the actual issue is within the Docker client code, not with the registry or OpenShift. 

When the OpenShift application router (HAProxy) reloads the configuration, requests may end up failing. The Docker client code has provisions to retry some failing requests, but looking at the code, they are not all done while pushing an image.

Comment 2 Ben Parees 2018-08-17 11:53:36 UTC

Pili based on your last comment what would you like us to do with this bug?


Do the registry logs show these errors?

Comment 3 hansmi 2018-08-17 12:20:58 UTC

> Do the registry logs show these errors?

The registry pod(s) show nothing. The errors stem from failing HTTP requests in the Docker client which aren't retried.

Comment 4 Ben Parees 2018-08-17 13:43:02 UTC

ok.  I can either close this as "can't fix", or we can send it to the router/networking team if you think the router should be handling this differently.  Up to you.

Comment 5 hansmi 2018-08-17 13:51:01 UTC

I don't think the router in OpenShift or the Docker registry can handle it differently. Changes would be required in the Docker client which isn't under your control.

Context: I work for the company which opened the support case which led to this report via Pili. On our end we got multiple reports from customers who experienced these issues. It was only after in-depth debugging that we discovered that it was a client-side issue, not in OpenShift. As such it was a follow-up to the support case which assumed that it's an OpenShift-specific issue.

Please go ahead and close as WONTFIX. We'll have to report it to upstream if it continues to be an issue.

Comment 6 Ben Parees 2018-08-17 14:03:32 UTC

Thanks

Comment 7 Red Hat Bugzilla 2023-09-15 00:11:41 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days