Bug 1895107 - docker rate limiting causing image pull failures
Summary: docker rate limiting causing image pull failures
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: ImageStreams
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: ---
Assignee: Oleg Bulatov
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On: 1901982 1904679 1904682 1904683 1904684 2051984
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-05 18:42 UTC by jamo luhrsen
Modified: 2022-02-08 13:34 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-13 17:06:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description jamo luhrsen 2020-11-05 18:42:04 UTC
Description of problem:

Docker started to rate limit image pulls recently [0] (Nov. 2nd) and
CI jobs are seeing image pull failures because of this.

you would see something like this in the build-log from this example job [1]:

3x kubelet: Failed to pull image "busybox": rpc error: code = Unknown desc = Error reading manifest latest in docker.io/library/busybox: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
* 3x kubelet: Error: ErrImagePull
* 5x kubelet: Back-off pulling image "busybox"
* 5x kubelet: Error: ImagePullBackOff



[0] https://www.docker.com/blog/what-you-need-to-know-about-upcoming-docker-hub-rate-limiting/
[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/852/pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn/1324047536943534080



Version-Release number of selected component (if applicable):


How reproducible:

very frequently. A search over a 2d period at the time of the creation of this
BZ showed 313 jobs affected by this:

https://search.ci.openshift.org/?search=Failed+to+pull+image+%22busybox%22&maxAge=48h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job


Additional info:

discussions in slack mentioned two alternatives:

gcr.io/google-containers/busybox
quay.io/quay/busybox

Comment 1 Seth Jennings 2020-11-06 14:35:55 UTC
Looking in openshift/origin (openshift-tests) some of the files I found referencing the docker.io busybox image:

test/extended/builds/hooks.go
test/extended/builds/multistage.go
test/extended/cli/compat.go
test/extended/images/extract.go
test/extended/images/layers.go
test/extended/images/mirror.go
test/extended/images/oc_tag.go
test/extended/images/resolve.go
test/extended/testdata/bindata.go
test/extended/testdata/builds/build-postcommit/docker.yaml
test/extended/testdata/builds/test-cds-dockerbuild.json
test/extended/testdata/builds/test-docker-build.json
test/extended/testdata/cmd/test/cmd/testdata/test-docker-build.json
test/extended/testdata/test-cli-debug.yaml

Most test that use the docker.io busybox image are in the Builds and Images tests.

Comment 2 Seth Jennings 2020-11-06 15:11:03 UTC
Being tracked for upstream e2e tests
https://github.com/kubernetes/test-infra/issues/19477
https://github.com/kubernetes/kubernetes/issues/94018

Comment 3 Seth Jennings 2020-11-06 15:15:49 UTC
meant to assign this to ImageStream, though there are some tests from Build that also need changed.

Comment 4 Seth Jennings 2020-11-06 16:05:03 UTC
Trying to address this for upstream e2e
https://github.com/openshift/release/pull/13460

Comment 5 W. Trevor King 2020-11-09 22:21:29 UTC
Still a popular failure mode:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=www.docker.com/increase-rate-limit' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ocp-4.5-e2e-vsphere-upi - 6 runs, 100% failed, 17% of failures match
periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy - 7 runs, 86% failed, 17% of failures match
periodic-ci-openshift-release-master-ocp-4.7-e2e-vsphere - 14 runs, 100% failed, 14% of failures match
pull-ci-cri-o-cri-o-master-e2e-aws - 29 runs, 41% failed, 8% of failures match
...
pull-ci-openshift-sriov-network-operator-master-e2e-aws - 22 runs, 100% failed, 5% of failures match
release-openshift-ocp-installer-e2e-aws-4.7 - 11 runs, 45% failed, 20% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.7 - 12 runs, 67% failed, 25% of failures match
release-openshift-okd-installer-e2e-aws-4.6 - 8 runs, 63% failed, 80% of failures match
release-openshift-okd-installer-e2e-aws-upgrade - 7 runs, 14% failed, 100% of failures match

Comment 8 Oleg Bulatov 2020-12-05 15:11:00 UTC
Clayton's PR has landed. But we still have this problem:

https://search.ci.openshift.org/?search=You+have+reached+your+pull+rate+limit&maxAge=48h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job

I created few BZ for tests that still use docker.io:

BZ 1904679
BZ 1904682
BZ 1904683
BZ 1904684

Comment 11 Oleg Bulatov 2021-01-13 17:06:26 UTC
https://search.ci.openshift.org/?search=You+have+reached+your+pull+rate+limit&maxAge=48h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job

"matched 1.56% of failing runs"

I think we've mostly mitigated the problem, but there are some tests that still use Docker Hub (mostly it's e2e-cmd). Let's create additional BZs for the remaining tests as they are owned by different teams.


Note You need to log in before you can comment on or make changes to this bug.