1895107 – docker rate limiting causing image pull failures

Bug 1895107 - docker rate limiting causing image pull failures

Summary: docker rate limiting causing image pull failures

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	ImageStreams
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Oleg Bulatov
QA Contact:	Wenjing Zheng
Docs Contact:
URL:
Whiteboard:
Depends On:	1901982 1904679 1904682 1904683 1904684 2051984
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-05 18:42 UTC by jamo luhrsen
Modified:	2022-02-08 13:34 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-13 17:06:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description jamo luhrsen 2020-11-05 18:42:04 UTC

Description of problem:

Docker started to rate limit image pulls recently [0] (Nov. 2nd) and
CI jobs are seeing image pull failures because of this.

you would see something like this in the build-log from this example job [1]:

3x kubelet: Failed to pull image "busybox": rpc error: code = Unknown desc = Error reading manifest latest in docker.io/library/busybox: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
* 3x kubelet: Error: ErrImagePull
* 5x kubelet: Back-off pulling image "busybox"
* 5x kubelet: Error: ImagePullBackOff



[0] https://www.docker.com/blog/what-you-need-to-know-about-upcoming-docker-hub-rate-limiting/
[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/852/pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn/1324047536943534080



Version-Release number of selected component (if applicable):


How reproducible:

very frequently. A search over a 2d period at the time of the creation of this
BZ showed 313 jobs affected by this:

https://search.ci.openshift.org/?search=Failed+to+pull+image+%22busybox%22&maxAge=48h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job


Additional info:

discussions in slack mentioned two alternatives:

gcr.io/google-containers/busybox
quay.io/quay/busybox

Comment 1 Seth Jennings 2020-11-06 14:35:55 UTC

Looking in openshift/origin (openshift-tests) some of the files I found referencing the docker.io busybox image:

test/extended/builds/hooks.go
test/extended/builds/multistage.go
test/extended/cli/compat.go
test/extended/images/extract.go
test/extended/images/layers.go
test/extended/images/mirror.go
test/extended/images/oc_tag.go
test/extended/images/resolve.go
test/extended/testdata/bindata.go
test/extended/testdata/builds/build-postcommit/docker.yaml
test/extended/testdata/builds/test-cds-dockerbuild.json
test/extended/testdata/builds/test-docker-build.json
test/extended/testdata/cmd/test/cmd/testdata/test-docker-build.json
test/extended/testdata/test-cli-debug.yaml

Most test that use the docker.io busybox image are in the Builds and Images tests.

Comment 2 Seth Jennings 2020-11-06 15:11:03 UTC

Being tracked for upstream e2e tests
https://github.com/kubernetes/test-infra/issues/19477
https://github.com/kubernetes/kubernetes/issues/94018

Comment 3 Seth Jennings 2020-11-06 15:15:49 UTC

meant to assign this to ImageStream, though there are some tests from Build that also need changed.

Comment 4 Seth Jennings 2020-11-06 16:05:03 UTC

Trying to address this for upstream e2e
https://github.com/openshift/release/pull/13460

Comment 5 W. Trevor King 2020-11-09 22:21:29 UTC

Still a popular failure mode:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=www.docker.com/increase-rate-limit' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ocp-4.5-e2e-vsphere-upi - 6 runs, 100% failed, 17% of failures match
periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy - 7 runs, 86% failed, 17% of failures match
periodic-ci-openshift-release-master-ocp-4.7-e2e-vsphere - 14 runs, 100% failed, 14% of failures match
pull-ci-cri-o-cri-o-master-e2e-aws - 29 runs, 41% failed, 8% of failures match
...
pull-ci-openshift-sriov-network-operator-master-e2e-aws - 22 runs, 100% failed, 5% of failures match
release-openshift-ocp-installer-e2e-aws-4.7 - 11 runs, 45% failed, 20% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.7 - 12 runs, 67% failed, 25% of failures match
release-openshift-okd-installer-e2e-aws-4.6 - 8 runs, 63% failed, 80% of failures match
release-openshift-okd-installer-e2e-aws-upgrade - 7 runs, 14% failed, 100% of failures match

Comment 8 Oleg Bulatov 2020-12-05 15:11:00 UTC

Clayton's PR has landed. But we still have this problem:

https://search.ci.openshift.org/?search=You+have+reached+your+pull+rate+limit&maxAge=48h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job

I created few BZ for tests that still use docker.io:

BZ 1904679
BZ 1904682
BZ 1904683
BZ 1904684

Comment 11 Oleg Bulatov 2021-01-13 17:06:26 UTC

https://search.ci.openshift.org/?search=You+have+reached+your+pull+rate+limit&maxAge=48h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job

"matched 1.56% of failing runs"

I think we've mostly mitigated the problem, but there are some tests that still use Docker Hub (mostly it's e2e-cmd). Let's create additional BZs for the remaining tests as they are owned by different teams.

Note You need to log in before you can comment on or make changes to this bug.