Bug 1895107

Summary:	docker rate limiting causing image pull failures
Product:	OpenShift Container Platform	Reporter:	jamo luhrsen <jluhrsen>
Component:	ImageStreams	Assignee:	Oleg Bulatov <obulatov>
Status:	CLOSED DEFERRED	QA Contact:	Wenjing Zheng <wzheng>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.7	CC:	aos-bugs, eparis, jcallen, jokerman, rmarasch, sjenning, wking
Target Milestone:	---	Keywords:	UpcomingSprint
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-01-13 17:06:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1901982, 1904679, 1904682, 1904683, 1904684, 2051984
Bug Blocks:

Description jamo luhrsen 2020-11-05 18:42:04 UTC

Description of problem:

Docker started to rate limit image pulls recently [0] (Nov. 2nd) and
CI jobs are seeing image pull failures because of this.

you would see something like this in the build-log from this example job [1]:

3x kubelet: Failed to pull image "busybox": rpc error: code = Unknown desc = Error reading manifest latest in docker.io/library/busybox: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
* 3x kubelet: Error: ErrImagePull
* 5x kubelet: Back-off pulling image "busybox"
* 5x kubelet: Error: ImagePullBackOff



[0] https://www.docker.com/blog/what-you-need-to-know-about-upcoming-docker-hub-rate-limiting/
[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/852/pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn/1324047536943534080



Version-Release number of selected component (if applicable):


How reproducible:

very frequently. A search over a 2d period at the time of the creation of this
BZ showed 313 jobs affected by this:

https://search.ci.openshift.org/?search=Failed+to+pull+image+%22busybox%22&maxAge=48h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job


Additional info:

discussions in slack mentioned two alternatives:

gcr.io/google-containers/busybox
quay.io/quay/busybox

Comment 1 Seth Jennings 2020-11-06 14:35:55 UTC

Looking in openshift/origin (openshift-tests) some of the files I found referencing the docker.io busybox image:

test/extended/builds/hooks.go
test/extended/builds/multistage.go
test/extended/cli/compat.go
test/extended/images/extract.go
test/extended/images/layers.go
test/extended/images/mirror.go
test/extended/images/oc_tag.go
test/extended/images/resolve.go
test/extended/testdata/bindata.go
test/extended/testdata/builds/build-postcommit/docker.yaml
test/extended/testdata/builds/test-cds-dockerbuild.json
test/extended/testdata/builds/test-docker-build.json
test/extended/testdata/cmd/test/cmd/testdata/test-docker-build.json
test/extended/testdata/test-cli-debug.yaml

Most test that use the docker.io busybox image are in the Builds and Images tests.

Comment 2 Seth Jennings 2020-11-06 15:11:03 UTC

Being tracked for upstream e2e tests
https://github.com/kubernetes/test-infra/issues/19477
https://github.com/kubernetes/kubernetes/issues/94018

Comment 3 Seth Jennings 2020-11-06 15:15:49 UTC

meant to assign this to ImageStream, though there are some tests from Build that also need changed.

Comment 4 Seth Jennings 2020-11-06 16:05:03 UTC

Trying to address this for upstream e2e
https://github.com/openshift/release/pull/13460

Comment 5 W. Trevor King 2020-11-09 22:21:29 UTC

Still a popular failure mode:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=www.docker.com/increase-rate-limit' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ocp-4.5-e2e-vsphere-upi - 6 runs, 100% failed, 17% of failures match
periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy - 7 runs, 86% failed, 17% of failures match
periodic-ci-openshift-release-master-ocp-4.7-e2e-vsphere - 14 runs, 100% failed, 14% of failures match
pull-ci-cri-o-cri-o-master-e2e-aws - 29 runs, 41% failed, 8% of failures match
...
pull-ci-openshift-sriov-network-operator-master-e2e-aws - 22 runs, 100% failed, 5% of failures match
release-openshift-ocp-installer-e2e-aws-4.7 - 11 runs, 45% failed, 20% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.7 - 12 runs, 67% failed, 25% of failures match
release-openshift-okd-installer-e2e-aws-4.6 - 8 runs, 63% failed, 80% of failures match
release-openshift-okd-installer-e2e-aws-upgrade - 7 runs, 14% failed, 100% of failures match

Comment 8 Oleg Bulatov 2020-12-05 15:11:00 UTC

Clayton's PR has landed. But we still have this problem:

https://search.ci.openshift.org/?search=You+have+reached+your+pull+rate+limit&maxAge=48h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job

I created few BZ for tests that still use docker.io:

BZ 1904679
BZ 1904682
BZ 1904683
BZ 1904684

Comment 11 Oleg Bulatov 2021-01-13 17:06:26 UTC

https://search.ci.openshift.org/?search=You+have+reached+your+pull+rate+limit&maxAge=48h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job

"matched 1.56% of failing runs"

I think we've mostly mitigated the problem, but there are some tests that still use Docker Hub (mostly it's e2e-cmd). Let's create additional BZs for the remaining tests as they are owned by different teams.