Description of problem: Kubelet cannot pull k8s.gcr.io/pause:3.1 image on bootpstrap node. In China, we cann't access k8s.gcr.io and other google services. Version-Release number of selected component (if applicable): cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="410.8.20190516.0" VERSION_ID="4.1" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.8.20190516.0 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.1" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.1" OSTREE_VERSION=410.8.20190516.0 How reproducible: Steps to Reproduce: 1. execute journalctl -b -u kubelet command on bootstrap node May 18 16:35:48 localhost.localdomain hyperkube[842]: E0518 16:35:48.093504 842 remote_runtime.go:96] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_bootstrap-machine-config-operator-localhost.localdomain_default_e40148da6106c3374d503347551ea5a4_0": Error determining manifest MIME type for docker://k8s.gcr.io/pause:3.1: pinging docker registry returned: Get https://k8s.gcr.io/v2/: dial tcp 108.177.125.82:443: i/o timeout May 18 16:35:48 localhost.localdomain hyperkube[842]: E0518 16:35:48.093604 842 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "bootstrap-machine-config-operator-localhost.localdomain_default(e40148da6106c3374d503347551ea5a4)" failed: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_bootstrap-machine-config-operator-localhost.localdomain_default_e40148da6106c3374d503347551ea5a4_0": Error determining manifest MIME type for docker://k8s.gcr.io/pause:3.1: pinging docker registry returned: Get https://k8s.gcr.io/v2/: dial tcp 108.177.125.82:443: i/o timeout May 18 16:35:48 localhost.localdomain hyperkube[842]: E0518 16:35:48.093619 842 kuberuntime_manager.go:661] createPodSandbox for pod "bootstrap-machine-config-operator-localhost.localdomain_default(e40148da6106c3374d503347551ea5a4)" failed: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_bootstrap-machine-config-operator-localhost.localdomain_default_e40148da6106c3374d503347551ea5a4_0": Error determining manifest MIME type for docker://k8s.gcr.io/pause:3.1: pinging docker registry returned: Get https://k8s.gcr.io/v2/: dial tcp 108.177.125.82:443: i/o timeout May 18 16:35:48 localhost.localdomain hyperkube[842]: E0518 16:35:48.093715 842 pod_workers.go:190] Error syncing pod e40148da6106c3374d503347551ea5a4 ("bootstrap-machine-config-operator-localhost.localdomain_default(e40148da6106c3374d503347551ea5a4)"), skipping: failed to "CreatePodSandbox" for "bootstrap-machine-config-operator-localhost.localdomain_default(e40148da6106c3374d503347551ea5a4)" with CreatePodSandboxError: "CreatePodSandbox for pod \"bootstrap-machine-config-operator-localhost.localdomain_default(e40148da6106c3374d503347551ea5a4)\" failed: rpc error: code = Unknown desc = error creating pod sandbox with name \"k8s_bootstrap-machine-config-operator-localhost.localdomain_default_e40148da6106c3374d503347551ea5a4_0\": Error determining manifest MIME type for docker://k8s.gcr.io/pause:3.1: pinging docker registry returned: Get https://k8s.gcr.io/v2/: dial tcp 108.177.125.82:443: i/o timeout" Actual results: Expected results: Additional info:
Possible solutions; 1) Bake *a* pause image into the OS, use it by default (but installed system would use it) 2) Put up a pause image at quay.io that doesn't require authentication 3) Change the installer to pull the release image and update the kubelet config on bootstrap Of these, 3) is probably easiest, perhaps even a one-liner - but 1) has come up in other contexts too. (Is there any more detail on the China firewall in this aspect? Is it blocking gcr.io but not quay.io? Does it only block unauthenticated pulls?)
The MCO isn't setting/using that image. That comes as part of the default cri-o package. The payload image should be available at the bootstrap though, we should change that to whatever the payload pause image is saying (maybe in bootkube.sh?) Or also, we could override that somehow (ship the cri-o RPM with a config file using a RH hosted image?) Sounds like an installer issue though. Maybe we should just patch the crio configuration to use the payload pause image directly here before starting kube/opneshift on bootstrap? https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template
bootkube already "reads" the pause image also: MACHINE_CONFIG_INFRA_IMAGE=$(podman run --quiet --rm ${release} image pod) As a stopgap, bootkube could just sed /etc/crio/crio.conf to use that on the bootstrap node (and restart crio)
Another option is that we push a pause image (that doesn't require auth) to quay.io that we default to in crio.
(In reply to Colin Walters from comment #1) > Possible solutions; > > 1) Bake *a* pause image into the OS, use it by default (but installed system > would use it) > 2) Put up a pause image at quay.io that doesn't require authentication > 3) Change the installer to pull the release image and update the kubelet > config on bootstrap > > Of these, 3) is probably easiest, perhaps even a one-liner - but 1) has come > up in other contexts too. > > (Is there any more detail on the China firewall in this aspect? Is it > blocking gcr.io but not quay.io? Does it only block unauthenticated pulls?) The Great Firewall of China block gcr.io because it's owned by Google. quay.io is ok, but it seems that the speed of pulling images from quay.io is slow and unstable in China, normally it's slower than pulling from docker.io. Is there any way to provide an disconnected installation method such as image lists, thanks.
Disconnected installs are part of 4.2 deliverables and this will be fixed then.
This must be fixed for 4.1.0, we cannot depend on images from gcr.io.
Taking a look at a patch for this now: diff --git a/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template b/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template index 9346f4b5f..9d2aa403f 100755 --- a/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template +++ b/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template @@ -38,6 +38,19 @@ OPENSHIFT_HYPERKUBE_IMAGE=$(podman run --quiet --rm ${release} image hyperkube) CLUSTER_BOOTSTRAP_IMAGE=$(podman run --quiet --rm ${release} image cluster-bootstrap) +# Verify at this point we have no images from docker.io or gcr.io +repos=$(mktemp --suffix='bootkube') +podman images --sort repository --format '{{"{{"}}.Repository {{"}}"}}' | sort -u > ${repos} +if grep -Ev '^(docker\.io|gcr\.io)/' ${repos}; then + echo "Disallowed registries found!" +fi + +# Now, as early as possible we replace the pause image and restart crio to use it, to ensure +# that we're using the pause image from our payload just like the primary cluster. +# Nothing should have created a pod yet. +sed -e 's,pause_image *=.*,pause_image ="'${MACHINE_CONFIG_INFRA_IMAGE}'"' +systemctl restart cri-o + mkdir --parents ./{bootstrap-manifests,manifests} if [ ! -f cvo-bootstrap.done ]
Moving this to https://github.com/openshift/installer/pull/1761
Backport landed: https://github.com/openshift/installer/pull/1762#event-2354467704
Verify this bug with payload 4.1.0-0.nightly-2019-05-20-233429 # oc adm release info registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-20-233429 --commits | grep installer installer https://github.com/openshift/installer 6a98be1f3d853b277a004d91e7eede8d21947491 On the cluster installed by previous installer, check /etc/crio/crio.conf on bootstrap node: [core@ip-10-0-5-192 ~]$ grep ^pause /etc/crio/crio.conf pause_image = "k8s.gcr.io/pause:3.1" pause_image_auth_file = "" pause_command = "/pause" With 4.1.0-0.nightly-2019-05-20-233429 [core@ip-10-0-6-197 ~]$ grep ^pause /etc/crio/crio.conf pause_image = "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5175e08492ca7affb85497bff1f74beb663b6e608cc953a48737325ea88de324" pause_image_auth_file = "" pause_command = "/usr/bin/pod" So no k8s.gcr.io/pause would be pulled now, also checked on masters and workers, no any images from unofficial registry being used during a fresh install.
The fix PR already merged into 4.1.0-0.nightly-2019-05-21-060354, and PASS. [root@bootstrap-0 ~]# cat /etc/crio/crio.conf |grep "pause_image =" pause_image = "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5175e08492ca7affb85497bff1f74beb663b6e608cc953a48737325ea88de324" [root@bootstrap-0 ~]# crictl images|grep pause [root@bootstrap-0 ~]# [root@control-plane-0 ~]# cat /etc/crio/crio.conf |grep "pause_image =" pause_image = "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5175e08492ca7affb85497bff1f74beb663b6e608cc953a48737325ea88de324" [root@control-plane-0 ~]# crictl images|grep pause [root@control-plane-0 ~]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758