1711844 – [UPI] [METAL] Kubelet cannot pull k8s.gcr.io/pause:3.1 image on bootpstrap node

Bug 1711844 - [UPI] [METAL] Kubelet cannot pull k8s.gcr.io/pause:3.1 image on bootpstrap node

Summary: [UPI] [METAL] Kubelet cannot pull k8s.gcr.io/pause:3.1 image on bootpstrap node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Abhinav Dahiya
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-20 08:47 UTC by xpflying
Modified:	2019-10-17 07:09 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:48:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 1762	0	'None'	closed	Bug 1711844: bootkube.sh: Use pause image from payload	2020-12-24 05:33:26 UTC
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:49:44 UTC

Description xpflying 2019-05-20 08:47:32 UTC

Description of problem:
Kubelet cannot pull k8s.gcr.io/pause:3.1 image on bootpstrap node. In China, we cann't access k8s.gcr.io and other google services.

Version-Release number of selected component (if applicable):
cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="410.8.20190516.0"
VERSION_ID="4.1"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.8.20190516.0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.1"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.1"
OSTREE_VERSION=410.8.20190516.0

How reproducible:


Steps to Reproduce:
1. execute journalctl -b -u kubelet command on bootstrap node


May 18 16:35:48 localhost.localdomain hyperkube[842]: E0518 16:35:48.093504     842 remote_runtime.go:96] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_bootstrap-machine-config-operator-localhost.localdomain_default_e40148da6106c3374d503347551ea5a4_0": Error determining manifest MIME type for docker://k8s.gcr.io/pause:3.1: pinging docker registry returned: Get https://k8s.gcr.io/v2/: dial tcp 108.177.125.82:443: i/o timeout
May 18 16:35:48 localhost.localdomain hyperkube[842]: E0518 16:35:48.093604     842 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "bootstrap-machine-config-operator-localhost.localdomain_default(e40148da6106c3374d503347551ea5a4)" failed: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_bootstrap-machine-config-operator-localhost.localdomain_default_e40148da6106c3374d503347551ea5a4_0": Error determining manifest MIME type for docker://k8s.gcr.io/pause:3.1: pinging docker registry returned: Get https://k8s.gcr.io/v2/: dial tcp 108.177.125.82:443: i/o timeout
May 18 16:35:48 localhost.localdomain hyperkube[842]: E0518 16:35:48.093619     842 kuberuntime_manager.go:661] createPodSandbox for pod "bootstrap-machine-config-operator-localhost.localdomain_default(e40148da6106c3374d503347551ea5a4)" failed: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_bootstrap-machine-config-operator-localhost.localdomain_default_e40148da6106c3374d503347551ea5a4_0": Error determining manifest MIME type for docker://k8s.gcr.io/pause:3.1: pinging docker registry returned: Get https://k8s.gcr.io/v2/: dial tcp 108.177.125.82:443: i/o timeout
May 18 16:35:48 localhost.localdomain hyperkube[842]: E0518 16:35:48.093715     842 pod_workers.go:190] Error syncing pod e40148da6106c3374d503347551ea5a4 ("bootstrap-machine-config-operator-localhost.localdomain_default(e40148da6106c3374d503347551ea5a4)"), skipping: failed to "CreatePodSandbox" for "bootstrap-machine-config-operator-localhost.localdomain_default(e40148da6106c3374d503347551ea5a4)" with CreatePodSandboxError: "CreatePodSandbox for pod \"bootstrap-machine-config-operator-localhost.localdomain_default(e40148da6106c3374d503347551ea5a4)\" failed: rpc error: code = Unknown desc = error creating pod sandbox with name \"k8s_bootstrap-machine-config-operator-localhost.localdomain_default_e40148da6106c3374d503347551ea5a4_0\": Error determining manifest MIME type for docker://k8s.gcr.io/pause:3.1: pinging docker registry returned: Get https://k8s.gcr.io/v2/: dial tcp 108.177.125.82:443: i/o timeout"

Actual results:


Expected results:


Additional info:

Comment 1 Colin Walters 2019-05-20 13:42:47 UTC

Possible solutions;

1) Bake *a* pause image into the OS, use it by default (but installed system would use it)
2) Put up a pause image at quay.io that doesn't require authentication
3) Change the installer to pull the release image and update the kubelet config on bootstrap

Of these, 3) is probably easiest, perhaps even a one-liner - but 1) has come up in other contexts too.

(Is there any more detail on the China firewall in this aspect?  Is it blocking gcr.io but not quay.io?  Does it only block unauthenticated pulls?)

Comment 2 Antonio Murdaca 2019-05-20 13:47:01 UTC

The MCO isn't setting/using that image. That comes as part of the default cri-o package.
The payload image should be available at the bootstrap though, we should change that to whatever the payload pause image is saying (maybe in bootkube.sh?)
Or also, we could override that somehow (ship the cri-o RPM with a config file using a RH hosted image?)

Sounds like an installer issue though. Maybe we should just patch the crio configuration to use the payload pause image directly here before starting kube/opneshift on bootstrap? https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template

Comment 3 Antonio Murdaca 2019-05-20 13:49:12 UTC

bootkube already "reads" the pause image also:

MACHINE_CONFIG_INFRA_IMAGE=$(podman run --quiet --rm ${release} image pod)

As a stopgap, bootkube could just sed /etc/crio/crio.conf to use that on the bootstrap node (and restart crio)

Comment 4 Mrunal Patel 2019-05-20 15:04:32 UTC

Another option is that we push a pause image (that doesn't require auth) to quay.io that we default to in crio.

Comment 5 xpflying 2019-05-20 16:06:55 UTC

(In reply to Colin Walters from comment #1)
> Possible solutions;
> 
> 1) Bake *a* pause image into the OS, use it by default (but installed system
> would use it)
> 2) Put up a pause image at quay.io that doesn't require authentication
> 3) Change the installer to pull the release image and update the kubelet
> config on bootstrap
> 
> Of these, 3) is probably easiest, perhaps even a one-liner - but 1) has come
> up in other contexts too.
> 
> (Is there any more detail on the China firewall in this aspect?  Is it
> blocking gcr.io but not quay.io?  Does it only block unauthenticated pulls?)

The Great Firewall of China block gcr.io because it's owned by Google. quay.io is ok, but it seems that the speed of pulling images from quay.io is slow and unstable in China, normally it's slower than pulling from docker.io. Is there any way to provide an disconnected installation method such as image lists, thanks.

Comment 6 Abhinav Dahiya 2019-05-20 16:10:25 UTC

Disconnected installs are part of 4.2 deliverables and this will be fixed then.

Comment 7 Derek Carr 2019-05-20 16:48:26 UTC

This must be fixed for 4.1.0, we cannot depend on images from gcr.io.

Comment 8 Colin Walters 2019-05-20 17:17:51 UTC

Taking a look at a patch for this now:

diff --git a/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template b/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template
index 9346f4b5f..9d2aa403f 100755
--- a/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template
+++ b/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template
@@ -38,6 +38,19 @@ OPENSHIFT_HYPERKUBE_IMAGE=$(podman run --quiet --rm ${release} image hyperkube)
 
 CLUSTER_BOOTSTRAP_IMAGE=$(podman run --quiet --rm ${release} image cluster-bootstrap)
 
+# Verify at this point we have no images from docker.io or gcr.io
+repos=$(mktemp --suffix='bootkube')
+podman images --sort repository --format '{{"{{"}}.Repository {{"}}"}}' | sort -u > ${repos}
+if grep -Ev '^(docker\.io|gcr\.io)/' ${repos}; then
+  echo "Disallowed registries found!"
+fi
+
+# Now, as early as possible we replace the pause image and restart crio to use it, to ensure
+# that we're using the pause image from our payload just like the primary cluster.
+# Nothing should have created a pod yet.
+sed -e 's,pause_image *=.*,pause_image ="'${MACHINE_CONFIG_INFRA_IMAGE}'"'
+systemctl restart cri-o
+
 mkdir --parents ./{bootstrap-manifests,manifests}
 
 if [ ! -f cvo-bootstrap.done ]

Comment 9 Colin Walters 2019-05-20 17:23:48 UTC

Moving this to https://github.com/openshift/installer/pull/1761

Comment 10 W. Trevor King 2019-05-20 22:49:29 UTC

Backport landed: https://github.com/openshift/installer/pull/1762#event-2354467704

Comment 12 Gaoyun Pei 2019-05-21 07:26:22 UTC

Verify this bug with payload 4.1.0-0.nightly-2019-05-20-233429

# oc adm release info registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-20-233429 --commits | grep installer
  installer                                     https://github.com/openshift/installer                                     6a98be1f3d853b277a004d91e7eede8d21947491


On the cluster installed by previous installer, check /etc/crio/crio.conf on bootstrap node:
[core@ip-10-0-5-192 ~]$ grep ^pause /etc/crio/crio.conf
pause_image = "k8s.gcr.io/pause:3.1"
pause_image_auth_file = ""
pause_command = "/pause"


With 4.1.0-0.nightly-2019-05-20-233429
[core@ip-10-0-6-197 ~]$ grep ^pause /etc/crio/crio.conf
pause_image = "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5175e08492ca7affb85497bff1f74beb663b6e608cc953a48737325ea88de324"
pause_image_auth_file = ""
pause_command = "/usr/bin/pod"

So no k8s.gcr.io/pause would be pulled now, also checked on masters and workers, no any images from unofficial registry being used during a fresh install.

Comment 13 Johnny Liu 2019-05-21 10:26:11 UTC

The fix PR already merged into 4.1.0-0.nightly-2019-05-21-060354, and PASS.


[root@bootstrap-0 ~]# cat /etc/crio/crio.conf |grep "pause_image ="
pause_image = "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5175e08492ca7affb85497bff1f74beb663b6e608cc953a48737325ea88de324"
[root@bootstrap-0 ~]# crictl  images|grep pause
[root@bootstrap-0 ~]#

[root@control-plane-0 ~]# cat /etc/crio/crio.conf |grep "pause_image ="
pause_image = "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5175e08492ca7affb85497bff1f74beb663b6e608cc953a48737325ea88de324"
[root@control-plane-0 ~]# crictl  images|grep pause
[root@control-plane-0 ~]#

Comment 15 errata-xmlrpc 2019-06-04 10:48:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.