Hide Forgot
Created attachment 1551293 [details] journalctl -b -u kubelet Description of problem: hyperkube.sh initialization process never ends, because it cannot create pods using the image registry.svc.ci.openshift.org/openshift/origin. When asking about image status, it receives several errors (attached files for bootkube.service and kubelet processes. ) Version-Release number of the following components: # /root/bin/openshift-install-4.0.0-0.nightly-2019-04-02-081046 --dir /root/installation/baremetal-0204 version /root/bin/openshift-install-4.0.0-0.nightly-2019-04-02-081046 v4.0.22-201903311754-dirty built from commit 977c4db80a8005fd0fd0cea26996a455d526201f $ cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="410.8.20190401.0" VERSION_ID="4.1" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.8.20190401.0 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.1" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.1" OSTREE_VERSION=410.8.20190401.0 How reproducible: Steps to Reproduce: 1.Execute openshift-install following the guide of UPI on baremetal 2.Access bootpstrap node by ssh 3.Check journal log of kubelet (journalctl -b -u kubelet) and search for lines like: abr 02 14:52:51 dell-r730-068.dsal.lab.eng.rdu2.redhat.com hyperkube[8343]: E0402 14:52:51.760701 8343 remote_image.go:87] ImageStatus "registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-02-133041@sha256:e9496cbe8432ed8378221fd940552a1160c495e937dd1b3a7f50ccd761a1c7cd" from image service failed: rpc error: code = Unknown desc = layer not known abr 02 14:52:51 dell-r730-068.dsal.lab.eng.rdu2.redhat.com hyperkube[8343]: E0402 14:52:51.760762 8343 kuberuntime_image.go:87] ImageStatus for image {"registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-02-133041@sha256:e9496cbe8432ed8378221fd940552a1160c495e937dd1b3a7f50ccd761a1c7cd"} failed: rpc error: code = Unknown desc = layer not known abr 02 14:52:51 dell-r730-068.dsal.lab.eng.rdu2.redhat.com hyperkube[8343]: E0402 14:52:51.760827 8343 kuberuntime_manager.go:717] init container start failed: ImageInspectError: Failed to inspect image "registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-02-133041@sha256:e9496cbe8432ed8378221fd940552a1160c495e937dd1b3a7f50ccd761a1c7cd": rpc error: code = Unknown desc = layer not known abr 02 14:52:51 dell-r730-068.dsal.lab.eng.rdu2.redhat.com hyperkube[8343]: E0402 14:52:51.760871 8343 pod_workers.go:186] Error syncing pod 99cf2943debadfe458ae24c0d96cbe01 ("bootstrap-machine-config-operator-dell-r730-068.dsal.lab.eng.rdu2.redhat.com_default(99cf2943debadfe458ae24c0d96cbe01)"), skipping: failed to "StartContainer" for "machine-config-controller" with ImageInspectError: "Failed to inspect image \"registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-02-133041@sha256:e9496cbe8432ed8378221fd940552a1160c495e937dd1b3a7f50ccd761a1c7cd\": rpc error: code = Unknown desc = layer not known" Actual results: Kubelet never ends to start Expected results: kubelet starts correctly Additional info: Workarround is to delete local images from podman, and let kubelet to download them again
Created attachment 1551294 [details] journalctl -b -u bootkube.service
Same error with latest payload image and RHCOS8 # /root/bin/openshift-install-4.0.0-0.nightly-2019-04-02-150843 --dir /root/installation/baremetal-0304 version /root/bin/openshift-install-4.0.0-0.nightly-2019-04-02-150843 v4.0.22-201903311754-dirty built from commit 977c4db80a8005fd0fd0cea26996a455d526201f # cat /etc/os os-release ostree/ [root@dell-r730-068 ~]# cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="410.8.20190402.0" VERSION_ID="4.1" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.8.20190402.0 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.1" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.1" OSTREE_VERSION=410.8.20190402.0 Apr 03 10:48:19 dell-r730-068.dsal.lab.eng.rdu2.redhat.com hyperkube[8229]: E0403 10:48:19.129823 8229 pod_workers.go:190] Error syncing pod 29b9928e9160ecaeb2c9d47ad003fc6e ("bootstrap-cluster-version-operator-dell-r730-068.dsal.lab.eng.rdu2.redhat.com_openshift-cluster-version(29b9928e9160ecaeb2c9d47ad003fc6e)"), skipping: failed to "CreatePodSandbox" for "bootstrap-cluster-version-operator-dell-r730-068.dsal.lab.eng.rdu2.redhat.com_openshift-cluster-version(29b9928e9160ecaeb2c9d47ad003fc6e)" with CreatePodSandboxError: "CreatePodSandbox for pod \"bootstrap-cluster-version-operator-dell-r730-068.dsal.lab.eng.rdu2.redhat.com_openshift-cluster-version(29b9928e9160ecaeb2c9d47ad003fc6e)\" failed: rpc error: code = Unknown desc = error creating pod sandbox with name \"k8s_bootstrap-cluster-version-operator-dell-r730-068.dsal.lab.eng.rdu2.redhat.com_openshift-cluster-version_29b9928e9160ecaeb2c9d47ad003fc6e_0\": layer not known"
Seems like something is going sideways with the image storage.
Some more debug info: If we check the podman images, we see that some of them are failing to show the size: # podman images REPOSITORY TAG IMAGE ID CREATED SIZE registry.svc.ci.openshift.org/openshift/origin-release v4.0 c426ba2df891 4 hours ago 283 MB <none> <none> 9fde3e396130 12 hours ago 314 MB <none> <none> 364c73ced698 13 hours ago 311 MB <none> <none> 52ae493f2aec 16 hours ago 316 MB <none> <none> d9b81ca3d0d1 18 hours ago 276 MB <none> <none> 7e830d05f7be 23 hours ago 261 MB registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-04-095952@sha256 4144c980dd8196c157f86b5e9bba5808b50e71ea9c1bb2ff4205b4031cf22bfc 1824c007ea86 23 hours ago unable to determine size registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-04-095952@sha256 0ec087e073673f8502f2137d4cacbb989060d3149c7c9f1588e30e4614016206 15ed210b0e53 23 hours ago unable to determine size <none> <none> 992d63949613 7 days ago 265 MB registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-04-095952 none 079bb1fd26e3 7 days ago 272 MB <none> <none> f9f64d8dbeb1 6 weeks ago 295 MB k8s.gcr.io/pause 3.1 da86e6ba6ca1 15 months ago unable to determine size Pulling them again, size is correctly calculated: # podman pull k8s.gcr.io/pause:3.1 Trying to pull k8s.gcr.io/pause:3.1...Getting image source signatures Copying blob 67ddbfb20a22: 306.02 KiB / 306.02 KiB [========================] 0s Copying config da86e6ba6ca1: 1.57 KiB / 1.57 KiB [==========================] 0s Writing manifest to image destination Storing signatures da86e6ba6ca197bf6bc5e9d900febd906b133eaa4750e6bed647b0fbe50ed43e # podman images REPOSITORY TAG IMAGE ID CREATED SIZE registry.svc.ci.openshift.org/openshift/origin-release v4.0 c426ba2df891 4 hours ago 283 MB <none> <none> 9fde3e396130 12 hours ago 314 MB <none> <none> 364c73ced698 13 hours ago 311 MB <none> <none> 52ae493f2aec 16 hours ago 316 MB <none> <none> d9b81ca3d0d1 18 hours ago 276 MB <none> <none> 7e830d05f7be 23 hours ago 261 MB registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-04-095952@sha256 4144c980dd8196c157f86b5e9bba5808b50e71ea9c1bb2ff4205b4031cf22bfc 1824c007ea86 23 hours ago unable to determine size registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-04-095952@sha256 0ec087e073673f8502f2137d4cacbb989060d3149c7c9f1588e30e4614016206 15ed210b0e53 23 hours ago unable to determine size <none> <none> 992d63949613 7 days ago 265 MB registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-04-095952 none 079bb1fd26e3 7 days ago 272 MB <none> <none> f9f64d8dbeb1 6 weeks ago 295 MB k8s.gcr.io/pause 3.1 da86e6ba6ca1 15 months ago 747 kB After pulling all images with size calculation error, kubelet starts and the installation continues and cluster is finally correctly installed.
RHCOS: 410.8.20190405.0 # /root/bin/openshift-install-4.0.0-0.nightly-2019-04-05-165550 --dir /root/installation/baremetal-0804 version /root/bin/openshift-install-4.0.0-0.nightly-2019-04-05-165550 v4.0.22-201904032147-dirty built from commit b6625b4084fb01ffbe190f1c74208ea00b4b7d9b release image registry.svc.ci.openshift.org/openshift/origin-release:v4.0 Changes on bootstrap.ign from the original generated by "openshift-install-4.0.0-0.nightly-2019-04-05-165550 --dir /root/installation/baremetal-0804 create ignition-configs" Kubelet starts with --serialize-image-pulls=true: {"contents":"[Unit]\nDescription=Kubernetes Kubelet\nWants=rpc-statd.service\n\n[Service]\nType=notify\nExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests\nEnvironment=KUBELET_RUNTIME_REQUEST_TIMEOUT=10m\nEnvironmentFile=-/etc/kubernetes/kubelet-env\n\nExecStart=/usr/bin/hyperkube \\\n kubelet \\\n --container-runtime=remote \\\n --container-runtime-endpoint=/var/run/crio/crio.sock \\\n --runtime-request-timeout=${KUBELET_RUNTIME_REQUEST_TIMEOUT} \\\n --pod-manifest-path=/etc/kubernetes/manifests \\\n --allow-privileged \\\n --minimum-container-ttl-duration=6m0s \\\n --cluster-domain=cluster.local \\\n --cgroup-driver=systemd \\\n --serialize-image-pulls=true \\\n --v=2 \\\n\nRestart=always\nRestartSec=10\n\n[Install]\nWantedBy=multi-user.target\n","enabled":true,"name":"kubelet.service"} Added debug to crio, adding following systemd service: {"contents":"[Unit]\nDescription=CRI-O daemon\nDocumentation=https://github.com/cri-o/cri-o\n\n[Service]\nExecStart=/bin/crio --runtime /bin/runc --log /root/crio.log --log-level debug\nRestart=always\nRestartSec=10s\n\n[Install]\nWantedBy=multi-user.target\n","name":"crio.service"} [root@dell-r730-068 ~]# ps -ef | grep kubelet root 3237 1 1 09:53 ? 00:00:04 /usr/bin/hyperkube kubelet --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-request-timeout=10m --pod-manifest-path=/etc/kubernetes/manifests --allow-privileged --minimum-container-ttl-duration=6m0s --cluster-domain=cluster.local --cgroup-driver=systemd --serialize-image-pulls=true --v=2 ps -ef | grep crio root 3181 1 0 09:53 ? 00:00:02 /bin/crio --runtime /bin/runc --log /root/crio.log --log-level debug After some time waiting, etcd cluster is not stared, I have manuall pulled again images: # podman images REPOSITORY TAG IMAGE ID CREATED SIZE registry.svc.ci.openshift.org/openshift/origin-release v4.0 a6bba4286d81 23 hours ago 283 MB <none> <none> 0ddf75385038 2 days ago 265 MB <none> <none> 0ddf83bfe39b 2 days ago 314 MB <none> <none> 9c4d1cb8c228 3 days ago 316 MB registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943@sha256 a11bc98182d7fdad4b71cf3a4fdddb0bc6d34cf00473d52e42186b0b6410bdd0 b206245de22d 3 days ago unable to determine size <none> <none> f6ae41ec051f 3 days ago 261 MB registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943@sha256 ad85ae157fb7f978aaed14c059772785fbcc4b76d0c503c0d9250427d1b12fa4 dd8a865bb213 3 days ago unable to determine size <none> <none> 364c73ced698 4 days ago 311 MB <none> <none> d9b81ca3d0d1 4 days ago 276 MB registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943 none f9f64d8dbeb1 6 weeks ago 295 MB k8s.gcr.io/pause 3.1 da86e6ba6ca1 15 months ago unable to determine size # podman pull registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943@sha256:a11bc98182d7fdad4b71cf3a4fdddb0bc6d34cf00473d52e42186b0b6410bdd0 # podman pull registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943@sha256:ad85ae157fb7f978aaed14c059772785fbcc4b76d0c503c0d9250427d1b12fa4 # podman pull k8s.gcr.io/pause:3.1 # podman images REPOSITORY TAG IMAGE ID CREATED SIZE registry.svc.ci.openshift.org/openshift/origin-release v4.0 a6bba4286d81 23 hours ago 283 MB <none> <none> 0ddf75385038 2 days ago 265 MB <none> <none> 0ddf83bfe39b 2 days ago 314 MB <none> <none> 9c4d1cb8c228 3 days ago 316 MB registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943@sha256 a11bc98182d7fdad4b71cf3a4fdddb0bc6d34cf00473d52e42186b0b6410bdd0 b206245de22d 3 days ago 266 MB <none> <none> f6ae41ec051f 3 days ago 261 MB registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943@sha256 ad85ae157fb7f978aaed14c059772785fbcc4b76d0c503c0d9250427d1b12fa4 dd8a865bb213 3 days ago 249 MB <none> <none> 364c73ced698 4 days ago 311 MB <none> <none> d9b81ca3d0d1 4 days ago 276 MB registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943 none f9f64d8dbeb1 6 weeks ago 295 MB k8s.gcr.io/pause 3.1 da86e6ba6ca1 15 months ago 747 kB After that, API and machine-config has started on bootstrap node, and etcd cluster has been finally installed. and openshift installation ends correctly. Attached is the /root/crio.log debug log.
Master and workers nodes are still booting with old RHCOS version and crio Bootstrap node: # cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="410.8.20190410.0" VERSION_ID="4.1" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.8.20190410.0 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.1" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.1" OSTREE_VERSION=410.8.20190410.0 # crio --version crio version 1.13.5-1.rhaos4.1.gita9d8dde.el8 Master Node: # cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="410.8.20190408.1" VERSION_ID="4.1" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.8.20190408.1 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.1" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.1" OSTREE_VERSION=410.8.20190408.1 # crio --version crio version 1.13.4-3.rhaos4.1.git30006b3.el8 Worker Node: # cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="410.8.20190408.1" VERSION_ID="4.1" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.8.20190408.1 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.1" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.1" OSTREE_VERSION=410.8.20190408.1 # crio --version crio version 1.13.4-3.rhaos4.1.git30006b3.el8
Fixed in cri-o-1.13.6-1.
Verified on unreleased-master-811-g5a3c57cb37b0f175c2ae33e64cd9a6947bd1d567-dirty
*** Bug 1708663 has been marked as a duplicate of this bug. ***
*** Bug 1708605 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758