Bug 1695516 - [UPI] [METAL] Kubelet not starting on bootpstrap node because of failed images: "layer not known"
Summary: [UPI] [METAL] Kubelet not starting on bootpstrap node because of failed image...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.1.0
Assignee: Nalin Dahyabhai
QA Contact: David Sanz
URL:
Whiteboard:
: 1708663 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-03 09:20 UTC by David Sanz
Modified: 2019-10-22 08:18 UTC (History)
12 users (show)

Fixed In Version: cri-o-1.13.5-1.rhaos4.1.gita9d8dde.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:47:00 UTC
Target Upstream Version:


Attachments (Terms of Use)
journalctl -b -u kubelet (13.00 MB, text/plain)
2019-04-03 09:20 UTC, David Sanz
no flags Details
journalctl -b -u bootkube.service (451.32 KB, text/plain)
2019-04-03 09:21 UTC, David Sanz
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:47:10 UTC

Description David Sanz 2019-04-03 09:20:59 UTC
Created attachment 1551293 [details]
journalctl -b -u kubelet

Description of problem:
hyperkube.sh initialization process never ends, because it cannot create pods using the image registry.svc.ci.openshift.org/openshift/origin.

When asking about image status, it receives several errors (attached files for bootkube.service and kubelet processes. )

Version-Release number of the following components:
# /root/bin/openshift-install-4.0.0-0.nightly-2019-04-02-081046 --dir /root/installation/baremetal-0204 version
/root/bin/openshift-install-4.0.0-0.nightly-2019-04-02-081046 v4.0.22-201903311754-dirty
built from commit 977c4db80a8005fd0fd0cea26996a455d526201f

$ cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="410.8.20190401.0"
VERSION_ID="4.1"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.8.20190401.0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.1"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.1"
OSTREE_VERSION=410.8.20190401.0


How reproducible:

Steps to Reproduce:
1.Execute openshift-install following the guide of UPI on baremetal
2.Access bootpstrap node by ssh
3.Check journal log of kubelet (journalctl -b -u kubelet) and search for lines like:

abr 02 14:52:51 dell-r730-068.dsal.lab.eng.rdu2.redhat.com hyperkube[8343]: E0402 14:52:51.760701    8343 remote_image.go:87] ImageStatus "registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-02-133041@sha256:e9496cbe8432ed8378221fd940552a1160c495e937dd1b3a7f50ccd761a1c7cd" from image service failed: rpc error: code = Unknown desc = layer not known
abr 02 14:52:51 dell-r730-068.dsal.lab.eng.rdu2.redhat.com hyperkube[8343]: E0402 14:52:51.760762    8343 kuberuntime_image.go:87] ImageStatus for image {"registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-02-133041@sha256:e9496cbe8432ed8378221fd940552a1160c495e937dd1b3a7f50ccd761a1c7cd"} failed: rpc error: code = Unknown desc = layer not known
abr 02 14:52:51 dell-r730-068.dsal.lab.eng.rdu2.redhat.com hyperkube[8343]: E0402 14:52:51.760827    8343 kuberuntime_manager.go:717] init container start failed: ImageInspectError: Failed to inspect image "registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-02-133041@sha256:e9496cbe8432ed8378221fd940552a1160c495e937dd1b3a7f50ccd761a1c7cd": rpc error: code = Unknown desc = layer not known
abr 02 14:52:51 dell-r730-068.dsal.lab.eng.rdu2.redhat.com hyperkube[8343]: E0402 14:52:51.760871    8343 pod_workers.go:186] Error syncing pod 99cf2943debadfe458ae24c0d96cbe01 ("bootstrap-machine-config-operator-dell-r730-068.dsal.lab.eng.rdu2.redhat.com_default(99cf2943debadfe458ae24c0d96cbe01)"), skipping: failed to "StartContainer" for "machine-config-controller" with ImageInspectError: "Failed to inspect image \"registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-02-133041@sha256:e9496cbe8432ed8378221fd940552a1160c495e937dd1b3a7f50ccd761a1c7cd\": rpc error: code = Unknown desc = layer not known"




Actual results:
Kubelet never ends to start

Expected results:
kubelet starts correctly


Additional info:

Workarround is to delete local images from podman, and let kubelet to download them again

Comment 1 David Sanz 2019-04-03 09:21:32 UTC
Created attachment 1551294 [details]
journalctl -b -u bootkube.service

Comment 2 David Sanz 2019-04-03 10:49:32 UTC
Same error with latest payload image and RHCOS8

# /root/bin/openshift-install-4.0.0-0.nightly-2019-04-02-150843 --dir /root/installation/baremetal-0304 version
/root/bin/openshift-install-4.0.0-0.nightly-2019-04-02-150843 v4.0.22-201903311754-dirty
built from commit 977c4db80a8005fd0fd0cea26996a455d526201f


# cat /etc/os
os-release  ostree/     
[root@dell-r730-068 ~]# cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="410.8.20190402.0"
VERSION_ID="4.1"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.8.20190402.0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.1"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.1"
OSTREE_VERSION=410.8.20190402.0





Apr 03 10:48:19 dell-r730-068.dsal.lab.eng.rdu2.redhat.com hyperkube[8229]: E0403 10:48:19.129823    8229 pod_workers.go:190] Error syncing pod 29b9928e9160ecaeb2c9d47ad003fc6e ("bootstrap-cluster-version-operator-dell-r730-068.dsal.lab.eng.rdu2.redhat.com_openshift-cluster-version(29b9928e9160ecaeb2c9d47ad003fc6e)"), skipping: failed to "CreatePodSandbox" for "bootstrap-cluster-version-operator-dell-r730-068.dsal.lab.eng.rdu2.redhat.com_openshift-cluster-version(29b9928e9160ecaeb2c9d47ad003fc6e)" with CreatePodSandboxError: "CreatePodSandbox for pod \"bootstrap-cluster-version-operator-dell-r730-068.dsal.lab.eng.rdu2.redhat.com_openshift-cluster-version(29b9928e9160ecaeb2c9d47ad003fc6e)\" failed: rpc error: code = Unknown desc = error creating pod sandbox with name \"k8s_bootstrap-cluster-version-operator-dell-r730-068.dsal.lab.eng.rdu2.redhat.com_openshift-cluster-version_29b9928e9160ecaeb2c9d47ad003fc6e_0\": layer not known"

Comment 3 Seth Jennings 2019-04-04 13:53:53 UTC
Seems like something is going sideways with the image storage.

Comment 4 David Sanz 2019-04-04 14:00:22 UTC
Some more debug info:

If we check the podman images, we see that some of them are failing to show the size:


# podman images
REPOSITORY                                                                     TAG                                                                IMAGE ID       CREATED         SIZE
registry.svc.ci.openshift.org/openshift/origin-release                         v4.0                                                               c426ba2df891   4 hours ago     283 MB
<none>                                                                         <none>                                                             9fde3e396130   12 hours ago    314 MB
<none>                                                                         <none>                                                             364c73ced698   13 hours ago    311 MB
<none>                                                                         <none>                                                             52ae493f2aec   16 hours ago    316 MB
<none>                                                                         <none>                                                             d9b81ca3d0d1   18 hours ago    276 MB
<none>                                                                         <none>                                                             7e830d05f7be   23 hours ago    261 MB
registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-04-095952@sha256   4144c980dd8196c157f86b5e9bba5808b50e71ea9c1bb2ff4205b4031cf22bfc   1824c007ea86   23 hours ago    unable to determine size
registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-04-095952@sha256   0ec087e073673f8502f2137d4cacbb989060d3149c7c9f1588e30e4614016206   15ed210b0e53   23 hours ago    unable to determine size
<none>                                                                         <none>                                                             992d63949613   7 days ago      265 MB
registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-04-095952          none                                                               079bb1fd26e3   7 days ago      272 MB
<none>                                                                         <none>                                                             f9f64d8dbeb1   6 weeks ago     295 MB
k8s.gcr.io/pause                                                               3.1                                                                da86e6ba6ca1   15 months ago   unable to determine size



Pulling them again, size is correctly calculated:



# podman pull k8s.gcr.io/pause:3.1
Trying to pull k8s.gcr.io/pause:3.1...Getting image source signatures
Copying blob 67ddbfb20a22: 306.02 KiB / 306.02 KiB [========================] 0s
Copying config da86e6ba6ca1: 1.57 KiB / 1.57 KiB [==========================] 0s
Writing manifest to image destination
Storing signatures
da86e6ba6ca197bf6bc5e9d900febd906b133eaa4750e6bed647b0fbe50ed43e
# podman images
REPOSITORY                                                                     TAG                                                                IMAGE ID       CREATED         SIZE
registry.svc.ci.openshift.org/openshift/origin-release                         v4.0                                                               c426ba2df891   4 hours ago     283 MB
<none>                                                                         <none>                                                             9fde3e396130   12 hours ago    314 MB
<none>                                                                         <none>                                                             364c73ced698   13 hours ago    311 MB
<none>                                                                         <none>                                                             52ae493f2aec   16 hours ago    316 MB
<none>                                                                         <none>                                                             d9b81ca3d0d1   18 hours ago    276 MB
<none>                                                                         <none>                                                             7e830d05f7be   23 hours ago    261 MB
registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-04-095952@sha256   4144c980dd8196c157f86b5e9bba5808b50e71ea9c1bb2ff4205b4031cf22bfc   1824c007ea86   23 hours ago    unable to determine size
registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-04-095952@sha256   0ec087e073673f8502f2137d4cacbb989060d3149c7c9f1588e30e4614016206   15ed210b0e53   23 hours ago    unable to determine size
<none>                                                                         <none>                                                             992d63949613   7 days ago      265 MB
registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-04-095952          none                                                               079bb1fd26e3   7 days ago      272 MB
<none>                                                                         <none>                                                             f9f64d8dbeb1   6 weeks ago     295 MB
k8s.gcr.io/pause                                                               3.1                                                                da86e6ba6ca1   15 months ago   747 kB



After pulling all images with size calculation error, kubelet starts and the installation continues and cluster is finally correctly installed.

Comment 6 David Sanz 2019-04-08 10:32:17 UTC
RHCOS: 410.8.20190405.0

# /root/bin/openshift-install-4.0.0-0.nightly-2019-04-05-165550 --dir /root/installation/baremetal-0804 version
/root/bin/openshift-install-4.0.0-0.nightly-2019-04-05-165550 v4.0.22-201904032147-dirty
built from commit b6625b4084fb01ffbe190f1c74208ea00b4b7d9b
release image registry.svc.ci.openshift.org/openshift/origin-release:v4.0


Changes on bootstrap.ign from the original generated by "openshift-install-4.0.0-0.nightly-2019-04-05-165550 --dir /root/installation/baremetal-0804 create ignition-configs"

Kubelet starts with --serialize-image-pulls=true:

{"contents":"[Unit]\nDescription=Kubernetes Kubelet\nWants=rpc-statd.service\n\n[Service]\nType=notify\nExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests\nEnvironment=KUBELET_RUNTIME_REQUEST_TIMEOUT=10m\nEnvironmentFile=-/etc/kubernetes/kubelet-env\n\nExecStart=/usr/bin/hyperkube \\\n  kubelet \\\n    --container-runtime=remote \\\n    --container-runtime-endpoint=/var/run/crio/crio.sock \\\n    --runtime-request-timeout=${KUBELET_RUNTIME_REQUEST_TIMEOUT} \\\n    --pod-manifest-path=/etc/kubernetes/manifests \\\n    --allow-privileged \\\n    --minimum-container-ttl-duration=6m0s \\\n    --cluster-domain=cluster.local \\\n    --cgroup-driver=systemd \\\n    --serialize-image-pulls=true \\\n    --v=2 \\\n\nRestart=always\nRestartSec=10\n\n[Install]\nWantedBy=multi-user.target\n","enabled":true,"name":"kubelet.service"}

Added debug to crio, adding following systemd service:

{"contents":"[Unit]\nDescription=CRI-O daemon\nDocumentation=https://github.com/cri-o/cri-o\n\n[Service]\nExecStart=/bin/crio --runtime /bin/runc --log /root/crio.log --log-level debug\nRestart=always\nRestartSec=10s\n\n[Install]\nWantedBy=multi-user.target\n","name":"crio.service"}

[root@dell-r730-068 ~]# ps -ef | grep kubelet
root      3237     1  1 09:53 ?        00:00:04 /usr/bin/hyperkube kubelet --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-request-timeout=10m --pod-manifest-path=/etc/kubernetes/manifests --allow-privileged --minimum-container-ttl-duration=6m0s --cluster-domain=cluster.local --cgroup-driver=systemd --serialize-image-pulls=true --v=2

ps -ef | grep crio
root      3181     1  0 09:53 ?        00:00:02 /bin/crio --runtime /bin/runc --log /root/crio.log --log-level debug


After some time waiting, etcd cluster is not stared, I have manuall pulled again images:

# podman images
REPOSITORY                                                                     TAG                                                                IMAGE ID       CREATED         SIZE
registry.svc.ci.openshift.org/openshift/origin-release                         v4.0                                                               a6bba4286d81   23 hours ago    283 MB
<none>                                                                         <none>                                                             0ddf75385038   2 days ago      265 MB
<none>                                                                         <none>                                                             0ddf83bfe39b   2 days ago      314 MB
<none>                                                                         <none>                                                             9c4d1cb8c228   3 days ago      316 MB
registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943@sha256   a11bc98182d7fdad4b71cf3a4fdddb0bc6d34cf00473d52e42186b0b6410bdd0   b206245de22d   3 days ago      unable to determine size
<none>                                                                         <none>                                                             f6ae41ec051f   3 days ago      261 MB
registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943@sha256   ad85ae157fb7f978aaed14c059772785fbcc4b76d0c503c0d9250427d1b12fa4   dd8a865bb213   3 days ago      unable to determine size
<none>                                                                         <none>                                                             364c73ced698   4 days ago      311 MB
<none>                                                                         <none>                                                             d9b81ca3d0d1   4 days ago      276 MB
registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943          none                                                               f9f64d8dbeb1   6 weeks ago     295 MB
k8s.gcr.io/pause                                                               3.1                                                                da86e6ba6ca1   15 months ago   unable to determine size


# podman pull registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943@sha256:a11bc98182d7fdad4b71cf3a4fdddb0bc6d34cf00473d52e42186b0b6410bdd0
# podman pull registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943@sha256:ad85ae157fb7f978aaed14c059772785fbcc4b76d0c503c0d9250427d1b12fa4
# podman pull k8s.gcr.io/pause:3.1

# podman images
REPOSITORY                                                                     TAG                                                                IMAGE ID       CREATED         SIZE
registry.svc.ci.openshift.org/openshift/origin-release                         v4.0                                                               a6bba4286d81   23 hours ago    283 MB
<none>                                                                         <none>                                                             0ddf75385038   2 days ago      265 MB
<none>                                                                         <none>                                                             0ddf83bfe39b   2 days ago      314 MB
<none>                                                                         <none>                                                             9c4d1cb8c228   3 days ago      316 MB
registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943@sha256   a11bc98182d7fdad4b71cf3a4fdddb0bc6d34cf00473d52e42186b0b6410bdd0   b206245de22d   3 days ago      266 MB
<none>                                                                         <none>                                                             f6ae41ec051f   3 days ago      261 MB
registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943@sha256   ad85ae157fb7f978aaed14c059772785fbcc4b76d0c503c0d9250427d1b12fa4   dd8a865bb213   3 days ago      249 MB
<none>                                                                         <none>                                                             364c73ced698   4 days ago      311 MB
<none>                                                                         <none>                                                             d9b81ca3d0d1   4 days ago      276 MB
registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-04-07-112943          none                                                               f9f64d8dbeb1   6 weeks ago     295 MB
k8s.gcr.io/pause                                                               3.1                                                                da86e6ba6ca1   15 months ago   747 kB


After that, API and machine-config has started on bootstrap node, and etcd cluster has been finally installed. and openshift installation ends correctly.


Attached is the /root/crio.log debug log.

Comment 11 David Sanz 2019-04-12 09:11:12 UTC
Master and workers nodes are still booting with old RHCOS version and crio

Bootstrap node:

# cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="410.8.20190410.0"
VERSION_ID="4.1"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.8.20190410.0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.1"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.1"
OSTREE_VERSION=410.8.20190410.0
# crio --version
crio version 1.13.5-1.rhaos4.1.gita9d8dde.el8


Master Node:
# cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="410.8.20190408.1"
VERSION_ID="4.1"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.8.20190408.1 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.1"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.1"
OSTREE_VERSION=410.8.20190408.1
# crio --version
crio version 1.13.4-3.rhaos4.1.git30006b3.el8


Worker Node:
# cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="410.8.20190408.1"
VERSION_ID="4.1"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.8.20190408.1 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.1"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.1"
OSTREE_VERSION=410.8.20190408.1
# crio --version
crio version 1.13.4-3.rhaos4.1.git30006b3.el8

Comment 12 Mrunal Patel 2019-04-12 20:21:48 UTC
Fixed in cri-o-1.13.6-1.

Comment 15 David Sanz 2019-04-16 15:43:47 UTC
Verified on unreleased-master-811-g5a3c57cb37b0f175c2ae33e64cd9a6947bd1d567-dirty

Comment 16 Neelesh Agrawal 2019-05-10 15:41:20 UTC
*** Bug 1708663 has been marked as a duplicate of this bug. ***

Comment 17 Neelesh Agrawal 2019-05-10 15:42:13 UTC
*** Bug 1708605 has been marked as a duplicate of this bug. ***

Comment 19 errata-xmlrpc 2019-06-04 10:47:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.